# Scraping Data from r/Joker

In the previous notebook, I requested text data from the Batman subreddit with Reddit's API. In this notebook, I'll be repeating the process for the Joker subreddit. 

### Read In Modules

In [47]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import requests
import time

### Extracting Data from Reddit API

I will write a for loop for the purpose of requesting and storing a high volume of data from Reddit's API. But first, I need to examine how the raw data is structured.

In [49]:
url_j = 'https://www.reddit.com/r/joker.json'

In [50]:
headers = {'user-agent': 'Christiaan'}

In [51]:
res = requests.get(url_j, headers=headers)

In [52]:
res.status_code

200

This data is formatted like a dictionary. Let's take a closer look at the key and value pairs.

In [53]:
the_json = res.json()

In [54]:
sorted(the_json['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [55]:
the_json['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [56]:
len(the_json['data']['children'])

25

My first request returned 25 posts. I'll identify the final post and set that as a parameter. That way, I can be sure that my requests are returning unique data.

In [57]:
pd.DataFrame(the_json['data']['children'])

Unnamed: 0,data,kind
0,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3
1,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3
2,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3
3,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3
4,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3
5,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3
6,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3
7,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3
8,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3
9,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3


In [58]:
the_json['data']['children']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'joker',
   'selftext': '',
   'author_fullname': 't2_i8jkn',
   'saved': False,
   'mod_reason_title': None,
   'gilded': 0,
   'clicked': False,
   'title': 'Got my first tattoo! Credits to _alexbadea_ on instagram.',
   'link_flair_richtext': [],
   'subreddit_name_prefixed': 'r/joker',
   'hidden': False,
   'pwls': None,
   'link_flair_css_class': None,
   'downs': 0,
   'thumbnail_height': 140,
   'hide_score': False,
   'name': 't3_a5tjey',
   'quarantine': False,
   'link_flair_text_color': 'dark',
   'author_flair_background_color': None,
   'subreddit_type': 'public',
   'ups': 76,
   'domain': 'i.redd.it',
   'media_embed': {},
   'thumbnail_width': 140,
   'author_flair_template_id': None,
   'is_original_content': False,
   'user_reports': [],
   'secure_media': None,
   'is_reddit_media_domain': True,
   'is_meta': False,
   'category': None,
   'secure_media_embed': {},
   'link_flair_text': None,
   'ca

In [59]:
[post['data']['name'] for post in the_json['data']['children']]

['t3_a5tjey',
 't3_a5rlgb',
 't3_a5bfmt',
 't3_a56yrm',
 't3_a5bxnl',
 't3_a4vgjy',
 't3_a50vks',
 't3_a45mu2',
 't3_a3ykoi',
 't3_a3fbs8',
 't3_a39yoy',
 't3_a3606h',
 't3_a2tf5x',
 't3_a2ymqt',
 't3_a2ipgc',
 't3_a29iq7',
 't3_a22m6b',
 't3_a1udjt',
 't3_a1xd6y',
 't3_a1jjzw',
 't3_a14tiv',
 't3_a0yfhd',
 't3_a0toct',
 't3_a0faut',
 't3_a0dsu0']

In [60]:
the_json['data']['after']

't3_a0dsu0'

In [61]:
param = {'after': 't3_a4mp15'}

# Writing a For Loop to make multiple requests from API

In order to streamline the scraping process, I will build a for loop that sends four requests each time it runs. My first block of code includes an empty 'posts' list and "after" set to "None". After running this block once, I will break everything from the for loop on into a new block. That way our post list and after variable don't reset each time I make a request.

In [62]:
posts = []
after = None
for i in range(4):
    print(i)
    if after == None:
        params = {}
    else:
        params = {'after':after}
    url = url_j
    res = requests.get(url, params=params, headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        posts.extend(the_json['data']['children'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(1)

0
1
2
3


In [76]:
for i in range(4):
    print(i)
    if after == None:
        params = {}
    else:
        params = {'after':after}
    url = url_j
    res = requests.get(url, params=params, headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        posts.extend(the_json['data']['children'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(1)

0
1
2
3


# Exploring the dictionary and text-based data

I've run the block above about ten times and have a good bit of data from my Batman subreddit. I'll store this in a dataframe and drill down through the dictionary to find my text corpus.

In [77]:
len(posts)

1075

In [78]:
joker_df = pd.DataFrame(posts)

In [79]:
joker_df.head()

Unnamed: 0,data,kind
0,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3
1,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3
2,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3
3,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3
4,"{'approved_at_utc': None, 'subreddit': 'joker'...",t3


In [80]:
joker_df.shape

(1075, 2)

# Storing post comments and titles to a dataframe; Saving to CSV
Now that I have a clear picture of where my text data lives in the dictionary, I'll extract the text from the comments and titles, store it in a dataframe, and save each to a csv file.

In [81]:
post_lst = [p['data']['title'] for p in posts]
data_dict = {
    'posts':post_lst,
    'label': np.full(len(post_lst), 1)
}

post_df = pd.DataFrame.from_dict(data_dict, orient='columns')
post_df[~post_df.duplicated()]

Unnamed: 0,posts,label
0,Got my first tattoo! Credits to _alexbadea_ on...,1
1,The Joker as portrayed in The Dark Knight was ...,1
2,I did a Joker Sculpture and wanted to share it...,1
3,"This is a very atmospheric, eerie playlist to ...",1
4,da fuq?,1
5,"Congratulations, /r/Joker! You are Subreddit o...",1
6,Crazy theory that just sparked in my mind,1
7,This is a selection of songs from comic-based ...,1
8,The Live Action Jokers Digital Art,1
9,Graphic Novels,1


In [82]:
post_df.to_csv('jokertitles1.csv')

In [83]:
post_lst = [p['data']['selftext'] for p in posts]
data_dict_comments = {
    'posts':post_lst,
    'label': np.full(len(post_lst), 1)
}

post_df_comments = pd.DataFrame.from_dict(data_dict_comments, orient='columns')
post_df_comments[~post_df_comments.duplicated()]

Unnamed: 0,posts,label
0,,1
6,What if the joker lied to us - not just about ...,1
9,Hi everyone!\n\nIm a massive fan of the Batman...,1
11,So this is probably a long shot lol and I neve...,1
25,I remember reading a comic where the Joker got...,1
32,We're introduced to Arthur. He's a likable guy...,1
33,What other actors would do an amazing job play...,1
50,from which game or movie? I have tried to find...,1
83,What if this was supposed to be a prequel to D...,1
85,"Some time ago, somebody tell me that the Joker...",1


In [84]:
post_df_comments.to_csv('jokercomments1.csv')

In [85]:
post_df_comments['posts'][6]

'What if the joker lied to us - not just about his origin but how he\'s the joker, what if the joker was always the joker, he had no beginning or end, the whole one bad day thing could just be either:\n1. His way of coping with being different\n2. Him manipulating his victims so they feel sympathy for him and making them feel like he\'s right and that anyone could just snap so that they actually do\n3. He\'s just fucking liar and is taking the shit out of you\n\nLet\'s go to the killing joke (ima paraphrase), \n "Jim said he wanted it to be done by the book, no matter what you did to him he\'s still as sane as he\'s ever been, because normal people DONT snap!" \ntheres also this\nJoker - ".. Why aren\'t you LAUGHING!?"\nBatman - "Because i\'ve heard it before - and it wasn\'t funny the first time."\n\nI think it\'s pretty apparent the Joker was a loose cannon even before becoming a villain'

## In this notebook, I scraped the Joker subreddit for text data. In the next notebook, I'll read in all of the data that I've collected from each subreddit and combine them into a single dataframe. 