# Requesting Posts from r/Batman

In this notebook, I'll be scraping the Batman subreddit for text data.

### Read In Modules

In [53]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import requests
import time

### Extracting Data from Reddit API

I will write a for loop for the purpose of requesting and storing a high volume of data from Reddit's API. But first, I need to examine how the raw data is structured.

In [55]:
url_b = 'https://www.reddit.com/r/batman.json'

In [56]:
headers = {'user-agent': 'Christiaan'}

In [57]:
res = requests.get(url_b, headers=headers)

In [58]:
res.status_code

200

In [59]:
the_json = res.json()

This data is formatted like a dictionary. Let's take a closer look at the key and value pairs.

In [60]:
sorted(the_json['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [61]:
the_json['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [62]:
len(the_json['data']['children'])

27

My first request returned 27 posts. I'll identify the final post and set that as a parameter. That way, I can be sure that my requests are returning unique data, which I haven't already received.

In [63]:
pd.DataFrame(the_json['data']['children'])

Unnamed: 0,data,kind
0,"{'approved_at_utc': None, 'subreddit': 'batman...",t3
1,"{'approved_at_utc': None, 'subreddit': 'batman...",t3
2,"{'approved_at_utc': None, 'subreddit': 'batman...",t3
3,"{'approved_at_utc': None, 'subreddit': 'batman...",t3
4,"{'approved_at_utc': None, 'subreddit': 'batman...",t3
5,"{'approved_at_utc': None, 'subreddit': 'batman...",t3
6,"{'approved_at_utc': None, 'subreddit': 'batman...",t3
7,"{'approved_at_utc': None, 'subreddit': 'batman...",t3
8,"{'approved_at_utc': None, 'subreddit': 'batman...",t3
9,"{'approved_at_utc': None, 'subreddit': 'batman...",t3


In [64]:
[post['data']['name'] for post in the_json['data']['children']]

['t3_a4yp6w',
 't3_a5g7mp',
 't3_a61uey',
 't3_a5t05c',
 't3_a5riuo',
 't3_a5zskr',
 't3_a64w1x',
 't3_a64ex4',
 't3_a5ur4l',
 't3_a654nj',
 't3_a5x9rp',
 't3_a5n87q',
 't3_a5wofy',
 't3_a5uf2v',
 't3_a5zodz',
 't3_a63g70',
 't3_a5v1zz',
 't3_a5wwuo',
 't3_a63upn',
 't3_a5uqtm',
 't3_a5n7oy',
 't3_a626ce',
 't3_a5lnxd',
 't3_a5wkcv',
 't3_a61bt7',
 't3_a5yxwy',
 't3_a5p9o1']

In [65]:
the_json['data']['after']

't3_a5p9o1'

In [66]:
param = {'after': 't3_a5mkwo'}

# Writing a For Loop to make multiple requests from API

In order to streamline the scraping process, I will build a for loop that sends four requests each time it runs. My first block of code includes an empty 'posts' list and "after" set to "None". After running this block once, I will break everything from the for loop on into a new block. That way our post list and after variable don't reset each time I make a request.

In [None]:
posts = []
after = None
for i in range(4):
    print(i)
    if after == None:
        params = {}
    else:
        params = {'after':after}
    url = url_b
    res = requests.get(url, params=params, headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        posts.extend(the_json['data']['children'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(1)

In [82]:
for i in range(4):
    print(i)
    if after == None:
        params = {}
    else:
        params = {'after':after}
    url = url_b
    res = requests.get(url, params=params, headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        posts.extend(the_json['data']['children'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(1)

0
1
2
3


# Exploring the dictionary and text-based data

I've run the block above about ten times and have a good bit of data from my Batman subreddit. I'll store this in a dataframe and drill down through the dictionary to find my text corpus.

In [83]:
len(posts)

991

In [84]:
df = pd.DataFrame(posts)
df.shape

(991, 2)

In [85]:
len(set([p['data']['name'] for p in posts]))
posts[1]['data']['title']

'Weekly Batman Discussion Thread - Which criminal group or organisation represents the biggest threat for Batman?'

In [86]:
for i in range(5): 
    print(the_json['data']['children'][i]['data']['selftext'])




I read a Batman comic in which a guy who fell out a window in a batman costume wakes up and thinks he actually is batman. He tries to fight criminals and is ultimately rescued by the real Batman. He does manage to rescue a prostitute in between. Anyone know which one?



I pledge to hereby watch this movie and pay my hard earned money while eating overpriced layered buttered popcorn with dill pickle and white cheddar powder lightly sprinkled on top.

I solemnly swear to tell all my friends, family and coworkers that Adam Driver will be the best Batman we see to date.

What say you Reddit.

&amp;#x200B;

[The Dark Knight?](https://i.redd.it/od764k0okru11.jpg)

&amp;#x200B;

&amp;#x200B;
&amp;#x200B;

https://i.redd.it/jhpbbodvvqu11.png



#[**Welcome, Batmembers!**](https://i.imgur.com/1KVYcCm.png)

The new remastered Blu-rays of Batman: The Animated Series comes out Tuesday, October 30th.  Check back *Friday, November 2nd* for the first episode of the rewatch right here on r/Batman!

IndexError: list index out of range

# Storing post comments and titles to a dataframe; Saving to CSV

Now that I have a clear picture of where my text data lives in the dictionary, I'll extract the text from the comments and titles, store it in a dataframe, and save each to a csv file.

In [87]:
post_lst = [p['data']['selftext'] for p in posts]
data_dict = {
    'posts':post_lst,
    'label': np.full(len(post_lst), 1)
}

post_df = pd.DataFrame.from_dict(data_dict, orient='columns')
post_df[~post_df.duplicated()]

Unnamed: 0,posts,label
0,"Hey there, citizens of Gotham. Welcome to our ...",1
1,"Hi all, and welcome back to the weekly Batman ...",1
2,,1
6,lately I have been reading a lot of the 1990's...,1
7,Trivia: The Batmobile from the 60's TV show is...,1
9,I am searching for two batman comics I read as...,1
15,Good day all!\n\n&amp;#x200B;\n\nI'm looking f...,1
18,Christian Bale must comeback as Batman.,1
21,Favourite robin and why?\n( just interested),1
24,Hey Batman subreddit! I remember this really s...,1


In [88]:
post_df.to_csv('batmancomments1.csv')

In [89]:
post_lst = [p['data']['title'] for p in posts]
data_dict_titles = {
    'posts':post_lst,
    'label': np.full(len(post_lst), 1)
}

post_df_titles = pd.DataFrame.from_dict(data_dict_titles, orient='columns')
post_df_titles[~post_df_titles.duplicated()]

Unnamed: 0,posts,label
0,Weekly Batman Comics (12/12/2018): The Batman ...,1
1,Weekly Batman Discussion Thread - Which crimin...,1
2,Hi all got this in the mail today saw there wa...,1
3,Hardy tells a story about Bale during TDKR´s f...,1
4,Wholesome,1
5,I made this in Photoshop. Let me know what you...,1
6,Do you like the New 52 Batman over the pre 2011?,1
7,The 60s TV Batmobile appears in the race in Re...,1
8,I hope this scene happens in a JL movie.,1
9,searching for two batman comics,1


In [90]:
post_df_titles.to_csv('batmantitles1.csv')

## Now that we have our data from r/Batman saved to a csv, I'll repeat the above process for r/Joker in a new notebook. This way, I can reuse my code from above for a new subreddit, without risking the loss of the data I've pulled from r/Batman.