In this notebook, we collect 5000 posts from two subreddits of Breaking Bad and Better Call Saul Series, using API.

# Imports:

In [60]:
import requests
import math
import time
import pandas as pd
import matplotlib.pyplot as plt


# Scrape Data:


Below functions exports submissions from reddit for the chosen subreddits and then convert them into a data frame.

In [61]:

def get_subreddit(subreddit, subreddit_type, size, page_size = 500):
    assert subreddit_type in ('submission', 'comment')
    url = f"https://api.pushshift.io/reddit/search/{subreddit_type}"
    final_list = []
    for i in range(math.ceil(size / page_size)): 
        print(f'Downloading {page_size} out of {size} posts, number {i+1}')
        if i == 0:
            params = {
                'subreddit': subreddit, 
                'size': page_size
            }
        else:
            params = {
            'subreddit': subreddit, 
            'size': page_size,
            'before': last_time
            }

        res = requests.get(url, params)
        assert res.status_code == 200
        data = res.json()
        post_list = data['data']
        last_time = post_list[-1]['created_utc']
        final_list.extend(post_list)
        time.sleep(2)
    return final_list

In [62]:
def make_reddit_df(reddit_list): 
    df = pd.DataFrame(reddit_list)
    df = df[['subreddit', 'selftext', 'title']]
    return df

In [63]:
# Breaking Bad Subreddit:
bb_submission_list = get_subreddit('breakingbad', 'submission', 2500, page_size = 500)
bb_submission_df =  make_reddit_df(bb_submission_list)

Downloading 500 out of 2500 posts, number 1
Downloading 500 out of 2500 posts, number 2
Downloading 500 out of 2500 posts, number 3
Downloading 500 out of 2500 posts, number 4
Downloading 500 out of 2500 posts, number 5


In [64]:
bb_submission_df.head()

Unnamed: 0,subreddit,selftext,title
0,breakingbad,[removed],If only
1,breakingbad,Ok so the show itself is a “scary” and violent...,I’m squeamish/gets scared easily - anyone wann...
2,breakingbad,If Hank simply watched security footage from t...,Motels didn't have cameras back then?
3,breakingbad,[removed],Walter Jr outfit in this scene:
4,breakingbad,You can’t get near him and have to lay low. Wh...,If you were Walt and the store was out of Etch...


In [65]:
# Better Call Saul Subreddit:
bcs_submission_list = get_subreddit('betterCallSaul', 'submission', 2500, page_size = 500)
bcs_submission_df =  make_reddit_df(bcs_submission_list)
bcs_submission_df.shape

Downloading 500 out of 2500 posts, number 1
Downloading 500 out of 2500 posts, number 2
Downloading 500 out of 2500 posts, number 3
Downloading 500 out of 2500 posts, number 4
Downloading 500 out of 2500 posts, number 5


(2496, 3)

Now we save the data frames into csv format:

In [75]:
bcs_submission_df.to_csv('data/bettercallsaul.csv')

In [76]:
bb_submission_df.to_csv('data/breakingbad.csv')

In [77]:
subreddits_df = pd.concat([bcs_submission_df, bb_submission_df], axis = 0)

In [78]:
subreddits_df.to_csv('data/subreddits.csv')

Next, we perform NLP and EDA on gathered data (notebook 2.NPL-EDA).