<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Reddit NPL Classfication Challenge
Notebook 1: Data Scraping

This notebook serves as the starting point for the Reddit NPL Classification Challenge. It contains the webscraping of data from two subreddits on redditc.com using Pushshift API. The data will later serve as the training and testing set for this project. 


Content of this notebook include:  

- [Data Scraping](#Data-Scraping)
- [Initial Feature Selection](#Initial-Feature-Selection)

### Import Libaries

This project would use requests library to pull the data from Pushshift API an use time library to control the pace so that my requests won't flood the application. As the datasets will be text-heavy, I'm also setting the display option to max column width so that I can review the content easily.

In [1]:
import requests
import pandas as pd
import time

In [2]:
# credit to my classmate Devin
pd.set_option('display.max_colwidth', 0)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

### Data Scraping

Pushshift API has a limit of 100 posts per request, I would need multiple requests to retrieve enough data for my models. My goal is to retrieve around 10,000 posts for each subreddit. Therefore I built a function to automate this process. This function will achieve the following: 

1. to retrieve the posts based on the subreddit name and total posts that I input; 
2. to concatenate all the posts retrieved per each subreddit into one dataframe; 
3. to drop the posts that have been removed and thus show no selftext in the system. 

I utilized the while loop to achieve these so that we significantly improve my efficiency in retrieving data. I printed out a status update after each request is done to indicate how many legit(non-removed) posts I have retrieved in total so that I could spot if anything went wrong. This turns out to be very helpful since I can spot if anything goes wrong in a much easier way. For example, when I was test running once, I noticed that the total posts I retrieve started to decrease as requests go on, which does not make sense. I then realized that it was because I did not reset the index after I dropped those "removed" posts so newly added posts were kept being dropped.

I set the latest post to be 10/10/2020 midnight PDT, which is one week before the time the data is being retrieved. I did this because I believe posts with certain "age" may be a better representation of the subreddits since the new posts are usually subject to certain inspections by reddit and tend not have enough interactions(comments, upvotes, etc).

At the end of the loop, I set the time interval of each request to 10s utilizing time.sleep function to show respect to the API community.

In [3]:
def get_data(subreddit, total_size):
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {
        'subreddit': subreddit,
        'size': 100, # the maximum posts Pushshift allows per each request
        'before': 1602313200  # 10/10/2020 00:00:00 PDT
    }
    df = pd.DataFrame()
    total = 0 # implies how many 
    times = 0
    while total < total_size:
        res = requests.get(url, params)
        if res.status_code == 200:
            times += 1
            posts = res.json()['data']
            params['before'] = posts[99]['created_utc']
            df_po = pd.DataFrame(posts)
            df = pd.concat([df, df_po])
            df.reset_index(drop = True, inplace = True)
            df = df.drop(df[df['selftext'] == '[removed]'].index)
            df.reset_index(drop = True, inplace = True)
            total = df.shape[0]
            print (f'This is no.{times} batch, status: {res.status_code}, total {total} data retrieved')

        else:
            print(res.status_code)
        time.sleep(10)
    return df

In [5]:
jokes = get_data('jokes', 10000)

This is no.1 batch, status: 200, total 84 data retrieved
This is no.2 batch, status: 200, total 170 data retrieved
This is no.3 batch, status: 200, total 258 data retrieved
This is no.4 batch, status: 200, total 344 data retrieved
This is no.5 batch, status: 200, total 428 data retrieved
This is no.6 batch, status: 200, total 508 data retrieved
This is no.7 batch, status: 200, total 597 data retrieved
This is no.8 batch, status: 200, total 685 data retrieved
This is no.9 batch, status: 200, total 771 data retrieved
This is no.10 batch, status: 200, total 859 data retrieved
This is no.11 batch, status: 200, total 946 data retrieved
This is no.12 batch, status: 200, total 1027 data retrieved
This is no.13 batch, status: 200, total 1115 data retrieved
This is no.14 batch, status: 200, total 1203 data retrieved
This is no.15 batch, status: 200, total 1288 data retrieved
This is no.16 batch, status: 200, total 1370 data retrieved
This is no.17 batch, status: 200, total 1456 data retrieved
T

It took me 115 requests to get 10,000+ posts that contains actual texts, which means the jokes subreddit posts tend to be kept there.

In [6]:
tales = get_data('talesfromretail', 10000)

This is no.1 batch, status: 200, total 34 data retrieved
This is no.2 batch, status: 200, total 68 data retrieved
This is no.3 batch, status: 200, total 97 data retrieved
This is no.4 batch, status: 200, total 125 data retrieved
This is no.5 batch, status: 200, total 161 data retrieved
This is no.6 batch, status: 200, total 190 data retrieved
This is no.7 batch, status: 200, total 215 data retrieved
This is no.8 batch, status: 200, total 246 data retrieved
This is no.9 batch, status: 200, total 270 data retrieved
This is no.10 batch, status: 200, total 306 data retrieved
This is no.11 batch, status: 200, total 331 data retrieved
This is no.12 batch, status: 200, total 356 data retrieved
This is no.13 batch, status: 200, total 399 data retrieved
This is no.14 batch, status: 200, total 435 data retrieved
This is no.15 batch, status: 200, total 468 data retrieved
This is no.16 batch, status: 200, total 515 data retrieved
This is no.17 batch, status: 200, total 551 data retrieved
This is n

On the contrary, it took me 203 requests to retrieve enough posts for the tales from retail subreddit, which means there are lots of comes and goes in this subreddit. This might be because posts in tales from retail are usually real life stories and original posters may change their mind about sharing personal experience after a while, or maybe there posts that should belong to other subreddits made their appearance in the wrong group. As there are many tales from ... subreddits group, it could make sense too.

In [16]:
print(f'Jokes shape {jokes.shape}')
print(f'Tales shape {tales.shape}')

Jokes shape (10008, 73)
Tales shape (10030, 88)


So after scraping the data, I have 10,000+ legit posts for both groups. I consider this to be enough for my later stages. However while comparing the shape of two subreddits, I noticed that tales from retail has a lot more columns in its dataset, which means me want to look into it.

### Initial Feature Selection

As my problem statement is about classifying posts, many features that I retrieved from Pushshift API may not be relevant. Therefore I trimmed down some features and only keeping those that will be helpful for my EDA & modelings. Also per mentioned just above, there are columns differences between jokes and tales subreddits. So I started with examing the exact columns of both groups.

In [9]:
jokes.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'removed_by_category', 'retrieved_on', 'score', 'selftext',
       'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title',
       'total_awards_received', 'treatment

In [10]:
tales.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_css_class',
       'link_flair_richtext', 'link_flair_text_color', 'link_flair_type',
       'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts',
       'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'thumbnail', 'title', 'total_awards_received',
       'treatmen

I used set to find the discrepency columns that tales has but jokes doesn't -- they don't seem to be very much irrelvant to my analysis. Therefore I decided not to match them in the jokes.

In [33]:
set(tales.columns) - set(jokes.columns)

{'author_created_utc',
 'author_flair_template_id',
 'category',
 'content_categories',
 'distinguished',
 'gilded',
 'media_embed',
 'og_description',
 'og_title',
 'removal_reason',
 'removed_by',
 'secure_media_embed',
 'steward_reports',
 'suggested_sort',
 'updated_utc'}

I handpicked some features that I believe important to my EDA and modeling. The most important among them of course will be the columns with texts, namely title and selftext. I also included score, number of comments, upvote ratio and created utc so that I can have an idea what kind of post it is and better understand the context that the analysis is based on.

In [26]:
features = ['author', 'title', 'selftext', 'score', 'num_comments', 'upvote_ratio', 'created_utc', 'subreddit' ]

In [27]:
jokes_pick = jokes[features]
tales_pick = tales[features]

One more thing I want to check here is to see if there are any duplicates in the text columns. And if there are, I would drop them.

In [28]:
jokes_pick = jokes_pick.drop_duplicates('title', keep = 'first')
jokes_pick = jokes_pick.drop_duplicates('selftext', keep = 'first')
jokes_pick.reset_index(drop = True, inplace = True)

In [29]:
tales_pick = tales_pick.drop_duplicates('title', keep = 'first')
tales_pick = tales_pick.drop_duplicates('selftext', keep = 'first')
tales_pick.reset_index(drop = True, inplace = True)

In [30]:
print(f'Jokes new shape {jokes_pick.shape}')
print(f'Tales new shape {tales_pick.shape}')

Jokes new shape (9526, 8)
Tales new shape (9434, 8)


There I have it. Seems there are not significant duplicate posts in either subreddits. With around 9500 posts for each subreddit, I'm comfortable to move forward to the next stage of the project.

### Export Data

In [31]:
jokes_pick.to_csv('../data/jokes.csv', index = False)
tales_pick.to_csv('../data/talesfromretail.csv', index = False)