# Project 3: Web APIs & NLP

## Contents:
- [Problem Statement](#Problem-Statement)
- [Setup](#Setup)
- [Query processing for Pushshift's API](#Query-processing-for-Pushshift's-API)
- [Next Steps](#Next-Steps)

## Problem Statement
Reddit is a massive collection of forums where registered members can share news and content or comment on other member’s posts. Posts are organised by subject into user-created boards called "communities" or "subreddits", which cover a variety of topics such as news, politics, religion, science, movies, video games, music, books, sports, fitness, cooking, pets, and image-sharing. As aspiring data scientists who have just completed a lesson on natural language processing (NLP), we are interested to find out if we can use NLP to classify posts from two different subreddits based on their title and text content. Such a classification model may be useful for Reddit as a company or even the subreddit moderators to make sure that the posts shown on the subreddit are relevant to the community.


### Background

For this project, we'll be classifying posts from two subreddits that are similar:
- r/GME: A subreddit channel where reddit users share and discuss content related to Gamestop ($GME) Stock
- r/dodgecoin: A subreddit channel where reddit users share and discuss content related to Dogecoins

Reddit has made various headlines, most notably when its community helped pump GameStop’s share price to triple-digit growth in late January. This mania later spread to dogecoin which saw massive price run-ups *([source](https://www.coindesk.com/investors-pump-250m-into-reddit-following-social-media-sites-prominent-role-in-gamestop-mania))*. Using NLP to analyse the posts from the two subreddits may help to shed some insight on the GameStop and Dogecoin mania.

This is clearly a supervised learning task, since the labels are provided (the expected output, i.e., binary representation of subreddit name). It is also a classification task, since we are predicting a discrete class label. More specifically, this is a binary classification problem, since the ultimate goal is to build a classifier to distinguish between just two classes, whether the post is from r/GME or r/dogecoin. 


### Performance Measure
We will evaluate the performance of our model using F1-score as the North Star metric.

The equation to compute F1-score:
![f1-score.png](../assets/f1-score.png)


The F1 Score is the weighted average (or harmonic mean) of Precision and Recall. Therefore, this score takes both False Positives and False Negatives into account to strike a balance between Precision and Recall.

Our goal is to get F1-score as close to 1 as possible, meaning we want to have low False Positives and low False Negatives, so that we're classifying the posts to the correct subreddit (thus minimising the occurrence of misclassification).

## Setup
All libraries used should be added here.

In [1]:
# import libraries
import os
import pandas as pd
import time
import requests
from datetime import datetime

In [2]:
# path to store the dataset downloaded
output_path = '../data'
os.makedirs(output_path, exist_ok=True)

Define a function to collect posts from subreddit using Pushshift's API.

*Note: Pushshift limits to 100 posts per request.*

In [3]:
def get_submissions(subreddit, start_datetime, end_datetime):
    """
    Get reddit submissions from the pushshift api between specified date range.
    Read more: https://github.com/pushshift/api
    
    Args:
    subreddit(string): Subreddit name.
    start_datetime(string): Start datetime in UTC. Acceptable string format is in "dd/mm/yyyy hh:mm:ss".
    end_datetime(string): End datetime in UTC. Acceptable string format is in "dd/mm/yyyy hh:mm:ss".
    
    Returns:
    A dataframe of text-only posts from subreddit.
           
    Raises:
    HTTPError when request status is not a 200.
    """

    # base url
    url = "https://api.pushshift.io/reddit/search/submission"
    
    # create an empty list to hold the dataframes
    df_list = []
    
    # convert start_datetime & end_datetime into epoch timestamps
    epoch = datetime(1970, 1, 1)
    start_datetime = int((datetime.strptime(start_datetime, "%d/%m/%Y %H:%M:%S") - epoch).total_seconds())
    end_datetime = int((datetime.strptime(end_datetime, "%d/%m/%Y %H:%M:%S") - epoch).total_seconds())
    
    while end_datetime > start_datetime:
        res = requests.get(url, 
                           # query parameters 
                           params={"subreddit": subreddit, 
                                        "size": 100,
                                        "after": start_datetime,
                                        "before": end_datetime,
                                       # return text-only posts
                                        "is_self": True})
        try:
            # if the response was successful, no exception will be raised
            res.raise_for_status()
            
        except requests.exceptions.HTTPError as e:
            # not a 200
            print("Error: " + str(e))
            raise
        
        else:
            # run the following codes if there are no exception
            print('Fetching data from {}'.format(res.url))
            json = res.json()
            # flatten the nested dictionary
            df = pd.json_normalize(json['data'])
            
            if len(df) > 0:
                # select the required columns
                df = df[['id', 'full_link', 'author', 'created_utc', 'subreddit', 'selftext', 'title', 'num_comments', 'score']]
                # update start_datetime to loop forward 
                start_datetime = df['created_utc'].max()
                # convert epoch time to readable time
                df['created_utc'] = pd.to_datetime(df['created_utc'],unit='s')
                # append to list
                df_list.append(df)
            
                # pause for 3 seconds before the next pull of 100 posts
                time.sleep(3)
                
            else:
                break
            
    print('---')
    print('Task Completed')
    return pd.concat(df_list, axis=0)

## Query processing for Pushshift's API
Fetch data from r/GME - GameStop Stock subreddit.
- We are interested in collecting all posts between 21st Jan 2021 to 2nd Feb 2021 because that was the timeline whereby GameStop shares have surged due to its popularity.

In [4]:
# specify params
subreddit1 = 'GME'
start_datetime1 = '21/01/2021 00:00:00'
end_datetime1 = '02/02/2021 23:59:59'

# query result to dataframe
GME_df = get_submissions(subreddit1, start_datetime1, end_datetime1)

Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1611187200&before=1612310399&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1611393367&before=1612310399&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1611576757&before=1612310399&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1611595388&before=1612310399&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1611620490&before=1612310399&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1611672634&before=1612310399&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1611760548&before=1612310399&is_self=True
Fetching data from h

Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1612238085&before=1612310399&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1612264710&before=1612310399&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1612274843&before=1612310399&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1612277728&before=1612310399&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1612280448&before=1612310399&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1612283006&before=1612310399&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1612287321&before=1612310399&is_self=True
Fetching data from h

In [5]:
print('Our dataset has {} rows and {} columns.'.format(GME_df.shape[0], GME_df.shape[1]))

Our dataset has 6805 rows and 9 columns.


In [6]:
# check the first five rows
display(GME_df.head())

Unnamed: 0,id,full_link,author,created_utc,subreddit,selftext,title,num_comments,score
0,l1n3vd,https://www.reddit.com/r/GME/comments/l1n3vd/g...,Jeffamazon,2021-01-21 00:52:21,GME,Just stumbled upon this sub. Didn't know it ex...,Greetings GME Gang,55,1
1,l1o60a,https://www.reddit.com/r/GME/comments/l1o60a/h...,stoney-the-tiger,2021-01-21 01:48:42,GME,[removed],Help Make Q4 Great,0,1
2,l1wi6q,https://www.reddit.com/r/GME/comments/l1wi6q/r...,B1ake1,2021-01-21 11:13:54,GME,HOLD THE LINES \n\n\n120 shares @27,"Remember lads, scared money don't make money.",0,1
3,l22dsa,https://www.reddit.com/r/GME/comments/l22dsa/h...,Dustin_James_Kid,2021-01-21 16:55:19,GME,I’m new and trying to learn. This stock scares...,How do we know when the squeeze has happened?,8,1
4,l22r5n,https://www.reddit.com/r/GME/comments/l22r5n/w...,MailNurse,2021-01-21 17:11:48,GME,Price is sub 40 now. :(.,WHERE ARE THE FUCKING REINFORCEMENTS,20,1


In [7]:
# check for duplicates
print('Our dataset has {} duplicated rows.'.format(GME_df[GME_df.duplicated(subset=['id']) == True].shape[0]))

Our dataset has 0 duplicated rows.


In [8]:
print('Time range of', subreddit1 ,'posts collected.')
print('Min:', GME_df['created_utc'].min())
print('Max:', GME_df['created_utc'].max())

Time range of GME posts collected.
Min: 2021-01-21 00:52:21
Max: 2021-02-02 23:55:59


Based on experience, it will be safer to create a function to avoid overwriting the original data collected after running through the notebook.

In [9]:
# create a function to prevent the files from being overwritten
def df_to_csv(df, filename, export=False):
    if export == False:
        return print('File has not been exported. Existing file has not been overwritten.')
    elif export == True:
        df.to_csv(output_path + '/' + filename, index=False)
        return print('File has been exported. Existing file has been overwritten.')
    else:
        return print('Nothing has happened.')

In [10]:
# save data
df_to_csv(GME_df, 'GME.csv', export=False) # toggle export=true only when we want the existing file to be overwritten

File has not been exported. Existing file has not been overwritten.


Fetch data from r/dogecoin - Dogecoin subreddit.
- Due to the long waiting time required for the initial time period selected, we will be limiting the collection of posts to 5th May 2021.

In [11]:
# specify params
subreddit2 = 'dogecoin'
start_datetime2 = '05/05/2021 00:00:00'
end_datetime2 = '05/05/2021 23:59:59'

# query result to dataframe
dogecoin_df = get_submissions(subreddit2, start_datetime2, end_datetime2)

Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620172800&before=1620259199&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620174293&before=1620259199&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620175666&before=1620259199&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620177049&before=1620259199&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620177977&before=1620259199&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620178841&before=1620259199&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620179763&before=162025919

Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620220279&before=1620259199&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620221168&before=1620259199&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620222030&before=1620259199&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620222887&before=1620259199&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620223813&before=1620259199&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620224480&before=1620259199&is_self=True
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620225392&before=162025919

In [12]:
print('Our dataset has {} rows and {} columns.'.format(dogecoin_df.shape[0], dogecoin_df.shape[1]))

Our dataset has 10750 rows and 9 columns.


In [13]:
# check the first five rows
display(dogecoin_df.head())

Unnamed: 0,id,full_link,author,created_utc,subreddit,selftext,title,num_comments,score
0,n525ha,https://www.reddit.com/r/dogecoin/comments/n52...,Ok_Salt_7206,2021-05-05 00:00:38,dogecoin,Only serious guys please!\n\nPlease pm\n\nP.S....,6000 dogecoins for $83. Pm Fast who need,10,1
1,n5262o,https://www.reddit.com/r/dogecoin/comments/n52...,Ruskgodkrewdoge,2021-05-05 00:01:26,dogecoin,,When doge passes eth the coin will be worth 4 ...,32,2
2,n5263b,https://www.reddit.com/r/dogecoin/comments/n52...,PingPing01,2021-05-05 00:01:27,dogecoin,[removed],Buy HODL this is what we need,0,1
3,n526uy,https://www.reddit.com/r/dogecoin/comments/n52...,T1DLiving,2021-05-05 00:02:29,dogecoin,"Hey fellow shibes, my birthday is in a couple ...",Birthday in a couple days,3,1
4,n526w3,https://www.reddit.com/r/dogecoin/comments/n52...,Malbec177,2021-05-05 00:02:31,dogecoin,Everyone should wish Elon Musks son Little X a...,Happy birthday to Elons son Little X!,1,1


In [14]:
# check for duplicates
print('Our dataset has {} duplicated rows.'.format(dogecoin_df[dogecoin_df.duplicated(subset=['id']) == True].shape[0]))

Our dataset has 0 duplicated rows.


In [15]:
print('Time range of', subreddit2 ,'posts collected.')
print('Min:', dogecoin_df['created_utc'].min())
print('Max:', dogecoin_df['created_utc'].max())

Time range of dogecoin posts collected.
Min: 2021-05-05 00:00:38
Max: 2021-05-05 23:59:50


In [16]:
# save data
df_to_csv(dogecoin_df, 'dogecoin.csv', export=False) # toggle export=true only when we want the existing file to be overwritten

File has not been exported. Existing file has not been overwritten.


## Next steps

In the [next notebook](02_Preprocessing_and_Modelling.ipynb), we'll look into exploring, cleaning, and preprocessing the data collected. Finally, comparing several models’ performance to get the best performing model.