# Project 3: Subreddit Classification with Pushshift API and NLP

## Part I - Project Intro and Data Cleaning

Author: Charles Ramey

Date: 04/02/2023

---

## Problem Statement

Hockey For Everyone ("HFE") is a fictional startup company whose mission is to design and sell affordable, high-quality hockey equipment that meets the needs of players of all backgrounds and abilities. This project aims to determine the most effective way to market hockey gear to reddit users by analyzing two subreddits: r/hockey and r/hockeyplayers. The goal is to identify which subreddit tends to focus more on non-professional hockey by analyzynig common words and phrases used in submissions to each subreddit. The project explores data scraped from Reddit using the Pushshift API, and tests a variety of classification machine learning models to accurately submissions by subreddit and provide the HFE marketing team with a tool to test advertisement language before rollout. 

#### Notebook Links

Part II - Exploratory Data Analysis (EDA)
- [`Part-2_eda.ipynb`](../code/Part-1_eda.ipynb)

Part III - Modeling
- [`Part-3_modeling.ipynb`](../code/Part-3_modeling.ipynb)

Part IV - Conclusion, Recommendations, and Sources
- [`Part-4_conclusion-and-recommendations.ipynb`](../code/Part-4_conclusion-and-recommendations.ipynb)

### Contents

- [Background](#Background)
- [Data Import](#Data-Import)
- [Cleaning](#Cleaning)

## Background

Reddit is a popular social media platform with a large and diverse user base. Companies often use Reddit to market their products to specific communities, or subreddits. For HFE, idenfitying the most relevant subreddit to target is critical to the success of their marketing campaign. By analyzing the language used in submissions to [r/hockey](https://www.reddit.com/r/hockey/) and [r/hockeyplayers](https://www.reddit.com/r/hockeyplayers/), the company can gain insight into the interests and needs of its potential customers. The project will use [natural language processing (NLP)](https://www.ibm.com/topics/natural-language-processing) to identify distinguishing patterns and trends in the language used in the two hockey-centric subreddits.

Data from the two subreddits is collected using the [Pushshift API](https://medium.com/mcd-unison/using-pushshift-api-for-data-analysis-on-reddit-b08d339c48b8) and the Python [`requests`] library. Pushshift is a platform that collects and archives a variety of Reddit data, from the number of posts in a subreddit to the amount of upvotes a comment has received. Users of the API can obtain the public data they are searching for by using the necessary search endpoint url in conjunction with a number of different [query parameters](http://api.pushshift.io/redoc#operation/search_reddit_subreddits_reddit_search_subreddit_get). In this project, approximately 200,000 of the most recent submissions were scraped from the two subreddits up until March 28, 12:00PM. The data was then cleaned and condensed to about 62,000 which were then used to train and test a variety of classification machine learning models.

### Disclaimer

During the creation of this notebook and the analysis contained herein, the author frequently referenced available documentation for imported libraries. Much of the coding syntax within this notebook is derived from various online sources, including stackoverflow, other public forums, and OpenAI's ChatGPT. The author does not claim the following code as fully original, and sources are cited where possible.

### Library Imports

In [1]:
import numpy as np
import pandas as pd

import requests
import time
import json
from datetime import datetime

## Data Import

This section will use the Pushshift API to collect the initial uncleaned data for both subreddits. The analysis will focus only on submissions (posts) and will not consider comments. Thus, the Pushshift endpoint for submission data is used as the base URL to which additional query parameters will be appended.

In [2]:
# establish base url for submission endpoint
base_url = 'https://api.pushshift.io/reddit/search/submission'
base_url

'https://api.pushshift.io/reddit/search/submission'

For submission time stamps, the Pushshift API works in [Unix epochs](https://www.epochconverter.com/), a proxy for coordinated universal time (UTC). The code cells below establish a couple things:

1. `current_time` returns the Unix epoch timestamp the corresponds to the time at which this notebook is last run.
2. `before_time` defines the starting point for the window over which this project will analyze submission. As previously stated, this Unix epoch timestamp corresponds to March 28, 2023 at 12:00pm.
3. The following cell prints out the number of days after March 28, 2023 at 12:00pm that this notebook is being run. This is to give the viewer a relative idea of the time frame of consideration.

In [3]:
# Check the current time (as unix epoch timestamp)
current_time = round(time.time())
current_time

1680456746

In [4]:
# Set start time for submission collection
# For this project, we will look at the submissions leading up to
# March 28, 2023 at 12:00pm, noon.
# The Unix epoch for this datetime is 1680019200
before_time = 1680019200
before_time

1680019200

In [5]:
print(f"This notebook was run {round((current_time-before_time)/(24*60*60), 2)} days after the studied time frame.")

This notebook was run 5.06 days after the studied time frame.


In [6]:
# This cell will limit dataframe displays to 10 columns for readability
pd.set_option('display.max_columns', 10)

#### r/hockey

The following code cells implement the Pushshift API via looping to collected the necessary submission data for the r/hockey subreddit. The data is saved to a Pandas dataframe under the alias, `h`, for r/`h`ockey.

In [7]:
'''
Code adapted from Devin Faye, General Assembly
'''
# initializing dataframe for r/hockey
h = []

# initializing variable to store the total time required to scrape the data
total_time = 0

# setting query parameters for Pushshift API request
params_h = {
    'subreddit' : 'hockey', # select submissions from r/hockey
    'limit' : 1000, # limit selection to 1000 submissions
    'until' : before_time, # start scraping at 03-28-2023 12:00:00 and work backwards in time
    'sort' : 'created_utc', # sort the data time-wise
    'order' : 'desc' # order in descending order (most recent to oldest)
}

# Creating a for loop to iterate over the 1000 submission limit until
# the desired number of submissions have been collected.
# The range is set to 150 to collect 150,000 submissions.
# This number was selected through iteration, such that the cleaned dataframes
# for both subreddits will contain approximately equal submission counts.
for _ in range(150):
    # setting a loop start time to calculate time per loop
    start = time.time()
    # Rather than print each response code to make sure the requests are working
    # We will raise an exception if the response code is not 200
    try:
        # sending request to Pushshift URL with chosen query parameters
        res_h = requests.get(base_url, params=params_h)
        if str(res_h) != '<Response [200]>':
            raise Exception("Response code is not 200")
    except Exception as e:
        print("An exception occured:", e)
    # Saving the 1000 requested submissions to a dataframe
    posts = pd.DataFrame(res_h.json()['data'])
    # Adding the pulled data to the master dataframe, h
    h += res_h.json()['data']
    # Updating the new scraping window start time to be the end of the previous iteration
    params_h['until'] = posts['created_utc'].min()
    # Setting a loop end time to calculate time per loop
    end = time.time()
    # Caluclating time to scrape the max number of submissions (1000) for this iteration
    time_to = end - start
    # Adding the scrape time of this loop to the total
    total_time += time_to
    # Wait for one second before making the next request to avoid overloading the API
    time.sleep(1)
        
# Print the number of submissions retrieved and the rate of retrieval
print(f"Retrieved {len(h)} total submissions with a scrape time of {total_time / (len(h) / 1000)} seconds per 1000 submissions.")

Retrieved 149940 total submissions with a scrape time of 3.3522249921824403 seconds per 1000 submissions.


In [8]:
# convert the result to a pandas dataframe
h = pd.DataFrame(h)

In [9]:
# verify that there are no duplicates
h.shape, h.id.nunique()

((149940, 106), 149940)

#### r/hockeyplayers

The following code cells implement the Pushshift API via looping to collected the necessary submission data for the r/hockeyplayers subreddit. The data is saved to a Pandas dataframe under the alias, `hp`, for r/`h`ockey`p`layers.

In [10]:
'''
Code adapted from Devin Faye, General Assembly
'''
# initializing dataframe for r/hockeyplayers
hp = []

# initializing variable to store the total time required to scrape the data
total_time = 0

# setting query parameters for Pushshift API request
params_hp = {
    'subreddit' : 'hockeyplayers', # select submission from r/hockeyplayers
    'limit' : 1000, # limit selection to 1000 submissions
    'until' : before_time, # start scraping at 03-28-2023 12:00:00 and work backwards in time
    'sort' : 'created_utc', # sort the data time-wise
    'order' : 'desc' # order in descending order (most recent to oldest)
}

# Creating a for loop to iterate over the 1000 submission limit until
# the desired number of submissions have been collected.
# The range is set to 50 to collect 50,000 submissions.
# This number was selected because the r/hockeyplayers subreddit
# has just over 50,000 total submissions since it was created..
for _ in range(50):
    # setting a loop start time to calculate time per loop
    start = time.time()
    # Rather than print each response code to make sure the requests are working
    # We will raise an exception if the response code is not 200
    try:
        # sending request to Pushshift URL with chosen query parameters
        res_hp = requests.get(base_url, params=params_hp)
        if str(res_hp) != '<Response [200]>':
            raise Exception("Response code is not 200")
    except Exception as e:
        print("An exception occured:", e)
    # Saving the 1000 requested submissions to a dataframe
    posts = pd.DataFrame(res_hp.json()['data'])
    # Adding the pulled data to the master dataframe, h
    hp += res_hp.json()['data']
    # Updating the new scraping window start time to be the end of the previous iteration
    params_hp['until'] = posts['created_utc'].min()
    # Setting a loop end time to calculate time per loop
    end = time.time()
    # Caluclating time to scrape the max number of submissions (1000) for this iteration
    time_to = end - start
    # Adding the scrape time of this loop to the total
    total_time += time_to
    # Wait for a second before making the next request to avoid overloading the API
    time.sleep(1)
        
# Print the number of submissions retrieved and the rate of retrieval
print(f"Retrieved {len(hp)} total submissions with a scrape time of {total_time / (len(hp) / 1000)} seconds per 1000 submissions.")

Retrieved 49987 total submissions with a scrape time of 2.368637295739617 seconds per 1000 submissions.


In [11]:
# convert the result to a pandas dataframe
hp = pd.DataFrame(hp)

In [12]:
# verify that there are no duplicates
hp.shape, hp.id.nunique()

((49987, 122), 49987)

---
## Cleaning

In this section, the initial dataframes are cleaned and condensed, removing unwanted and unnecessary data. The r/hockey and r/hockeyplayers dataframes are ultimately reduced to two columns, `subreddit` and `text`, and combined into a single, common dataframe, `df`, which will be used for data exploration and modeling.

In [13]:
h.head(2)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,...,tournament_data,event_end,event_is_live,event_start,removal_reason
0,hockey,I haven't been able to find out whether he is ...,t2_dxf3o,0,Does anyone know Jordan Staals take on Pride j...,...,,,,,
1,hockey,,t2_v2ee6yxk,0,Controversial ref call in the 6:th quarter fin...,...,,,,,


In [14]:
hp.head(2)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,...,approved_at_utc,banned_at_utc,from_kind,from_id,from
0,hockeyplayers,Has anyone tried the new Warrior Ritual X4 Pro...,t2_vipi2zpy,0,Warrior Goalie Pants,...,,,,,
1,hockeyplayers,,t2_d1md0ej,0,"Goalies, any tips for a player going in net fo...",...,,,,,


In [15]:
# Checking that hockey is the only class represented in the subreddit column
h['subreddit'].value_counts()

hockey                 149905
u_valhalla-hockey          27
u_rl-hockey-god             4
u_Howitzer-Hockey           1
u_All-things-hockey         1
u_gratts-hockey             1
u_All-Hockey                1
Name: subreddit, dtype: int64

Whether due to an issue with the API or some fault with the request code, there are some submissions that are incorrectly labeled as not belonging to the subreddit r/hockey. Since these submissions make up only a small fraction of the data that was pulled, we will simply drop these submissions.

In [16]:
# Using only submissions with the correct subreddit nam
h = h[h['subreddit'] == 'hockey']

In [17]:
# Checking that hockeyplayers is the only class represented in the subreddit column
hp['subreddit'].value_counts()

hockeyplayers    49987
Name: subreddit, dtype: int64

In [18]:
# List the features with one or more null values and the number of null values they contain
h.isna().sum().sort_values(ascending = False).loc[lambda x: x > 0]

removal_reason                   149905
discussion_type                  149905
view_count                       149905
removed_by                       149905
content_categories               149905
event_start                      149904
event_is_live                    149904
event_end                        149904
tournament_data                  149904
top_awarded_type                 149880
distinguished                    149822
poll_data                        149796
call_to_action                   149735
category                         149735
author_cakeday                   149396
gallery_data                     149130
media_metadata                   148711
is_gallery                       148419
crosspost_parent_list            145943
crosspost_parent                 145943
suggested_sort                   144354
edited_on                        141575
link_flair_template_id           121884
link_flair_css_class             102029
link_flair_text                  101455


In [19]:
# From a quick review, none of these features will be relevant to this analyis
# So they will be dropped from the dataframe before further cleaning
h = h.dropna(axis=1)

In [20]:
# List the features with one or more null values and the number of null values they contain
hp.isna().sum().sort_values(ascending = False).loc[lambda x: x > 0]

from               49987
likes              49987
link_flair_text    49987
view_count         49987
removed_by         49987
                   ...  
spoiler             4751
contest_mode        4387
domain                90
url                   90
archived              71
Length: 100, dtype: int64

In [21]:
# Dropping columns with null values
hp = hp.dropna(axis=1)

In [22]:
# Checking for common columns between the two dataframes
common_cols = h.columns.intersection(hp.columns)

In [23]:
# From here on, we will only clean based on shared features
h = h.reindex(columns=common_cols)
hp = hp.reindex(columns=common_cols)

In [24]:
h.shape, hp.shape

((149905, 22), (49987, 22))

In [25]:
h.columns

Index(['subreddit', 'selftext', 'gilded', 'title', 'media_embed',
       'secure_media_embed', 'score', 'thumbnail', 'edited', 'is_self',
       'over_18', 'locked', 'subreddit_id', 'id', 'author', 'num_comments',
       'permalink', 'stickied', 'created_utc', 'retrieved_utc', 'updated_utc',
       'utc_datetime_str'],
      dtype='object')

The `is_self` field represents submissions that are self posts (i.e. not reposted/cross-posted). We want to use these posts to help minimize duplicates and focus on users who are posting original content relevent to them.

In [26]:
# Check to see how many posts were created by the submitting user
h['is_self'].value_counts()

False    91264
True     58641
Name: is_self, dtype: int64

In [27]:
# Check to see how many posts were created by the submitting user
hp['is_self'].value_counts()

True     35354
False    14633
Name: is_self, dtype: int64

In [28]:
# Let's drop the non-self submissions
h = h.drop(h[h['is_self'] == False].index)
hp = hp.drop(hp[hp['is_self'] == False].index)

In [29]:
# Now we'll reduce the dataframe to just the subreddit, title and, body text
# These are the fields that will be used for training our model in Part 2
h = h[['subreddit', 'title', 'selftext']]
hp = hp[['subreddit', 'title', 'selftext']]

In [30]:
# To get the most text to train the model, 
# we will combine the title and selftext fields into a single text column
h['text'] = h['title'] + ' ' + h['selftext']
hp['text'] = hp['title'] + ' ' + hp['selftext']

In [31]:
# Now that we've engineered a new feature from title and selftext,
# We will remove these fields from the dataframe
h = h.drop(columns=['title','selftext'])
hp = hp.drop(columns=['title','selftext'])

In [32]:
h.head()

Unnamed: 0,subreddit,text
0,hockey,Does anyone know Jordan Staals take on Pride j...
2,hockey,[Game Thread][Hockey Federation of Ukraine Cha...
3,hockey,Elimination/Clinching Scenarios + Daily Free T...
4,hockey,Is McDavid's Contract Worth It? I'm guessing h...
9,hockey,How’s the Tank for Bedard? Who’s the favourite...


In [33]:
hp.head()

Unnamed: 0,subreddit,text
0,hockeyplayers,Warrior Goalie Pants Has anyone tried the new ...
1,hockeyplayers,"Goalies, any tips for a player going in net fo..."
2,hockeyplayers,Huge Hockey Facility up for Auction Lots and l...
4,hockeyplayers,Skates for kids with wide feet Hope I'm in the...
6,hockeyplayers,Gear for sled hockey Posting here for a friend...


The most useful submissions are those which contain more text and context. We may still get some useful phrases from two-word posts(for example, "new skates"), however one-word submissions are not expected to be especially useful, therefor they will be dropped.

In [34]:
# get a word count columns
h['word_count'] = h['text'].apply(lambda x: len(x.split()))
hp['word_count'] = hp['text'].apply(lambda x: len(x.split()))

In [35]:
# drop submissions with 1 word
h = h[h['word_count'] > 1]
hp = hp[hp['word_count'] > 1]

In [36]:
# drop word count column since it is no longer needed
h = h.drop(columns='word_count')
hp = hp.drop(columns='word_count')

With the Pushshift API, it is possible to get submissions which have deleted or removed content. Not only are these submissions unhelpful, but they are common, so the words "deleted" and "removed" are likely to show up frequently in both subreddits. We will remove these submissions.

In [37]:
# drop submissions that have been deleted/removed
h = h[~h['text'].str.contains('\[deleted]|\[removed]')]
hp = hp[~hp['text'].str.contains('\[deleted]|\[removed]')]

In [38]:
h.shape, hp.shape

((32177, 2), (30726, 2))

Now the dataframes have been sufficiently cleaned and both subreddits are approximately equally represented, with just over 30,000 submissions each. Up to this point, we've been removing entire rows from the data, however all further cleaning will be performed on the text itself to prepare it for EDA and modeling. We will combined the dataframes at this point and save the combined dataframe before manipulating the text.

In [39]:
# Combine the dataframes into a single dataframe for preprocessing
df = pd.concat([h, hp])

In [40]:
df['subreddit'].value_counts()

hockey           32177
hockeyplayers    30726
Name: subreddit, dtype: int64

### Save Cleaned Dataframes

In [41]:
df.to_csv('../data/combined_text.csv', index=False)
h.to_csv('../data/hockey_text.csv', index=False)
hp.to_csv('../data/hockeyplayers_text.csv', index=False)

---
#### Notebook Links

Part II - Exploratory Data Analysis (EDA)
- [`Part-2_eda.ipynb`](../code/Part-1_eda.ipynb)

Part III - Modeling
- [`Part-3_modeling.ipynb`](../code/Part-3_modeling.ipynb)

Part IV - Conclusion, Recommendations, and Sources
- [`Part-4_conclusion-and-recommendations.ipynb`](../code/Part-4_conclusion-and-recommendations.ipynb)