# Data Collection

---

### Table of contents

* 1. [Project Introduction](#intro)
    * 1.1 [Workflow sample](#workflow_sample)
* 2. [Using Pushshift's API to pull data from subreddits](#API)
* 3. [Pulling data from r/DMAcademy and r/truezelda](#pull-data-1)
* 4. [Pulling data from r/PoliticalDiscussion](#pull-data-2)

<a id='intro'></a>

---
### Project Introduction

This project is all about Natural Language Processing (NLP) and binary classification models. The goals are to use [Pushshift's](https://github.com/pushshift/api) API to collect posts from two subreddits, then to create and compare two different models that can predict which subreddit a given post came from. 

#### Subreddit selection

This project requires text-rich subreddit posts in order for the final models to be well-informed. Initially, I considered using [r/oceanography](https://www.reddit.com/r/oceanography/), but found too many posts containing only images or videos. I then considered [r/DnD](https://www.reddit.com/r/DnD/) and [r/legendofzelda](https://www.reddit.com/r/legendofzelda/), but again found too many images and not enough text. I settled on using the slightly smaller but more discussion-focused subreddits [r/DMAcademy](https://www.reddit.com/r/DMAcademy/) and [r/truezelda](https://www.reddit.com/r/truezelda/). 

#### Breaking this project down into steps:
1. Use Pushshift's API to collect subreddit posts - collect 5,000 posts from [r/DMAcademy](https://www.reddit.com/r/DMAcademy/) and 5,000 posts from [r/truezelda](https://www.reddit.com/r/truezelda/)
2. Data cleaning and preprocessing - drop removed or deleted posts from our dataset, drop null values; remove hyperlinks, digits, punctuation, and bot messages from posts
3. Conduct EDA - investigate word count per post, character count per post, and post sentiment averaged across each subreddit individually 
4. Fit models to data - train a Bernoulli Naive Bayes model and a Support Vector Machine (SVM) classifier to predict which subreddit a post came from
5. Compare model results and metrics - compare misclassification rates and accuracy scores of each model

#### After accomplishing the initial goal

After completing the minimum goals for this project, I cleaned up my workflow and applied it to a new goal: collecting posts from one subreddits, then creating and comparing two models that can **predict which year** a given post from that subreddit came from. For this goal, I needed a subreddit which would see a significant enough change in vocabulary across time for a model to use in its predictions. I chose [r/PoliticalDiscussion](https://www.reddit.com/r/PoliticalDiscussion/) for this because it met the main requirement of being a text-rich subreddit, and I thought political eras would see a significant enough change in discussion topics and vocabulary for my models to pick up. This subreddit was created in 2011 and focuses on US politics so I chose to pull posts from two years where presidential elections were held, 2012 and 2020. I was working on data collection on March 31, 2021, so I pulled data from March 31, 2012 and March 31, 2020.

<a id='workflow_sample'></a>


#### Workflow sample

Before jumping into the deep end and pull 20,000 subreddit posts, we can pull a small sample of posts from just a single subreddit to prove that we can indeed use this API! 

In [1]:
# Imports
import re
import requests
import pandas as pd
import matplotlib.pyplot as plt
import time

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [17]:
# Define reddit's URL for requests
url = 'https://api.pushshift.io/reddit/search/' + 'submission'

In [20]:
# create parameters 
params = {
'subreddit': 'DMAcademy',
'size': 10}

In [None]:
# initiate pull request
res = requests.get(url, params)

In [None]:
# Check out request's status code
res.status_code

A status code of 200 indicates a successful pull! An example of an error at this stage of the process would be getting a 404 status code.

In [22]:
# get data from json file
data = res.json()
posts = data['data']
df = pd.DataFrame(posts)

NameError: name 'res' is not defined

In [19]:
# Inspect the posts we just pulled
df[['selftext', 'title']]

NameError: name 'df' is not defined

Now that we've done a proof of concept, we can put the above code into a function to pull many more posts over an extended period of time.

<a id='API'></a>

---

### Using Pushshift's API to pull data from subreddits

I knew that I would be pulling data from multiple different subreddits for this project so I created a function to streamline data pull requests. 

You can read more about Pushshift's API on this [GitHub page](https://github.com/pushshift/api). There is also a [YouTube video](https://www.youtube.com/watch?v=AcrjEWsMi_E) walkthrough of setting up this API. 

### **Warning**
Pulling large volumes of data through Pushshift's API puts you at risk of getting banned by the server host. Therefore, it's recommended that you pull small amounts of posts over an extended period of time. I set up a function with a built-in delay that pulls 100 posts per minute for a user-specified number of iterations. To get 5,000 posts, I ran this function for 50 iterations for each subreddit/time period. I've since altered the code in this notebook so that it only pulls 2 iterations, to demonstrate the process without forcing readers to run four 50-minute data pulls.

In [5]:
# Create a function to pull data using Pushshift's API
# Input: type of data to pull (can be 'submission' or 'comment), 
# desired number of pull iterations, desired subreddit, desired time

def get_posts(pull_type, iters, subreddit, desired_time):
    
    # Define reddit's URL for requests
    url = 'https://api.pushshift.io/reddit/search/' + pull_type
        
    # Create empty master dataframe to fill
    master_df = pd.DataFrame()
    
    # Loop through specified number 
    for i in range(iters):
        # Set API parameters
        params = {
        'subreddit': subreddit,
        'size': 100,
        'before':desired_time}
        
        # Pull data
        res = requests.get(url, params)
        data = res.json()
        posts = data['data']
        df = pd.DataFrame(posts)
        
        # Concatenate data to master dataframe
        frames = [df, master_df]
        master_df = pd.concat(frames, axis=0, ignore_index=True)
        
        # Get time of oldest post in this data
        # This resets the API parameters so that you pull older posts every iteration
        desired_time = df['created_utc'].min()
        print(f'Completed {i+1} iterations, {iters-i-1} iterations remaining')
        
        # Time delay so you don't get banned by Pushshift
        time.sleep(60)
    
    # Return dataframe containing all collected posts
    return master_df

<a id='pull-data-1'></a>

---

### Pulling data from r/DMAcademy and r/truezelda

First goal: create and compare models that can predict whether a post came from subreddit A or subreddit B.

Second goal: create and compare models that can predict whether a post from one subreddit came from year A or year B. 

Thus, I pulled data from three different subreddits but did four total pulls: one from [r/DMAcademy](https://www.reddit.com/r/DMAcademy/), one from [r/truezelda](https://www.reddit.com/r/truezelda/), one from [r/PoliticalDiscussion](https://www.reddit.com/r/PoliticalDiscussion/) in the year 2012, and one from r/PoliticalDiscussion in the year 2020. 

I chose to pull 5,000 posts from each subreddit/time period to ensure that my models would be well-informed. It was recommended that my models be trained on a minimum of 2,000 posts from each subreddit, so I pulled well over that to ensure that I would have enough posts to work with even after dropping potentially hundreds of unusable posts.

To start, I pulled 5,000 posts from r/DMAcademy and stored it in a dataframe. The `get_posts` function pulls 100 posts per iteration, so passing it 50 will produce 50 * 100, or 5,000 posts. I passed `int(time.time())` to the function to pull the 5,000 most recent posts at the time of writing. When I pulled posts from r/PoliticalDiscussion, I passed the function a specific time in [Unix or Epoch time](https://en.wikipedia.org/wiki/Unix_time) (formatted as number of seconds since 00:00:00 Jan 1, 1970, an arbitrary date) to pull posts from a specific date and time in 2012 and 2020.

In [6]:
# pull data from r/DMAcademy
dmacademy_df = get_posts('submission', 2, 'DMAcademy', int(time.time()))

Completed 1 iterations, 1 iterations remaining
Completed 2 iterations, 0 iterations remaining


In [7]:
# Inspect the dataframe
dmacademy_df

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,url,whitelist_status,wls,link_flair_css_class,post_hint,preview,author_cakeday,author_flair_background_color,author_flair_text_color,removed_by_category
0,[],False,sparklyladybug,,[],,text,t2_2mkj57z9,False,False,...,https://www.reddit.com/r/DMAcademy/comments/mi...,all_ads,6,,,,,,,
1,[],False,Skaldicthorn,,[],,text,t2_50ard3ld,False,False,...,https://www.reddit.com/r/DMAcademy/comments/mi...,all_ads,6,,,,,,,
2,[],False,Ready_Tap_4327,,[],,text,t2_87x6hj8f,False,False,...,https://www.reddit.com/r/DMAcademy/comments/mi...,all_ads,6,,,,,,,
3,[],False,cochon_de_lait,,[],,text,t2_6vi9b,False,False,...,https://www.reddit.com/r/DMAcademy/comments/mi...,all_ads,6,,,,,,,
4,[],False,Root1nTootinPutin,,[],,text,t2_hskctv9,False,False,...,https://www.reddit.com/r/DMAcademy/comments/mi...,all_ads,6,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,[],False,Dragonboy233,,[],,text,t2_4enr9fgd,False,False,...,https://www.reddit.com/r/DMAcademy/comments/mi...,all_ads,6,,,,,,,
196,[],False,Fadinglight656,,[],,text,t2_95mbl9ol,False,False,...,https://www.reddit.com/r/DMAcademy/comments/mi...,all_ads,6,,,,,,,
197,[],False,Miner49r10,,[],,text,t2_pmomu,False,False,...,https://www.reddit.com/r/DMAcademy/comments/mi...,all_ads,6,Guide,,,,,,
198,[],False,Wabafeedle,,[],,text,t2_xhccq,False,False,...,https://www.reddit.com/r/DMAcademy/comments/mi...,all_ads,6,,,,,,,


In [15]:
# Confirm the number of unique posts we just pulled
dmacademy_df['id'].nunique()

5000

The printout above shows us the number of **unique reddit links** contained in our dataframe. This tells us that we didn't pull any duplicate posts, hooray! 

This all looks good, so now we can pull posts from r/truezelda.

In [11]:
# Pull posts from r/truezelda
truezelda_df = get_posts('submission', 2, 'truezelda', int(time.time()))

ConnectionError: HTTPSConnectionPool(host='api.pushshift.io', port=443): Max retries exceeded with url: /reddit/search/submission?subreddit=truezelda&size=100&before=1617468318 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fcc1026bdc0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

In [10]:
# Inspect the dataframe
truezelda_df

NameError: name 'truezelda_df' is not defined

In [None]:
# Check for number of unique posts
truezelda_df['id'].nunique()

In [None]:
# Are the missing ids nulls?
truezelda_df['id'].isnull().sum()
# Nope! 

Sometimes, you may pull duplicate posts. I'm not sure why this happens, but I've only seen it turn up at a very small fraction of the data, so we can dismiss it and use what we have.

If you want to investigate the full datasets without waiting pulling it through the API, run the cell below to read them in as csv's and check the numbers of unique posts.

In [14]:
dmacademy_df = pd.read_csv('../data/dmacademy.csv')
truezelda_df = pd.read_csv('../data/truezelda.csv')

print(dmacademy_df['id'].nunique())
print(truezelda_df['id'].nunique())

5000
4996


See what I mean about pulling duplicate posts? Since it only appears to be 4 duplicates out of 5,000, I'm going to keep what I have and move on.

### Save data to .csv files
Now that we've pulled the data needed for the first goal of this project, let's save it all as `.csv`s.

In [None]:
# Set index=False to avoid creating an unnecessary index column
dmacademy_df.to_csv('../data/dmacademy.csv', index=False)
truezelda_df.to_csv('../data/truezelda.csv', index=False)

<a id='pull-data-2'></a>

---
### Pulling data from r/PoliticalDiscussion

Now let's pull posts from r/PoliticalDiscussion. I'm pulling data from different years using an [Epoch time converter](https://www.epochconverter.com/). 

The specific Epoch times I used are 1333169208, which is Saturday, March 31, 2012 4:46:48 AM, and 1585630008, which is Tuesday, March 31, 2020 4:46:48 AM. Same date and time, but different years! 

I abbreviated the name political discussion to poli_dis for the dataframes below. I will continue to use this abbreviation throughout the rest of the notebooks in this project.

In [None]:
poli_dis_2012_df = get_posts('comment', 2, 'PoliticalDiscussion', 1333169208) 

In [None]:
poli_dis_2012_df['id'].nunique()

In [None]:
poli_dis_2020_df = get_posts('comment', 2, 'PoliticalDiscussion', 1585630008)

In [None]:
poli_dis_2020_df['id'].nunique()

Since the goal here isn't to predict on the 'subreddit' column but to predict on the year a post was made, we should add our target variable to our data.

In [None]:
poli_dis_2012_df['year'] = '2012'
poli_dis_2020_df['year'] = '2020'

In [None]:
print(poli_dis_2012_df['id'].nunique())
print(poli_dis_2020_df['id'].nunique())

In [None]:
# Note that the datasets from different years have different numbers of columns
print(poli_dis_2012_df.shape)
print(poli_dis_2020_df.shape)

If you want to investigate the full datasets without waiting pulling it through the API, run the cell below to read them in as csv's and check the numbers of unique posts.

In [None]:
poli_dis_2012_df = pd.read_csv('../data/poli_dis_2012.csv')
poli_dis_2020_df = pd.read_csv('../data/poli_dis_2020.csv')

### Save data to .csv files
Now that we've pulled the data needed for the second model, let's save it as `.csv`s. I commented out the lines below to prevent myself from accidentally overwriting my datasets.

In [None]:
# poli_dis_2012_df.to_csv('../data/poli_dis_2012.csv', index=False)
# poli_dis_2020_df.to_csv('../data/poli_dis_2020.csv', index=False)