# Data Collection

---

### Project Introduction

---

### Subreddit selection
---

In [1]:
# Imports
import re
import requests
import pandas as pd
import matplotlib.pyplot as plt
import time

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

---
### Using Pushshift's API to pull data from subreddits

I knew that I would be pulling data from multiple different subreddits for this project so I created a function to streamline data pull requests. 

You can read more about Pushshift's API on this [GitHub page](https://github.com/pushshift/api). There is also a [YouTube video](https://www.youtube.com/watch?v=AcrjEWsMi_E) walkthrough of setting up this API. 

In [2]:
# Create a function to pull data a specified number of times, from a specified subreddit, at a specified time
def get_posts(pull_type, iters, subreddit, desired_time):
    
    # Define reddit's URL for requests
    url = 'https://api.pushshift.io/reddit/search/' + str(pull_type)
        
    # Create empty master dataframe to fill
    master_df = pd.DataFrame()
    
    # Loop through specified number 
    for i in range(iters):
        # Set API parameters
        params = {
        'subreddit': subreddit,
        'size': 100,
        'before':desired_time}
        
        # Pull data
        res = requests.get(url, params)
        data = res.json()
        posts = data['data']
        df = pd.DataFrame(posts)
        
        # Concatenate data to master dataframe
        frames = [df, master_df]
        master_df = pd.concat(frames, axis=0, ignore_index=True)
        
        # Get time of oldest post in this data
        # This resets the API parameters so that you pull older posts every iteration
        desired_time = df['created_utc'].min()
        print(f'Completed {i+1} iterations, {iters-i-1} iterations remaining')
        
        # Time delay so you don't get banned by Pushshift
        time.sleep(60)
    
    # Return dataframe containing all collected posts
    return master_df

---
### Pulling data from subreddits

I went down two different routes for this project: first, create a model that can predict whether a post came from subreddit A or subreddit B; second, create a model that can predict whether a post from one subreddit came from year A or year B. Thus, I pulled data from 3 different subreddits but did four total pulls: one from [r/DMAcademy](https://www.reddit.com/r/DMAcademy/), one from [r/truezelda](https://www.reddit.com/r/truezelda/), one from [r/PoliticalDiscussion](https://www.reddit.com/r/PoliticalDiscussion/) in the year 2012, and one from r/PoliticalDiscussion in the year 2020. Each pull totaled 5,000 subreddit posts (post title and main text only, no comments), except for year 2020 pull from r/PoliticalDiscussion. For whatever reason, this subreddit gave me an error when trying to pull the last 100 posts, so I conceded to only gathering 4,900 from this year. 

I chose 5,000 posts to ensure that my models will be well-informed. It was recommended that my models be trained on 2,000 posts from each subreddit as a **minimum**, but I know that posts can be removed/deleted online, so I pulled well over the minimum recommended number to ensure that I would have enough posts to work with. However, 10,000 total posts is a lot for my models to crunch, so maybe aim for 3,000 or 4,000 per subreddit next time.

### **Warning**
Do not run any of the cells below unless you have ~4 hours to spare.

To start, let's pull 5,000 posts from the DMAcademy subreddit and store it in a dataframe. Remember, the `get_posts` function pulls 100 posts per iteration, so passing it 50 will produce 50 * 100 posts, or 5,000! Also, I passed `int(time.time())` to my `get_posts` function to pull the 5,000 most recent posts at the time of writing. When we move on to the political discussion posts, you'll see me use a specific time called [Unix or Epoch time](https://en.wikipedia.org/wiki/Unix_time) (formatted as number of seconds since 00:00:00 Jan 1, 1970, an arbitrary date) to pull posts from a specific date and time in 2012 and 2020.

In [3]:
# If you want to investigate the missing links without waiting 100 minutes to pull data, uncomment the two lines below and run this cell

# dmacademy_df = pd.read_csv('../data/dmacademy_df.csv')
# truezelda_df = pd.read_csv('../data/truezelda.csv')

In [4]:
# dmacademy_df = get_posts('submission', 50, 'DMAcademy', int(time.time()))

Completed 1 iterations, 49 iterations remaining


KeyboardInterrupt: 

In [None]:
# Inspect the dataframe
dmacademy_df

In [None]:
dmacademy_df['full_link'].nunique()

The printout above shows us the number of **unique reddit links** contained in our dataframe. This tells us that we didn't pull any duplicate posts, hooray! 

This all looks good, so now we can pull posts from r/truezelda.

In [None]:
# truezelda_df = get_posts(50, 'truezelda', int(time.time()))

In [None]:
truezelda_df

In [None]:
truezelda_df['id'].nunique()

In [None]:
# Are the missing ids nulls?
truezelda_df['id'].isnull().sum()
# Nope! 

For whatever reason, it looks like we may have pulled 4 duplicate posts. Since this is only .08% of our data from this subreddit, let's ignore it and use what we have.

---
### Save data to .csv files
Now that we've pulled the data needed for the first model, let's save it as `.csv`s.

In [None]:
# Set index=False to avoid creating an unnecessary index column
dmacademy_df.to_csv('../data/dmacademy.csv', index=False)
truezelda_df.to_csv('../data/truezelda.csv', index=False)

---
### Political discussion subreddit
Now let's pull posts from r/PoliticalDiscussion. I'm pulling data from different years using an [Epoch time converter](https://www.epochconverter.com/). 

In [None]:
# Define reddit's URL for requests

url = 'https://api.pushshift.io/reddit/search/' + 'comment'

In [None]:
params = {
'subreddit': 'PoliticalDiscussion',
'size': 100,
'before':1585630008}
# Tuesday, March 31, 2020 4:46:48 AM
# Pull data
res = requests.get(url, params)
data = res.json()
posts = data['data']
df = pd.DataFrame(posts)

In [None]:
res.status_code

In [None]:
df['body']

In [None]:
df['body'].isnull().sum()

In [None]:
df.loc[df['body']=='[removed]']['body']

In [None]:
df.loc[df['body'].str.contains('Hello, /u/')]['body']

In [None]:
poli_dis_2020_df.loc[poli_dis_2020_df['body'].str.contains('I am a bot')]['body']

In [None]:
# If you want to investigate the missing links without waiting 100 minutes to pull data, uncomment the two lines below and run this cell

# poli_dis_2012_df = pd.read_csv('../data/poli_dis_2012.csv')
# poli_dis_2020_df = pd.read_csv('../data/poli_dis_2020.csv')

In [None]:
#poli_dis_2012_df = get_posts('comment', 50 , 'PoliticalDiscussion', 1333169208) 
# this time is Saturday, March 31, 2012 4:46:48 AM
# or Friday, March 30, 2012 9:46:48 PM GMT-07:00

In [None]:
poli_dis_2012_df.loc[poli_dis_2012_df['body']=='[deleted]']['body']

In [None]:
poli_dis_2012_df.loc[poli_dis_2012_df['body']=='[removed]']['body']

In [None]:
poli_dis_2012_df['body'].isnull().sum()

In [None]:
poli_dis_2020_df['body'][:100]

In [None]:
poli_dis_2012_df['body'].nunique()

In [None]:
#poli_dis_2020_df = get_posts('comment', 50, 'PoliticalDiscussion', 1585630008)
# Tuesday, March 31, 2020 4:46:48 AM

In [None]:
poli_dis_2012_df['year'] = '2012'
poli_dis_2020_df['year'] = '2020'

In [None]:
print(poli_dis_2012_df['id'].nunique())
print(poli_dis_2020_df['id'].nunique())

In [None]:
print(poli_dis_2020_df['body'].nunique())
print(poli_dis_2012_df['body'].nunique())

In [None]:
print(poli_dis_2012_df.shape)
print(poli_dis_2020_df.shape)

# talk about data here

---
### Save data to .csv files
Now that we've pulled the data needed for the second model, let's save it as `.csv`s.

In [None]:
poli_dis_2012_df.to_csv('../data/poli_dis_2012.csv', index=False)
poli_dis_2020_df.to_csv('../data/poli_dis_2020.csv', index=False)

---


### Pick up here!!!

So far we've accomplished:
* getting data - 200 subreddit posts so far (not comments)
* count vectorizing the subreddit posts
* passing data to TWO MODELS
    * Bernoulli Naive Bayes model : when we have 0/1 variables.
    * TFIDF multinomial naive bayes : when our variables are positive integers

To do next:
* get MORE DATA - source for pulling data on time delay: https://gist.github.com/tecoholic/1242694
* get different types of data - try comments! try using titles alongside selftext! 