# Notebook 01

# Problem Statement: How well does a classification model hold up over time?

I am interested in exploring how a successful classification model holds up over time. If a model is built based on a dataset collected at a specific point in time, how far into the future can it be applied, and still be successful (accurate)? Where success is defined as performing better than the baseline accuracy. It would be a bonus if I can identify whether covid specifically had an impact, by creating my model from posts gathered from Sept 2019, and applying the model to posts gathered during subsequent months. In order to inject an added degree of difficulty to this problem, the subreddits need to have similar content. And in order to successfully fulfill the requirement to collect data over time, the subreddit memberships need to be very high, to generate enough volume. I am additionally curious about how generalizable the model will be when applied to posts from other subreddits.

I start with 2 categories: practical here's how you do stuff (r/LifeProTips), and deeper existential stuff (r/Showerthoughts). A big advantage about using these subreddits is that they are among the largest subreddits, with memberships ~20M. 1000 posts are extracted from each subreddit, starting at midnight on the 20th of each month.

Notebook sequence:
1. Webscraping
2. Exploratory Data Analysis
3. Logistic Regression Model
4. Random Forest, Extra Trees, Support Vector Machine Models
5. Apply Models to Datasets & Evaluate

## 1. Scraping Reddit using API

In this notebook, posts are initially scraped from 2 subreddits, 'LifeProTips' (19M members) and 'Showerthoughts' (22M members) using the Pushshift API. For each subreddit, 1000 posts are extracted starting from the 20th day of each month (and moving backwards), over a 17-month period. Each set of 1000 posts comes from approximately 6 days, and is saved in its own dataframe.

The training/test set for modeling (notebook #3) will be created from a dataset grabbed in Sept 2019.

In [38]:
# Grab posts from these subreddits: 'LifeProTips' & 'Showerthoughts'
# Extract 1000 posts starting from the 20th day of each month, midnight (and moving backwards), for a 17-month period (starting before Covid)
# Every 100 posts covers ~12 hours, so 1000 posts spans ~6 days 
# Place each 1000 posts in its own df
# Use the posts from Sept 20, 2020 as the training set (train on a dataset away from the 2 known big events: covid and perseverance landing)
# Test all other sets of posts
# Compare accuracy over time

In [14]:
import pandas as pd
import numpy as np
import time
import requests

url = "https://api.pushshift.io/reddit/search/submission"

In [36]:
# use Epoch Converter to generate Epoch timestamps for all months
# https://www.epochconverter.com/

# timestamps from Oct 2019 - Feb 2021
timestamp_all = [1571529600, 1574208000, 1576800000, 1579478400, 
                1582156800, 1584662400, 1587340800, 1589932800, 1592611200, 1595203200,
                1597881600, 1600560000, 1603152000, 1605830400, 1608422400,
                1611100800, 1613779200]

# create 17 dataframes with the 1st entry being 'init'

dict = {'subreddit': 'init', 'selftext': 'init', 'title': 'init'}

# instantiate empty dataframes
df1_name = []
df2_name = []

subr = ['LifeProTips','Showerthoughts'] #19M, 22M

In [11]:
# scrape 'LifeProTips'
for i in range(len(timestamp_all)):
    
    # generate dataframe name
    df1_name.append(f"df_{i+1}_{subr[0]}")
    
    # create dataframe with 'init' values
    # Google: python create variable name from string
    # https://www.daniweb.com/programming/software-development/threads/111526/setting-a-string-as-a-variable-name
    vars()[df1_name[i]] = pd.DataFrame([dict])
    timestamp = timestamp_all[i]
    
    # loop through 10 times to collect 1000 posts in total
    for j in range(10):
        params = {
            'subreddit':subr[0],
            'size': 100,
            'before': timestamp
        }
        
        res = requests.get(url,params)
        res.status_code
        data = res.json()
        posts = data['data']
        df = pd.DataFrame(posts)

        # concatenate dataframes
        vars()[df1_name[i]] = pd.concat([vars()[df1_name[i]],df[['subreddit','selftext','title']]])
        
        # grab timestamp from most recent batch of 100 posts
        timestamp = posts[99]['created_utc']
        # check that the 'for' loop is grabbing unique posts
        # print(vars()[df1_name[i]]['title'].iloc[(vars()[df1_name[i]].shape[0])-1])
        time.sleep(3) # rest for 3 seconds before doing this again
    
    # drop the first row
    vars()[df1_name[i]] = vars()[df1_name[i]].iloc[1:]
    
    # save df
    vars()[df1_name[i]].to_csv(f'../data/raw/{subr[0]}_{i}.csv', index=False)

In [12]:
# scrape 'Showerthoughts'
for i in range(len(timestamp_all)):
    df2_name.append(f"df_{i+1}_{subr[1]}")
    vars()[df2_name[i]] = pd.DataFrame([dict])
    timestamp = timestamp_all[i]
    
    for j in range(10):
        params = {
            'subreddit':subr[1],
            'size': 100,
            'before': timestamp
        }
        
        res = requests.get(url,params)
        data = res.json()
        posts = data['data']
        df = pd.DataFrame(posts)
        vars()[df2_name[i]] = pd.concat([vars()[df2_name[i]],df[['subreddit','selftext','title']]])
        timestamp = posts[99]['created_utc']
        # check that the 'for' loop is grabbing unique posts
        # print(vars()[df2_name[i]]['title'].iloc[(vars()[df2_name[i]].shape[0])-1])
        time.sleep(3)
    
    # drop the first row
    vars()[df2_name[i]] = vars()[df2_name[i]].iloc[1:]
    
    # save df
    vars()[df2_name[i]].to_csv(f'../data/raw/{subr[1]}_{i}.csv', index=False)

## 2. Extract 1000 posts per subreddit

Extract posts from Sept 2019 into separate files, with a different file naming convention.

In [37]:
# List of other potential subreddits
# subr = ['financialindependence','FinancialPlanning'] #866k, 203k
# subr = ['books','news'] #19.1M, 22.9M
# subr = ['stocks','gardening'] #2.4M, 3.5M
# subr = ['boardgames','gaming'] #3.4M, 29.4M (all games inc. boardgames, except sports)
# subr = ['todayilearned','nottheonion'] #25M, 18.9M
# subr = ['space','science'] #18.2M, 26M

# timestamp for 9/20/2019, midnight
timestamp = [1568937600]
filename = ['df_' + subr[0], 'df_' + subr[1]]

vars()[filename[0]] = pd.DataFrame([dict])
vars()[filename[1]] = pd.DataFrame([dict])

##################################
####### FIRST DATAFRAME ##########
##################################

for j in range(10):
    params = {
        'subreddit':subr[0],
        'size': 100,
        'before': timestamp
    }

    res = requests.get(url,params)
    res.status_code
    data = res.json()
    posts = data['data']
    df = pd.DataFrame(posts)
    vars()[filename[0]] = pd.concat([vars()[filename[0]],df[['subreddit','selftext','title']]])
    timestamp = posts[99]['created_utc']
    time.sleep(3)

# drop the first row
vars()[filename[0]] = vars()[filename[0]].iloc[1:]

# save df
vars()[filename[0]].to_csv(f'../data/{filename[0]}.csv', index=False)

##################################
####### SECOND DATAFRAME #########
##################################

for j in range(10):
    params = {
        'subreddit':subr[1],
        'size': 100,
        'before': timestamp
    }

    res = requests.get(url,params)
    res.status_code
    data = res.json()
    posts = data['data']
    df = pd.DataFrame(posts)
    vars()[filename[1]] = pd.concat([vars()[filename[1]],df[['subreddit','selftext','title']]])
    timestamp = posts[99]['created_utc']
    time.sleep(3)

# drop the first row
vars()[filename[1]] = vars()[filename[1]].iloc[1:]

# save df
vars()[filename[1]].to_csv(f'../data/{filename[1]}.csv', index=False)