# Project 3
# Using NLP to classify posts to one of two subreddits

For this project, I am going to compare the Phish subreddit with the Grateful Dead subreddit. In this notebook, I'll scrape reddit using the pushshift API.

## Using the pushshift Reddit API to gather posts

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
import time
import requests
import json

I'll use the function that Brian showed us in class to scrape using the pushshift Reddit API.

In [2]:
def query_pushshift(subreddit, kind='submission', skip=30, times=25, 
                    subfield = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 'score', 'is_self'],
                    comfields = ['body', 'score', 'created_utc']):

    stem = "https://api.pushshift.io/reddit/search/{}/?subreddit={}&size=500".format(kind, subreddit)
    mylist = []
    
    for x in range(1, times):
        
        URL = "{}&after={}d".format(stem, skip * x)
        print(URL)
        response = requests.get(URL)
        assert response.status_code == 200
        mine = response.json()['data']
        df = pd.DataFrame(mine)
        mylist.append(df)
        time.sleep(2)
        
    full = pd.concat(mylist, sort=False)
    
    if kind == "submission":
        
        full = full[subfield]
        
        full = full.drop_duplicates()
        
        full = full.loc[full['is_self'] == True]
        
    def get_date(created):
        return dt.date.fromtimestamp(created)
    
    _timestamp = full["created_utc"].apply(get_date)
    
    full['timestamp'] = _timestamp

    print(full.shape)
    
    return full

Gathering the Dead data:

In [3]:
dead = query_pushshift("gratefuldead")

https://api.pushshift.io/reddit/search/submission/?subreddit=gratefuldead&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=gratefuldead&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=gratefuldead&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=gratefuldead&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=gratefuldead&size=500&after=150d
https://api.pushshift.io/reddit/search/submission/?subreddit=gratefuldead&size=500&after=180d
https://api.pushshift.io/reddit/search/submission/?subreddit=gratefuldead&size=500&after=210d
https://api.pushshift.io/reddit/search/submission/?subreddit=gratefuldead&size=500&after=240d
https://api.pushshift.io/reddit/search/submission/?subreddit=gratefuldead&size=500&after=270d
https://api.pushshift.io/reddit/search/submission/?subreddit=gratefuldead&size=500&after=300d
https://api.pushshift.io/reddit/search/submission/?subreddit=gr

Gathering the Phish data:

In [4]:
phish = query_pushshift("phish")

https://api.pushshift.io/reddit/search/submission/?subreddit=phish&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=phish&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=phish&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=phish&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=phish&size=500&after=150d
https://api.pushshift.io/reddit/search/submission/?subreddit=phish&size=500&after=180d
https://api.pushshift.io/reddit/search/submission/?subreddit=phish&size=500&after=210d
https://api.pushshift.io/reddit/search/submission/?subreddit=phish&size=500&after=240d
https://api.pushshift.io/reddit/search/submission/?subreddit=phish&size=500&after=270d
https://api.pushshift.io/reddit/search/submission/?subreddit=phish&size=500&after=300d
https://api.pushshift.io/reddit/search/submission/?subreddit=phish&size=500&after=330d
https://api.pushshift.io/reddit/search/submiss

In [5]:
len(dead)

5874

In [6]:
len(phish)

5784

These seem like well-sized, well-balanced datasets. I'll save these to csv files and proceed.

In [7]:
dead.reset_index(inplace=True, drop=True)
phish.reset_index(inplace=True, drop=True)

In [8]:
dead.to_csv("./dead.csv", index=False)
phish.to_csv("./phish.csv", index=False)