# Data Gathering

This notebook will run through the r/pandr and r/dundermifflin subreddits to gather our data.

## Loading Libraries

In [1]:
import pandas as pd
import datetime as dt
import time
import requests

In [47]:
# adapted from Mahdi Shadkam-Farrokhi project 3 intro lesson

def query_pushshift(subreddit, kind = 'submission', day_window = 180, n = 20):
    SUBFIELDS = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 'score', 'is_self']
    
    # establish base url and stem
    BASE_URL = f"https://api.pushshift.io/reddit/search/{kind}" # also known as the "API endpoint" 
    stem = f"{BASE_URL}?subreddit={subreddit}&size=500" # always pulling max of 500
    
    # instantiate empty list for temp storage
    posts = []
    
    # implement for loop with `time.sleep(2)`
    for i in range(1, n + 1):
        URL = "{}&after={}d".format(stem, day_window * i)
        print("Querying from: " + URL)
        response = requests.get(URL)
        assert response.status_code == 200
        mine = response.json()['data']
        df = pd.DataFrame.from_dict(mine)
        posts.append(df)
        time.sleep(2)
    
    # pd.concat storage list
    full = pd.concat(posts, sort=False)
    
    # if submission
    if kind == "submission":
        # select desired columns
        full = full[SUBFIELDS]
        # drop duplicates
        full.drop_duplicates(inplace = True)
        # select `is_self` == True
        full = full.loc[full['is_self'] == True]

    # create `timestamp` column
    full['timestamp'] = full["created_utc"].map(dt.date.fromtimestamp)
    
    print("Query Complete!")    
    return full 

In [48]:
parks_df = query_pushshift('PandR')

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=180d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=360d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=540d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=720d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=900d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=1080d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=1260d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=1440d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=1620d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=5

In [55]:
parks_df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
4,Letter to Brendanawicz,This guy had no chance. He started out being c...,PandR,1571349530,RandySwango,4,18,True,2019-10-17
9,Just started watching the show on Amazon Prime...,At least on season two as I've noticed the sub...,PandR,1571352782,Oo00oOo00oOO,0,1,True,2019-10-17
11,Office Ladies podcast just came out to discuss...,,PandR,1571353982,ImplicationOfDanger,6,11,True,2019-10-17
21,Noticed they didn't mention jerry in the S3 recap,The beginning of season 3 episode 1 has a reca...,PandR,1571367525,hogmanjr100,4,6,True,2019-10-17
30,Star trek movies rule,My favorite seasons are definitely 3/5/7. Seas...,PandR,1571407817,DEEP_HURTING,0,3,True,2019-10-18


In [50]:
parks_df.to_csv('../data/parks.csv', index=False)

In [51]:
office_df = query_pushshift('DunderMifflin')

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=DunderMifflin&size=500&after=180d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=DunderMifflin&size=500&after=360d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=DunderMifflin&size=500&after=540d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=DunderMifflin&size=500&after=720d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=DunderMifflin&size=500&after=900d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=DunderMifflin&size=500&after=1080d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=DunderMifflin&size=500&after=1260d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=DunderMifflin&size=500&after=1440d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=DunderMifflin&size=500&after=1620d
Querying from: 

In [49]:
parks_df.shape

(2049, 9)

In [52]:
office_df.shape

(2626, 9)

In [56]:
office_df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
8,I really wish we were able to see Michael's op...,Title says it all basically. People who've see...,DunderMifflin,1571343765,Corndogred,3,2,True,2019-10-17
16,PLOT HOLE ALERT,"In the segment ""Kevin cooks stuff in the offic...",DunderMifflin,1571346137,Loyalarad2007,10,1,True,2019-10-17
19,Looking for a specific blooper,What season do I have to look at to find the b...,DunderMifflin,1571347273,busterhead,1,1,True,2019-10-17
25,I just realized the guy at the high school job...,,DunderMifflin,1571350005,lianagolucky,3,6,True,2019-10-17
29,“I’m doing the best I can here so you can get ...,Why did that create scenes like this for Pam. ...,DunderMifflin,1571350614,StrongHandDan,11,1,True,2019-10-17


In [54]:
office_df.to_csv('../data/office.csv', index=False)

In [58]:
office_df['title'][:10]

8     I really wish we were able to see Michael's op...
16                                      PLOT HOLE ALERT
19                       Looking for a specific blooper
25    I just realized the guy at the high school job...
29    “I’m doing the best I can here so you can get ...
31                            Dwight as Hannibal Lecter
35    The scene where dwight tries to comfort pam wh...
49                          Season 9 Andy is the worst!
53                          I just noticed something...
55                                 Season 9 is terrible
Name: title, dtype: object

In [60]:
reddit = pd.concat([parks_df, office_df])

In [61]:
reddit.to_csv('../data/reddit.csv', index = False)

In [62]:
pd.read_csv('../data/reddit.csv')

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Letter to Brendanawicz,This guy had no chance. He started out being c...,PandR,1571349530,RandySwango,4,18,True,2019-10-17
1,Just started watching the show on Amazon Prime...,At least on season two as I've noticed the sub...,PandR,1571352782,Oo00oOo00oOO,0,1,True,2019-10-17
2,Office Ladies podcast just came out to discuss...,,PandR,1571353982,ImplicationOfDanger,6,11,True,2019-10-17
3,Noticed they didn't mention jerry in the S3 recap,The beginning of season 3 episode 1 has a reca...,PandR,1571367525,hogmanjr100,4,6,True,2019-10-17
4,Star trek movies rule,My favorite seasons are definitely 3/5/7. Seas...,PandR,1571407817,DEEP_HURTING,0,3,True,2019-10-18
...,...,...,...,...,...,...,...,...,...
4670,Ultimate troll (X-post from f7u12),Here it is:\n\nhttp://i.imgur.com/78ONN.jpg,DunderMifflin,1306295681,[deleted],2,21,True,2011-05-24
4671,Let's face it: The Office is over and peaked s...,Mine:\n\n1. Michael Scott\n2. Dwight Schrute\n...,DunderMifflin,1306301631,[deleted],2,0,True,2011-05-25
4672,"Did anyone catch the Producer's Cut of the ""Go...",Missed the episode myself. The internet is s...,DunderMifflin,1306474964,notnotbuddy,4,10,True,2011-05-27
4673,Best Michael vs. Toby scene?,I personally love [this one](http://youtu.be/-...,DunderMifflin,1306625629,french91,17,9,True,2011-05-28


#### Problem statement:

Mahdi's: Hacker said these two shows were so similar, it didn't matter to keep them separate -- shuffled the two subreddits but Mahdi was able to separate the two.

Mine?: NBC is looking to see how people on the internet engage with some of their most famous sitcoms. They assigned an intern to gather all the posts he could on reddit for the team to analyze later on. This intern is well.. an intern, and he just put all the posts into one folder! We were able to separate most posts, but the last two shows are still stuck together.. "The Office" and "Parks and Recreation." Given that they share some of the same creators, characterter names, and even actors (looking at you Rashida Jones..) our job is to build a model that is able to sort through these reddit posts and separate them back into their appropriate reddit threads (The Office -- r/DunderMifflin or Parks and Recreation -- r/PandR). 

On top of that, NBC is also interested in what content helped made it possible to differentiate between the two shows. We will build a classification model and measure our success through accuracy. 

- Tfidf count vectorizer columns + other features (number of col) + (dummied tags?)
- don't have to use extra