# Data Collection

---

### Project Introduction

---

### Subreddit selection
---

In [104]:
# Imports
import re
import requests
import pandas as pd
import matplotlib.pyplot as plt
import time

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

---
### Using Pushshift's API to pull data from subreddits

I knew that I would be pulling data from multiple different subreddits for this project so I created a function to streamline data pull requests. 

You can read more about Pushshift's API on this [GitHub page](https://github.com/pushshift/api). There is also a [YouTube video](https://www.youtube.com/watch?v=AcrjEWsMi_E) walkthrough of setting up this API. 

In [114]:
# Create a function to pull data a specified number of times, from a specified subreddit, at a specified time
def get_posts(pull_type, iters, subreddit, desired_time):
    
    # Define reddit's URL for requests
    url = 'https://api.pushshift.io/reddit/search/' + str(pull_type)
        
    # Create empty master dataframe to fill
    master_df = pd.DataFrame()
    
    # Loop through specified number 
    for i in range(iters):
        # Set API parameters
        params = {
        'subreddit': subreddit,
        'size': 100,
        'before':desired_time}
        
        # Pull data
        res = requests.get(url, params)
        data = res.json()
        posts = data['data']
        df = pd.DataFrame(posts)
        
        # Concatenate data to master dataframe
        frames = [df, master_df]
        master_df = pd.concat(frames, axis=0, ignore_index=True)
        
        # Get time of oldest post in this data
        # This resets the API parameters so that you pull older posts every iteration
        desired_time = df['created_utc'].min()
        print(f'Completed {i+1} iterations, {iters-i-1} iterations remaining')
        
        # Time delay so you don't get banned by Pushshift
        time.sleep(60)
    
    # Return dataframe containing all collected posts
    return master_df

---
### Pulling data from subreddits

I went down two different routes for this project: first, create a model that can predict whether a post came from subreddit A or subreddit B; second, create a model that can predict whether a post from one subreddit came from year A or year B. Thus, I pulled data from 3 different subreddits but did four total pulls: one from [r/DMAcademy](https://www.reddit.com/r/DMAcademy/), one from [r/truezelda](https://www.reddit.com/r/truezelda/), one from [r/PoliticalDiscussion](https://www.reddit.com/r/PoliticalDiscussion/) in the year 2012, and one from r/PoliticalDiscussion in the year 2020. Each pull totaled 5,000 subreddit posts (post title and main text only, no comments), except for year 2020 pull from r/PoliticalDiscussion. For whatever reason, this subreddit gave me an error when trying to pull the last 100 posts, so I conceded to only gathering 4,900 from this year. 

I chose 5,000 posts to ensure that my models will be well-informed. It was recommended that my models be trained on 2,000 posts from each subreddit as a **minimum**, but I know that posts can be removed/deleted online, so I pulled well over the minimum recommended number to ensure that I would have enough posts to work with. However, 10,000 total posts is a lot for my models to crunch, so maybe aim for 3,000 or 4,000 per subreddit next time.

### **Warning**
Do not run any of the cells below unless you have ~4 hours to spare.

To start, let's pull 5,000 posts from the DMAcademy subreddit and store it in a dataframe. Remember, the `get_posts` function pulls 100 posts per iteration, so passing it 50 will produce 50 * 100 posts, or 5,000! Also, I passed `int(time.time())` to my `get_posts` function to pull the 5,000 most recent posts at the time of writing. When we move on to the political discussion posts, you'll see me use a specific time called [Unix or Epoch time](https://en.wikipedia.org/wiki/Unix_time) (formatted as number of seconds since 00:00:00 Jan 1, 1970, an arbitrary date) to pull posts from a specific date and time in 2012 and 2020.

In [6]:
# If you want to investigate the missing links without waiting 100 minutes to pull data, uncomment the two lines below and run this cell

# dmacademy_df = pd.read_csv('../data/dmacademy_df.csv')
# truezelda_df = pd.read_csv('../data/truezelda.csv')

In [33]:
dmacademy_df = get_posts('submission', 50, 'DMAcademy', int(time.time()))

Completed 1 iterations, 0 iterations remaining


In [13]:
# Inspect the dataframe
dmacademy_df

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,link_flair_css_class,post_hint,preview,author_flair_background_color,author_flair_text_color,removed_by_category,gilded,author_cakeday,suggested_sort,banned_by
0,[],False,Vashael,,[],,text,t2_7p5eovfe,False,False,...,,,,,,,,,,
1,[],False,Atarihero76,,[],,text,t2_48oeurxu,False,False,...,Guide,self,"{'enabled': False, 'images': [{'id': '9XSNOgfA...",,,,,,,
2,[],False,JethroBuldean,,[],,text,t2_1plx2r4z,False,False,...,,,,,,,,,,
3,[],False,Mechaaniac,,[],,text,t2_a04c8tpe,False,False,...,,,,,,,,,,
4,[],False,Hungerforhuman,,[],,text,t2_a2ouksrx,False,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,[],False,AngelsJos,,[],,text,t2_9571c28u,False,True,...,,,,,,,,,,
4996,[],False,Light_of_Avalon,,[],,text,t2_871qw,False,False,...,,,,,,,,,,
4997,[],False,Randoff-Runemaker,,[],,text,t2_ab1yxy0e,False,False,...,,,,,,,,,,
4998,[],False,Shatyel,,[],,text,t2_1v4sccw1,False,False,...,,,,,,,,,,


In [14]:
dmacademy_df['full_link'].nunique()

5000

The printout above shows us the number of **unique reddit links** contained in our dataframe. This tells us that we didn't pull any duplicate posts, hooray! 

This all looks good, so now we can pull posts from r/truezelda.

In [16]:
truezelda_df = get_posts(50, 'truezelda', int(time.time()))

Completed 1 iterations, 49 iterations remaining
Completed 2 iterations, 48 iterations remaining
Completed 3 iterations, 47 iterations remaining
Completed 4 iterations, 46 iterations remaining
Completed 5 iterations, 45 iterations remaining
Completed 6 iterations, 44 iterations remaining
Completed 7 iterations, 43 iterations remaining
Completed 8 iterations, 42 iterations remaining
Completed 9 iterations, 41 iterations remaining
Completed 10 iterations, 40 iterations remaining
Completed 11 iterations, 39 iterations remaining
Completed 12 iterations, 38 iterations remaining
Completed 13 iterations, 37 iterations remaining
Completed 14 iterations, 36 iterations remaining
Completed 15 iterations, 35 iterations remaining
Completed 16 iterations, 34 iterations remaining
Completed 17 iterations, 33 iterations remaining
Completed 18 iterations, 32 iterations remaining
Completed 19 iterations, 31 iterations remaining
Completed 20 iterations, 30 iterations remaining
Completed 21 iterations, 29 i

In [17]:
truezelda_df

Unnamed: 0,author,author_flair_css_class,author_flair_text,brand_safe,can_mod_post,contest_mode,created_utc,domain,full_link,id,...,updated_utc,steward_reports,og_description,og_title,removed_by_category,removed_by,url_overridden_by_dest,link_flair_template_id,link_flair_text,poll_data
0,xxx_mlgnoscope_xxx,,,True,False,False,1510504803,self.truezelda,https://www.reddit.com/r/truezelda/comments/7c...,7cge06,...,,,,,,,,,,
1,Ender_Skywalker,,,True,False,False,1510444647,self.truezelda,https://www.reddit.com/r/truezelda/comments/7c...,7cbzex,...,,,,,,,,,,
2,[deleted],,,True,False,False,1510444560,self.truezelda,https://www.reddit.com/r/truezelda/comments/7c...,7cbz3u,...,,,,,,,,,,
3,bluestreakxp,,,True,False,False,1510275857,self.truezelda,https://www.reddit.com/r/truezelda/comments/7b...,7bxsop,...,,,,,,,,,,
4,Kholdstare93,,,True,False,False,1510270811,self.truezelda,https://www.reddit.com/r/truezelda/comments/7b...,7bxbp1,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4991,lutyrannus,,,,False,False,1614809046,self.truezelda,https://www.reddit.com/r/truezelda/comments/lx...,lx5kjs,...,,,,,,,,64e0e3ac-d29d-11ea-a357-0e5060aaa7f3,Open Discussion,
4992,novacav,,,,False,False,1614794221,self.truezelda,https://www.reddit.com/r/truezelda/comments/lx...,lx01je,...,,,,,,,,64e0e3ac-d29d-11ea-a357-0e5060aaa7f3,Open Discussion,
4993,[deleted],,,,False,False,1614789322,self.truezelda,https://www.reddit.com/r/truezelda/comments/lw...,lwy2zq,...,,,,,deleted,,,7cfab9de-d5b0-11ea-86b5-0e8cdba62949,Question,
4994,yuckygross,,,,False,False,1614737483,self.truezelda,https://www.reddit.com/r/truezelda/comments/lw...,lwinf0,...,,,,,,,,7cfab9de-d5b0-11ea-86b5-0e8cdba62949,Question,


In [5]:
truezelda_df['id'].nunique()

NameError: name 'truezelda_df' is not defined

In [None]:
# Are the missing ids nulls?
truezelda_df['id'].isnull().sum()
# Nope! 

For whatever reason, it looks like we may have pulled 4 duplicate posts. Since this is only .08% of our data from this subreddit, let's ignore it and use what we have.

---
### Save data to .csv files
Now that we've pulled the data needed for the first model, let's save it as `.csv`s.

In [19]:
# Set index=False to avoid creating an unnecessary index column
dmacademy_df.to_csv('../data/dmacademy.csv', index=False)
truezelda_df.to_csv('../data/truezelda.csv', index=False)

---
### Political discussion subreddit
Now let's pull posts from r/PoliticalDiscussion. I'm pulling data from different years using an [Epoch time converter](https://www.epochconverter.com/). 

In [123]:
# Define reddit's URL for requests

url = 'https://api.pushshift.io/reddit/search/' + 'comment'

In [124]:
params = {
'subreddit': 'PoliticalDiscussion',
'size': 100,
'before':1585630008}
# Tuesday, March 31, 2020 4:46:48 AM
# Pull data
res = requests.get(url, params)
data = res.json()
posts = data['data']
df = pd.DataFrame(posts)

In [125]:
res.status_code

200

In [109]:
df['body']

0                                              Prob not
1     Yes the country would be better off if those s...
2     [A reminder for everyone](https://www.reddit.c...
3     I don’t see this pandemic as a solid argument ...
4     Do you think America would be better off witho...
                            ...                        
95    Hello, /u/Gransanto102. Thanks for contributin...
96    Hello, /u/SocialObserver3802. Thanks for contr...
97    Hello, /u/SthenicFreeze. Thanks for contributi...
98    What specifically didn’t you approve of? The o...
99                                            [removed]
Name: body, Length: 100, dtype: object

In [110]:
df['body'].isnull().sum()

0

In [111]:
df.loc[df['body']=='[removed]']['body']

7     [removed]
55    [removed]
81    [removed]
99    [removed]
Name: body, dtype: object

In [112]:
df.loc[df['body'].str.contains('Hello, /u/')]['body']

39    Hello, /u/qikqu35t10n. Thanks for contributing...
40    Hello, /u/loganalytics. Thanks for contributin...
86    Hello, /u/makinggrace. Thanks for contributing...
87    Hello, /u/cc_hk. Thanks for contributing! Unfo...
94    Hello, /u/Realtalkdo3. Thanks for contributing...
95    Hello, /u/Gransanto102. Thanks for contributin...
96    Hello, /u/SocialObserver3802. Thanks for contr...
97    Hello, /u/SthenicFreeze. Thanks for contributi...
Name: body, dtype: object

In [129]:
poli_dis_2020_df.loc[poli_dis_2020_df['body'].str.contains('I am a bot')]['body']

1       [A reminder for everyone](https://www.reddit.c...
3       [A reminder for everyone](https://www.reddit.c...
19      [A reminder for everyone](https://www.reddit.c...
42      [A reminder for everyone](https://www.reddit.c...
45      [A reminder for everyone](https://www.reddit.c...
                              ...                        
4909    Your submission was automatically removed.\n\n...
4947    [A reminder for everyone](https://www.reddit.c...
4962    [A reminder for everyone](https://www.reddit.c...
4989    [A reminder for everyone](https://www.reddit.c...
4990    [A reminder for everyone](https://www.reddit.c...
Name: body, Length: 271, dtype: object

In [113]:
# If you want to investigate the missing links without waiting 100 minutes to pull data, uncomment the two lines below and run this cell

# poli_dis_2012_df = pd.read_csv('../data/poli_dis_2012.csv')
# poli_dis_2020_df = pd.read_csv('../data/poli_dis_2020.csv')

In [115]:
poli_dis_2012_df = get_posts('comment', 50 , 'PoliticalDiscussion', 1333169208) 
# this time is Saturday, March 31, 2012 4:46:48 AM
# or Friday, March 30, 2012 9:46:48 PM GMT-07:00

Completed 1 iterations, 49 iterations remaining
Completed 2 iterations, 48 iterations remaining
Completed 3 iterations, 47 iterations remaining
Completed 4 iterations, 46 iterations remaining
Completed 5 iterations, 45 iterations remaining
Completed 6 iterations, 44 iterations remaining
Completed 7 iterations, 43 iterations remaining
Completed 8 iterations, 42 iterations remaining
Completed 9 iterations, 41 iterations remaining
Completed 10 iterations, 40 iterations remaining
Completed 11 iterations, 39 iterations remaining
Completed 12 iterations, 38 iterations remaining
Completed 13 iterations, 37 iterations remaining
Completed 14 iterations, 36 iterations remaining
Completed 15 iterations, 35 iterations remaining
Completed 16 iterations, 34 iterations remaining
Completed 17 iterations, 33 iterations remaining
Completed 18 iterations, 32 iterations remaining
Completed 19 iterations, 31 iterations remaining
Completed 20 iterations, 30 iterations remaining
Completed 21 iterations, 29 i

In [116]:
poli_dis_2012_df.loc[poli_dis_2012_df['body']=='[deleted]']['body']

11      [deleted]
34      [deleted]
77      [deleted]
83      [deleted]
119     [deleted]
          ...    
4928    [deleted]
4929    [deleted]
4930    [deleted]
4931    [deleted]
4932    [deleted]
Name: body, Length: 267, dtype: object

In [117]:
poli_dis_2012_df.loc[poli_dis_2012_df['body']=='[removed]']['body']

Series([], Name: body, dtype: object)

In [118]:
poli_dis_2012_df['body'].isnull().sum()

0

In [119]:
poli_dis_2020_df['body'][:100]

0                                              Prob not
1     Yes the country would be better off if those s...
2     [A reminder for everyone](https://www.reddit.c...
3     I don’t see this pandemic as a solid argument ...
4     Do you think America would be better off witho...
                            ...                        
95    Hello, /u/Gransanto102. Thanks for contributin...
96    Hello, /u/SocialObserver3802. Thanks for contr...
97    Hello, /u/SthenicFreeze. Thanks for contributi...
98    What specifically didn’t you approve of? The o...
99                                            [removed]
Name: body, Length: 100, dtype: object

In [120]:
poli_dis_2012_df['body'].nunique()

4725

In [126]:
poli_dis_2020_df = get_posts('comment', 50, 'PoliticalDiscussion', 1585630008)
# Tuesday, March 31, 2020 4:46:48 AM

Completed 1 iterations, 49 iterations remaining
Completed 2 iterations, 48 iterations remaining
Completed 3 iterations, 47 iterations remaining
Completed 4 iterations, 46 iterations remaining
Completed 5 iterations, 45 iterations remaining
Completed 6 iterations, 44 iterations remaining
Completed 7 iterations, 43 iterations remaining
Completed 8 iterations, 42 iterations remaining
Completed 9 iterations, 41 iterations remaining
Completed 10 iterations, 40 iterations remaining
Completed 11 iterations, 39 iterations remaining
Completed 12 iterations, 38 iterations remaining
Completed 13 iterations, 37 iterations remaining
Completed 14 iterations, 36 iterations remaining
Completed 15 iterations, 35 iterations remaining
Completed 16 iterations, 34 iterations remaining
Completed 17 iterations, 33 iterations remaining
Completed 18 iterations, 32 iterations remaining
Completed 19 iterations, 31 iterations remaining
Completed 20 iterations, 30 iterations remaining
Completed 21 iterations, 29 i

In [127]:
poli_dis_2020_df['body'][400]

'Texas has changed because of California voters fleeing the state and its taxes. As urban population grows Texas is getting more purple.'

In [128]:
poli_dis_2020_df.loc[poli_dis_2020_df['body'].str.contains('Hello, /u/')]['body']

37      Hello, /u/PMmeURsprintPROGRAMS. Thanks for con...
38      Hello, /u/pleasedontbullyme_. Thanks for contr...
39      Hello, /u/chickenman86. Thanks for contributin...
40      Hello, /u/dumbirds. Thanks for contributing! U...
67      Hello, /u/Theduder89. Thanks for contributing!...
                              ...                        
4987    Hello, /u/cc_hk. Thanks for contributing! Unfo...
4994    Hello, /u/Realtalkdo3. Thanks for contributing...
4995    Hello, /u/Gransanto102. Thanks for contributin...
4996    Hello, /u/SocialObserver3802. Thanks for contr...
4997    Hello, /u/SthenicFreeze. Thanks for contributi...
Name: body, Length: 182, dtype: object

In [129]:
poli_dis_2020_df.loc[poli_dis_2020_df['body'].str.contains('I am a bot')]['body']

1       [A reminder for everyone](https://www.reddit.c...
3       [A reminder for everyone](https://www.reddit.c...
19      [A reminder for everyone](https://www.reddit.c...
42      [A reminder for everyone](https://www.reddit.c...
45      [A reminder for everyone](https://www.reddit.c...
                              ...                        
4909    Your submission was automatically removed.\n\n...
4947    [A reminder for everyone](https://www.reddit.c...
4962    [A reminder for everyone](https://www.reddit.c...
4989    [A reminder for everyone](https://www.reddit.c...
4990    [A reminder for everyone](https://www.reddit.c...
Name: body, Length: 271, dtype: object

In [132]:
poli_dis_2020_df.loc[poli_dis_2020_df['body']=='[removed]']['body'].count()

219

In [133]:
poli_dis_2020_df.loc[poli_dis_2020_df['body']=='[deleted]']['body'].count()

19

In [134]:
poli_dis_2012_df['year'] = '2012'
poli_dis_2020_df['year'] = '2020'

In [137]:
print(poli_dis_2012_df['id'].nunique())
print(poli_dis_2020_df['id'].nunique())

5000
5000


In [138]:
print(poli_dis_2020_df['body'].nunique())
print(poli_dis_2012_df['body'].nunique())

4448
4725


In [143]:
print(poli_dis_2012_df.shape)
print(poli_dis_2020_df.shape)

(5000, 23)
(5000, 36)


# talk about data here

---
### Save data to .csv files
Now that we've pulled the data needed for the second model, let's save it as `.csv`s.

In [145]:
poli_dis_2012_df.to_csv('../data/poli_dis_2012.csv', index=False)
poli_dis_2020_df.to_csv('../data/poli_dis_2020.csv', index=False)

---


### Pick up here!!!

So far we've accomplished:
* getting data - 200 subreddit posts so far (not comments)
* count vectorizing the subreddit posts
* passing data to TWO MODELS
    * Bernoulli Naive Bayes model : when we have 0/1 variables.
    * TFIDF multinomial naive bayes : when our variables are positive integers

To do next:
* get MORE DATA - source for pulling data on time delay: https://gist.github.com/tecoholic/1242694
* get different types of data - try comments! try using titles alongside selftext! 