# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [1]:
import requests
import json
import datetime
from time import sleep
import pandas as pd
import numpy as np
import requests
from sklearn.model_selection import train_test_split
from bs4 import BeautifulSoup   
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegression

In [2]:
def fetch_page(subreddit, after = None):
    url = f'https://www.reddit.com/r/{subreddit}/.json'
    headers = {'User-agent': 'webscraping bot'}
    params = {'after' : after}
    r = requests.get(url, headers=headers)
    return r.json()['data']['children']

def parse_post(post):
    keep = ['subreddit','selftext','title','score','name','author','num_comments','permalink','stickied','url']
    data = post['data']
    return {k: v for k, v in data.items() if k in keep}

def parsed_page(page):
    parsed_posts = []
    after = ''
    for post in page:
        p = parse_post(post)
        after = p['name']
        parsed_posts.append(p)
    return parsed_posts, after

def fetch_subreddit(subreddit, pages = 4):
    all_posts = []
    after = ''
    for i in range(pages):
        print(f'Fetching Page {i}')
        page = fetch_page(subreddit, after)
        parsed_posts, after = parsed_page(page)
        all_posts.extend(parsed_posts)
        sleep(5)
    return all_posts


In [3]:
subreddit1 = fetch_subreddit('dota2', pages = 40)
subreddit2 = fetch_subreddit('baseball', pages = 40)
sub1 = pd.DataFrame(subreddit1)
sub2 = pd.DataFrame(subreddit2)

Fetching Page 0
Fetching Page 1
Fetching Page 2
Fetching Page 3
Fetching Page 4
Fetching Page 5
Fetching Page 6
Fetching Page 7
Fetching Page 8
Fetching Page 9
Fetching Page 10
Fetching Page 11
Fetching Page 12
Fetching Page 13
Fetching Page 14
Fetching Page 15
Fetching Page 16
Fetching Page 17
Fetching Page 18
Fetching Page 19
Fetching Page 20
Fetching Page 21
Fetching Page 22
Fetching Page 23
Fetching Page 24
Fetching Page 25
Fetching Page 26
Fetching Page 27
Fetching Page 28
Fetching Page 29
Fetching Page 30
Fetching Page 31
Fetching Page 32
Fetching Page 33
Fetching Page 34
Fetching Page 35
Fetching Page 36
Fetching Page 37
Fetching Page 38
Fetching Page 39
Fetching Page 0
Fetching Page 1
Fetching Page 2
Fetching Page 3
Fetching Page 4
Fetching Page 5
Fetching Page 6
Fetching Page 7
Fetching Page 8
Fetching Page 9
Fetching Page 10
Fetching Page 11
Fetching Page 12
Fetching Page 13
Fetching Page 14
Fetching Page 15
Fetching Page 16
Fetching Page 17
Fetching Page 18
Fetching Page 19


In [4]:
sub1['r_dota2'] = 1
sub1['r_baseball'] = 0

In [47]:
sub1.head(5)

Unnamed: 0,author,name,num_comments,permalink,score,selftext,stickied,subreddit,title,url,r_dota2,r_baseball
0,VRCkid,t3_ag5vek,171,/r/DotA2/comments/ag5vek/rdota2_2019_survey_up...,274,Hey /r/DotA2!\n\nSince we're at the start of a...,True,DotA2,/r/Dota2 2019 Survey &amp; Updates,https://www.reddit.com/r/DotA2/comments/ag5vek...,1,0
1,VRCbot,t3_aiaulg,64,/r/DotA2/comments/aiaulg/the_349th_weekly_stup...,10,\nReady the questions! Feel free to ask anythi...,True,DotA2,The 349th Weekly Stupid Questions Thread,https://www.reddit.com/r/DotA2/comments/aiaulg...,1,0
2,qwer4790,t3_ai9hp2,321,/r/DotA2/comments/ai9hp2/shadow_in_hospital_to...,1241,,False,DotA2,Shadow in hospital too due to bad food,https://i.redd.it/yuvgsf9unrb21.jpg,1,0
3,PSGLGD-Baka,t3_ai8f9v,169,/r/DotA2/comments/ai8f9v/lmao_l_know_why_they_...,1257,,False,DotA2,"Lmao, l know why they lose",https://i.redd.it/fneh4ppatqb21.gif,1,0
4,lelANDtoplel,t3_ai9qin,99,/r/DotA2/comments/ai9qin/congratulations_to_th...,474,"With their win over Evil Geniuses, Virtus Pro ...",False,DotA2,Congratulations to the first team to qualify f...,https://www.reddit.com/r/DotA2/comments/ai9qin...,1,0


In [6]:
sub2['r_dota2'] = 0
sub2['r_baseball'] = 1

In [48]:
sub2.head(5)

Unnamed: 0,author,name,num_comments,permalink,score,selftext,stickied,subreddit,title,url,r_dota2,r_baseball
0,BaseballBot,t3_aia3lp,27,/r/baseball/comments/aia3lp/general_discussion...,2,#So what's this thread for?\n\n* ~~Discussion ...,True,baseball,[General Discussion] Around the Horn - 1/21/19,https://www.reddit.com/r/baseball/comments/aia...,0,1
1,thedeejus,t3_a6rwz4,125,/r/baseball/comments/a6rwz4/on_the_downvote_br...,310,"Hey everybody, we're well aware of the apparen...",True,baseball,On the downvote brigade,https://www.reddit.com/r/baseball/comments/a6r...,0,1
2,RealJimBoeheim,t3_ai69pc,222,/r/baseball/comments/ai69pc/harper_confirmed_j...,2288,,False,baseball,[Harper] Confirmed: Just called Tony Romo to s...,https://twitter.com/Bharper3407/status/1087200...,0,1
3,Wizedex,t3_aiavsr,42,/r/baseball/comments/aiavsr/ja_happ_reds_outbi...,101,,False,baseball,J.A. Happ: Reds outbid the Yankees and offered...,https://www.mlbtraderumors.com/2019/01/rosenth...,0,1
4,Too_Hood_95,t3_aiadom,65,/r/baseball/comments/aiadom/verducci_if_mlb_im...,51,,False,baseball,[Verducci] If MLB Implements a Significant Rul...,https://www.si.com/mlb/2019/01/17/rob-manfred-...,0,1


In [8]:
data = pd.concat([sub1, sub2], ignore_index = True)

In [49]:
data.head(5)

Unnamed: 0,author,name,num_comments,permalink,score,selftext,stickied,subreddit,title,url,r_dota2,r_baseball
0,VRCkid,t3_ag5vek,171,/r/DotA2/comments/ag5vek/rdota2_2019_survey_up...,274,Hey /r/DotA2!\n\nSince we're at the start of a...,True,DotA2,/r/Dota2 2019 Survey &amp; Updates,https://www.reddit.com/r/DotA2/comments/ag5vek...,1,0
1,VRCbot,t3_aiaulg,64,/r/DotA2/comments/aiaulg/the_349th_weekly_stup...,10,\nReady the questions! Feel free to ask anythi...,True,DotA2,The 349th Weekly Stupid Questions Thread,https://www.reddit.com/r/DotA2/comments/aiaulg...,1,0
2,qwer4790,t3_ai9hp2,321,/r/DotA2/comments/ai9hp2/shadow_in_hospital_to...,1241,,False,DotA2,Shadow in hospital too due to bad food,https://i.redd.it/yuvgsf9unrb21.jpg,1,0
3,PSGLGD-Baka,t3_ai8f9v,169,/r/DotA2/comments/ai8f9v/lmao_l_know_why_they_...,1257,,False,DotA2,"Lmao, l know why they lose",https://i.redd.it/fneh4ppatqb21.gif,1,0
4,lelANDtoplel,t3_ai9qin,99,/r/DotA2/comments/ai9qin/congratulations_to_th...,474,"With their win over Evil Geniuses, Virtus Pro ...",False,DotA2,Congratulations to the first team to qualify f...,https://www.reddit.com/r/DotA2/comments/ai9qin...,1,0


#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [10]:
target = data['r_dota2']
X1 = data['title']
X2 = data['selftext']

In [50]:
target[:5]

0    1
1    1
2    1
3    1
4    1
Name: r_dota2, dtype: int64

In [51]:
X1[:5]

0                   /r/Dota2 2019 Survey &amp; Updates
1             The 349th Weekly Stupid Questions Thread
2               Shadow in hospital too due to bad food
3                           Lmao, l know why they lose
4    Congratulations to the first team to qualify f...
Name: title, dtype: object

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X1, target, random_state=42, test_size=0.3)

In [52]:
X_train[:5]

415                         Turning Point for Team Liquid
273                            Lmao, l know why they lose
1782       [General Discussion] Around the Horn - 1/21/19
250     The major host shaved his beard off after the ...
413                                The Chinese Masterplan
Name: title, dtype: object

In [15]:
def title_to_words(title):
    title_text = BeautifulSoup(title).get_text()
    lower_case = title_text.lower()
    retokenizer = RegexpTokenizer(r'[a-z]+')
    words = retokenizer.tokenize(lower_case)
    stops = set(stopwords.words('english'))
    meaningful_words = [w for w in words if not w in stops]
    return(" ".join(meaningful_words))

In [16]:
num_titles = X_train.size
cleaned_train_titles = []

for i in range(0, num_titles):
    # If the index is evenly divisible by 1000, print a message
    if((i+1) % 50 == 0):
        print("Title %d of %d\n" % ( i+1, num_titles ))                                                                    
    cleaned_train_titles.append( title_to_words( X_train.iloc[i] ))

Title 50 of 1512

Title 100 of 1512

Title 150 of 1512

Title 200 of 1512

Title 250 of 1512

Title 300 of 1512

Title 350 of 1512

Title 400 of 1512

Title 450 of 1512

Title 500 of 1512

Title 550 of 1512

Title 600 of 1512

Title 650 of 1512

Title 700 of 1512

Title 750 of 1512

Title 800 of 1512

Title 850 of 1512

Title 900 of 1512

Title 950 of 1512

Title 1000 of 1512

Title 1050 of 1512

Title 1100 of 1512

Title 1150 of 1512

Title 1200 of 1512

Title 1250 of 1512

Title 1300 of 1512

Title 1350 of 1512

Title 1400 of 1512

Title 1450 of 1512

Title 1500 of 1512



In [17]:
num_titles2 = X_test.size
cleaned_test_titles = []

for i in range(0, num_titles2):
    # If the index is evenly divisible by 1000, print a message
    if((i+1) % 50 == 0):
        print("Title %d of %d\n" % ( i+1, num_titles2 ))                                                                    
    cleaned_test_titles.append( title_to_words( X_test.iloc[i] ))

Title 50 of 648

Title 100 of 648

Title 150 of 648

Title 200 of 648

Title 250 of 648

Title 300 of 648

Title 350 of 648

Title 400 of 648

Title 450 of 648

Title 500 of 648

Title 550 of 648

Title 600 of 648



In [53]:
cleaned_test_titles[:5]

['general discussion around horn',
 'director looks cool match',
 'rosenthal one possible delay announcement reds trade yankees sonny gray jonheyman said yesterday finalized reds might trying sign gray extension part deal confirmed discussion would make sense',
 'starting shortstops guess year',
 'maffia works']

In [54]:
cleaned_train_titles[:5]

['turning point team liquid',
 'lmao l know lose',
 'general discussion around horn',
 'major host shaved beard first ehome vs fnatic game',
 'chinese masterplan']

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=41)

vectorizer = CountVectorizer(max_features=200)

train_features = vectorizer.fit(cleaned_train_titles)

train_features = vectorizer.transform(cleaned_train_titles)
test_features = vectorizer.transform(cleaned_test_titles)

In [42]:
features = dict(list(zip(vectorizer.get_feature_names(),log_reg.coef_[0])))
feature_df = pd.DataFrame(features, index=['coef']).T
feature_df['abs_coef'] = abs(feature_df['coef'])

In [45]:
feature_df.sort_values('abs_coef', ascending=False).head(20)

Unnamed: 0,coef,abs_coef
qualify,1.317536,1.317536
year,-1.269444,1.269444
give,-1.099655,1.099655
see,-0.898013,0.898013
lot,0.814378,0.814378
already,-0.798472,0.798472
end,-0.786887,0.786887
molitor,-0.772896,0.772896
seasons,-0.772896,0.772896
right,-0.71557,0.71557


In [56]:
vocab = vectorizer.get_feature_names()
vocab[:5]

['admins', 'already', 'also', 'another', 'anything']

In [22]:
train_features

<1512x200 sparse matrix of type '<class 'numpy.int64'>'
	with 7681 stored elements in Compressed Sparse Row format>

In [23]:
test_features

<648x200 sparse matrix of type '<class 'numpy.int64'>'
	with 2603 stored elements in Compressed Sparse Row format>

## Predicting subreddit using Random Forests + Another Classifier

In [24]:
log_reg = LogisticRegression()
log_reg.fit(train_features, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [25]:
log_reg.score(train_features, y_train)

0.9351851851851852

In [26]:
log_reg.score(test_features, y_test)

0.9074074074074074

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

#### Thought experiment: What is the baseline accuracy for this model?

The model shows very high scores based on the title text, a score of 0.9905!

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [46]:
dt = RandomForestClassifier()

dt.fit(train_features, y_train)
print(dt.score(train_features, y_train))
dt.score(test_features, y_test)

0.9351851851851852




0.9074074074074074

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [28]:
s = cross_val_score(dt, train_features, y_train, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.935 ± 0.013


#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [29]:
dt2 = DecisionTreeClassifier()

dt2.fit(train_features, y_train)

print(dt.score(test_features, y_test))

s2 = cross_val_score(dt2, train_features, y_train, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Decision Tree", s2.mean().round(3), s2.std().round(3)))

0.9074074074074074
Decision Tree Score:	0.935 ± 0.013


In [30]:
dt3 = BaggingClassifier()

dt3.fit(train_features, y_train)

s3 = cross_val_score(dt3, train_features, y_train, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Bagging", s3.mean().round(3), s3.std().round(3)))

Bagging Score:	0.935 ± 0.013


### I'll repeat the procedure using the post text instead of the title

In [31]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, target, random_state=42, test_size=0.3)

In [32]:
num_text = X2_train.size
cleaned_train_text = []

for i in range(0, num_text):
    # If the index is evenly divisible by 1000, print a message
    if((i+1) % 50 == 0):
        print("Title %d of %d\n" % ( i+1, num_text ))                                                                    
    cleaned_train_text.append( title_to_words( X2_train.iloc[i] ))
    
num_text2 = X2_test.size
cleaned_test_text = []

for i in range(0, num_text2):
    # If the index is evenly divisible by 1000, print a message
    if((i+1) % 50 == 0):
        print("Title %d of %d\n" % ( i+1, num_text2 ))                                                                    
    cleaned_test_text.append( title_to_words( X2_test.iloc[i] ))

Title 50 of 1512

Title 100 of 1512

Title 150 of 1512

Title 200 of 1512

Title 250 of 1512

Title 300 of 1512

Title 350 of 1512

Title 400 of 1512

Title 450 of 1512

Title 500 of 1512

Title 550 of 1512

Title 600 of 1512

Title 650 of 1512

Title 700 of 1512

Title 750 of 1512

Title 800 of 1512

Title 850 of 1512

Title 900 of 1512

Title 950 of 1512

Title 1000 of 1512

Title 1050 of 1512

Title 1100 of 1512

Title 1150 of 1512

Title 1200 of 1512

Title 1250 of 1512

Title 1300 of 1512

Title 1350 of 1512

Title 1400 of 1512

Title 1450 of 1512

Title 1500 of 1512

Title 50 of 648

Title 100 of 648

Title 150 of 648

Title 200 of 648

Title 250 of 648

Title 300 of 648

Title 350 of 648

Title 400 of 648

Title 450 of 648

Title 500 of 648

Title 550 of 648

Title 600 of 648



In [57]:
cleaned_train_text[:5]

['',
 '',
 'thread discussion yesterday games excitement today games staring window waiting spring general questions mildly interesting facts praising santa anything else worth sharing asking warrant post game threads use games schedule sidebar navigate team want game thread featured posts links join official r baseball discord server https www reddit com r baseball comments bxp official rbaseball discord server new r baseball baseball check https www reddit com r baseball comments ofctq new comers guide common baseball terms newcomers guide common baseball terms u aagpeng reviewed r baseball offseason schedule hot stove guide https www reddit com r baseball comments taccj rbaseball offseason schedule hot stove guide book club january book rule work ben lindbergh sam miller discussion held january https www reddit com r baseball comments ac vmu reminder january book month note best user experience recommend disabling reddit redesign using r baseball week schedule day feature sunday bas

In [34]:
vectorizer = CountVectorizer(max_features=200)

train_features2 = vectorizer.fit(cleaned_train_text)

train_features2 = vectorizer.transform(cleaned_train_text)
test_features2 = vectorizer.transform(cleaned_test_text)

In [35]:
train_features2

<1512x200 sparse matrix of type '<class 'numpy.int64'>'
	with 8565 stored elements in Compressed Sparse Row format>

In [36]:
test_features2

<648x200 sparse matrix of type '<class 'numpy.int64'>'
	with 3995 stored elements in Compressed Sparse Row format>

In [37]:
log_reg = LogisticRegression()
log_reg.fit(train_features2, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [None]:
log_reg.score(train_features2, y_train)

In [38]:
log_reg.score(test_features2, y_test)

0.5416666666666666

In [39]:
dt = RandomForestClassifier()
dt.fit(train_features2, y_train)
dt.score(test_features2, y_test)



0.6234567901234568

In [40]:
s = cross_val_score(dt, train_features2, y_train, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.632 ± 0.017


# Executive Summary
---
Put your executive summary in a Markdown cell below.

It was quite easy for the Random Forest model to predict the subreddit based on the chosen 200 words.  Scores of over 0.99 are pretty damn impressive!