# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

In [34]:
### New question: how Democrat / Republican is your subreddit? (cats, dogs; narwhal, bacon)

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [1]:
import pandas as pd

In [2]:
import requests
import json

In [23]:
URL = "https://www.reddit.com/r/asoiaf/.json"

In [24]:
## YOUR CODE HERE
res = requests.get(URL, headers={'User-agent': 'Conor Barry Bot 0.1'})

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [25]:
data = res.json()

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [6]:
import time

In [7]:
## YOUR CODE HERE
data['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [8]:
data['data']['after']

't3_9bahet'

In [13]:
print('   ',len(data['data']['children']))

    27


In [37]:
URL = "https://www.reddit.com/r/Democrat/.json"
res = requests.get(URL, headers={'User-agent': 'Conor Barry Bot 0.1'})
data = res.json()

In [57]:
data['data']

{'modhash': '',
 'dist': 14,
 'children': [{'kind': 't3',
   'data': {'approved_at_utc': None,
    'subreddit': 'Republican',
    'selftext': '',
    'author_fullname': 't2_x9cry',
    'saved': False,
    'mod_reason_title': None,
    'gilded': 0,
    'clicked': False,
    'title': 'Man Gives His Excuse For Stealing Teen’s MAGA Hat – Cap Same As KKK Hood',
    'link_flair_richtext': [],
    'subreddit_name_prefixed': 'r/Republican',
    'hidden': False,
    'pwls': 6,
    'link_flair_css_class': None,
    'downs': 0,
    'thumbnail_height': 60,
    'hide_score': False,
    'name': 't3_8wvq99',
    'quarantine': False,
    'link_flair_text_color': 'dark',
    'author_flair_background_color': '',
    'subreddit_type': 'public',
    'ups': 23,
    'domain': 'dailycaller.com',
    'media_embed': {},
    'thumbnail_width': 140,
    'author_flair_template_id': None,
    'is_original_content': False,
    'user_reports': [],
    'secure_media': None,
    'is_reddit_media_domain': False,
    'i

In [60]:
data_list = []
URL = "https://www.reddit.com/r/Democrats/.json"

for step in range(20):
    res = requests.get(URL, headers={'User-agent': 'Conor Barry Bot 0.1'})
    data = res.json()
    
    for i in range(len(data['data']['children'])):
        temp = {}
        temp['subreddit'] = data['data']['children'][i]['data']['subreddit_name_prefixed'].replace('r/', '')
        temp['title']=data['data']['children'][i]['data']['title']
        temp['selftext'] = data['data']['children'][i]['data']['selftext']
        data_list.append(temp)

    new_after = data['data']['after']
    URL = "https://www.reddit.com/r/Democrats/.json?after=" + new_after
    print('    iteration {} worked. URL: {}'.format(step, URL))
    time.sleep(3)
    
df_democrats_all = pd.DataFrame(data_list)
print('    size: ', df_democrats_all.shape)
df_democrats_all.head()

    iteration 0 worked. URL: https://www.reddit.com/r/Democrats/.json?after=t3_9blrny
    iteration 1 worked. URL: https://www.reddit.com/r/Democrats/.json?after=t3_9b93fw
    iteration 2 worked. URL: https://www.reddit.com/r/Democrats/.json?after=t3_9b31sf
    iteration 3 worked. URL: https://www.reddit.com/r/Democrats/.json?after=t3_9aupq1
    iteration 4 worked. URL: https://www.reddit.com/r/Democrats/.json?after=t3_9ailj2
    iteration 5 worked. URL: https://www.reddit.com/r/Democrats/.json?after=t3_9al1re
    iteration 6 worked. URL: https://www.reddit.com/r/Democrats/.json?after=t3_9a7vsn
    iteration 7 worked. URL: https://www.reddit.com/r/Democrats/.json?after=t3_9a4jli
    iteration 8 worked. URL: https://www.reddit.com/r/Democrats/.json?after=t3_9a2n2q
    iteration 9 worked. URL: https://www.reddit.com/r/Democrats/.json?after=t3_99sqoc
    iteration 10 worked. URL: https://www.reddit.com/r/Democrats/.json?after=t3_99qmyi
    iteration 11 worked. URL: https://www.reddit.com/

Unnamed: 0,selftext,subreddit,title
0,We are two of r/democrats more vocal members f...,democrats,THE TIME FOR UNITY IS NOW - A Progressive and ...
1,,democrats,Trump's Popularity Is Still Rock Solid Because...
2,,democrats,Trump finally realizes that he confessed to ob...
3,,democrats,Republicans can’t even agree to take a segrega...
4,"In his Op-Ed piece called ""The Commander of Fe...",democrats,Trump will burn in hell for spreading one fear...


### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [None]:
# Export to csv
df_democrats_all.to_csv('./InitialDemocratsPosts.csv')

In [65]:
df_init_DemRep = df_democrats_all.append(df_republicans_all, ignore_index=True, verify_integrity=True)

In [67]:
df_init_DemRep.shape

(1001, 3)

In [68]:
df_init_DemRep['subreddit'].value_counts()

democrats     501
Republican    500
Name: subreddit, dtype: int64

In [79]:
mask={
    'democrats':1, 
    'Republican':0
}
df_init_DemRep['subreddit'].map(mask)
df_init_DemRep.isnull().sum()

selftext     0
subreddit    0
title        0
dtype: int64

In [71]:
#SO MUCH ASSUMPTIONS, BUT IT SEEMS NECESSARY
#Figures as of 8/30/2018
dem_subscribers = 66.2 #Thousand
reb_subscribers = 52.4 #Thousand
dem_prior = dem_subscribers / (dem_subscribers + reb_subscribers)
dem_prior

0.5581787521079259

In [72]:
#Loosely correlates to 2016 presidential election number of 51%
#Lower Figure here jives with qualitative assumption that reddit leans democrat a little more

#### Conor's Gameplan

1. Get Dems, repubs into single frame
    1a. Convert data
    1b. Temporarily downselect titles only
2. Run logit
3. Answer basic questions
4. Create separate title vector (with word_ format for feature names) for each
5. Collect these together to one mega frame and re-rune
6. Synthesize Democrat / Republican-ness

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [88]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import seaborn as sns

In [63]:
my_stop_words = stop_words.ENGLISH_STOP_WORDS

In [80]:
df_data = df_init_DemRep.loc[:, ['subreddit', 'title']]

In [142]:
X_train, X_test, y_train, y_test = train_test_split(df_data.title, df_data.subreddit, random_state=42)

In [143]:
TfdVec=TfidfVectorizer(stop_words=my_stop_words, ngram_range=(1,3)) #best n_grams <5
X_train_transform = TfdVec.fit_transform(X_train).todense()
X_test_transform = TfdVec.transform(X_test).todense()
df_X_train_transform = pd.DataFrame(X_train_transform, columns=TfdVec.get_feature_names() )
df_X_test_transform = pd.DataFrame(X_test_transform, columns=TfdVec.get_feature_names())
log_reg = LogisticRegression()
log_reg.fit(df_X_train_transform, y_train)
log_reg.score(df_X_test_transform, y_test)

0.6972111553784861

In [144]:
feature_values =(dict(zip(list(df_X_train_transform.columns), list(log_reg.coef_[0]))))
sorted_d = sorted((value, key) for (key,value) in feature_values.items())

print('   republican indicators:')
for i in range(10):
    print('    ', sorted_d[i])
#print('    democratic indicators: ', sorted_d[:-10])
print('\n    democratic indicators:')
for i in range(10):
    print('    ', sorted_d[-i-1])

   republican indicators:
     (-0.5984480636388538, 'left')
     (-0.5488044144002266, 'gun')
     (-0.5262355220612943, 'liberals')
     (-0.5231938719769645, 'good')
     (-0.5134134530368043, 'socialism')
     (-0.5030253676182352, 'congressional')
     (-0.490210697278464, 'hc')
     (-0.4891010749006024, 'media')
     (-0.446918686010309, 'free')
     (-0.4345843387239702, 'life')

    democratic indicators:
     (1.866470345595284, 'trump')
     (1.217342670132449, 'president')
     (1.0114776609558307, 'mccain')
     (0.8149181734446282, 'cohen')
     (0.7846011810419642, 'twitter')
     (0.7738936478928158, 'democrats')
     (0.7323648092965098, 'plan')
     (0.7116658175775133, 'rourke')
     (0.6726589036408543, 'kavanaugh')
     (0.6533326653291052, 'beto')


#### Refit using binary features

In [163]:
TfdVec=TfidfVectorizer(stop_words=my_stop_words, ngram_range=(1,1), binary=True) #best n_grams <5
X_train_transform = TfdVec.fit_transform(X_train).todense()
X_test_transform = TfdVec.transform(X_test).todense()
df_X_train_transform = pd.DataFrame(X_train_transform, columns=TfdVec.get_feature_names() )
df_X_test_transform = pd.DataFrame(X_test_transform, columns=TfdVec.get_feature_names())
log_reg = LogisticRegression()
log_reg.fit(df_X_train_transform, y_train)
log_reg.score(df_X_test_transform, y_test)

0.7051792828685259

In [164]:
feature_values =(dict(zip(list(df_X_train_transform.columns), list(log_reg.coef_[0]))))
sorted_d = sorted((value, key) for (key,value) in feature_values.items())

print('   republican indicators:')
for i in range(10):
    print('    ', sorted_d[i])
#print('    democratic indicators: ', sorted_d[:-10])
print('\n    democratic indicators:')
for i in range(10):
    print('    ', sorted_d[-i-1])

   republican indicators:
     (-0.8543113703113804, 'left')
     (-0.8152754312971255, 'good')
     (-0.775454522607984, 'gun')
     (-0.7702200864564787, 'liberals')
     (-0.7600401777948265, 'supporters')
     (-0.7522166252206345, 'hc')
     (-0.7306916639294441, 'work')
     (-0.728088623906636, 'congressional')
     (-0.7145830163857193, 'free')
     (-0.6865350734897988, 'media')

    democratic indicators:
     (1.7823235746145472, 'trump')
     (1.7271626220987657, 'president')
     (1.2197904298138396, 'mccain')
     (1.1515761426609172, 'rourke')
     (1.0821573931416122, 'democrats')
     (1.0472360743100555, 'kavanaugh')
     (1.0147076937557096, 'twitter')
     (0.9839460599021884, 'plan')
     (0.9610823165330641, 'beto')
     (0.9509836918380705, 'cohen')


Answer: Slight improvement in model performance. Similar keywords in the top 10.

## Predicting subreddit using Random Forests + Another Classifier

In [165]:
## YOUR CODE HERE

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

In [166]:
## YOUR CODE HERE

#### Thought experiment: What is the baseline accuracy for this model?

In [174]:
## Basline Accuracy is 50%

In [159]:
from sklearn.metrics import confusion_matrix

In [168]:
cm = confusion_matrix(y_test, log_reg.predict(df_X_test_transform))
df_cm = pd.DataFrame(cm, columns=['Predicted Repub', 'Pred Dem'], index=['Actual Rep', 'Actual Dem'])
df_cm

Unnamed: 0,Predicted Repub,Pred Dem
Actual Rep,89,39
Actual Dem,35,88


In [172]:
y_test.shape[0]

251

In [173]:
#Actual Accuracy:
(df_cm.iloc[0,0] + df_cm.iloc[1,1] )/y_test.shape[0]

0.7051792828685259

In [176]:
#Given basline, model improved over random guess (50% dem) by 40%
(df_cm.iloc[0,0] + df_cm.iloc[1,1] )/y_test.shape[0] / .5

1.4103585657370519

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [151]:
from sklearn.ensemble import RandomForestClassifier

In [177]:
TfdVec=TfidfVectorizer(stop_words=my_stop_words, ngram_range=(1,1)) #best n_grams <5
X_train_transform = TfdVec.fit_transform(X_train).todense()
X_test_transform = TfdVec.transform(X_test).todense()
df_X_train_transform = pd.DataFrame(X_train_transform, columns=TfdVec.get_feature_names() )
df_X_test_transform = pd.DataFrame(X_test_transform, columns=TfdVec.get_feature_names())
rfc = RandomForestClassifier()
rfc.fit(df_X_train_transform, y_train)
rfc.score(df_X_test_transform, y_test)

0.6573705179282868

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [179]:
cm = confusion_matrix(y_test, log_reg.predict(df_X_test_transform))
df_cm = pd.DataFrame(cm, columns=['Predicted Repub', 'Pred Dem'], index=['Actual Rep', 'Actual Dem'])
df_cm
#This is making essentially the same predictions, apparently

Unnamed: 0,Predicted Repub,Pred Dem
Actual Rep,89,39
Actual Dem,36,87


#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [153]:
from sklearn.model_selection import cross_val_score

In [186]:
rfc = RandomForestClassifier()
for i in range(2,10):
    print('    ', i, cross_val_score(rfc, df_X_train_transform, y_train, cv=i).mean())

     2 0.6293333333333333
     3 0.6333333333333333
     4 0.6426356809648425
     5 0.6597924648502896
     6 0.66
     7 0.6759630124116105
     8 0.6625299521968958
     9 0.6652323580034424


In [199]:
X_train, X_test, y_train, y_test = train_test_split(df_data.title, 
                                                    df_data.subreddit,
                                                    test_size=.3, 
                                                    random_state=42)

In [200]:
TfdVec=TfidfVectorizer(stop_words=my_stop_words, ngram_range=(1,1), binary=True) #best n_grams <5
X_train_transform = TfdVec.fit_transform(X_train).todense()
X_test_transform = TfdVec.transform(X_test).todense()
df_X_train_transform = pd.DataFrame(X_train_transform, columns=TfdVec.get_feature_names() )
df_X_test_transform = pd.DataFrame(X_test_transform, columns=TfdVec.get_feature_names())
log_reg = LogisticRegression()
log_reg.fit(df_X_train_transform, y_train)
log_reg.score(df_X_test_transform, y_test)

0.707641196013289

##### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [201]:
from sklearn.naive_bayes import MultinomialNB

In [203]:
TfdVec=TfidfVectorizer(stop_words=my_stop_words, ngram_range=(1,1), binary=True) #best n_grams <5
X_train_transform = TfdVec.fit_transform(X_train).todense()
X_test_transform = TfdVec.transform(X_test).todense()
df_X_train_transform = pd.DataFrame(X_train_transform, columns=TfdVec.get_feature_names() )
df_X_test_transform = pd.DataFrame(X_test_transform, columns=TfdVec.get_feature_names())
mnb = MultinomialNB()
mnb.fit(df_X_train_transform, y_train)
mnb.score(df_X_test_transform, y_test)

0.6478405315614618

##### Pipeline and GridSearchCV

In [205]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from nltk.stem.snowball import EnglishStemmer

In [231]:
pipe = Pipeline([
    ('TfdVec2', TfidfVectorizer()),
    ('log_reg2', LogisticRegression())
])



param_grid2 = {
    'TfdVec2__stop_words':[None, my_stop_words],
    'TfdVec2__binary': [False, True],
    'log_reg2__C': [1,10,30] }

gs2 = GridSearchCV(pipe, param_grid2, cv=4)
#getting weird parameters here. Suspect vectorizatoin not going great


# Executive Summary
---
Put your executive summary in a Markdown cell below.