# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [1]:
import requests
import json
import pandas as pd
import requests
import time

In [2]:
url = "http://www.reddit.com/r/theonion.json"

In [3]:
## YOUR CODE HERE
res = requests.get(url, headers={'User-agent': 'Chris Manley Bot 0.1'})

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [4]:
data = res.json()
data

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 25,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'TheOnion',
     'selftext': '',
     'author_fullname': 't2_fqzfymc',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': 'Americans Observing 911 By Trying Not To Masturbate',
     'link_flair_richtext': [],
     'subreddit_name_prefixed': 'r/TheOnion',
     'hidden': False,
     'pwls': None,
     'link_flair_css_class': None,
     'downs': 0,
     'thumbnail_height': 105,
     'hide_score': False,
     'name': 't3_9ew2hp',
     'quarantine': False,
     'link_flair_text_color': 'dark',
     'author_flair_background_color': None,
     'subreddit_type': 'public',
     'ups': 330,
     'domain': 'youtube.com',
     'media_embed': {'content': '&lt;iframe width="459" height="344" src="https://www.youtube.com/embed/dRoVFK2XkN0?feature=oembed&amp;enablejsapi=1" frameborder="0" allow="autoplay

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [5]:
def scraper(subreddit, posts, sorter, age, n):    
    after = None
    if age == 0:
        url = 'https://www.reddit.com/r/{}/{}.json'.format(subreddit, sorter)
    else:
        url = 'https://www.reddit.com/r/{}/{}/.json?t={}'.format(subreddit, sorter, age)
    for i in range(n):
        if after == None:
            current_url = url
        elif age == 0:
            current_url = url + '?after=' + after
        else:
            current_url = url + '&after=' + after
        res = requests.get(current_url, headers={'User-agent': 'Chris 1.0'})
        if res.status_code != 200:
            print('Status Error', res.status_code)
            break
        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after']
        time.sleep(2)
        if after == None:
            break
        if (i + 1) % 25 == 0:
            print(i + 1)
            
    return

In [6]:
def main_scraper(subreddit, posts, sorter_list, age_list, n):
    
    for sorter in sorter_list:
        if sorter in ['top', 'controversial']:
            for age in age_list:
                scraper(subreddit, posts, sorter, age, n)
                print('Finished', sorter, age)
        else:
            scraper(subreddit, posts, sorter, 0, n)  
            print("Finished", sorter)
    
    return

In [7]:
posts = []

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [8]:
# This was done in a separate notebook. I will import the csv data into this notebook.
# As the scraping takes some time, I'm just trying to avoid unnecessary scrapes.
# I've included the functions I used to scrape above.
# I'll copy the code to run my fuunctions here:

# main_scraper('theonion', posts, ['hot', 'controversial', 'new', 'top'], ['week','month', 'year', 'all'], n=1000)

onion_data = pd.read_csv('./theonion.csv')
notonion_data = pd.read_csv('./nottheonion.csv')


In [9]:
onion_data['selftext'].fillna('', inplace=True)

onion_data['text'] = onion_data['title'] + onion_data['selftext']

onion_data[['text', 'selftext', 'title']].head()

Unnamed: 0,text,selftext,title
0,Fabled Lost City Of Gold Finally Discovered Of...,,Fabled Lost City Of Gold Finally Discovered Of...
1,Purina Introduces ‘Own Shit’ Dog Food Flavor,,Purina Introduces ‘Own Shit’ Dog Food Flavor
2,Does Brett Kavanaugh’s 1996 Legal Essay ‘Donal...,,Does Brett Kavanaugh’s 1996 Legal Essay ‘Donal...
3,PR Disaster: Nike Is Under Fire After It Relea...,,PR Disaster: Nike Is Under Fire After It Relea...
4,Is It Fair To Not Pay College Football Players...,,Is It Fair To Not Pay College Football Players...


Checking out Onion Data:

In [10]:
notonion_data['selftext'].fillna('', inplace=True)

In [11]:
notonion_data['text'] = notonion_data['title'] + notonion_data['selftext']

In [12]:
notonion_data[['text', 'selftext', 'title']].head()

Unnamed: 0,text,selftext,title
0,Rocket City Trash Pandas chosen as new Madison...,,Rocket City Trash Pandas chosen as new Madison...
1,2K asks fans to tell Belgium they want loot boxes,,2K asks fans to tell Belgium they want loot boxes
2,Coastal Labs Studying Increased Flooding Consi...,,Coastal Labs Studying Increased Flooding Consi...
3,Twentieth Century Fox pulls scene from 'The Pr...,,Twentieth Century Fox pulls scene from 'The Pr...
4,Tour of Britain bike turned into penis,,Tour of Britain bike turned into penis


In [13]:
onion_data['subreddit'].replace(to_replace='TheOnion', value=1, inplace=True)

In [14]:
notonion_data['subreddit'].replace(to_replace='nottheonion', value=0, inplace=True)

In [15]:
df2 = pd.concat(objs=(onion_data, notonion_data), axis=0, ignore_index=True)

In [16]:
full_df = df2.loc[:, ['text', 'subreddit']]

In [17]:
full_df.head()

Unnamed: 0,text,subreddit
0,Fabled Lost City Of Gold Finally Discovered Of...,1
1,Purina Introduces ‘Own Shit’ Dog Food Flavor,1
2,Does Brett Kavanaugh’s 1996 Legal Essay ‘Donal...,1
3,PR Disaster: Nike Is Under Fire After It Relea...,1
4,Is It Fair To Not Pay College Football Players...,1


In [18]:
X = full_df['text']
y = full_df['subreddit'] 

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
import nltk
from nltk.tokenize import RegexpTokenizer
from bs4 import BeautifulSoup
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
lemmatizer = WordNetLemmatizer()
import matplotlib.pyplot as plt

import numpy as np

In [20]:
full_df['subreddit'].value_counts()/full_df.shape[0]

1    0.515219
0    0.484781
Name: subreddit, dtype: float64

Very balanced data set, TheOnion is the dominant class with %51. That is my baseline prediction.

In [21]:
full_df['text'].values

array(['Fabled Lost City Of Gold Finally Discovered Off I-95 Outside Baltimore',
       'Purina Introduces ‘Own Shit’ Dog Food Flavor',
       'Does Brett Kavanaugh’s 1996 Legal Essay ‘Donald Trump Should Be Allowed To Commit Crimes If He Becomes President’ Disqualify Him From The Supreme Court?',
       ...,
       'CCTV caught deaf, intellectually disabled trio discussing plot to murder housemate, court hears',
       '‘Covfefe’ on list of vanity license plates banned in Georgia',
       "Woman killed by train while investigating 'goatman' myth"],
      dtype=object)

In [22]:
X = full_df['text']
y = full_df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=28, stratify=y)

In [23]:
X_train.head()

3386    News: Safety First: To Avoid Rio’s Polluted Wa...
2145        I Can't Stand It When Jews Talk During Movies
403     ‘In Office, I Only Ate 7 Almonds A Day. As A P...
5428    This barber will publicly shame your misbehavi...
6166    Sean Spicer stole a mini fridge from Junior Wh...
Name: text, dtype: object

In [24]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: subreddit, dtype: int64

In [25]:
cvec = CountVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = 'english') 

X_train_trans = cvec.fit_transform(X_train)
X_test_trans = cvec.transform(X_test)

In [26]:
X_train_trans

<5026x11819 sparse matrix of type '<class 'numpy.int64'>'
	with 42361 stored elements in Compressed Sparse Row format>

In [27]:
# X_train_trans = X_train_trans.toarray()

In [28]:
X_train_trans.shape

(5026, 11819)

In [29]:
X_train_trans.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [30]:
trans_df = pd.DataFrame(X_train_trans.todense(),
                   columns=cvec.get_feature_names())

In [31]:
trans_df.head()

Unnamed: 0,00,000,000th,001,030,045,07,10,100,1000,...,zombies,zone,zones,zoo,zooey,zookeeper,zoologist,zopittybop,zuckerberg,zune
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
cvec_top100 = trans_df.sum().sort_values(ascending=False).head(100)

In [33]:
cvec_top100

man           407
say           291
trump         216
new           212
said          212
life          202
says          200
woman         199
just          177
news          168
people        138
police        120
like          112
year          110
video         107
day            97
school         96
old            96
time           92
000            88
report         87
sex            85
white          84
house          78
years          77
world          73
nation         72
know           68
don            67
women          66
             ... 
wife           40
students       40
court          40
need           40
love           40
work           39
wants          39
claims         38
scientists     38
finally        37
rules          37
having         37
god            37
government     37
getting        36
win            36
death          36
right          36
week           36
dog            36
judge          36
help           35
student        35
person         34
shit      

In [34]:
tvec = TfidfVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = 'english') 

X_train_tvec = tvec.fit_transform(X_train)
X_test_tvec = tvec.transform(X_test)

In [35]:
X_train_tvec

<5026x11819 sparse matrix of type '<class 'numpy.float64'>'
	with 42361 stored elements in Compressed Sparse Row format>

In [36]:
X_train_tvec.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [37]:
tvec_df = pd.DataFrame(X_train_tvec.todense(),
                   columns=tvec.get_feature_names())

In [38]:
tvec_df.head()

Unnamed: 0,00,000,000th,001,030,045,07,10,100,1000,...,zombies,zone,zones,zoo,zooey,zookeeper,zoologist,zopittybop,zuckerberg,zune
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [39]:
tvec_top100 = tvec_df.sum().sort_values(ascending=False).head(100)

In [40]:
tvec_top100

man           72.270332
say           55.685287
said          46.312897
says          42.774111
woman         42.391388
trump         42.051258
new           41.797679
life          39.098307
just          34.054436
people        29.627007
news          28.758629
police        27.553071
year          25.204689
video         24.598569
like          24.220693
school        22.924767
report        22.448118
day           22.192520
old           21.739853
nation        21.470238
sex           21.446779
time          21.268394
000           20.196123
white         19.828250
years         19.222980
house         18.861946
know          18.335411
world         18.071483
obama         18.071020
way           17.906507
                ...    
kids          10.898276
work          10.878524
students      10.855711
court         10.687230
claims        10.647629
right         10.645894
little        10.614547
dog           10.479658
getting       10.476167
mom           10.470964
judge         10

## Predicting subreddit using Random Forests + Another Classifier

In [41]:
## YOUR CODE HERE
rf = RandomForestClassifier()

grid = {'n_estimators': [int(x) for x in np.linspace(start = 1000, stop = 2000, num = 10)],
               'max_depth': [int(x) for x in np.linspace(10, 110, num = 11)]}
               #'min_samples_split': [2, 5, 10],
               #'min_samples_leaf': [1, 2, 4]}

rf_grid = GridSearchCV(estimator = rf, 
                       param_grid = grid, 
                       cv = 3, 
                       verbose=2,
                       n_jobs = -1)


#rf = RandomForestClassifier()
#rfcv = cross_val_score(rf, X_train_tvec, y_train)

In [46]:
rf_grid = rf_grid.fit(X_train_tvec, y_train)

Fitting 3 folds for each of 110 candidates, totalling 330 fits
[CV] max_depth=10, n_estimators=1000 .................................
[CV] max_depth=10, n_estimators=1000 .................................
[CV] max_depth=10, n_estimators=1000 .................................
[CV] max_depth=10, n_estimators=1111 .................................
[CV] .................. max_depth=10, n_estimators=1000, total=   8.0s
[CV] max_depth=10, n_estimators=1111 .................................
[CV] .................. max_depth=10, n_estimators=1000, total=   8.1s
[CV] max_depth=10, n_estimators=1111 .................................
[CV] .................. max_depth=10, n_estimators=1000, total=   8.1s
[CV] max_depth=10, n_estimators=1222 .................................
[CV] .................. max_depth=10, n_estimators=1111, total=   9.1s
[CV] max_depth=10, n_estimators=1222 .................................
[CV] .................. max_depth=10, n_estimators=1111, total=   9.8s
[CV] max_depth

[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  2.1min


[CV] .................. max_depth=20, n_estimators=1111, total=  18.5s
[CV] max_depth=20, n_estimators=1222 .................................
[CV] .................. max_depth=20, n_estimators=1111, total=  18.5s
[CV] max_depth=20, n_estimators=1222 .................................
[CV] .................. max_depth=20, n_estimators=1111, total=  18.5s
[CV] max_depth=20, n_estimators=1333 .................................
[CV] .................. max_depth=20, n_estimators=1222, total=  17.8s
[CV] max_depth=20, n_estimators=1333 .................................
[CV] .................. max_depth=20, n_estimators=1222, total=  17.3s
[CV] max_depth=20, n_estimators=1333 .................................
[CV] .................. max_depth=20, n_estimators=1222, total=  17.2s
[CV] max_depth=20, n_estimators=1444 .................................
[CV] .................. max_depth=20, n_estimators=1333, total=  18.6s
[CV] max_depth=20, n_estimators=1444 .................................
[CV] .

[CV] .................. max_depth=40, n_estimators=1000, total=  29.0s
[CV] max_depth=40, n_estimators=1111 .................................
[CV] .................. max_depth=40, n_estimators=1000, total=  28.8s
[CV] max_depth=40, n_estimators=1222 .................................
[CV] .................. max_depth=40, n_estimators=1111, total=  30.3s
[CV] max_depth=40, n_estimators=1222 .................................
[CV] .................. max_depth=40, n_estimators=1111, total=  29.8s
[CV] max_depth=40, n_estimators=1222 .................................
[CV] .................. max_depth=40, n_estimators=1111, total=  28.9s
[CV] max_depth=40, n_estimators=1333 .................................
[CV] .................. max_depth=40, n_estimators=1222, total=  33.2s
[CV] max_depth=40, n_estimators=1333 .................................
[CV] .................. max_depth=40, n_estimators=1222, total=  34.4s
[CV] max_depth=40, n_estimators=1333 .................................
[CV] .

[CV] .................. max_depth=60, n_estimators=1000, total=  37.3s
[CV] max_depth=60, n_estimators=1111 .................................
[CV] .................. max_depth=50, n_estimators=2000, total= 1.1min
[CV] max_depth=60, n_estimators=1111 .................................
[CV] .................. max_depth=60, n_estimators=1000, total=  36.5s
[CV] max_depth=60, n_estimators=1111 .................................
[CV] .................. max_depth=60, n_estimators=1000, total=  35.6s
[CV] max_depth=60, n_estimators=1222 .................................
[CV] .................. max_depth=60, n_estimators=1111, total=  40.9s
[CV] max_depth=60, n_estimators=1222 .................................


[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 21.3min


[CV] .................. max_depth=60, n_estimators=1111, total=  41.0s
[CV] max_depth=60, n_estimators=1222 .................................
[CV] .................. max_depth=60, n_estimators=1111, total=  40.4s
[CV] max_depth=60, n_estimators=1333 .................................
[CV] .................. max_depth=60, n_estimators=1222, total=  45.0s
[CV] max_depth=60, n_estimators=1333 .................................
[CV] .................. max_depth=60, n_estimators=1222, total=  43.9s
[CV] max_depth=60, n_estimators=1333 .................................
[CV] .................. max_depth=60, n_estimators=1222, total=  43.7s
[CV] max_depth=60, n_estimators=1444 .................................
[CV] .................. max_depth=60, n_estimators=1333, total=  50.0s
[CV] max_depth=60, n_estimators=1444 .................................
[CV] .................. max_depth=60, n_estimators=1333, total=  51.9s
[CV] max_depth=60, n_estimators=1444 .................................
[CV] .

[CV] .................. max_depth=80, n_estimators=1000, total=  45.1s
[CV] max_depth=80, n_estimators=1222 .................................
[CV] .................. max_depth=80, n_estimators=1111, total=  52.2s
[CV] max_depth=80, n_estimators=1222 .................................
[CV] .................. max_depth=80, n_estimators=1111, total=  52.3s
[CV] max_depth=80, n_estimators=1222 .................................
[CV] .................. max_depth=80, n_estimators=1111, total=  51.6s
[CV] max_depth=80, n_estimators=1333 .................................
[CV] .................. max_depth=80, n_estimators=1222, total=  55.2s
[CV] max_depth=80, n_estimators=1333 .................................
[CV] .................. max_depth=80, n_estimators=1222, total=  54.0s
[CV] max_depth=80, n_estimators=1333 .................................
[CV] .................. max_depth=80, n_estimators=1222, total=  53.4s
[CV] max_depth=80, n_estimators=1444 .................................
[CV] .

[CV] .................. max_depth=90, n_estimators=2000, total= 1.7min
[CV] max_depth=100, n_estimators=1111 ................................
[CV] ................. max_depth=100, n_estimators=1000, total=  56.0s
[CV] max_depth=100, n_estimators=1111 ................................
[CV] ................. max_depth=100, n_estimators=1000, total=  52.5s
[CV] max_depth=100, n_estimators=1222 ................................
[CV] ................. max_depth=100, n_estimators=1111, total=  59.3s
[CV] max_depth=100, n_estimators=1222 ................................
[CV] ................. max_depth=100, n_estimators=1111, total= 1.0min
[CV] max_depth=100, n_estimators=1222 ................................
[CV] ................. max_depth=100, n_estimators=1111, total=  59.2s
[CV] max_depth=100, n_estimators=1333 ................................
[CV] ................. max_depth=100, n_estimators=1222, total= 1.1min
[CV] max_depth=100, n_estimators=1333 ................................
[CV] .

[Parallel(n_jobs=-1)]: Done 330 out of 330 | elapsed: 76.8min finished


In [48]:
rf_grid.best_score_

0.7495025865499403

In [49]:
rf_grid.best_params_

{'max_depth': 40, 'n_estimators': 2000}

In [51]:
rf_grid.best_estimator_.feature_importances_

array([6.12761983e-07, 3.62793512e-04, 2.28100514e-06, ...,
       0.00000000e+00, 4.54235377e-05, 0.00000000e+00])

In [74]:
features_strength = zip(tvec.get_feature_names(), rf_grid.best_estimator_.feature_importances_);

In [75]:
sorted(features_strength, key=lambda x: abs(x[1]), reverse=True)

[('said', 0.041258000100986444),
 ('life', 0.036204803982002064),
 ('say', 0.02977953048640284),
 ('says', 0.022717569737262355),
 ('news', 0.02241814258180089),
 ('police', 0.02116527752764488),
 ('just', 0.01906282540536102),
 ('trump', 0.015461176341787492),
 ('man', 0.012854822525454324),
 ('nation', 0.012588831627712821),
 ('onion', 0.010623209173151568),
 ('questions', 0.010070345558047974),
 ('arrested', 0.009805644409353787),
 ('woman', 0.00924945402057306),
 ('know', 0.008227134051692463),
 ('report', 0.007305601503758775),
 ('things', 0.007125976212802251),
 ('sex', 0.0070617982937521825),
 ('time', 0.006640179560920047),
 ('shit', 0.006639477413886805),
 ('way', 0.006070042290218351),
 ('texas', 0.005994931674570696),
 ('blog', 0.005670630144197652),
 ('fucking', 0.005465288063784184),
 ('claims', 0.005434161623265901),
 ('quiz', 0.0054215249948956655),
 ('chinese', 0.005003188492890732),
 ('announced', 0.004536657847337158),
 ('video', 0.004465760821434874),
 ('wife', 0.004

In [60]:
rf_grid.best_estimator_.predict(X_test_tvec)

array([1, 1, 1, ..., 0, 0, 1])

In [61]:
rf_grid.best_estimator_.score(X_test_tvec, y_test)

0.7619331742243437

Tried max_depth up to 110 and n_estimators up to 1000 and got a score of 75. Best params were max depth = 30 and n_estimators = 1000. I'm going to run this again and up the n_estimators to 2000 to see if that improves my score at all.

#### Thought experiment: What is the baseline accuracy for this model?

In [240]:
## YOUR CODE HERE
full_df['subreddit'].value_counts()/full_df.shape[0]

1    0.515219
0    0.484781
Name: subreddit, dtype: float64

Baseline accuracy is the percentage of the prevalence of the dominant class in the dataset. In this case, posts from The Onion are 51.5% of my dataset. What this means is that if I guessed that a post was from The Onion everytime, I would be right roughly 51.5% of the time. So, any model that I create must perform better than 51.5% to be deemed better than just a guess.

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [None]:
## YOUR CODE HERE

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [47]:
## YOUR CODE HERE
lr = LogisticRegression()
lr.fit(X_train_tvec, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [48]:
print('Logreg intercept:', lr.intercept_)
print('Logreg coef(s):', lr.coef_)

Logreg intercept: [-0.1299145]
Logreg coef(s): [[ 0.10792601  0.19787256 -0.1440772  ... -0.09892511  0.24442562
   0.15475768]]


In [49]:
cross_val_score(lr, X_train_tvec, y_train, cv=5).mean()

0.7815389402302957

In [50]:
lr.score(X_test_tvec, y_test)

0.7911694510739857

In [None]:
y_hat = model.predict(X)
y_hat_proba = model.predict_proba(X)

Fitting the same model to different data

In [51]:
lit_data = pd.read_csv('./literature.csv')
physics_data = pd.read_csv('./physics.csv')

In [52]:
lit_data['selftext'].fillna('', inplace=True)
lit_data['text'] = lit_data['title'] + lit_data['selftext']
lit_data[['text', 'selftext', 'title']].head()

Unnamed: 0,text,selftext,title
0,The FBI’s Spying on Writers Was Literary Criti...,,The FBI’s Spying on Writers Was Literary Criti...
1,Chekov &amp; Tolstoy,,Chekov &amp; Tolstoy
2,"John Steinbeck was a sadistic womaniser, says ...",,"John Steinbeck was a sadistic womaniser, says ..."
3,In Praise of the Epistolary Novel,,In Praise of the Epistolary Novel
4,Jorge Luis Borges Selects 74 Books for Your Pe...,,Jorge Luis Borges Selects 74 Books for Your Pe...


In [53]:
physics_data['selftext'].fillna('', inplace=True)
physics_data['text'] = physics_data['title'] + physics_data['selftext']
physics_data[['text', 'selftext', 'title']].head()

Unnamed: 0,text,selftext,title
0,"Physics Questions Thread - Week 36, 2018**Tues...",**Tuesday Physics Questions: 04-Sep-2018**\n\n...,"Physics Questions Thread - Week 36, 2018"
1,"Careers/Education Questions Thread - Week 36, ...",**Thursday Careers &amp; Education Advice Thre...,"Careers/Education Questions Thread - Week 36, ..."
2,This symbol is engraved outside the institute ...,,This symbol is engraved outside the institute ...
3,Teaching some physics with a homemade experiment,,Teaching some physics with a homemade experiment
4,Today is the 10-year anniversary of the first ...,,Today is the 10-year anniversary of the first ...


In [54]:
lit_data['subreddit'].replace(to_replace='literature', value=1, inplace=True)

In [73]:
physics_data['subreddit'].replace(to_replace='Physics', value=0, inplace=True)

In [74]:
lit_phys_df = pd.concat(objs=(lit_data, physics_data), axis=0, ignore_index=True)

In [75]:
full_df2 = lit_phys_df.loc[:, ['text', 'subreddit']]

In [76]:
full_df2.tail()

Unnamed: 0,text,subreddit
6399,Best Explanation of Quantum Field Theory That ...,0
6400,Quantum mechanics for high-school students,0
6401,PhD Comics Explains the Higgs Boson: Jorge Cha...,0
6402,The answer to any thermodynamics question,0
6403,Verlinde's new theory of gravity passes first ...,0


In [77]:
full_df2.shape

(6404, 2)

In [78]:
X2 = full_df2['text']
y2 = full_df2['subreddit'] 

In [79]:
full_df2['subreddit'].value_counts()/full_df2.shape[0]

0    0.62742
1    0.37258
Name: subreddit, dtype: float64

In [80]:
full_df2.head()

Unnamed: 0,text,subreddit
0,The FBI’s Spying on Writers Was Literary Criti...,1
1,Chekov &amp; Tolstoy,1
2,"John Steinbeck was a sadistic womaniser, says ...",1
3,In Praise of the Epistolary Novel,1
4,Jorge Luis Borges Selects 74 Books for Your Pe...,1


In [82]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.30, random_state=28, stratify=y2)

In [92]:
second_tvec = TfidfVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = 'english') 

X2_train_tvec = second_tvec.fit_transform(X2_train)
X2_test_tvec = second_tvec.transform(X2_test)

In [93]:
X2_train_tvec

<4482x19172 sparse matrix of type '<class 'numpy.float64'>'
	with 99592 stored elements in Compressed Sparse Row format>

In [94]:
X2_train_tvec.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [95]:
tvec_df2 = pd.DataFrame(X2_train_tvec.todense(),
                   columns=second_tvec.get_feature_names())

In [97]:
tvec_df2.head()

Unnamed: 0,00,000,00000022,0002,0005,000th,000x,002,005,0061493341,...,δt,λcdm,λcdmso,μs,μν,ρc,дбгб,синтэкс,അമ,യവര
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [98]:
lr2 = LogisticRegression()
lr2.fit(X2_train_tvec, y2_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [99]:
cross_val_score(lr2, X2_train_tvec, y2_train, cv=5).mean()

0.9332860527154004

In [100]:
lr2.score(X2_test_tvec, y2_test)

0.9500520291363164

# Executive Summary
---
Put your executive summary in a Markdown cell below.

The institution for integrity in news reporting has asked General Assembly to analyze real and 'fake' news headlines in an effort to understand how similar these headlines can actually sound. The goal of this presentation is to 

The Subreddits: The Onion is a well known humor and satire site, famous for creating entirely fake and funny news stories. While most people are aware that the Onion is a satire site, replies from individuals who seem to take these news stories as real stories can still be found all over the internet.

Not the Onion is a subreddit devoted to real news stories from reliable news sources, that sound like they could be Onion articles.

While most PEOPLE can usualy tell the difference 


The Onion      0.515219, 3,453 unique posts
Not the Onion  0.484781, 3,249 unique posts

