### Gabriela Osorio
#### DSI Project 3 - Creating a Binary Subreddit Classifier - TO vs. LA
#### November 5, 2018

#### Preamble
>  Reddit is an online public content sharing platform that is organized into different categories known as subreddits. Subreddits are comprised of user-submitted posts that can be text, media, or both. Posts can be interacted with through user-prompted upvotes, downvotes, comments, and of course, views. This project will outline the creation of a subreddit classifier that predicts the subreddit a given post is from. Specifially, it's a binary classifier for the Toronto and Los Angeles subreddits. This model can then be expanded upon to explore what the most important characteristics are of this classifier.

#### Quick Model Summary
> **Input**: 'Title' <br>
**Output**: Binary label ('LA' or 'TO')<br>
**Type**: Binary Classifier: Random Forest, Support Vector Machine <br>
**Metrics of Success**: Accuracy <br>


## PART 1: Webscraping Using the Reddit API

We begin by scraping posts from the Toronto and LA subreddits using Reddit's API. This portion was created from a template provided by Max Humber, course instructor, so it should not be mistaken for the author's original work. 

Potentially interesting and influential features of posts that have been identified and included in this webscraping include: 
- subreddit: to be part of the target vector later on 
- title: text input  
- selftext : actual text from the post
- downs : upvotes, positive points
- ups : downvotes, negative points
- num_comments : number of comments
- permalink 
- name 
- author 
- is_original_content : binary answer to "Is content in selftext original?"
- edited : binary answer to "Has this post been edited?"
- media_only : binary answer to "Does the post only have a photo?" 

### PART 1A: Scraping

### PART 1B: Creating and Exporting Scraped Goods as CSV 

In [39]:
#!mkdir data

In [40]:
#now = str(datetime.datetime.now())[:19]

#filename = f'data/datasci scrape {now}.csv'
#filename

In [41]:
#df.to_csv(filename, index=False)

## PART 2: Preprocessing

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import stop_words
from nltk.corpus import stopwords
import numpy as np
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import r2_score, accuracy_score
from sklearn import svm
import seaborn as sns
import pandas as pd


In [43]:
TO=pd.read_csv('./data/TO.csv')
TO.shape

(965, 12)

In [44]:
LA= pd.read_csv('./data/LA.csv')
LA.shape

(970, 12)

In [45]:
cities=[TO,LA]
tola=pd.concat(cities)
tola.dtypes

author                 object
downs                   int64
edited                 object
is_original_content      bool
media_only               bool
name                   object
num_comments            int64
permalink              object
selftext               object
subreddit              object
title                  object
ups                     int64
dtype: object

#### Let's hold our horses here.
Though we extracted many potential features for our upcoming predictive models, we'll only be focusing on title for now in this model. 

> Later, we might get to use this as our features: **X=tola.drop(columns='subreddit')**

For now though, we'll stick to X='Title'

In [46]:
tola.shape

(1935, 12)

In [47]:
tola.tail()

Unnamed: 0,author,downs,edited,is_original_content,media_only,name,num_comments,permalink,selftext,subreddit,title,ups
965,westondeboer,0,False,False,False,t3_9pew82,3,/r/LosAngeles/comments/9pew82/helicopter_makes...,,LosAngeles,Helicopter makes emergency landing in Dodger S...,24
966,max_raid,0,False,False,False,t3_9pltyt,6,/r/LosAngeles/comments/9pltyt/spectrum_interne...,In Encino and haven't had internet since I wok...,LosAngeles,Spectrum Internet Outage?,1
967,throwdatwey,0,False,False,False,t3_9ph8al,7,/r/LosAngeles/comments/9ph8al/flash_in_the_sky/,Just saw a huge flash in what I believe was th...,LosAngeles,Flash in the sky?,5
968,comolaflor24,0,False,False,False,t3_9pagh0,12,/r/LosAngeles/comments/9pagh0/car_broke_down_o...,Shout out to the random good Samaritans who he...,LosAngeles,Car broke down on the Expo Line track,96
969,BlankVerse,0,False,False,False,t3_9pd0yq,53,/r/LosAngeles/comments/9pd0yq/least_shocking_n...,,LosAngeles,Least shocking news ever: Report says owners d...,34


In [48]:
my_stopwords = stopwords.words('english')
my_stopwords.extend(['amp','x200b','\n'])


In [49]:
print(((tola['selftext'].isnull().sum())/1881)),
'Posts missing their text'

0.7304625199362041


'Posts missing their text'

Let's get ready to set up our actual experiment now! As we just saw, about 73% of our data points don't have any text in the post and are just a title.
 Our X aka features will be the string titles, and our y aka target will be the TO and LA labels. Right now those values are in string format and say either toronto or losangeles. So we will change our target's values to floats that are binary. 

In [50]:
tola['subreddit']

0         toronto
1         toronto
2         toronto
3         toronto
4         toronto
5         toronto
6         toronto
7         toronto
8         toronto
9         toronto
10        toronto
11        toronto
12        toronto
13        toronto
14        toronto
15        toronto
16        toronto
17        toronto
18        toronto
19        toronto
20        toronto
21        toronto
22        toronto
23        toronto
24        toronto
25        toronto
26        toronto
27        toronto
28        toronto
29        toronto
          ...    
940    LosAngeles
941    LosAngeles
942    LosAngeles
943    LosAngeles
944    LosAngeles
945    LosAngeles
946    LosAngeles
947    LosAngeles
948    LosAngeles
949    LosAngeles
950    LosAngeles
951    LosAngeles
952    LosAngeles
953    LosAngeles
954    LosAngeles
955    LosAngeles
956    LosAngeles
957    LosAngeles
958    LosAngeles
959    LosAngeles
960    LosAngeles
961    LosAngeles
962    LosAngeles
963    LosAngeles
964    Los

In [51]:
X=tola['title']
tola['subreddit'].replace({'toronto': 1, 'LosAngeles': 0}, inplace = True)


In [52]:
y=tola['subreddit']

In [53]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.2, random_state=42)
tfidvec = TfidfVectorizer(stop_words = my_stopwords)
tfidvec.fit(X_train)
X_train= tfidvec.transform(X_train)
X_test = tfidvec.transform(X_test)

In [54]:
X_train=X_train.todense()

In [55]:
X_test=X_test.todense()

Okay, so now we have the values in matrices or dataframes. We're ready to model

## Part 3: Modeling

### Part 3A: Selecting and Fitting Models

building out a function to get back info from the different models

In [56]:
from sklearn.metrics import confusion_matrix

def conf_matrix(model, X_test):
    
    y_pred = model.predict(X_test)            
    cm = confusion_matrix(y_test, y_pred)
    acc=round(accuracy_score(y_test,y_pred),2)
    tn, fp, fn, tp = cm.ravel()                
    print(f"True Negatives: {tn}")            
    print(f"False Positives: {fp}")
    print(f"False Negatives: {fn}")
    print(f"True Positives: {tp}") 
    print(f"Accuracy Score:{acc}")
    return pd.DataFrame(cm, columns = ['Predicted TO','Predicted LA'], index = ['Actual TO', 'Actual LA'])
             

- **Support Vector Machine**

Trying out a Support Vector Machine (for the first time)

In [57]:
from sklearn import svm
svc = svm.SVC(gamma=.6)
svc.fit(X_train, y_train)
conf_matrix(svc, X_test)


True Negatives: 169
False Positives: 25
False Negatives: 31
True Positives: 162
Accuracy Score:0.86


Unnamed: 0,Predicted TO,Predicted LA
Actual TO,169,25
Actual LA,31,162


- **Random Forest Classifier**

In [58]:
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier()
rfc.fit(X_train, y_train)
conf_matrix(rfc, X_test)

True Negatives: 180
False Positives: 14
False Negatives: 61
True Positives: 132
Accuracy Score:0.81


Unnamed: 0,Predicted TO,Predicted LA
Actual TO,180,14
Actual LA,61,132


- **Adaptive Boosting** 

In [59]:
from sklearn.ensemble import AdaBoostClassifier
ada=AdaBoostClassifier()
ada.fit(X_train, y_train)
conf_matrix(ada, X_test)

True Negatives: 187
False Positives: 7
False Negatives: 80
True Positives: 113
Accuracy Score:0.78


Unnamed: 0,Predicted TO,Predicted LA
Actual TO,187,7
Actual LA,80,113


- **Basic Logistic Regression**

In [60]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
lr.fit(X_train, y_train)
conf_matrix(lr, X_test)

True Negatives: 170
False Positives: 24
False Negatives: 32
True Positives: 161
Accuracy Score:0.86


Unnamed: 0,Predicted TO,Predicted LA
Actual TO,170,24
Actual LA,32,161


**Analysis:** -----------------------------------------------------------------------------
- 4 models: Basic logistic regression, random forest classifier, adaptive boosting, and support vector machine have achieved accuracy scores of 86 and above. 

We'll now apply **GridSearchCV** to these models to optimize their parameters and automate the best possible options.

## Part 3B: Optimizing Models' Parameters for Highest Accuracy

- **Optimizing the Linear Regression:**

In [61]:
lr = LogisticRegression(solver='lbfgs', max_iter=200) 
lr.fit(X_train, y_train)

print(lr.score(X_test, y_test))
print(np.mean(y_test))

0.8552971576227391
0.49870801033591733


In [62]:
gs_params = {
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':np.logspace(-5,0,100)
}

lr_gridsearch = GridSearchCV(LogisticRegression(), gs_params, cv=10, verbose=1)

In [63]:
%%time
lr_gridsearch = lr_gridsearch.fit(X_train, y_train)

Fitting 10 folds for each of 200 candidates, totalling 2000 fits
CPU times: user 2min 44s, sys: 28.5 s, total: 3min 12s
Wall time: 1min 57s


[Parallel(n_jobs=1)]: Done 2000 out of 2000 | elapsed:  2.0min finished


In [64]:
lr_gridsearch.best_score_

0.8753229974160207

In [65]:
lr_gridsearch.best_params_

{'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}

**That was pretty tedious work for just optimizing one model. We'll now look to our knowledge of class, functions, and dictionaries to optimize all functions automatically. The EstimatorSelectionHelper code is originally from David S. Batista. I've modified it for use in Python 3 with sklearn's current library.**

In [66]:
class EstimatorSelectionHelper:
    
    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}
    
    def fit(self, X, y, **grid_kwargs):
        for key in self.keys:
            print('Running GridSearchCV for %s.' % key)
            model = self.models[key]
            params = self.params[key]
            grid_search = GridSearchCV(model, params, **grid_kwargs)
            grid_search.fit(X, y)
            self.grid_searches[key] = grid_search
        print('Done.')
    
    def score_summary(self, sort_by='mean_test_score'):
        frames = []
        for name, grid_search in self.grid_searches.items():
            frame = pd.DataFrame(grid_search.cv_results_)
            frame = frame.filter(regex='^(?!.*param_).*$')
            frame['estimator'] = len(frame)*[name]
            frames.append(frame)
        df = pd.concat(frames)
        
        df = df.sort_values([sort_by], ascending=False)
        df = df.reset_index()
        df = df.drop(['rank_test_score', 'index'], 1)
        
        columns = df.columns.tolist()
        columns.remove('estimator')
        columns = ['estimator']+columns
        df = df[columns]
        return df

In [67]:
models1 = { 
    'Logistic Regression': lr,
    'SVC': svc,
    'RandomForestClassifier': rfc,
    'AdaBoostClassifier': ada,
    
}

params1 = {
    'Logistic Regression': {'penalty':['l1','l2'],'solver':['liblinear'],'C':np.logspace(-5,0,100)},
    'SVC': [
        {'kernel': ['linear'], 'C': [1, 10], 'gamma': [0.001, 0.01, 1, 10]},
        {'kernel': ['rbf'], 'C': [1, 10], 'gamma': [0.001, 0.0001]}], 
    'RandomForestClassifier': { 'n_estimators': [16,32]},
    'AdaBoostClassifier':  { 'n_estimators': [16, 32] }
}

- **Scoring**

In [68]:
helper1 = EstimatorSelectionHelper(models1, params1)
helper1.fit(X_test, y_test, return_train_score=True, n_jobs=2)
df=helper1.score_summary()
df

Running GridSearchCV for Logistic Regression.
Running GridSearchCV for SVC.
Running GridSearchCV for RandomForestClassifier.
Running GridSearchCV for AdaBoostClassifier.
Done.


Unnamed: 0,estimator,mean_fit_time,std_fit_time,mean_score_time,std_score_time,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,Logistic Regression,0.014150,0.003115,0.001903,0.000412,"{'C': 0.7924828983539169, 'penalty': 'l2', 'so...",0.800000,0.775194,0.765625,0.780362,0.014501,0.996109,1.000000,0.992278,0.996129,0.003153
1,Logistic Regression,0.016338,0.002496,0.001925,0.000555,"{'C': 0.49770235643321137, 'penalty': 'l2', 's...",0.807692,0.767442,0.765625,0.780362,0.019452,0.980545,1.000000,0.988417,0.989654,0.007991
2,Logistic Regression,0.014096,0.002548,0.002270,0.000458,"{'C': 0.8902150854450392, 'penalty': 'l2', 'so...",0.800000,0.775194,0.765625,0.780362,0.014501,0.996109,1.000000,0.992278,0.996129,0.003153
3,Logistic Regression,0.015905,0.003055,0.001851,0.000334,"{'C': 0.7054802310718645, 'penalty': 'l2', 'so...",0.800000,0.775194,0.765625,0.780362,0.014501,0.996109,1.000000,0.992278,0.996129,0.003153
4,Logistic Regression,0.016287,0.002306,0.001486,0.000118,"{'C': 0.6280291441834247, 'penalty': 'l2', 'so...",0.800000,0.767442,0.765625,0.777778,0.015822,0.984436,1.000000,0.988417,0.990951,0.006602
5,Logistic Regression,0.014596,0.001678,0.001602,0.000020,"{'C': 0.22051307399030456, 'penalty': 'l2', 's...",0.792308,0.775194,0.765625,0.777778,0.011045,0.976654,0.988372,0.980695,0.981907,0.004860
6,Logistic Regression,0.015195,0.002345,0.001489,0.000163,"{'C': 0.5590810182512223, 'penalty': 'l2', 'so...",0.800000,0.767442,0.765625,0.777778,0.015822,0.980545,1.000000,0.988417,0.989654,0.007991
7,Logistic Regression,0.015360,0.005676,0.001435,0.000259,"{'C': 0.44306214575838776, 'penalty': 'l2', 's...",0.807692,0.759690,0.765625,0.777778,0.021413,0.980545,0.996124,0.988417,0.988362,0.006360
8,SVC,0.389107,0.005279,0.173696,0.002417,"{'C': 1, 'gamma': 10, 'kernel': 'linear'}",0.784615,0.751938,0.789062,0.775194,0.016544,0.996109,0.996124,0.996139,0.996124,0.000012
9,SVC,0.386825,0.009657,0.177336,0.007160,"{'C': 1, 'gamma': 1, 'kernel': 'linear'}",0.784615,0.751938,0.789062,0.775194,0.016544,0.996109,0.996124,0.996139,0.996124,0.000012


In [69]:
df.columns

Index(['estimator', 'mean_fit_time', 'std_fit_time', 'mean_score_time',
       'std_score_time', 'params', 'split0_test_score', 'split1_test_score',
       'split2_test_score', 'mean_test_score', 'std_test_score',
       'split0_train_score', 'split1_train_score', 'split2_train_score',
       'mean_train_score', 'std_train_score'],
      dtype='object')

In [70]:
df1=df[['estimator','params', 'mean_test_score', 'std_test_score',
       'mean_train_score', 'std_train_score']]
df1

Unnamed: 0,estimator,params,mean_test_score,std_test_score,mean_train_score,std_train_score
0,Logistic Regression,"{'C': 0.7924828983539169, 'penalty': 'l2', 'so...",0.780362,0.014501,0.996129,0.003153
1,Logistic Regression,"{'C': 0.49770235643321137, 'penalty': 'l2', 's...",0.780362,0.019452,0.989654,0.007991
2,Logistic Regression,"{'C': 0.8902150854450392, 'penalty': 'l2', 'so...",0.780362,0.014501,0.996129,0.003153
3,Logistic Regression,"{'C': 0.7054802310718645, 'penalty': 'l2', 'so...",0.780362,0.014501,0.996129,0.003153
4,Logistic Regression,"{'C': 0.6280291441834247, 'penalty': 'l2', 'so...",0.777778,0.015822,0.990951,0.006602
5,Logistic Regression,"{'C': 0.22051307399030456, 'penalty': 'l2', 's...",0.777778,0.011045,0.981907,0.004860
6,Logistic Regression,"{'C': 0.5590810182512223, 'penalty': 'l2', 'so...",0.777778,0.015822,0.989654,0.007991
7,Logistic Regression,"{'C': 0.44306214575838776, 'penalty': 'l2', 's...",0.777778,0.021413,0.988362,0.006360
8,SVC,"{'C': 1, 'gamma': 10, 'kernel': 'linear'}",0.775194,0.016544,0.996124,0.000012
9,SVC,"{'C': 1, 'gamma': 1, 'kernel': 'linear'}",0.775194,0.016544,0.996124,0.000012


#### Conclusion

There isn't a clear real utility for the simple classification of post into a subreddit. For one, 100% of posts on reddit belong to a subreddit so a post would always have its subreddit attached to it.  It's just obviously not very impressive in utility.  At first glance, that is.

The process of creating a binary classifier can bring to light some other factors about whatever's being studied. Keeping in mind that the sample of information they're getting from is demographically biased so there are limitations in its generaliztion to many other applications. 

Learning about what keywords are most important to a city may give you an idea about the culture in the city, or what's important to the city. So many things could be discovered from this sort of exploration. Could also be used in market research analysis (yawn) by comparing one brand's subreddit with another's. Or, one show's fanbase vs another's. This binary classification model can also be modified to be applied to message filters (like sorting out spam from emails).

Application doesn't end there, however. Sociologists would have a field day with this tool due to its far-reaching applications in thematic analyses. 