# Model Creation and Analysis
## Natural Language Processing Reddit Project
### Zach Tretter, April 2020
General Assembly, Data Science Immersion Cohort 11, Boston

# Executive Summary

### Problem Statement
Explore the power of binary classification models to differentiate between two subreddits.

### Project Methodology
Two subreddits were considered : r/conservative and r/libertarian.  These subreddits were chosen because there is enough overlap between them to be interesting but still enough to distinguishable.  They are of comparable size (~350K users) and both advertise themselves as the "the place for __ on reddit". Additionally, other political subreddits have insane post volume that would have limited analysis to the past couple weeks.

Data Extraction consisted of pulling 100K posts from each subreddit and then filtering for only those with more than 10 upvotes (i.e. score).  Upvoted content reflects that which is "representative" of a subreddit.  Over 2/3 of all posts on these subreddits received no upvotes.  A random sample of 5K was taken from each subreddit to maintain balanced classes.  The independent variable is submission title due to the prevalence of image and meme posts that are not yet interpretable for this analyst's machine learning abilities.

Text Preprocessing consisted principaly of lemmatizing.  Numbers were kept in as "2020" and "16" have strong political meaning.  The subreddit names were removed as filtering based on these would not be interesting

### Model Performance

Baseline Score is 50%

| Model                   | Train Score | Test Score | Test F1 Score |
|-------------------------|-------------|------------|---------------|
| Logistic CVEC           | 0.809733    | 0.7240     | 0.739623      |
| Logistic TFIDF          | 0.832800    | 0.7292     | 0.735856      |
| Multinomial Bayes CVEC  | 0.793200    | 0.7160     | 0.719589      |
| Multinomial Bayes TFIDF | 0.797333    | 0.7164     | 0.719874      |
| Support Vector CVEC     | 0.928400    | 0.7112     | 0.713264      |
| Support Vector TFIDF    | 0.910000    | 0.6924     | 0.702284      |
| Random Forest CVEC      | 0.858000    | 0.6744     | 0.709907      |
| Random Forest TFIDF     | 0.858000    | 0.6744     | 0.709907      |

### Principal Findings

* All Models performed similarly on test data (~70%)
    * SVCs and Random Forests particularly suffered from overfit
    * TFIDF performed marginally better than Count Vectorizer for Logistic/Bayes/SVC
        * Reverse performance with Random Forest
    * Consistently more inaccurate predictions of conservative than of libertarian
    

* Findings from Grid Searches (Text Vectorizer Parmeters)
    * Max Features : Models almost always select for more features
    * Max_df : Consistently selected to be between 0.2 and 0.4
    * Min_df : Not impactful, default chosen
    * StopWords: Logistic selected for English, other classifiers selected none
    * n-grams: Length 1,1 almost always selected
    
 
#### These models show we can be ~40% more accurate than a random guess!

## Table of Contents
- [Executive Summary](#Executive-Summary)
- [Imports](#Imports)
- [Data Import and Cleaning](#Data-Import-and-Cleaning)
- [DataFrame Cleaning](#DataFrame-Cleaning)
- [Pre-Processing](#Pre-Processing)
- [Global Model Constructs](#Global-Model-Constructs)
- [Logistic Regression](#Logistic-Regression)
- [Multinomial Bayes](#Multinomial-Bates)
- [Support Vector Machine](#Support-Vector-Machine)
- [Random Forest](#Random-Forest)
- [Summary Metrics](#Summary-Metrics)

## Imports

In [2]:
# General Functionality
import numpy as np
import pandas as pd
import time

# Text Processing
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Model Construction
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.metrics import confusion_matrix

# Transformer and Classifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC



In [3]:
# Read in our Data
df = pd.read_csv('~/ga/project_3/DATA/subreddit_data.csv')

# drop unnamed zero column
df.drop(columns = 'Unnamed: 0',inplace=True)

## DataFrame Cleaning
* Verify the Subreddits
* Set a score threshold
* Draw a random sample
* Reduce to core features of interest

In [4]:
# Identify our Subreddits
class1subreddit = df[df['class']==1]['subreddit'].unique()[0]
class0subreddit = df[df['class']==0]['subreddit'].unique()[0]
print(f"r/{class1subreddit} is Class 1 and r/{class0subreddit} is Class 0")

r/Libertarian is Class 1 and r/Conservative is Class 0


In [5]:
# Set a Score Threshold
score_threshold = 10

# Set a Sample Size to ensure parity in our classes
sample_size = 5_000

# Verify there is sufficient volume at this threshold
df[(df['score']>=score_threshold)].groupby('class')['title'].count()

class
0    23809
1    12210
Name: title, dtype: int64

In [6]:
# Draw Samples from each subreddit
df_sample_class0 = df[
    (df['score']>=score_threshold) & 
    (df['class']==0)].sample(n = sample_size, 
                             random_state=42).copy()

df_sample_class1 = df[
    (df['score']>=score_threshold) & 
    (df['class']==1)].sample(n = sample_size,
                             random_state=42).copy()

df = df_sample_class0.append(df_sample_class1 )

In [7]:
# Verify our classes are balanced
df['class'].value_counts()

1    5000
0    5000
Name: class, dtype: int64

In [8]:
# Reduce to core columns for now
df = df[['subreddit','title','selftext','score','class']]

# Fill in NaNs with dashes
df = df.fillna("-")

In [9]:
# View our Dataframe
df

Unnamed: 0,subreddit,title,selftext,score,class
54930,Conservative,"Pro-Abortion Activists March in Mexico City, S...",-,63,0
54823,Conservative,Nigeria’s Democracy Is In Peril As Leaders Imp...,-,12,0
94104,Conservative,Horrible: Major British University Warns Jewis...,-,49,0
98500,Conservative,NYT: The Tables Have Turned -- Time To Investi...,-,94,0
3764,Conservative,Dow Jumps Nearly 800 Points After Sanders Drop...,-,19,0
...,...,...,...,...,...
197341,Libertarian,Federal Court: First Amendment Protects Sharin...,-,47,1
140397,Libertarian,What specific regulations are causing high hea...,I think libertarians often lose credibility wh...,10,1
139188,Libertarian,A Russian example of why government regulation...,"I'm not Russian, but I've lived here for five ...",26,1
139483,Libertarian,Rand Paul sucks,-,30,1


## Text Pre-Processing

In [10]:
stopwords_custom =  [
    'libertarian',
    'libertarianism',
    'conservative'
]

In [11]:
def joint_token_lemma_function(document):
    
    tokenizer = RegexpTokenizer("\w+")
    
    token_list = tokenizer.tokenize(document.lower())
    
    lemmatizer = WordNetLemmatizer()
    
    lemmas = [lemmatizer.lemmatize(i) for i in token_list]
    
    words_to_remove = set(stopwords_custom)
    
    words_we_want = [word for word in lemmas if word not in words_to_remove]
    
    return " ".join(words_we_want)

In [12]:
# Sample Test Strings for our Word Cleaning Function
test_string_conservative = "New York City mayor Bill de Blasio is shocked that the criminals he released over virus concerns committed new crimes. He said Monday it is “unconscionable” that criminals released early from prison over coronavirus fears would commit new crimes."
print(test_string_conservative)
print('\n')
test_string_libertarian = "Poll: 74 percent of voters concerned about losing freedoms due to COVID-19"
print(test_string_libertarian)
print("\n")
test_string_custom = "The rest of this string should read as the word test repeated four times test conservative test libertarian test libertarianism test"
print(test_string_custom)

New York City mayor Bill de Blasio is shocked that the criminals he released over virus concerns committed new crimes. He said Monday it is “unconscionable” that criminals released early from prison over coronavirus fears would commit new crimes.


Poll: 74 percent of voters concerned about losing freedoms due to COVID-19


The rest of this string should read as the word test repeated four times test conservative test libertarian test libertarianism test


In [13]:
print(joint_token_lemma_function(test_string_conservative),"\n")
print(joint_token_lemma_function(test_string_libertarian),"\n")
print(joint_token_lemma_function(test_string_custom))

new york city mayor bill de blasio is shocked that the criminal he released over virus concern committed new crime he said monday it is unconscionable that criminal released early from prison over coronavirus fear would commit new crime 

poll 74 percent of voter concerned about losing freedom due to covid 19 

the rest of this string should read a the word test repeated four time test test test test


## Global Model Constructs
* Apply our text cleaning function to the dataframe
* Define X and y
* Train Test Split X and y
* Create two functions to summarize model results
* Use a consistent set of parameters to gridsearch over for CVEC and TFIDF

In [14]:
df['modified_title']=df['title'].apply(joint_token_lemma_function)

In [19]:
df[['subreddit','score','class','title','modified_title']]

Unnamed: 0,subreddit,score,class,title,modified_title
54930,Conservative,63,0,"Pro-Abortion Activists March in Mexico City, S...",pro abortion activist march in mexico city set...
54823,Conservative,12,0,Nigeria’s Democracy Is In Peril As Leaders Imp...,nigeria s democracy is in peril a leader impri...
94104,Conservative,49,0,Horrible: Major British University Warns Jewis...,horrible major british university warns jewish...
98500,Conservative,94,0,NYT: The Tables Have Turned -- Time To Investi...,nyt the table have turned time to investigate ...
3764,Conservative,19,0,Dow Jumps Nearly 800 Points After Sanders Drop...,dow jump nearly 800 point after sander drop ou...
...,...,...,...,...,...
197341,Libertarian,47,1,Federal Court: First Amendment Protects Sharin...,federal court first amendment protects sharing...
140397,Libertarian,10,1,What specific regulations are causing high hea...,what specific regulation are causing high heal...
139188,Libertarian,26,1,A Russian example of why government regulation...,a russian example of why government regulation...
139483,Libertarian,30,1,Rand Paul sucks,rand paul suck


In [20]:
X = df['modified_title']
y = df['class']

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    random_state = 123, 
                                                    stratify = y)

In [21]:
def display_pipeline_results(gridsearchobject):
    print(f"The best score is {gridsearchobject.best_score_} with these parameters:")
    for i in gridsearchobject.best_params_.items():
        print("\t",i[0],"at",i[1])
    print(f"The train score of {gridsearchobject.best_estimator_.score(X_train,y_train)}")
    print(f"The test score of {gridsearchobject.best_estimator_.score(X_test,y_test)}") 

def display_confusionmatrix(y_test,predictions):
    cm = confusion_matrix(y_test,predictions)
    column_titles = ['Predicted '+class1subreddit, 'Predicted '+class0subreddit]
    row_titles = ['Actually '+class1subreddit, 'Actually '+class0subreddit]
    return pd.DataFrame(cm,columns=column_titles,index = row_titles)

In [52]:
# Identify the text parameters we'll be grid searching over for CVEC and TFIDF
text_parameters = {
    'max_features': [300, 800, 2500],
    'min_df':[1], # this attribute seems to be irrelevant
    'max_df':[0, 0.2,0.4,0.6, 0.8, 1],
    'ngram_range':[[1,1], [1,2]],
    'stop_words':['english', None]
}

## Logistic Regression
For Count Vectorize and then Term Frequency–Inverse Document Frequency
* Build a Pipeline and Gridsearch
* Fit it
* View the scores

In [53]:
# Build the Pipline
pipe = Pipeline([
    ('cvec',CountVectorizer()),
    ('lr',LogisticRegression(max_iter = 250))
])

params = {
    'cvec__max_features' : text_parameters['max_features'],
    'cvec__min_df':text_parameters['min_df'],
    'cvec__max_df':text_parameters['max_df'],
    'cvec__ngram_range' : text_parameters['ngram_range'],
    'cvec__stop_words' : text_parameters['stop_words'],
    'lr__C' : [0.1,1,10] 
}

gs_cvec_lr = GridSearchCV(pipe, params, cv = 5,
                          n_jobs = -2,
                          verbose = 5)

In [54]:
# Let there be fit!
start_the_clock = time.time()

gs_cvec_lr.fit(X_train, y_train)

print(f"{time.time() - start_the_clock} seconds to run \n")


Fitting 5 folds for each of 216 candidates, totalling 1080 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 11 concurrent workers.
[Parallel(n_jobs=-2)]: Done  78 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-2)]: Done 258 tasks      | elapsed:    6.7s
[Parallel(n_jobs=-2)]: Done 453 tasks      | elapsed:   14.6s
[Parallel(n_jobs=-2)]: Done 615 tasks      | elapsed:   21.6s
[Parallel(n_jobs=-2)]: Done 813 tasks      | elapsed:   29.5s
[Parallel(n_jobs=-2)]: Done 1047 tasks      | elapsed:   39.4s


41.18470573425293 seconds to run 



[Parallel(n_jobs=-2)]: Done 1080 out of 1080 | elapsed:   41.0s finished


In [55]:
model_in_this_cell = gs_cvec_lr

# Display Results for this Pipline
display_pipeline_results(model_in_this_cell)

# Display an roc area under the curve value
print(f"Our AUC is {metrics.roc_auc_score(y_test,model_in_this_cell.predict_proba(X_test)[:,1])}")

# Display Confusion Matrix
display_confusionmatrix(y_test,model_in_this_cell.predict(X_test))

The best score is 0.7433333333333334 with these parameters:
	 cvec__max_df at 0.2
	 cvec__max_features at 2500
	 cvec__min_df at 1
	 cvec__ngram_range at [1, 1]
	 cvec__stop_words at english
	 lr__C at 0.1
The train score of 0.8097333333333333
The test score of 0.724
Our AUC is 0.7881670399999999


Unnamed: 0,Predicted Libertarian,Predicted Conservative
Actually Libertarian,830,420
Actually Conservative,270,980


In [56]:
# Unpack our best CVEC inputs
cvec_params = [x for x in list(gs_cvec_lr.best_params_.items()) if 'cvec' in x[0]]

# Create a Count Vectorizer
cvec = CountVectorizer(cvec_params)

# Fit and Transform to the X train data
X_train_cvec = cvec.fit_transform(X_train)
# Transform the X test data
X_test_cvec = cvec.transform(X_test)

# Create a dataframe that we'll use to look at our coefficients
# Transpose so that the features are on the y-axis
X_train_cvec_df = pd.DataFrame(X_train_cvec.toarray(),
                               columns=cvec.get_feature_names()).T

# Instantiate a Logistic Regression
lr = LogisticRegression(max_iter = 500,C=0.1)

# Fit it to the transformed X_train
lr.fit(X_train_cvec,y_train)

# Add a column for our feature coefficients
X_train_cvec_df['coefficients']=list(np.exp(lr.coef_[0]))

In [57]:
# Coefficients Most Predictive of the Class 0 Subreddit
print(f'Most predictive of r/{class0subreddit}')
print(1/X_train_cvec_df.sort_values('coefficients')['coefficients'].head(10))
print("\n")

# Coefficients Most Predictive of the Class 1 Subreddit
print(f'Most predictive of r/{class1subreddit}')
print(X_train_cvec_df.sort_values('coefficients',ascending=False)['coefficients'].head(10))

Most predictive of r/Conservative
democrat    3.189244
cnn         2.975044
biden       2.900565
liberal     2.551663
aoc         2.450560
dems        2.308733
climate     2.259992
watch       2.242193
omar        2.241133
clinton     2.201184
Name: coefficients, dtype: float64


Most predictive of r/Libertarian
government    2.434765
tariff        1.970836
paul          1.832067
police        1.824586
rand          1.815529
amash         1.790284
market        1.726388
marijuana     1.702862
gun           1.694839
drug          1.670149
Name: coefficients, dtype: float64


In [58]:
# Build the Pipline
pipe = Pipeline([
    ('tfidf',TfidfVectorizer()),
    ('lr',LogisticRegression(max_iter = 500))
])

params = {
    'tfidf__max_features' : text_parameters['max_features'],
    'tfidf__min_df':text_parameters['min_df'],
    'tfidf__max_df':text_parameters['max_df'],
    'tfidf__ngram_range' : text_parameters['ngram_range'],
    'tfidf__stop_words' : text_parameters['stop_words'],
    'lr__C' : [0.1,1,10] 
}

gs_tfidf_lr = GridSearchCV(pipe, params,cv = 5,
                          n_jobs = -2,
                          verbose = 5)

In [59]:
# Let there be fit!
start_the_clock = time.time()

gs_tfidf_lr.fit(X_train, y_train)

print(f"{time.time() - start_the_clock} seconds to run \n")


Fitting 5 folds for each of 216 candidates, totalling 1080 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 11 concurrent workers.
[Parallel(n_jobs=-2)]: Done  78 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-2)]: Done 258 tasks      | elapsed:    8.4s
[Parallel(n_jobs=-2)]: Done 420 tasks      | elapsed:   15.0s
[Parallel(n_jobs=-2)]: Done 582 tasks      | elapsed:   22.5s
[Parallel(n_jobs=-2)]: Done 780 tasks      | elapsed:   30.8s
[Parallel(n_jobs=-2)]: Done 1014 tasks      | elapsed:   42.4s
[Parallel(n_jobs=-2)]: Done 1080 out of 1080 | elapsed:   44.9s finished


45.21828484535217 seconds to run 



In [60]:
model_in_this_cell = gs_tfidf_lr

# Display Results for this Pipline
display_pipeline_results(model_in_this_cell)

# Display an roc area under the curve value
print(f"Our AUC is {metrics.roc_auc_score(y_test,model_in_this_cell.predict_proba(X_test)[:,1])}")

# Display Confusion Matrix
display_confusionmatrix(y_test,model_in_this_cell.predict(X_test))

The best score is 0.7405333333333333 with these parameters:
	 lr__C at 1
	 tfidf__max_df at 0.2
	 tfidf__max_features at 2500
	 tfidf__min_df at 1
	 tfidf__ngram_range at [1, 2]
	 tfidf__stop_words at english
The train score of 0.8328
The test score of 0.7292
Our AUC is 0.79758752


Unnamed: 0,Predicted Libertarian,Predicted Conservative
Actually Libertarian,880,370
Actually Conservative,307,943


In [61]:
# Unpack our best TFIDF inputs
tfidf_params = [x for x in list(gs_tfidf_lr.best_params_.items()) if 'tfidf' in x[0]]

# Create a Count Vectorizer
tfidf = TfidfVectorizer(tfidf_params)

# Fit and Transform to the X train data
X_train_tfidf = tfidf.fit_transform(X_train)
# Transform the X test data
X_test_tfidf = tfidf.transform(X_test)

# Create a dataframe that we'll use to look at our coefficients
# Transpose so that the features are on the y-axis
X_train_tfidf_df = pd.DataFrame(X_train_tfidf.toarray(),columns=tfidf.get_feature_names()).T

# Instantiate a Logistic Regression
lr = LogisticRegression(max_iter = 500)

# Fit it to the transformed X_train
lr.fit(X_train_tfidf,y_train)

# Add a column for our feature coefficients
X_train_tfidf_df['coefficients']=list(np.exp(lr.coef_[0]))

In [62]:
# Coefficients Most Predictive of the Class 0 Subreddit
print(f'Most predictive of r/{class0subreddit}')
print(1/X_train_tfidf_df.sort_values('coefficients')['coefficients'].head(10))
print("\n")

# Coefficients Most Predictive of the Class 1 Subreddit
print(f'Most predictive of r/{class1subreddit}')
print(X_train_tfidf_df.sort_values('coefficients',ascending=False)['coefficients'].head(10))

Most predictive of r/Conservative
democrat    60.242516
biden       28.319568
cnn         27.375831
liberal     15.062547
aoc         14.698459
dems        13.874114
climate     13.173285
omar        11.671782
clinton     10.864782
pelosi       9.961375
Name: coefficients, dtype: float64


Most predictive of r/Libertarian
government    32.987725
police         8.935407
tariff         8.225776
paul           7.656773
gun            7.615683
rand           5.956973
this           5.824979
amash          5.638409
drug           5.373929
law            5.249525
Name: coefficients, dtype: float64


## Multinomial Bayes
For Count Vectorize and then Term Frequency–Inverse Document Frequency
* Build a Pipeline and Gridsearch
* Fit it
* View the scores

In [63]:
# Build the Pipline
pipe = Pipeline([
    ('cvec',CountVectorizer()),
    ('mnb',MultinomialNB())
])

params = {
    'cvec__max_features' : text_parameters['max_features'],
    'cvec__min_df':text_parameters['min_df'],
    'cvec__max_df':text_parameters['max_df'],
    'cvec__ngram_range' : text_parameters['ngram_range'],
    'cvec__stop_words' : text_parameters['stop_words'],
    'mnb__fit_prior' : [True,False] 
}

gs_cvec_mnb = GridSearchCV(pipe, params, cv = 5,
                          n_jobs = -2,
                          verbose = 5)

In [64]:
# Let there be fit!
start_the_clock = time.time()

gs_cvec_mnb.fit(X_train, y_train)

print(f"{time.time() - start_the_clock} seconds to run \n")

Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done  96 tasks      | elapsed:    2.2s
[Parallel(n_jobs=8)]: Done 276 tasks      | elapsed:    8.8s
[Parallel(n_jobs=8)]: Done 528 tasks      | elapsed:   19.8s


28.235747575759888 seconds to run 



[Parallel(n_jobs=8)]: Done 720 out of 720 | elapsed:   27.9s finished


In [65]:
model_in_this_cell = gs_cvec_mnb

# Display Results for this Pipline
display_pipeline_results(model_in_this_cell)

# Display an roc area under the curve value
print(f"Our AUC is {metrics.roc_auc_score(y_test,model_in_this_cell.predict_proba(X_test)[:,1])}")

# Display Confusion Matrix
display_confusionmatrix(y_test,model_in_this_cell.predict(X_test))

The best score is 0.7190666666666666 with these parameters:
	 cvec__max_df at 0.4
	 cvec__max_features at 2500
	 cvec__min_df at 1
	 cvec__ngram_range at [1, 1]
	 cvec__stop_words at None
	 mnb__fit_prior at True
The train score of 0.7932
The test score of 0.716
Our AUC is 0.7932870399999998


Unnamed: 0,Predicted Libertarian,Predicted Conservative
Actually Libertarian,879,371
Actually Conservative,339,911


In [66]:
# Build the Pipline
pipe = Pipeline([
    ('tfidf',TfidfVectorizer()),
    ('mnb',MultinomialNB())
])

params = {
    'tfidf__max_features' : text_parameters['max_features'],
    'tfidf__min_df':text_parameters['min_df'],
    'tfidf__max_df':text_parameters['max_df'],
    'tfidf__ngram_range' : text_parameters['ngram_range'],
    'tfidf__stop_words' : text_parameters['stop_words'],
    'mnb__fit_prior' : [True,False] 
}

gs_tfidf_mnb = GridSearchCV(pipe, params, cv = 5,
                          n_jobs = -2,
                          verbose = 5)

In [67]:
# Let there be fit!
start_the_clock = time.time()

gs_tfidf_mnb.fit(X_train, y_train)

print(f"{time.time() - start_the_clock} seconds to run \n")

Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done  96 tasks      | elapsed:    2.5s
[Parallel(n_jobs=8)]: Done 276 tasks      | elapsed:    8.8s
[Parallel(n_jobs=8)]: Done 528 tasks      | elapsed:   16.7s


23.40483593940735 seconds to run 



[Parallel(n_jobs=8)]: Done 720 out of 720 | elapsed:   23.2s finished


In [69]:
model_in_this_cell = gs_tfidf_mnb

# Display Results for this Pipline
display_pipeline_results(model_in_this_cell)

# Display an roc area under the curve value
print(f"Our AUC is {metrics.roc_auc_score(y_test,model_in_this_cell.predict_proba(X_test)[:,1])}")

# Display Confusion Matrix
display_confusionmatrix(y_test,model_in_this_cell.predict(X_test))

The best score is 0.7196 with these parameters:
	 mnb__fit_prior at True
	 tfidf__max_df at 0.2
	 tfidf__max_features at 2500
	 tfidf__min_df at 1
	 tfidf__ngram_range at [1, 1]
	 tfidf__stop_words at None
The train score of 0.7973333333333333
The test score of 0.7164
Our AUC is 0.79702656


Unnamed: 0,Predicted Libertarian,Predicted Conservative
Actually Libertarian,880,370
Actually Conservative,339,911


## Support Vector Machine
For Count Vectorize and then Term Frequency–Inverse Document Frequency
* Build a Pipeline and Gridsearch
* Fit it
* View the scores

In [72]:
# Build the Pipline
pipe = Pipeline([
    ('cvec',CountVectorizer()),
    ('ss',StandardScaler(with_mean = False)),
    ('svc',SVC())
])

params = {
    'cvec__max_features' : text_parameters['max_features'],
    'cvec__min_df':text_parameters['min_df'],
    'cvec__max_df':text_parameters['max_df'],
    'cvec__ngram_range' : text_parameters['ngram_range'],
    'cvec__stop_words' : text_parameters['stop_words'],
}

gs_cvec_svc = GridSearchCV(pipe, params, cv = 5,
                          n_jobs = -2,
                          verbose = 5)

In [73]:
# Let there be fit!
start_the_clock = time.time()

gs_cvec_svc.fit(X_train, y_train)

print(f"{time.time() - start_the_clock} seconds to run \n")

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 tasks      | elapsed:    1.2s
[Parallel(n_jobs=8)]: Done  56 tasks      | elapsed:    2.7s
[Parallel(n_jobs=8)]: Done 146 tasks      | elapsed:   46.9s
[Parallel(n_jobs=8)]: Done 272 tasks      | elapsed:  1.9min
[Parallel(n_jobs=8)]: Done 360 out of 360 | elapsed:  2.4min finished


148.11143517494202 seconds to run 



In [74]:
model_in_this_cell = gs_cvec_svc

# Display Results for this Pipline
display_pipeline_results(model_in_this_cell)

# # Display an roc area under the curve value
# print(f"Our AUC is {metrics.roc_auc_score(y_test,model_in_this_cell.predict_proba(X_test)[:,1])}")

# Display Confusion Matrix
display_confusionmatrix(y_test,model_in_this_cell.predict(X_test))

The best score is 0.7146666666666668 with these parameters:
	 cvec__max_df at 0.4
	 cvec__max_features at 2500
	 cvec__min_df at 1
	 cvec__ngram_range at [1, 1]
	 cvec__stop_words at None
The train score of 0.9284
The test score of 0.7112


Unnamed: 0,Predicted Libertarian,Predicted Conservative
Actually Libertarian,880,370
Actually Conservative,352,898


In [78]:
# Build the Pipline
pipe = Pipeline([
    ('tfidf',TfidfVectorizer()),
    ('ss',StandardScaler(with_mean = False)),
    ('svc',SVC())
])

params = {
    'tfidf__max_features' : text_parameters['max_features'],
    'tfidf__min_df':text_parameters['min_df'],
    'tfidf__max_df':text_parameters['max_df'],
    'tfidf__ngram_range' : text_parameters['ngram_range'],
    'tfidf__stop_words' : text_parameters['stop_words'],
}

gs_tfidf_svc = GridSearchCV(pipe, params, cv = 5,
                          n_jobs = -2,
                          verbose = 10)

In [79]:
# Let there be fit!
start_the_clock = time.time()

gs_tfidf_svc.fit(X_train, y_train)

print(f"{time.time() - start_the_clock} seconds to run \n")

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 11 concurrent workers.
[Parallel(n_jobs=-2)]: Done   3 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-2)]: Done  10 tasks      | elapsed:    1.4s
[Parallel(n_jobs=-2)]: Done  19 tasks      | elapsed:    1.8s
[Parallel(n_jobs=-2)]: Done  28 tasks      | elapsed:    2.0s
[Parallel(n_jobs=-2)]: Done  39 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-2)]: Done  50 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-2)]: Done  63 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-2)]: Done  76 tasks      | elapsed:    9.8s
[Parallel(n_jobs=-2)]: Done  91 tasks      | elapsed:   15.5s
[Parallel(n_jobs=-2)]: Done 106 tasks      | elapsed:   23.5s
[Parallel(n_jobs=-2)]: Done 123 tasks      | elapsed:   32.4s
[Parallel(n_jobs=-2)]: Done 140 tasks      | elapsed:   38.8s
[Parallel(n_jobs=-2)]: Done 159 tasks      | elapsed:   48.8s
[Parallel(n_jobs=-2)]: Done 178 tasks      | elapsed:   59.3s
[Parallel(n_jobs=-2)]: Done 199 tasks      | elapsed:  

134.5156753063202 seconds to run 



In [80]:
model_in_this_cell = gs_tfidf_svc

# Display Results for this Pipline
display_pipeline_results(model_in_this_cell)

# # Display an roc area under the curve value
# print(f"Our AUC is {metrics.roc_auc_score(y_test,model_in_this_cell.predict_proba(X_test)[:,1])}")

# Display Confusion Matrix
display_confusionmatrix(y_test,model_in_this_cell.predict(X_test))

The best score is 0.7174666666666666 with these parameters:
	 tfidf__max_df at 0.4
	 tfidf__max_features at 800
	 tfidf__min_df at 1
	 tfidf__ngram_range at [1, 1]
	 tfidf__stop_words at None
The train score of 0.91
The test score of 0.6924


Unnamed: 0,Predicted Libertarian,Predicted Conservative
Actually Libertarian,824,426
Actually Conservative,343,907


## Random Forest
For Count Vectorize and then Term Frequency–Inverse Document Frequency
* Build a Pipeline and Gridsearch
* Fit it
* View the scores

In [83]:
# Build the Pipline
pipe = Pipeline([
    ('cvec',CountVectorizer()),
    ('rf',RandomForestClassifier()),
])

params = {
    'cvec__max_features' : text_parameters['max_features'],
    'cvec__min_df':text_parameters['min_df'],
    'cvec__max_df':text_parameters['max_df'],
    'cvec__ngram_range' : text_parameters['ngram_range'],
    'cvec__stop_words' : text_parameters['stop_words'],
    
    'rf__n_estimators' : [5,25],
    'rf__max_depth' : [5,15,45]}

gs_cvec_rf = GridSearchCV(pipe, params, cv = 5,
                          n_jobs = -2,
                          verbose = 5)

In [84]:
# Let there be fit!
start_the_clock = time.time()

gs_cvec_rf.fit(X_train, y_train)

print(f"{time.time() - start_the_clock} seconds to run \n")

Fitting 5 folds for each of 432 candidates, totalling 2160 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 11 concurrent workers.
[Parallel(n_jobs=-2)]: Done  50 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-2)]: Done 140 tasks      | elapsed:    4.8s
[Parallel(n_jobs=-2)]: Done 266 tasks      | elapsed:    8.4s
[Parallel(n_jobs=-2)]: Done 428 tasks      | elapsed:   14.0s
[Parallel(n_jobs=-2)]: Done 626 tasks      | elapsed:   23.4s
[Parallel(n_jobs=-2)]: Done 860 tasks      | elapsed:   33.6s
[Parallel(n_jobs=-2)]: Done 1130 tasks      | elapsed:   44.0s
[Parallel(n_jobs=-2)]: Done 1436 tasks      | elapsed:   57.9s
[Parallel(n_jobs=-2)]: Done 1778 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-2)]: Done 2160 out of 2160 | elapsed:  1.5min finished


89.6806092262268 seconds to run 



In [85]:
model_in_this_cell = gs_cvec_rf

# Display Results for this Pipline
display_pipeline_results(model_in_this_cell)

# # Display an roc area under the curve value
print(f"Our AUC is {metrics.roc_auc_score(y_test,model_in_this_cell.predict_proba(X_test)[:,1])}")

# Display Confusion Matrix
display_confusionmatrix(y_test,model_in_this_cell.predict(X_test))

The best score is 0.7097333333333333 with these parameters:
	 cvec__max_df at 0.2
	 cvec__max_features at 2500
	 cvec__min_df at 1
	 cvec__ngram_range at [1, 1]
	 cvec__stop_words at None
	 rf__max_depth at 45
	 rf__n_estimators at 25
The train score of 0.858
The test score of 0.6744
Our AUC is 0.73879744


Unnamed: 0,Predicted Libertarian,Predicted Conservative
Actually Libertarian,690,560
Actually Conservative,254,996


In [86]:
# Build the Pipline
pipe = Pipeline([
    ('tfidf',TfidfVectorizer()),
    ('rf',RandomForestClassifier()),
])

params = {
    'tfidf__max_features' : text_parameters['max_features'],
    'tfidf__min_df':text_parameters['min_df'],
    'tfidf__max_df':text_parameters['max_df'],
    'tfidf__ngram_range' : text_parameters['ngram_range'],
    'tfidf__stop_words' : text_parameters['stop_words'],
    
    'rf__n_estimators' : [5,25],
    'rf__max_depth' : [5,15,45]}

gs_tfidf_rf = GridSearchCV(pipe, params, cv = 5,
                          n_jobs = -2,
                          verbose = 5)

In [87]:
# Let there be fit!
start_the_clock = time.time()

gs_tfidf_rf.fit(X_train, y_train)

print(f"{time.time() - start_the_clock} seconds to run \n")

Fitting 5 folds for each of 432 candidates, totalling 2160 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Batch computation too fast (0.1466s.) Setting batch_size=2.
[Parallel(n_jobs=8)]: Done   2 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done   9 tasks      | elapsed:    0.2s
[Parallel(n_jobs=8)]: Done  16 tasks      | elapsed:    0.5s
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    1.0s
[Parallel(n_jobs=8)]: Done  52 tasks      | elapsed:    1.6s
[Parallel(n_jobs=8)]: Done  74 tasks      | elapsed:    2.3s
[Parallel(n_jobs=8)]: Done  96 tasks      | elapsed:    3.0s
[Parallel(n_jobs=8)]: Done 122 tasks      | elapsed:    3.8s
[Parallel(n_jobs=8)]: Done 148 tasks      | elapsed:    4.6s
[Parallel(n_jobs=8)]: Done 178 tasks      | elapsed:    5.7s
[Parallel(n_jobs=8)]: Done 208 tasks      | elapsed:    6.6s
[Parallel(n_jobs=8)]: Done 242 tasks      | elapsed:    7.8s
[Parallel(n_jobs=8)]: Done 276 tasks      | elapsed:    9.1s
[Parallel(n_jobs=8)]: Done 314 tasks      | elapse

101.39671230316162 seconds to run 



In [88]:
model_in_this_cell = gs_tfidf_rf

# Display Results for this Pipline
display_pipeline_results(model_in_this_cell)

# # Display an roc area under the curve value
print(f"Our AUC is {metrics.roc_auc_score(y_test,model_in_this_cell.predict_proba(X_test)[:,1])}")

# Display Confusion Matrix
display_confusionmatrix(y_test,model_in_this_cell.predict(X_test))

The best score is 0.7083999999999999 with these parameters:
	 rf__max_depth at 45
	 rf__n_estimators at 25
	 tfidf__max_df at 0.4
	 tfidf__max_features at 2500
	 tfidf__min_df at 1
	 tfidf__ngram_range at [1, 1]
	 tfidf__stop_words at english
The train score of 0.8276
The test score of 0.6804
Our AUC is 0.75092896


Unnamed: 0,Predicted Libertarian,Predicted Conservative
Actually Libertarian,676,574
Actually Conservative,225,1025


## Summary Metrics

In [145]:
from sklearn.metrics import f1_score

model_dict = {
    'models':{
        'Logistic CVEC':gs_cvec_lr,
        'Logistic TFIDF':gs_tfidf_lr,
        'Multinomial Bayes CVEC':gs_cvec_mnb,
        'Multinomial Bayes TFIDF':gs_tfidf_mnb,
        'Support Vector CVEC':gs_cvec_svc,
        'Support Vector TFIDF':gs_tfidf_svc,
        'Random Forest CVEC':gs_cvec_rf,
        'Random Forest TFIDF':gs_cvec_rf
    }}

model_dict['Train Score'] = [i.best_estimator_.score(X_train,y_train)
                           for i in model_dict['models'].values()]

model_dict['Test Score'] = [i.best_estimator_.score(X_test,y_test)
                           for i in model_dict['models'].values()]

model_dict['Test F1 Score'] = [f1_score(y_test,i.predict(X_test)) 
                                 for i in model_dict['models'].values()]


pd.DataFrame( data = [model_dict['Train Score'],
                      model_dict['Test Score'],
                      model_dict['Test F1 Score']],
            columns = list(model_dict['models'].keys()),
            index = ['Train Score','Test Score','Test F1 Score']).T

Unnamed: 0,Train Score,Test Score,Test F1 Score
Logistic CVEC,0.809733,0.724,0.739623
Logistic TFIDF,0.8328,0.7292,0.735856
Multinomial Bayes CVEC,0.7932,0.716,0.719589
Multinomial Bayes TFIDF,0.797333,0.7164,0.719874
Support Vector CVEC,0.9284,0.7112,0.713264
Support Vector TFIDF,0.91,0.6924,0.702284
Random Forest CVEC,0.858,0.6744,0.709907
Random Forest TFIDF,0.858,0.6744,0.709907


In [129]:
# Parameters
# List of Models
list_of_models = [gs_cvec_lr, 
                  gs_tfidf_lr,
                  gs_cvec_mnb, 
                  gs_tfidf_mnb,
                  gs_cvec_svc, 
                  gs_tfidf_svc,
                  gs_cvec_rf, 
                  gs_tfidf_rf]

for i in list_of_models:
    display_pipeline_results(i)

The best score is 0.7433333333333334 with these parameters:
	 cvec__max_df at 0.2
	 cvec__max_features at 2500
	 cvec__min_df at 1
	 cvec__ngram_range at [1, 1]
	 cvec__stop_words at english
	 lr__C at 0.1
The train score of 0.8097333333333333
The test score of 0.724
The best score is 0.7405333333333333 with these parameters:
	 lr__C at 1
	 tfidf__max_df at 0.2
	 tfidf__max_features at 2500
	 tfidf__min_df at 1
	 tfidf__ngram_range at [1, 2]
	 tfidf__stop_words at english
The train score of 0.8328
The test score of 0.7292
The best score is 0.7190666666666666 with these parameters:
	 cvec__max_df at 0.4
	 cvec__max_features at 2500
	 cvec__min_df at 1
	 cvec__ngram_range at [1, 1]
	 cvec__stop_words at None
	 mnb__fit_prior at True
The train score of 0.7932
The test score of 0.716
The best score is 0.7196 with these parameters:
	 mnb__fit_prior at True
	 tfidf__max_df at 0.2
	 tfidf__max_features at 2500
	 tfidf__min_df at 1
	 tfidf__ngram_range at [1, 1]
	 tfidf__stop_words at None
The