# Project 3: Reddit API Classification & Natural Language Processing

## Tom Ludlow, DSI-NY-6

Using NLP to identify posts from **r/audioengineering** and **r/livesound**

# Notebook 3: Model Selection

This notebook contains the code and processes used to assess the effectiveness of potential classification models when used with our pre-processed data, including:
- Multinomial Naive Bayes
- K-Nearest Neighbors
- Logistic Regression Classifier
- Random Forest
- AdaBoost
- Gradient Boost

Models are tested using two vectorization transformers: **CountVectorizer, TF-IDF**

A GridSearch is run across all models to rule out non-viable options.  The models with the most predictive potential are then selected and optimized in the next notebook.

### Contents:
- [**GridSearch - CountVectorizer**](#CountVectorizer)
- [**GridSearch - TF-IDF**](#TF-IDF)
- [**Results Assessment**](#Results-assessment)

**Libraries**

In [13]:
# library imports
import requests
import time
import pandas as pd
import numpy as np
import ast
import re
from tqdm import tqdm

# preprocessing imports
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# modeling imports
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

import warnings
warnings.filterwarnings('ignore')

In [2]:
# random state var
r = 1220

In [3]:
X_train = pd.read_csv('./csv/181220_X_train.csv', index_col=0)
X_test = pd.read_csv('./csv/181220_X_test.csv', index_col=0)
y_train = pd.read_csv('./csv/181220_y_train.csv', index_col=0)
y_test = pd.read_csv('./csv/181220_y_test.csv', index_col=0)

In [4]:
y_train = pd.DataFrame(y_train, columns=['is_ls'])
y_test = pd.DataFrame(y_test, columns=['is_ls'])

In [5]:
y_train.shape

(1427, 1)

In [6]:
X_train.shape

(1427, 2)

In [7]:
X_test.shape

(476, 2)

In [8]:
y_test.shape

(476, 1)

## GridSearchCV

The `GridSearchCV` tool allows us to program multiple hyperparameters across our models.  It will generate a model with each combination of our desired hyperparameters, and optimize the highest-scoring result.

We will run a single model for each of the following 6 classifiers:
 - Multinomial Naive Bayes
 - K-Nearest Neighbors
 - Logistic Regression
 - Random Forest
 - AdaBoost (adaptive boost)
 - Gradient Boost
 
We will run two GridSearches to benchmark these models for two feature extraction techniques: `CountVectorizer` and `TfidfVectorizer`.  We can use the accuracy of the results to narrow our model selection to the most effective approaches.

As these models execute, the results will be displayed, then stored into a DataFrame for final comparison.

### CountVectorizer

In [9]:
steps_list_gr_cv = [ # list of pipeline steps for each model combo
    [('cv',CountVectorizer()),('multi_nb',MultinomialNB())],
    [('cv',CountVectorizer()),('scaler',StandardScaler(with_mean=False)),('knn',KNeighborsClassifier())], 
    [('cv',CountVectorizer()),('scaler',StandardScaler(with_mean=False)),('logreg',LogisticRegression())],
    [('cv',CountVectorizer()),('rf',RandomForestClassifier())],
    [('cv',CountVectorizer()),('ada',AdaBoostClassifier())],
    [('cv',CountVectorizer()),('gb',GradientBoostingClassifier())]
]

In [10]:
steps_titles = ['multi_nb','knn','logreg','rf','ada','gb']

In [11]:
pipe_params_cv = [
    {"cv__stop_words":['english'], "cv__ngram_range":[(1,1),(1,2)]},
    {"cv__stop_words":['english'], "cv__ngram_range":[(1,1),(1,2)]},
    {"cv__stop_words":['english'], "cv__ngram_range":[(1,1),(1,2)]},
    {"cv__stop_words":['english'], "cv__ngram_range":[(1,1),(1,2)]},
    {"cv__stop_words":['english'], "cv__ngram_range":[(1,1),(1,2)]},
    {"cv__stop_words":['english'], "cv__ngram_range":[(1,1),(1,2)]}
]


In [12]:
# instantiate results DataFrame

grid_results = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy','tn','fp','fn','tp'])
grid_results.head()

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp


In [14]:
X_train_pre_post = X_train['post_lm']
X_test_pre_post = X_test['post_lm']

In [15]:
for i in tqdm(range(len(steps_list_gr_cv))):           # timed loop through index of number of steps
    pipe = Pipeline(steps=steps_list_gr_cv[i])         # configure pipeline for each model
    grid = GridSearchCV(pipe, pipe_params_cv[i], cv=3) # fit GridSearchCV to model and model's params

    model_results = {}

    grid.fit(X_train_pre_post, y_train)
    
    print('Model: ',steps_titles[i])
    model_results['model'] = steps_titles[i]

    print('Best Params: ', grid.best_params_)
    model_results['best_params'] = grid.best_params_

    print(grid.score(X_train_pre_post, y_train), '\n')
    model_results['train_accuracy'] = grid.score(X_train_pre_post, y_train)
    
    print(grid.score(X_test_pre_post, y_test), '\n')
    model_results['test_accuracy'] = grid.score(X_test_pre_post, y_test)

    # Display the confusion matrix results showing true/false positive/negative
    tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test_pre_post)).ravel() 
    print("True Negatives: %s" % tn)
    model_results['tn'] = tn

    print("False Positives: %s" % fp)  
    model_results['fp'] = fp

    print("False Negatives: %s" % fn)
    model_results['fn'] = fn

    print("True Positives: %s" % tp, '\n')
    model_results['tp'] = tp

    grid_results = grid_results.append(model_results, ignore_index=True)

  0%|          | 0/6 [00:00<?, ?it/s]

Model:  multi_nb
Best Params:  {'cv__ngram_range': (1, 2), 'cv__stop_words': 'english'}
0.995795374912 



 17%|█▋        | 1/6 [00:01<00:08,  1.79s/it]

0.836134453782 

True Negatives: 195
False Positives: 36
False Negatives: 42
True Positives: 203 

Model:  knn
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.547302032235 



 33%|███▎      | 2/6 [00:03<00:07,  1.81s/it]

0.533613445378 

True Negatives: 10
False Positives: 221
False Negatives: 1
True Positives: 244 

Model:  logreg
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.999299229152 

0.766806722689 



 50%|█████     | 3/6 [00:05<00:05,  1.79s/it]

True Negatives: 174
False Positives: 57
False Negatives: 54
True Positives: 191 

Model:  rf
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.985984583041 

0.773109243697 



 67%|██████▋   | 4/6 [00:07<00:03,  1.78s/it]

True Negatives: 187
False Positives: 44
False Negatives: 64
True Positives: 181 

Model:  ada
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.861948142957 

0.77731092437 



 83%|████████▎ | 5/6 [00:10<00:02,  2.11s/it]

True Negatives: 178
False Positives: 53
False Negatives: 53
True Positives: 192 

Model:  gb
Best Params:  {'cv__ngram_range': (1, 2), 'cv__stop_words': 'english'}
0.913805185704 



100%|██████████| 6/6 [00:21<00:00,  4.78s/it]

0.813025210084 

True Negatives: 168
False Positives: 63
False Negatives: 26
True Positives: 219 






In [16]:
grid_results_cv = grid_results

In [17]:
grid_results.sort_values('test_accuracy',ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp
0,multi_nb,"{'cv__ngram_range': (1, 2), 'cv__stop_words': ...",0.995795,0.836134,195.0,36.0,42.0,203.0
5,gb,"{'cv__ngram_range': (1, 2), 'cv__stop_words': ...",0.913805,0.813025,168.0,63.0,26.0,219.0
4,ada,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.861948,0.777311,178.0,53.0,53.0,192.0
3,rf,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.985985,0.773109,187.0,44.0,64.0,181.0
2,logreg,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.999299,0.766807,174.0,57.0,54.0,191.0
1,knn,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.547302,0.533613,10.0,221.0,1.0,244.0


### TF-IDF

In [18]:
steps_list_gr_tf = [ # list of pipeline steps for each model combo
    [('tf',TfidfVectorizer()),('multi_nb',MultinomialNB())],
    [('tf',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('knn',KNeighborsClassifier())], 
    [('tf',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('logreg',LogisticRegression())],
    [('tf',TfidfVectorizer()),('rf',RandomForestClassifier())],
    [('tf',TfidfVectorizer()),('ada',AdaBoostClassifier())],
    [('tf',TfidfVectorizer()),('gb',GradientBoostingClassifier())]
]

In [19]:
steps_titles = ['multi_nb','knn','logreg','rf','ada','gb']

In [20]:
pipe_params_tf = [
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]}
]


In [21]:
# instantiate results DataFrame

grid_results = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy','tn','fp','fn','tp'])
grid_results.head()

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp


In [22]:
X_train_pre_post = X_train['post_lm']
X_test_pre_post = X_test['post_lm']

In [23]:
for i in tqdm(range(len(steps_list_gr_tf))):           # timed loop through index of number of steps
    pipe = Pipeline(steps=steps_list_gr_tf[i])         # configure pipeline for each model
    grid = GridSearchCV(pipe, pipe_params_tf[i], cv=3) # fit GridSearchCV to model and model's params

    model_results = {}

    grid.fit(X_train_pre_post, y_train)
    
    print('Model: ',steps_titles[i])
    model_results['model'] = steps_titles[i]

    print('Best Params: ', grid.best_params_)
    model_results['best_params'] = grid.best_params_

    print(grid.score(X_train_pre_post, y_train), '\n')
    model_results['train_accuracy'] = grid.score(X_train_pre_post, y_train)
    
    print(grid.score(X_test_pre_post, y_test), '\n')
    model_results['test_accuracy'] = grid.score(X_test_pre_post, y_test)

    tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test_pre_post)).ravel()
    print("True Negatives: %s" % tn)
    model_results['tn'] = tn

    print("False Positives: %s" % fp)
    model_results['fp'] = fp

    print("False Negatives: %s" % fn)
    model_results['fn'] = fn

    print("True Positives: %s" % tp, '\n')
    model_results['tp'] = tp

    grid_results = grid_results.append(model_results, ignore_index=True)

 17%|█▋        | 1/6 [00:01<00:06,  1.39s/it]

Model:  multi_nb
Best Params:  {'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}
0.957252978276 

0.819327731092 

True Negatives: 195
False Positives: 36
False Negatives: 50
True Positives: 195 

Model:  knn
Best Params:  {'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}
0.540995094604 



 33%|███▎      | 2/6 [00:03<00:06,  1.61s/it]

0.529411764706 

True Negatives: 7
False Positives: 224
False Negatives: 0
True Positives: 245 

Model:  logreg
Best Params:  {'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}
0.999299229152 



 50%|█████     | 3/6 [00:05<00:05,  1.70s/it]

0.821428571429 

True Negatives: 183
False Positives: 48
False Negatives: 37
True Positives: 208 

Model:  rf
Best Params:  {'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}
0.995795374912 

0.756302521008 



 67%|██████▋   | 4/6 [00:07<00:03,  1.69s/it]

True Negatives: 184
False Positives: 47
False Negatives: 69
True Positives: 176 

Model:  ada
Best Params:  {'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}
0.868955851437 

0.760504201681 



 83%|████████▎ | 5/6 [00:10<00:02,  2.16s/it]

True Negatives: 176
False Positives: 55
False Negatives: 59
True Positives: 186 

Model:  gb
Best Params:  {'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}
0.924316748423 

0.800420168067 

True Negatives: 174
False Positives: 57
False Negatives: 38
True Positives: 207 



100%|██████████| 6/6 [00:20<00:00,  4.69s/it]


In [24]:
grid_results_tf = grid_results

## Results assessment

Adding columns for the gap between train and set accuracy scores.  This will tell us about the level of overfitting that may be present in each model.  

In [25]:
grid_results_tf['tt_gap'] = grid_results_tf['train_accuracy'] - grid_results_tf['test_accuracy']
grid_results_cv['tt_gap'] = grid_results_cv['train_accuracy'] - grid_results_cv['test_accuracy']

The **baseline accuracy** is the likelihood of a post being `is_is=1` based solely on the percentage of our dataset that is our target value.  Here, we normalize our value counts to show a baseline accuracy of **51.4%**.

In [26]:
# baseline accuracy
y_train.is_ls.value_counts(normalize=True)

1    0.514366
0    0.485634
Name: is_ls, dtype: float64

In [27]:
grid_results_tf['ba_gap'] = grid_results_tf['test_accuracy'] - y_train.is_ls.value_counts(normalize=True)[1]
grid_results_cv['ba_gap'] = grid_results_cv['test_accuracy'] - y_train.is_ls.value_counts(normalize=True)[1]

By consolidating and sorting our results values by `test_accuracy`, we can assess which models will be the best starting points.  Overall, CountVectorized and TF-IDF models performed similarly.  Because CountVectorized registered the highest score, we will use that as our vectorizer.

In [28]:
grid_results_cv.sort_values('test_accuracy',ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp,tt_gap,ba_gap
0,multi_nb,"{'cv__ngram_range': (1, 2), 'cv__stop_words': ...",0.995795,0.836134,195.0,36.0,42.0,203.0,0.159661,0.321769
5,gb,"{'cv__ngram_range': (1, 2), 'cv__stop_words': ...",0.913805,0.813025,168.0,63.0,26.0,219.0,0.10078,0.298659
4,ada,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.861948,0.777311,178.0,53.0,53.0,192.0,0.084637,0.262945
3,rf,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.985985,0.773109,187.0,44.0,64.0,181.0,0.212875,0.258743
2,logreg,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.999299,0.766807,174.0,57.0,54.0,191.0,0.232493,0.252441
1,knn,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.547302,0.533613,10.0,221.0,1.0,244.0,0.013689,0.019248


In [29]:
grid_results_tf.sort_values('test_accuracy',ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp,tt_gap,ba_gap
2,logreg,"{'tf__ngram_range': (1, 2), 'tf__stop_words': ...",0.999299,0.821429,183.0,48.0,37.0,208.0,0.177871,0.307063
0,multi_nb,"{'tf__ngram_range': (1, 1), 'tf__stop_words': ...",0.957253,0.819328,195.0,36.0,50.0,195.0,0.137925,0.304962
5,gb,"{'tf__ngram_range': (1, 1), 'tf__stop_words': ...",0.924317,0.80042,174.0,57.0,38.0,207.0,0.123897,0.286054
4,ada,"{'tf__ngram_range': (1, 1), 'tf__stop_words': ...",0.868956,0.760504,176.0,55.0,59.0,186.0,0.108452,0.246138
3,rf,"{'tf__ngram_range': (1, 1), 'tf__stop_words': ...",0.995795,0.756303,184.0,47.0,69.0,176.0,0.239493,0.241937
1,knn,"{'tf__ngram_range': (1, 2), 'tf__stop_words': ...",0.540995,0.529412,7.0,224.0,0.0,245.0,0.011583,0.015046


Looking at model types, we can see that the CountVectorized Multinomial Naive-Bayes and TF-IDF Logistic Regression performed best on an initial run.  We will select these two, as well as the RandomForest model, which was requested by the project requirements, and GradientBoost Decision Tree to enhance modeling accuracy.  We will continue to optimize each of these models.

### Model Selections: 
#### 1. Lemmatized CountVectorizer Multinomial Naive-Bayes
  - `cv__ngram_range=(1,2)`
  - `cv__stop_words='english'`
  
#### 2. Lemmatized CountVectorizer Random Forest 
*(project requirement)*
  - `cv__ngram_range=(1,1)`
  - `cv__stop_words='english'`
  
#### 3. Lemmatized CountVectorizer Gradient-Boost Decision Tree
  - `cv__ngram_range=(1,2)`
  - `cv__stop_words='english'`
  
#### 4. Lemmatized TF-IDF Scaled Logistic Regression
  - `tf__ngram_range=(1,2)`
  - `tf__stop_words='english'`

## Continue to Notebook 4: Model Optimization