---
# Model Selection
---


This notebook contains the code and processes used to assess the effectiveness of potential classification models when used with our pre-processed data, including:

- Naive Bayes (Pre-requisite model)
    - **The columns of features would be all integer counts, so `MultinomialNB` is the best choice here.**
    - BernoulliNB is best when we have 0/1 counts in all columns of X. (a.k.a. dummy variables)
    - GaussianNB is best when the columns of X are Normally distributed. 
    
- K-Nearest Neighbors

- Logistic Regression Classifier

The models are then optimized through an iterative approach. For each model, we have set up a runs DataFrame to store the parameters and results of each GridSearch. The GridSearch is set to a random_state value, so that cross validation selection will be consistent between runs, and we will be able to make direct comparisons over effectiveness of hyperparameters.

We start with a wide range for fields of interest, and narrow around the optimally selected value and gauge the degree of accuracy increase (or decrease). Through trial and error, we are able to select hyperparameters that will promote the most accurate modeling results.

#### Imports and load file

In [1]:
# library imports
import requests
import time
import pandas as pd
import numpy as np
import re

from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, accuracy_score


pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import warnings
warnings.filterwarnings('ignore')

In [2]:
# open cleaned data
df_clean = pd.read_csv('./dataset/df_clean.csv')

---

In [3]:
df_clean.head()

Unnamed: 0,is_sw,cleaned_post_stem,cleaned_post_lem
0,0,new wiki avoid accident encourag suicid spot c...,new wiki avoid accidentally encouraging suicid...
1,0,remind absolut activ kind allow day want recog...,reminder absolutely activism kind allowed day ...
2,0,haha help ye suicid ye get help post mobil bel...,haha help yes suicidal yes getting help posted...
3,0,someon pleas talk anyon pleas absolut one turn...,someone please talk anyone please absolutely o...
4,0,usual respond post feel like post lot peopl ne...,usually responding post feel like posting lot ...


Set the X matrix to contain features as both cleaned_post_lem and cleaned_post_stem, and our y target matrix to is_ls.

In [4]:
X = df_clean[['cleaned_post_lem']]
y = df_clean['is_sw']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [6]:
print('X_train length:{}'.format(len(X_train)))
print('y_train length:{}'.format(len(y_train)))
print('X_test length:{}'.format(len(X_test)))
print('y_test length:{}'.format(len(y_test)))      

X_train length:1431
y_train length:1431
X_test length:478
y_test length:478


---

## 1. Model Optimization
---

### Model selections

##### CountVectorizer Multinomial Naive-Bayes (project requirement)

- Best Parameters:  {'cvec__max_df': 0.5, 'cvec__max_features': 5000, 'cvec__min_df': 3, 'cvec__ngram_range': (1, 2)}

##### CountVectorizer Logistic Regression

- Best Parameters:  {'cvec__max_df': 0.5, 'cvec__max_features': 10000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2)}

##### CountVectorizer K-Nearest Neighbors**

- Best Parameters:  {'cvec__max_df': 0.2, 'cvec__max_features': 1000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2)}

Using the post count vectorized parameter above to furthe optimize the models.  The GridSearch is set to a random_state value, so that cross validation selection will be consistent between runs, and we will be able to make direct comparisons over effectiveness of hyperparameters.

We start with a wide range for fields of interest, and narrow around the optimally selected value and gauge the degree of accuracy increase (or decrease). Through trial and error, we are able to select hyperparameters that will promote the most accurate modeling results.

---
### CountVectorizer Multinomial Naive-Bayes

In [7]:
# parameters for GridSearch using Pipeline
mnb_params = {"mnb__alpha":np.arange(1,1.5,.1), 
              "cvec__max_features":[4000, 5000, 7000, 10000]}

# steps defining pipeline sequence and fixed parameters for GridSearch
mnb_steps = [('cvec',CountVectorizer(ngram_range= (1, 2))),
            ('mnb',MultinomialNB())]

In [8]:
# establish model pipeline by reference to steps list
pipe = Pipeline(mnb_steps)

In [9]:
%%time

# empty dict to store results
mnb_post_results = {} 

# optimize GridSearch hyperparameters on `cv=5` cross validation runs
grid = GridSearchCV(pipe, mnb_params, cv=5) 
# fit to our training data
grid.fit(X_train['cleaned_post_lem'], y_train) 


# print/store training accuracy
print('Train Accuracy: ',grid.score(X_train['cleaned_post_lem'], y_train))
mnb_post_results['train_accuracy'] = grid.score(X_train['cleaned_post_lem'], y_train) 

# print/store test accuracy
print('Test Accuracy: ',grid.score(X_test['cleaned_post_lem'], y_test))
mnb_post_results['test_accuracy'] = grid.score(X_test['cleaned_post_lem'], y_test) 

# print/store best parameters
print('Best Parameters: ',grid.best_params_)
mnb_post_results['best_params'] = grid.best_params_ 

tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test['cleaned_post_lem'])).ravel() # inspect counted results in matrix
print("True Negatives: %s" % tn)
mnb_post_results['tn'] = tn
print("False Positives: %s" % fp)
mnb_post_results['fp'] = fp
print("False Negatives: %s" % fn)
mnb_post_results['fn'] = fn
print("True Positives: %s" % tp)
mnb_post_results['tp'] = tp

Train Accuracy:  0.8714185883997205
Test Accuracy:  0.6778242677824268
Best Parameters:  {'cvec__max_features': 4000, 'mnb__alpha': 1.2000000000000002}
True Negatives: 190
False Positives: 56
False Negatives: 98
True Positives: 134
Wall time: 1min 27s


##### Result Metrics: Multinomial Naive-Bayes

In [10]:
# accuracy (verification since .score(test) should get the same result)
(mnb_post_results['tn'] + mnb_post_results['tp']) / (mnb_post_results['tn'] + mnb_post_results['fp'] + mnb_post_results['fn'] + mnb_post_results['tp'])

0.6778242677824268

In [11]:
# sensitivity
mnb_post_results['tp'] / (mnb_post_results['tp'] + mnb_post_results['fn'])

0.5775862068965517

In [12]:
# specificity
mnb_post_results['tn'] / (mnb_post_results['tn'] + mnb_post_results['fp'])

0.7723577235772358

In [13]:
# precision
mnb_post_results['tp'] / (mnb_post_results['tp'] + mnb_post_results['fp'])

0.7052631578947368

---
### CountVectorizer Logistic Regression

In [14]:
# parameters for GridSearch using Pipeline
lr_params = {"lr__penalty":['l1', 'l2'], 
             "lr__C": np.arange(1,1.5,.1),
             "lr__tol":[.00035],
             "cvec__max_features":[5000,10000,20000,30000]}
lr_steps = [('cvec',CountVectorizer(ngram_range= (1, 2))),
            ('lr',LogisticRegression(random_state=42))]

In [15]:
# establish model pipeline by reference to steps list
pipe = Pipeline(lr_steps)

In [16]:
%%time

# empty dict to store results
lr_post_results = {}

# optimize GridSearch hyperparameters on `cv=5` cross validation runs
grid = GridSearchCV(pipe, lr_params, cv=5)
# fit to our training data
grid.fit(X_train['cleaned_post_lem'], y_train)

# print/store training accuracy
print('Train Accuracy: ',grid.score(X_train['cleaned_post_lem'], y_train))
lr_post_results['train_accuracy'] = grid.score(X_train['cleaned_post_lem'], y_train)

# print/store test accuracy
print('Test Accuracy: ',grid.score(X_test['cleaned_post_lem'], y_test))
lr_post_results['test_accuracy'] = grid.score(X_test['cleaned_post_lem'], y_test)

# print/store best parameters
print('Best Parameters: ',grid.best_params_)
lr_post_results['bp'] = grid.best_params_

tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test['cleaned_post_lem'])).ravel()
print("True Negatives: %s" % tn)
lr_post_results['tn'] = tn
print("False Positives: %s" % fp)
lr_post_results['fp'] = fp
print("False Negatives: %s" % fn)
lr_post_results['fn'] = fn
print("True Positives: %s" % tp)
lr_post_results['tp'] = tp


Train Accuracy:  0.9937106918238994
Test Accuracy:  0.6924686192468619
Best Parameters:  {'cvec__max_features': 30000, 'lr__C': 1.0, 'lr__penalty': 'l2', 'lr__tol': 0.00035}
True Negatives: 187
False Positives: 59
False Negatives: 88
True Positives: 144


##### Result Metrics: CountVectorized Logistic Regression

In [18]:
# accuracy (verification since .score(test) should get the same result)
(lr_post_results['tn'] + lr_post_results['tp']) / (lr_post_results['tn'] + lr_post_results['fp'] + lr_post_results['fn'] + lr_post_results['tp'])

0.6924686192468619

In [19]:
# sensitivity
lr_post_results['tp'] / (lr_post_results['tp'] + lr_post_results['fn'])

0.6206896551724138

In [20]:
# specificity
lr_post_results['tn'] / (lr_post_results['tn'] + lr_post_results['fp'])

0.7601626016260162

In [21]:
# precision
lr_post_results['tp'] / (lr_post_results['tp'] + lr_post_results['fp'])

0.7093596059113301

---
### CountVectorizer K-Nearest Neighbors

In [22]:
# parameters for GridSearch using Pipeline
knn_params = {"knn__n_neighbors":np.arange(4,20,2),
              "cvec__max_features":[500, 1000, 3000]}

# steps defining pipeline sequence and fixed parameters for GridSearch
knn_steps = [('cvec',CountVectorizer(ngram_range= (1, 2))),
            ('knn',KNeighborsClassifier())]

In [23]:
# establish model pipeline by reference to steps list
pipe = Pipeline(knn_steps)

In [24]:
%%time

# empty dict to store results
knn_post_results = {} 

# optimize GridSearch hyperparameters on `cv=5` cross validation runs
grid = GridSearchCV(pipe, knn_params, cv=5) 
# fit to our training data
grid.fit(X_train['cleaned_post_lem'], y_train) 


# print/store training accuracy
print('Train Accuracy: ',grid.score(X_train['cleaned_post_lem'], y_train))
knn_post_results['train_accuracy'] = grid.score(X_train['cleaned_post_lem'], y_train) 

# print/store test accuracy
print('Test Accuracy: ',grid.score(X_test['cleaned_post_lem'], y_test))
knn_post_results['test_accuracy'] = grid.score(X_test['cleaned_post_lem'], y_test) 

# print/store best parameters
print('Best Parameters: ',grid.best_params_)
knn_post_results['best_params'] = grid.best_params_ 

tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test['cleaned_post_lem'])).ravel() # inspect counted results in matrix
print("True Negatives: %s" % tn)
knn_post_results['tn'] = tn
print("False Positives: %s" % fp)
knn_post_results['fp'] = fp
print("False Negatives: %s" % fn)
knn_post_results['fn'] = fn
print("True Positives: %s" % tp)
knn_post_results['tp'] = tp

Train Accuracy:  0.6373165618448637
Test Accuracy:  0.606694560669456
Best Parameters:  {'cvec__max_features': 500, 'knn__n_neighbors': 12}
True Negatives: 215
False Positives: 31
False Negatives: 157
True Positives: 75
Wall time: 1min 46s


##### Result Metrics: CountVectorized KNN

In [25]:
# accuracy (verification since .score(test) should get the same result)
(knn_post_results['tn'] + knn_post_results['tp']) / (knn_post_results['tn'] + knn_post_results['fp'] + knn_post_results['fn'] + knn_post_results['tp'])

0.606694560669456

In [26]:
# sensitivity
knn_post_results['tp'] / (knn_post_results['tp'] + knn_post_results['fn'])

0.3232758620689655

In [27]:
# specificity
knn_post_results['tn'] / (knn_post_results['tn'] + knn_post_results['fp'])

0.8739837398373984

In [28]:
# precision
knn_post_results['tp'] / (knn_post_results['tp'] + knn_post_results['fp'])

0.7075471698113207

### Optimized features

##### Model 1: Multinomial Naive-Bayes

- Lemmatizer
- CountVectorizer
    - ngram_range=(1,2)
- GridSearch
    - cv__max_features=4000
    - mnb__alpha=1.2
- Train Accuracy:  0.8714185883997205
- Test Accuracy:  0.6778242677824268


##### Model 2: Logistic Regression

- Lemmatizer
- CountVectorizer
    - ngram_range=(1,2)
- GridSearch
    - cv__max_features=30000
    - lr__penalty='l2'
    - lr__C=1
    - lr__tol=.000035
- Train Accuracy:  0.9937106918238994
- Test Accuracy:  0.6924686192468619


##### Model 3: K-Nearest Neighbors

- Lemmatizer
- CountVectorizer
    - ngram_range=(1,2)
- GridSearch
    - cv__max_features=500
    - knn_n_neighbors=12
- Train Accuracy:  0.6373165618448637
- Test Accuracy:  0.606694560669456


---
### Evulation

Test accuracy scores for Logistic Regression model was better than K-Nearest Neighbor model, although both were improved with tuning of hyperparameters. Multinomial Naive Bayes, however, saw a decrease in test scores.

Highest-performing model is Logistic Regression, and coefficients allow us to understand the data easily. Moreover, we see improvements to the metrics of accuracy, specificity, sensitivity and precision. Despite tuning the parameters, I could not get better than 69.2% accuracy, unfortunately.

I spent some time trying to generalize pipelines in order to efficiently run lots of different grid searches on many different models and parameters. This generalized function has room for improvement, but was quite helpful for me to stay organized in my tests.