# Project 3: Natural Language Processing and Classification

Benjamin Chee, DSI-SG-17

Classifying posts from r/xboxone and r/PS5

# Notebook 4: Model Optimisation

This notebook contains code used to classify with models using our prepared data.

The following were used:
- Multinomial Naive Bayes
- K-Nearest Neighbors
- Logistic Regression Classifier
- Random Forest

GridSearch was then used to optimise each model, and and evaluation was done

Contents:
- GridSearch - CountVectorizer
- GridSearch - TF-IDF


## Libraries

In [3]:
import datetime
import time
import re
import pandas as pd
import numpy as np
from tqdm import tqdm

# general scikitlearn imports
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

#NLP
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Classification models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

# metrics
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score,roc_curve

In [4]:
#initialise date time
date_run = datetime.datetime.now()
date= date_run.date()

In [5]:
#reading output from notebook 1
df_pre=pd.read_csv('./csv/df_pre_2020-10-01.csv')

In [7]:
df_pre.dropna(inplace=True)

### Train-Test Split

In [9]:
X = df_pre[['post_st','post_lm']]
y = df_pre['from_ps5']

In [5]:
df_pre.head()

Unnamed: 0,post_st,post_lm,from_ps5
0,tech weekli xbox one tech support thi is the t...,tech weekly xbox one tech support this is the ...,0
1,gta iv one of my fav game ever nearli a lock ...,gta iv one of my fav game ever nearly a locked...,0
2,more seri x load time comparison,more series x load time comparison,0
3,digit foundri xbox seri x backward compat test...,digital foundry xbox series x backwards compat...,0
4,do you rememb when thi pictur blew our mind,do you remember when this picture blew our mind,0


In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [11]:
len(X_train)

1021

In [12]:
len(y_train)

1021

## Count Vectorised Multinomial Naive Bayes¶


In [13]:
mnb_runs = pd.DataFrame(columns=['train_accuracy','test_accuracy','bp','tn','fp','fn','tp'])


In [14]:
mnb_runs.head()

Unnamed: 0,train_accuracy,test_accuracy,bp,tn,fp,fn,tp


In [16]:
# parameters for GridSearch using Pipeline, formatted to call named estimators
mnb_params = {'mnb__alpha':np.arange(1.04,1.06,0.005), 'cv__max_features':np.arange(3450,3550,20)}

# steps defining pipeline sequence and fixed parameters for GridSearch
mnb_steps = [('cv',CountVectorizer(stop_words='english', ngram_range=(1,1))),
            ('mnb',MultinomialNB())]

In [17]:
pipe = Pipeline(mnb_steps) 


In [18]:
X_train_post = X_train['post_lm']
X_test_post = X_test['post_lm']

In [19]:
mnb_post_results = {} # empty dict to store results

grid = GridSearchCV(pipe, mnb_params, cv=5) # optimize GridSearch hyperparameters on `cv=5` cross validation runs
grid.fit(X_train_post, y_train) # fit to our training data

print('Train Accuracy: ',grid.score(X_train_post, y_train))
mnb_post_results['train_accuracy'] = grid.score(X_train_post, y_train) # print/store training accuracy

print('Test Accuracy: ',grid.score(X_test_post, y_test))
mnb_post_results['test_accuracy'] = grid.score(X_test_post, y_test) # print/store test accuracy

print('BP: ',grid.best_params_)
mnb_post_results['bp'] = grid.best_params_ # print/store best parameters

tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test_post)).ravel() # inspect counted results in matrix
print(f'True Negatives: {tn}')
mnb_post_results['tn'] = tn

print(f'True Positives: {fp}')
mnb_post_results['fp'] = fp

print(f'True Negatives: {fn}')
mnb_post_results['fn'] = fn

print(f'True Positives: {tp}', '\n')
mnb_post_results['tp'] = tp

Train Accuracy:  0.9618021547502449
Test Accuracy:  0.906158357771261
BP:  {'cv__max_features': 3490, 'mnb__alpha': 1.0499999999999998}
True Negatives: 191
True Positives: 21
True Negatives: 11
True Positives: 118 



In [85]:
mnb_runs = mnb_runs.append(mnb_post_results, ignore_index=True)

In [86]:
mnb_runs.head()


Unnamed: 0,train_accuracy,test_accuracy,bp,tn,fp,fn,tp
0,0.962782,0.897361,"{'cv__max_features': 3500, 'mnb__alpha': 1.2}",190,22,13,116
1,0.96572,0.909091,"{'cv__max_features': 3800, 'mnb__alpha': 1.1}",193,19,12,117
2,0.961802,0.906158,"{'cv__max_features': 3500, 'mnb__alpha': 1.05}",191,21,11,118
3,0.961802,0.906158,"{'cv__max_features': 3490, 'mnb__alpha': 1.049...",191,21,11,118


## TF-IDF Random Forest


In [58]:
rf_runs = pd.DataFrame(columns=['train_accuracy','test_accuracy','bp','tn','fp','fn','tp'])

In [54]:
rf_runs.head()

Unnamed: 0,train_accuracy,test_accuracy,bp,tn,fp,fn,tp


In [55]:
rf_params = {"rf__n_estimators":np.arange(92,96,1), "rf__max_depth": np.arange(7,9,1), 
             "tf__max_features":[None,35000,40000,45000]}
rf_steps = [('tf',TfidfVectorizer(stop_words='english', ngram_range=(1,1))),
             ('rf',RandomForestClassifier(random_state=42))]

In [56]:
pipe = Pipeline(rf_steps)

In [65]:
rf_post_results = {}

grid = GridSearchCV(pipe, rf_params, cv=5)
grid.fit(X_train_post, y_train)

print('Train Accuracy: ',grid.score(X_train_post, y_train))
rf_post_results['train_accuracy'] = grid.score(X_train_post, y_train)

print('Test Accuracy: ',grid.score(X_test_post, y_test))
rf_post_results['test_accuracy'] = grid.score(X_test_post, y_test)

print('BP: ',grid.best_params_)
rf_post_results['bp'] = grid.best_params_

tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test_post)).ravel()
print(f'True Negatives: {tn}')
rf_post_results['tn'] = tn

print(f'False Positives: {fp}')
rf_post_results['fp'] = fp

print(f'False Negatives: {fn}')
rf_post_results['fn'] = fn

print(f'True Positives: {tp}', '\n')
rf_post_results['tp'] = tp

Train Accuracy:  0.7806072477962782
Test Accuracy:  0.7390029325513197
BP:  {'rf__max_depth': 8, 'rf__n_estimators': 93, 'tf__max_features': None}
True Negatives: 212
False Positives: 0
False Negatives: 89
True Positives: 40 



In [60]:
rf_runs = rf_runs.append(rf_post_results, ignore_index=True)

In [61]:
rf_runs.head()

Unnamed: 0,train_accuracy,test_accuracy,bp,tn,fp,fn,tp
0,0.780607,0.739003,"{'rf__max_depth': 8, 'rf__n_estimators': 93, '...",212,0,89,40


## Count Vectorised K Nearest Neighbours


In [25]:
knn_runs = pd.DataFrame(columns=['train_accuracy','test_accuracy','bp','tn','fp','fn','tp'])


In [26]:
knn_params = {'knn__n_neighbors':np.arange(2,5,1)}
knn_steps = [('cv',CountVectorizer(stop_words='english', ngram_range=(1,1))),
            ('sc',StandardScaler(with_mean=False)),
            ('knn',KNeighborsClassifier())]

In [27]:
pipe = Pipeline(knn_steps)
knn_post_results = {}

grid = GridSearchCV(pipe, knn_params, cv=3)
grid.fit(X_train_post, y_train)

print('Train Accuracy: ',grid.score(X_train_post, y_train))
knn_post_results['train_accuracy'] = grid.score(X_train_post, y_train)

print('Test Accuracy: ',grid.score(X_test_post, y_test))
knn_post_results['test_accuracy'] = grid.score(X_test_post, y_test)

print('BP: ',grid.best_params_)
knn_post_results['bp'] = grid.best_params_

tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test_post)).ravel()
print(f'True Negatives: {tn}')
knn_post_results['tn'] = tn

print(f'False Positives: {fp}')
knn_post_results['fp'] = fp

print(f'False Negatives: {fn}')
knn_post_results['fn'] = fn

print(f'True Positives: {tp}', '\n')
knn_post_results['tp'] = tp

Train Accuracy:  0.791380999020568
Test Accuracy:  0.6744868035190615
BP:  {'knn__n_neighbors': 3}
True Negatives: 204
False Positives: 8
False Negatives: 103
True Positives: 26 



In [140]:
knn_runs = knn_runs.append(knn_post_results, ignore_index=True)

In [141]:
knn_runs.head()

Unnamed: 0,train_accuracy,test_accuracy,bp,tn,fp,fn,tp
0,0.791381,0.674487,{'knn__n_neighbors': 3},204,8,103,26


## Count Vectorised Logistic Regression


In [62]:
lr_runs = pd.DataFrame(columns=['train_accuracy','test_accuracy','bp','tn','fp','fn','tp'])


In [49]:
lr_params = {"lr__penalty":['l1'], "lr__C": [1.44,1.45,1.46,1.47,1.43],
             "lr__tol":[.001], "cv__max_features":[17000,16750,17250]}
lr_steps = [('cv',CountVectorizer(stop_words='english', ngram_range=(1,1))),
            ('sc',StandardScaler(with_mean=False)),
            ('lr',LogisticRegression(solver='saga',random_state=42,max_iter=9999,n_jobs=-1))]

In [50]:
pipe = Pipeline(lr_steps)
lr_post_results = {}

grid = GridSearchCV(pipe, lr_params, cv=3)
grid.fit(X_train_post, y_train)

print('Train Accuracy: ',grid.score(X_train_post, y_train))
lr_post_results['train_accuracy'] = grid.score(X_train_post, y_train)

print('Test Accuracy: ',grid.score(X_test_post, y_test))
lr_post_results['test_accuracy'] = grid.score(X_test_post, y_test)

print('BP: ',grid.best_params_)
lr_post_results['bp'] = grid.best_params_

tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test_post)).ravel()
print(f'True Negatives: {tn}')
lr_post_results['tn'] = tn

print(f'False Positives: {fp}')
lr_post_results['fp'] = fp

print(f'False Negatives: {fn}')
lr_post_results['fn'] = fn

print(f'True Positives: {tp}', '\n')
lr_post_results['tp'] = tp

Train Accuracy:  0.9980411361410382
Test Accuracy:  0.8563049853372434
BP:  {'cv__max_features': 17000, 'lr__C': 1.44, 'lr__penalty': 'l1', 'lr__tol': 0.001}
True Negatives: 196
False Positives: 16
False Negatives: 33
True Positives: 96 



In [63]:
lr_runs = lr_runs.append(lr_post_results, ignore_index=True)

In [64]:
lr_runs.head()

Unnamed: 0,train_accuracy,test_accuracy,bp,tn,fp,fn,tp
0,0.998041,0.856305,"{'cv__max_features': 17000, 'lr__C': 1.44, 'lr...",196,16,33,96


## Optimized Model Features

### Model 1: Multinomial Naive-Bayes

- Lemmatizer
- CountVectorizer
    - stop_words='english'
    - ngram_range=(1,1)
- GridSearch
    - cv__max_features=3490
    - mnb__alpha=1.05

### Model 2: Random Forest

- Lemmatizer
- CountVectorizer
    - stop_words='english'
    - ngram_range=(1,1)
- GridSearch
    - tf__max_features=None
    - rf__criterion='gini'
    - rf__n_estimators=93
    - rf__max_depth=8

### Model 3: KNN

- Lemmatizer
- CountVectorizer
    - stop_words='english'
    - ngram_range=(1,1)
- GridSearch
    - cv__max_features=None
    - n_neighbours=3

### Model 4: Logistic Regression

- Lemmatizer
- CountVectorizer
    - stop_words='english'
    - ngram_range=(1,1)
- GridSearch
    - cv__max_features=1700
    - lr__C=1.44

With our model hyperparameters optimized through GridSearch, we build each desired model pipeline.

## Model 1 Optimized: Multinomial Naive Bayes

In [66]:
m1_steps = [('m1_cv',CountVectorizer(stop_words='english', ngram_range=(1,1), max_features=3490)),
           ('m1_mnb',MultinomialNB(alpha=1.05))]

In [67]:
pipe_1 = Pipeline(m1_steps)
pipe_1.fit(X_train.post_lm, y_train)


Pipeline(steps=[('m1_cv',
                 CountVectorizer(max_features=3490, stop_words='english')),
                ('m1_mnb', MultinomialNB(alpha=1.05))])

In [140]:
pred_proba = [i[1] for i in pipe_1.predict_proba(X_test.post_lm)]

pred_df = pd.DataFrame({'true_values': y_test,
                        'pred_probs':pred_proba})

In [141]:
roc_auc_score(pred_df['true_values'], pred_df['pred_probs'])

0.9749524645312271

In [69]:
pipe_1.score(X_test.post_lm, y_test)


0.906158357771261

In [70]:
tn, fp, fn, tp = confusion_matrix(y_test, pipe_1.predict(X_test.post_lm)).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

print("\nAccuracy: ", (tn + tp) / (tn + fp + fn + tp))
print("Sensitivity: ", tp / (tp + fn))
print("Specificity: ", tn / (tn + fp))
print("Precision: ", tp / (tp + fp))

True Negatives: 191
False Positives: 21
False Negatives: 11
True Positives: 118

Accuracy:  0.906158357771261
Sensitivity:  0.9147286821705426
Specificity:  0.9009433962264151
Precision:  0.8489208633093526


## Model 2 Optimized: TF IDF Random Forest

In [71]:
m2_steps = [('m2_tf',TfidfVectorizer(stop_words='english', ngram_range=(1,1))),
           ('m2_rf',RandomForestClassifier(criterion='gini', n_estimators=93, max_depth=8))]

In [146]:
pipe_2 = Pipeline(m2_steps)
pipe_2.fit(X_train.post_lm, y_train)

Pipeline(steps=[('m2_tf', TfidfVectorizer(stop_words='english')),
                ('m2_rf',
                 RandomForestClassifier(max_depth=8, n_estimators=93))])

In [147]:
pred_proba = [i[1] for i in pipe_2.predict_proba(X_test.post_lm)]

pred_df = pd.DataFrame({'true_values': y_test,
                        'pred_probs':pred_proba})

In [148]:
roc_auc_score(pred_df['true_values'], pred_df['pred_probs'])

0.9694676027497441

In [73]:
pipe_2.score(X_test.post_lm, y_test)

0.8064516129032258

In [74]:
tn, fp, fn, tp = confusion_matrix(y_test, pipe_2.predict(X_test.post_lm)).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

print("\nAccuracy: ", (tn + tp) / (tn + fp + fn + tp))
print("Sensitivity: ", tp / (tp + fn))
print("Specificity: ", tn / (tn + fp))
print("Precision: ", tp / (tp + fp))

True Negatives: 212
False Positives: 0
False Negatives: 66
True Positives: 63

Accuracy:  0.8064516129032258
Sensitivity:  0.4883720930232558
Specificity:  1.0
Precision:  1.0


## Model 3 Optimized: K Nearest Neighbours

In [109]:
m3_steps = [('m3_cv',CountVectorizer(stop_words='english', ngram_range=(1,1), max_features=None)),
            ('m3_sc',StandardScaler(with_mean=False)),
            ('m3_knn',KNeighborsClassifier(n_neighbors=3))]

In [149]:
pipe_3 = Pipeline(m3_steps)
pipe_3.fit(X_train.post_lm, y_train)

Pipeline(steps=[('m3_cv', CountVectorizer(stop_words='english')),
                ('m3_sc', StandardScaler(with_mean=False)),
                ('m3_knn', KNeighborsClassifier(n_neighbors=3))])

In [150]:
pred_proba = [i[1] for i in pipe_3.predict_proba(X_test.post_lm)]

pred_df = pd.DataFrame({'true_values': y_test,
                        'pred_probs':pred_proba})

In [151]:
roc_auc_score(pred_df['true_values'], pred_df['pred_probs'])

0.7144032470381747

In [111]:
pipe_3.score(X_test.post_lm, y_test)

0.6744868035190615

In [112]:
tn, fp, fn, tp = confusion_matrix(y_test, pipe_3.predict(X_test.post_lm)).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

print("\nAccuracy: ", (tn + tp) / (tn + fp + fn + tp))
print("Sensitivity: ", tp / (tp + fn))
print("Specificity: ", tn / (tn + fp))
print("Precision: ", tp / (tp + fp))

True Negatives: 204
False Positives: 8
False Negatives: 103
True Positives: 26

Accuracy:  0.6744868035190615
Sensitivity:  0.20155038759689922
Specificity:  0.9622641509433962
Precision:  0.7647058823529411


 ## Model 4 Optimized: Logistic Regression

In [100]:
m4_steps = [('m4_cv',CountVectorizer(stop_words='english', ngram_range=(1,1), max_features=1700)),
            ('m4_sc',StandardScaler(with_mean=False)),
            ('m4_lr',LogisticRegression(penalty='l1',solver='saga', C=1.44, tol=.001,max_iter=9999))]

In [152]:
pipe_4 = Pipeline(m4_steps)
pipe_4.fit(X_train.post_lm, y_train)

Pipeline(steps=[('m4_cv',
                 CountVectorizer(max_features=1700, stop_words='english')),
                ('m4_sc', StandardScaler(with_mean=False)),
                ('m4_lr',
                 LogisticRegression(C=1.44, max_iter=9999, penalty='l1',
                                    solver='saga', tol=0.001))])

In [153]:
pred_proba = [i[1] for i in pipe_4.predict_proba(X_test.post_lm)]

pred_df = pd.DataFrame({'true_values': y_test,
                        'pred_probs':pred_proba})

In [154]:
roc_auc_score(pred_df['true_values'], pred_df['pred_probs'])

0.9323533713617084

In [102]:
pipe_4.score(X_test.post_lm, y_test)

0.8709677419354839

In [78]:
tn, fp, fn, tp = confusion_matrix(y_test, pipe_4.predict(X_test.post_lm)).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

print("\nAccuracy: ", (tn + tp) / (tn + fp + fn + tp))
print("Sensitivity: ", tp / (tp + fn))
print("Specificity: ", tn / (tn + fp))
print("Precision: ", tp / (tp + fp))

True Negatives: 193
False Positives: 19
False Negatives: 25
True Positives: 104

Accuracy:  0.8709677419354839
Sensitivity:  0.8062015503875969
Specificity:  0.910377358490566
Precision:  0.8455284552845529


## Optimized Model Features

### Model 1: Multinomial Naive-Bayes

- Lemmatizer
- CountVectorizer
    - stop_words='english'
    - ngram_range=(1,1)
- GridSearch
    - cv__max_features=3490
    - mnb__alpha=1.05

### Model 2: Random Forest

- Lemmatizer
- CountVectorizer
    - stop_words='english'
    - ngram_range=(1,1)
- GridSearch
    - tf__max_features=None
    - rf__criterion='gini'
    - rf__n_estimators=93
    - rf__max_depth=8

### Model 3: KNN

- Lemmatizer
- CountVectorizer
    - stop_words='english'
    - ngram_range=(1,1)
- GridSearch
    - cv__max_features=None
    - rf__criterion='gini'
    - rf__n_estimators=93
    - rf__max_depth=8

### Model 4: Logistic Regression

- Lemmatizer
- CountVectorizer
    - stop_words='english'
    - ngram_range=(1,1)
- GridSearch
    - cv__max_features=1700
    - lr__C=1.44
    
The respective scores are shown below as well:

|Model|Test-accuracy|ROC AUC score|
|---|---|---|
|Multinomial Naive-Bayes|0.91|0.97|
|Random Forest|0.81|0.97|
|KNN|0.67|0.71|
|Logistic Regression|0.87|0.93|

The best performing model is model 1, the Multinomial Naive-Bayes model

### Continue to Notebook 5: Model Testing, Data Interpretation, Conclusion