# Project 3: Natural Language Processing and Classification

Benjamin Chee, DSI-SG-17

Classifying posts from r/xboxone and r/PS5

# Notebook 3: Model Selection

This notebook contains code used to classify with models using our prepared data.

The following were used:
- Multinomial Naive Bayes
- K-Nearest Neighbors
- Logistic Regression Classifier
- Random Forest

2 Vectorisation methods were used:
- CountVectorizer
- TF-IDF

GridSearch was then used to optimise each model

Contents:
- GridSearch - CountVectorizer
- GridSearch - TF-IDF


## Libraries

In [20]:
import datetime
import time
import re
import pandas as pd
import numpy as np
from tqdm import tqdm

# general scikitlearn imports
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

#NLP
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Classification models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

# metrics
from sklearn.metrics import confusion_matrix, accuracy_score

In [2]:
#initialise date time
date_run = datetime.datetime.now()
date= date_run.date()

In [177]:
#reading output from notebook 1
df_pre=pd.read_csv('./csv/df_pre_2020-10-01.csv')

In [178]:
df_pre.dropna(inplace=True)

In [183]:
len(df_pre)

1362

### Train-Test Split

In [56]:
X = df_pre[['post_st','post_lm']]
y = df_pre['from_ps5']

In [145]:
df_pre.head(30)

Unnamed: 0,post_st,post_lm,from_ps5
0,tech weekli xbox one tech support thi is the t...,tech weekly xbox one tech support this is the ...,0
1,gta iv one of my fav game ever nearli a lock ...,gta iv one of my fav game ever nearly a locked...,0
2,more seri x load time comparison,more series x load time comparison,0
3,digit foundri xbox seri x backward compat test...,digital foundry xbox series x backwards compat...,0
4,do you rememb when thi pictur blew our mind,do you remember when this picture blew our mind,0
5,an entir game gener ha gone and we have not ha...,an entire gaming generation ha gone and we hav...,0
6,did anyon els notic the live wallpap on the se...,did anyone else notice the live wallpaper on t...,0
7,year of play xbox ha brought me to thi point ...,year of playing xbox ha brought me to this po...,0
8,xsx is the most quiet xbox ever i'll conclud t...,xsx is the most quiet xbox ever i'll conclude ...,0
9,each gamer score you get on oct will convert...,each gamer score you get on oct will convert...,0


In [57]:
X.isnull().sum()

post_st    0
post_lm    0
dtype: int64

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [59]:
len(X_train)

1021

In [60]:
len(y_train)

1021

### GridSearchCV
The GridSearchCV tool allows us to program multiple hyperparameters across our models. It will generate a model with each combination of our desired hyperparameters, and optimize the highest-scoring result.

We will run a single model for each of the following 6 classifiers:

Multinomial Naive Bayes
K-Nearest Neighbors
Logistic Regression
Random Forest
AdaBoost (adaptive boost)
Gradient Boost
We will run two GridSearches to benchmark these models for two feature extraction techniques: CountVectorizer and TfidfVectorizer. We can use the accuracy of the results to narrow our model selection to the most effective approaches.

As these models execute, the results will be displayed, then stored into a DataFrame for final comparison.

In [215]:
# list of pipeline steps for each model combo
steps_list_gr_cv = [ 
    [('cv',CountVectorizer()),('multi_nb',MultinomialNB())],
    [('cv',CountVectorizer()),('scaler',StandardScaler(with_mean=False)),('knn',KNeighborsClassifier())], 
    [('cv',CountVectorizer()),('scaler',StandardScaler(with_mean=False)),('logreg',LogisticRegression())],
    [('cv',CountVectorizer()),('rf',RandomForestClassifier())]

]

In [197]:
steps_titles = ['multi_nb','knn','logreg','rf']

In [219]:
pipe_params_cv = [
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]}
]

In [255]:
# instantiate results DataFrame
grid_results = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy','tn','fp','fn','tp'])
grid_results.head()

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp


In [217]:
X_train_pre_post = X_train['post_lm']
X_test_pre_post = X_test['post_lm']

In [260]:
for i in tqdm(range(len(steps_list_gr_cv))):           # timed loop through index of number of steps
    pipe = Pipeline(steps=steps_list_gr_cv[i])         # configure pipeline for each model
    grid = GridSearchCV(pipe, pipe_params_cv[i], cv=3) # fit GridSearchCV to model and model's params

    model_results = {}

    grid.fit(X_train_pre_post, y_train)
    
    print('Model: ',steps_titles[i])
    model_results['model'] = steps_titles[i]

    print('Best Params: ', grid.best_params_)
    model_results['best_params'] = grid.best_params_

    print(grid.score(X_train_pre_post, y_train), '\n')
    model_results['train_accuracy'] = grid.score(X_train_pre_post, y_train)
    
    print(grid.score(X_test_pre_post, y_test), '\n')
    model_results['test_accuracy'] = grid.score(X_test_pre_post, y_test)

    # Display the confusion matrix results showing true/false positive/negative
    tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test_pre_post)).ravel() 
    print(f'True Positives: {tn}')
    model_results['tn'] = tn

    print(f'True Positives: {fp}')
    model_results['fp'] = fp

    print(f'True Positives: {fn}')
    model_results['fn'] = fn

    print(f'True Positives: {tp}', '\n')
    model_results['tp'] = tp
    
    grid_results = grid_results.append(model_results, ignore_index=True)

 25%|█████████████████████                                                               | 1/4 [00:00<00:01,  1.84it/s]

Model:  multi_nb
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.9755142017629774 

0.9149560117302052 

True Positives: 196
True Positives: 16
True Positives: 13
True Positives: 116 

Model:  knn
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.7267384916748286 

0.6686217008797654 

True Positives: 209

 50%|██████████████████████████████████████████                                          | 2/4 [00:01<00:01,  1.66it/s]


True Positives: 3
True Positives: 110
True Positives: 19 



 75%|███████████████████████████████████████████████████████████████                     | 3/4 [00:02<00:00,  1.32it/s]

Model:  logreg
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
1.0 

0.8269794721407625 

True Positives: 199
True Positives: 13
True Positives: 46
True Positives: 83 

Model:  rf
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
1.0 

0.906158357771261 

True Positives: 209

100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.21s/it]


True Positives: 3
True Positives: 29
True Positives: 100 






In [261]:
grid_results_cv = grid_results


In [262]:
grid_results.sort_values('test_accuracy',ascending=False)


Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp
0,multi_nb,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.975514,0.914956,196,16,13,116
3,rf,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",1.0,0.906158,209,3,29,100
2,logreg,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",1.0,0.826979,199,13,46,83
1,knn,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.726738,0.668622,209,3,110,19


### TF-IDF

In [201]:
steps_list_gr_tf = [ # list of pipeline steps for each model combo
    [('tf',TfidfVectorizer()),('multi_nb',MultinomialNB())],
    [('tf',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('knn',KNeighborsClassifier())], 
    [('tf',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('logreg',LogisticRegression())],
    [('tf',TfidfVectorizer()),('rf',RandomForestClassifier())]
]

In [199]:
steps_titles = ['multi_nb','knn','logreg','rf']


In [200]:
pipe_params_tf = [
    {'tf__stop_words':['english'], 'tf__ngram_range':[(1,1),(1,2)]},
    {'tf__stop_words':['english'], 'tf__ngram_range':[(1,1),(1,2)]},
    {'tf__stop_words':['english'], 'tf__ngram_range':[(1,1),(1,2)]},
    {'tf__stop_words':['english'], 'tf__ngram_range':[(1,1),(1,2)]}
]

In [263]:
# instantiate results DataFrame
grid_results = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy','tn','fp','fn','tp'])
grid_results.head()

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp


In [203]:
X_train_pre_post = X_train['post_lm']
X_test_pre_post = X_test['post_lm']

In [264]:
for i in tqdm(range(len(steps_list_gr_tf))):           # timed loop through index of number of steps
    pipe = Pipeline(steps=steps_list_gr_tf[i])         # configure pipeline for each model
    grid = GridSearchCV(pipe, pipe_params_tf[i], cv=3) # fit GridSearchCV to model and model's params

    model_results = {}

    grid.fit(X_train_pre_post, y_train)
    
    print('Model: ',steps_titles[i])
    model_results['model'] = steps_titles[i]

    print('Best Params: ', grid.best_params_)
    model_results['best_params'] = grid.best_params_

    print(grid.score(X_train_pre_post, y_train), '\n')
    model_results['train_accuracy'] = grid.score(X_train_pre_post, y_train)
    
    print(grid.score(X_test_pre_post, y_test), '\n')
    model_results['test_accuracy'] = grid.score(X_test_pre_post, y_test)

    tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test_pre_post)).ravel()
    print(f'True Positives: {tn}')
    model_results['tn'] = tn

    print(f'True Positives: {fp}')
    model_results['fp'] = fp

    print(f'True Positives: {fn}')
    model_results['fn'] = fn

    print(f'True Positives: {tp}', '\n')
    model_results['tp'] = tp

    grid_results = grid_results.append(model_results, ignore_index=True)

 25%|█████████████████████                                                               | 1/4 [00:00<00:01,  1.81it/s]

Model:  multi_nb
Best Params:  {'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}
0.9324191968658179 

0.8211143695014663 

True Positives: 211
True Positives: 1
True Positives: 60
True Positives: 69 

Model:  knn
Best Params:  {'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}
0.6199804113614104 



 50%|██████████████████████████████████████████                                          | 2/4 [00:01<00:01,  1.54it/s]

0.6217008797653959 

True Positives: 212
True Positives: 0
True Positives: 129
True Positives: 0 



 75%|███████████████████████████████████████████████████████████████                     | 3/4 [00:02<00:00,  1.35it/s]

Model:  logreg
Best Params:  {'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}
1.0 

0.841642228739003 

True Positives: 198
True Positives: 14
True Positives: 40
True Positives: 89 

Model:  rf
Best Params:  {'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}
1.0 

0.9178885630498533 

True Positives: 211

100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.20s/it]


True Positives: 1
True Positives: 27
True Positives: 102 






In [265]:
grid_results_tf = grid_results


In [266]:
grid_results.sort_values('test_accuracy',ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp
3,rf,"{'tf__ngram_range': (1, 1), 'tf__stop_words': ...",1.0,0.917889,211,1,27,102
2,logreg,"{'tf__ngram_range': (1, 1), 'tf__stop_words': ...",1.0,0.841642,198,14,40,89
0,multi_nb,"{'tf__ngram_range': (1, 1), 'tf__stop_words': ...",0.932419,0.821114,211,1,60,69
1,knn,"{'tf__ngram_range': (1, 2), 'tf__stop_words': ...",0.61998,0.621701,212,0,129,0


### Results assessment
Adding columns for the gap between train and set accuracy scores. This will tell us about the level of overfitting that may be present in each model.


The baseline accuracy is the likelihood of a post being from_ps5=1 based solely on the percentage of our dataset that is our target value. Here, we normalize our value counts to show a baseline accuracy of 62.0%.

In [176]:
y_train.value_counts(normalize=True)

0    0.61998
1    0.38002
Name: from_ps5, dtype: float64

In [267]:
#comparing vs test train scroe to see generalisability
grid_results_tf['tt_delta'] = grid_results_tf['train_accuracy'] - grid_results_tf['test_accuracy']
grid_results_cv['tt_delta'] = grid_results_cv['train_accuracy'] - grid_results_cv['test_accuracy']

In [268]:
#comparing vs baseline
grid_results_tf['ba_delta'] = grid_results_tf['test_accuracy'] - y_train.value_counts(normalize=True)[1]
grid_results_cv['ba_delta'] = grid_results_cv['test_accuracy'] - y_train.value_counts(normalize=True)[1]


In [269]:
grid_results_cv.sort_values('test_accuracy',ascending=False)


Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp,tt_delta,ba_delta
0,multi_nb,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.975514,0.914956,196,16,13,116,0.060558,0.534936
3,rf,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",1.0,0.906158,209,3,29,100,0.093842,0.526139
2,logreg,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",1.0,0.826979,199,13,46,83,0.173021,0.44696
1,knn,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.726738,0.668622,209,3,110,19,0.058117,0.288602


In [270]:
grid_results_tf.sort_values('test_accuracy',ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp,tt_delta,ba_delta
3,rf,"{'tf__ngram_range': (1, 1), 'tf__stop_words': ...",1.0,0.917889,211,1,27,102,0.082111,0.537869
2,logreg,"{'tf__ngram_range': (1, 1), 'tf__stop_words': ...",1.0,0.841642,198,14,40,89,0.158358,0.461623
0,multi_nb,"{'tf__ngram_range': (1, 1), 'tf__stop_words': ...",0.932419,0.821114,211,1,60,69,0.111305,0.441095
1,knn,"{'tf__ngram_range': (1, 2), 'tf__stop_words': ...",0.61998,0.621701,212,0,129,0,-0.00172,0.241681


Looking at model types, we can see that the CountVectorized Multinomial Naive-Bayes and Random Forest performed best on an initial run. We will select these two, as well as the RandomForest model, which was requested by the project requirements, and GradientBoost Decision Tree to enhance modeling accuracy. We will continue to optimize each of these models.

Model Selections:

1. Lemmatized CountVectorizer Multinomial Naive-Bayes

    - cv__ngram_range=(1,1)

    - cv__stop_words='english'


2. Lemmatized TF-IDF Scaled Random Forest

    - tf__ngram_range=(1,1)

    - tf__stop_words='english'


3. Lemmatized CountVectorizer K nearest neighbours

    - cv__ngram_range=(1,1)

    - cv__stop_words='english'


4. Lemmatized CountVectorizer Logistic Regression

    - cv__ngram_range=(1,1)

    - cv__stop_words='english'



### Continue to Notebook 4: Model Optimisation