# Project 3 Part 3 - Feature Selection and Modelling

This notebook will cover:

- Selection of the X and y features
- Modelling using various model types:
    - K-Nearest Neighbours
    - Random Forest
    - Multinomial Naive Bayes
    - Support Vector Machine
    - Logistic Regression

In [1]:
# import the necessary libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, plot_roc_curve, roc_auc_score, recall_score, precision_score, f1_score
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn import svm
from tabulate import tabulate

In [2]:
# import the dataset
onion_or_not = pd.read_csv('../Datasets/onion_or_not.csv')
print(onion_or_not.shape)
onion_or_not.head()

(10000, 8)


Unnamed: 0,subreddit,title,created_utc,title_length,title_word_count,new_text,newtext_length,newtext_word_count
0,nottheonion,Woman self-isolates in plane toilet for five h...,1640966006,101,14,woman self isolates in plane toilet for five h...,98,16
1,nottheonion,Humanity's Final Arms Race: UN Fails to Agree ...,1640965471,67,12,humanity s final arm race un fails to agree on...,63,13
2,nottheonion,Prince Andrew accuser seeks evidence he could ...,1640963938,55,9,prince andrew accuser seek evidence he could n...,54,9
3,nottheonion,Upset over their grievances not being addresse...,1640963308,220,34,upset over their grievance not being addressed...,207,33
4,nottheonion,Thousands of coma patients may be conscious bu...,1640963023,99,14,thousand of coma patient may be conscious but ...,95,15


In [14]:
onion_or_not[onion_or_not.title.str.contains('&amp;')]

Unnamed: 0,subreddit,title,created_utc,title_length,title_word_count,new_text,newtext_length,newtext_word_count
78,nottheonion,Capital Hospital Islamabad Jobs 2022 for Medic...,1640887476,176,24,capital hospital islamabad job 2022 for medica...,158,23
363,nottheonion,Dad goes to Indian restaurant on Xmas Day &amp...,1640680167,82,13,dad go to indian restaurant on xmas day amp co...,76,14
369,nottheonion,Santa kicked out of AT&amp;T Stadium after put...,1640672450,74,12,santa kicked out of at amp t stadium after put...,73,14
606,nottheonion,Logan Paul claims people think Jake Paul is th...,1640373611,91,15,logan paul claim people think jake paul is thi...,88,16
621,nottheonion,Cookies &amp; Cream She’s Eating Herself Out,1640365535,44,7,cooky amp cream she s eating herself out,40,8
...,...,...,...,...,...,...,...,...
6841,TheOnion,"Gamers, We Just Spent 4 Days Trapped In A Roll...",1593025130,172,29,gamers we just spent 4 day trapped in a rolled...,165,32
7844,TheOnion,"Bad News, Gamers! ‘Mario &amp; Sonic At The Ol...",1572992691,91,17,bad news gamers mario amp sonic at the olympic...,84,17
8290,TheOnion,Sable &amp; Rosenfeld Launches Ad Campaign Reb...,1565132300,89,13,sable amp rosenfeld launch ad campaign rebrand...,83,13
9161,TheOnion,Man Entering Fog Of Insanity Asked If This His...,1553781556,80,15,man entering fog of insanity asked if this his...,78,16


In [12]:
onion_or_not['title'].iloc[9161]

'Man Entering Fog Of Insanity Asked If This His First Time At Dave &amp; Buster’s'

In [25]:
onion_or_not['test'] = onion_or_not.title.apply(lambda x: x.replace("&amp; ", ""))

In [26]:
onion_or_not['test'].iloc[9161]

'Man Entering Fog Of Insanity Asked If This His First Time At Dave Buster’s'

In [31]:
onion_or_not[onion_or_not.title.str.contains('The Onion')]

Unnamed: 0,subreddit,title,created_utc,title_length,title_word_count,new_text,newtext_length,newtext_word_count,test
1817,TheOnion,Look at The Onion,1639613288,17,4,look at the onion,17,4,Look at The Onion
4304,TheOnion,The Onion Reviews 'Licorice Pizza',1637771822,34,5,the onion review licorice pizza,31,5,The Onion Reviews 'Licorice Pizza'
4478,nottheonion,The Onion on Not The Onion,1637680099,26,6,the onion on not the onion,26,6,The Onion on Not The Onion
5307,TheOnion,‘The Onion’ Accidentally Sent Our Sex Columnis...,1634006860,69,11,the onion accidentally sent our sex columnist ...,67,11,‘The Onion’ Accidentally Sent Our Sex Columnis...
5323,TheOnion,"The Onion, FDA Commissioner: I Give up on you ...",1633208339,50,10,the onion fda commissioner i give up on you pig,47,10,"The Onion, FDA Commissioner: I Give up on you ..."
...,...,...,...,...,...,...,...,...,...
8951,TheOnion,The Onion’s Legal Analysts Have Completed Thei...,1555625847,106,18,the onion s legal analyst have completed their...,104,19,The Onion’s Legal Analysts Have Completed Thei...
9101,TheOnion,The Onion Reviews ‘Pet Sematary’,1554393665,32,5,the onion review pet sematary,29,5,The Onion Reviews ‘Pet Sematary’
9270,TheOnion,The Onion Looks Back At 'Back To The Future',1552474341,44,9,the onion look back at back to the future,41,9,The Onion Looks Back At 'Back To The Future'
9495,TheOnion,The Onion is dying,1550471127,18,4,the onion is dying,18,4,The Onion is dying


In [32]:
onion_or_not[onion_or_not.title.str.contains('TheOnion')]

Unnamed: 0,subreddit,title,created_utc,title_length,title_word_count,new_text,newtext_length,newtext_word_count,test
884,nottheonion,Best of r/NotTheOnion 2021: Nominations now open!,1640168032,49,7,best of r nottheonion 2021 nomination now open,46,8,Best of r/NotTheOnion 2021: Nominations now open!
6450,TheOnion,Tell me this News report doesn’t remind you of...,1602808974,55,10,tell me this news report doesn t remind you of...,55,11,Tell me this News report doesn’t remind you of...
7224,TheOnion,TheOnion.com Has Been Designated As A Pandemic...,1585171615,115,18,theonion com ha been designated a a pandemic s...,113,20,TheOnion.com Has Been Designated As A Pandemic...


In [33]:
onion_or_not[onion_or_not.title.str.contains('theonion')]

Unnamed: 0,subreddit,title,created_utc,title_length,title_word_count,new_text,newtext_length,newtext_word_count,test
2284,nottheonion,/r/nottheonion being renamed to /r/stupidmods,1639232070,45,5,r nottheonion being renamed to r stupidmods,43,7,/r/nottheonion being renamed to /r/stupidmods
4024,TheOnion,“Nottheonion” mods totally aren’t snowflake to...,1637936387,69,10,nottheonion mod totally aren t snowflake toddl...,64,11,“Nottheonion” mods totally aren’t snowflake to...
4026,TheOnion,“Nottheonion” mods are totally not pent up but...,1637936279,93,17,nottheonion mod are totally not pent up butt h...,84,16,“Nottheonion” mods are totally not pent up but...
8933,TheOnion,Guess If a Headline Is from r/theonion or r/no...,1555881934,56,9,guess if a headline is from r theonion or r no...,55,11,Guess If a Headline Is from r/theonion or r/no...


In [34]:
onion_or_not[onion_or_not.title.str.contains('nottheonion')]

Unnamed: 0,subreddit,title,created_utc,title_length,title_word_count,new_text,newtext_length,newtext_word_count,test
2284,nottheonion,/r/nottheonion being renamed to /r/stupidmods,1639232070,45,5,r nottheonion being renamed to r stupidmods,43,7,/r/nottheonion being renamed to /r/stupidmods
8933,TheOnion,Guess If a Headline Is from r/theonion or r/no...,1555881934,56,9,guess if a headline is from r theonion or r no...,55,11,Guess If a Headline Is from r/theonion or r/no...


## Generating Binary Classifier Column

Since we want to create a model that will effectively predict which news is fake, we will use TheOnion as positive, since it is satire and not real news.


**1: TheOnion<br>
0: nottheonion**

In [143]:
# create column for 1s and 0s for subreddit
# TheOnion : 1, notthe onion : 0
onion_or_not['onion'] = [1 if value == 'TheOnion' else 0 for value in onion_or_not.subreddit.values]
onion_or_not.head()

Unnamed: 0,subreddit,title,created_utc,title_length,title_word_count,new_text,newtext_length,newtext_word_count,onion
0,nottheonion,Woman self-isolates in plane toilet for five h...,1640966006,101,14,woman self isolates in plane toilet for five h...,98,16,0
1,nottheonion,Humanity's Final Arms Race: UN Fails to Agree ...,1640965471,67,12,humanity s final arm race un fails to agree on...,63,13,0
2,nottheonion,Prince Andrew accuser seeks evidence he could ...,1640963938,55,9,prince andrew accuser seek evidence he could n...,54,9,0
3,nottheonion,Upset over their grievances not being addresse...,1640963308,220,34,upset over their grievance not being addressed...,207,33,0
4,nottheonion,Thousands of coma patients may be conscious bu...,1640963023,99,14,thousand of coma patient may be conscious but ...,95,15,0


In [144]:
# as stated in the previous notebook, the classes are perfectly balanced
print(onion_or_not.onion.value_counts(normalize=False))
onion_or_not.onion.value_counts(normalize=True)

0    5000
1    5000
Name: onion, dtype: int64


0    0.5
1    0.5
Name: onion, dtype: float64

### The X and y features

We will use the `new_text` to predict whether a title is from the onion or not. The `new_text` is the tokenized and lemmatized version of the original `title` text.<br>
Hence, our X will be:

In [145]:
X = onion_or_not.new_text
X.head()

0    woman self isolates in plane toilet for five h...
1    humanity s final arm race un fails to agree on...
2    prince andrew accuser seek evidence he could n...
3    upset over their grievance not being addressed...
4    thousand of coma patient may be conscious but ...
Name: new_text, dtype: object

Our y will simply be the `onion` column of 1s and 0s:

In [146]:
y = onion_or_not.onion
y.head()

0    0
1    0
2    0
3    0
4    0
Name: onion, dtype: int64

### Train, Test and Split

In [147]:
# train, test and split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=42,
                                                    stratify=y)

## Important metrics for this dataset

### Which is worse, FALSE POSITIVES or FALSE NEGATIVES?

**False Positive - News that is real, but wrongly classified as fake.<br>
False Negative - News that is fake, but wrongly classified as real.**<br>

False positives are important, as the implications of classifying real news as fake can be serious. For example, real news about vaccines being classified as fake can have a serious impact on the healthcare system and vaccination efforts of a nation.

That said, false negatives hold equal importance as well. For example, classifying a fake news about a terrorist attack as real will cause undue panic and anxiety, which will cause unecessary stress the security and defence personnel.

Since avoiding both false positives and false negatives are equally important for our problem, we need a trade-off between precision and recall. We will thus **use the f1 score as the main metric**. The f1 score is defined as the harmonic mean of precision and recall.<sup>1</sup><br>
Nevertheless, we will still look at other metrics, namely:
- Accuracy
- Recall
- Precision

## Modelling

We will model the data using the following models:
- K-Nearest Neighbours
- Random Forest
- Multinomial Naive Bayes
- Support Vector Machine
- Logistic Regression

The model with the best metric will be chosen as our primary model, on which we will perform further study and draw reccomendations and conclusions from.

### Baseline Model
It is tempting to just use accuracy score of 0.5 as the baseline. This however, would be pointless as it is overly basic, and the dataset is deliberately balanced in the first place. We will thus use k-Nearest Neighbours as the baseline model as it is a proper model, yet basic enough for a baseline evaluation.

In [148]:
# create KNN pipeline
# no default CountVectorizer and KNN will be used as it is a baseline
pipe_knn_base = Pipeline([
    ('cvec', CountVectorizer()), # instantiate CountVectorizer
    ('knn', KNeighborsClassifier()) # Instantiate KNN
])

In [149]:
# fit the train data
pipe_knn_base.fit(X_train, y_train)

Pipeline(steps=[('cvec', CountVectorizer()), ('knn', KNeighborsClassifier())])

In [150]:
# get the score on the train and test sets
print(f'Baseline train accuracy score: {pipe_knn_base.score(X_train, y_train)}')
print(f'Baseline test accuracy score: {pipe_knn_base.score(X_test, y_test)}')

Baseline train accuracy score: 0.7388
Baseline test accuracy score: 0.6076


The model is overfit, as can be seen by the difference between the train and test scores.

In [151]:
# what are the other test set metrics for baseline KNN model?
print(f'Baseline test RECALL: {recall_score(y_test, pipe_knn_base.predict(X_test))}')
print(f'Baseline test PRECISION: {precision_score(y_test, pipe_knn_base.predict(X_test))}')
print(f'Baseline test f1 SCORE: {f1_score(y_test, pipe_knn_base.predict(X_test))}')

Baseline test RECALL: 0.6392
Baseline test PRECISION: 0.6012039127163281
Baseline test f1 SCORE: 0.6196200077549439


**Summary of the test scores for the baseline KNN Model**

In [152]:
print(tabulate([['Metric', 'Score'],
                ['Accuracy', pipe_knn_base.score(X_test, y_test)],
               ['Recall', recall_score(y_test, pipe_knn_base.predict(X_test))],
               ['Precision', precision_score(y_test, pipe_knn_base.predict(X_test))],
               ['f1 Score', f1_score(y_test, pipe_knn_base.predict(X_test))]],
               headers='firstrow'))

Metric        Score
---------  --------
Accuracy   0.6076
Recall     0.6392
Precision  0.601204
f1 Score   0.61962


Scores for the metrics are overall not that good.

### k-Nearest Neighbors (KNN) Model Hyperparameter Tuning
Since the baseline performed rather poorly, we will tune the hyperparameters of the the KNN model to see if the score can be improved.

In [153]:
# create KNN pipeline
pipe_knn = Pipeline([
    ('cvec', CountVectorizer()), # instantiate CountVectorizer
    ('knn', KNeighborsClassifier()) # Instantiate KNN
])

In [154]:
# CountVectorizer and KNN parameters
pipe_knn_params = {
    'cvec__max_features': [None, 1000, 2000, 3000],
    'cvec__min_df': [1, 2, 3],
    'cvec__max_df': [.8, .85],
    'cvec__ngram_range': [(1,1), (1,2), (1,3)],
    'cvec__stop_words': [None, 'english'],
    'knn__n_neighbors': [1, 3, 5, 7]
}

In [155]:
# Gridsearch for KNN
gs_knn = GridSearchCV(pipe_knn,
                     param_grid=pipe_knn_params,
                     cv=5,
                     n_jobs=-1)

In [156]:
# fit the training data
gs_knn.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('knn', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'cvec__max_df': [0.8, 0.85],
                         'cvec__max_features': [None, 1000, 2000, 3000],
                         'cvec__min_df': [1, 2, 3],
                         'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)],
                         'cvec__stop_words': [None, 'english'],
                         'knn__n_neighbors': [1, 3, 5, 7]})

In [157]:
# where are the best parameters in KNN model
print(gs_knn.best_params_)

# best score for KNN model
gs_knn.best_score_

{'cvec__max_df': 0.8, 'cvec__max_features': None, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': 'english', 'knn__n_neighbors': 1}


0.6612

In [158]:
# what are the KNN accuracy scores?
print(f'KNN train accuracy score: {gs_knn.score(X_train, y_train)}')
print(f'KNN test accuracy score: {gs_knn.score(X_test, y_test)}')

KNN train accuracy score: 0.9997333333333334
KNN test accuracy score: 0.6824


A number of observations can be seen by comparing the KNN model with hyperparameter tuning with the baseline.
- Removal of english stop words helps to improve the test and train score
- Usage of uni-grams help improve the test and train score
- The nearest neighbours that worked best for this model is **1**.
- Although scores have improved, overfitting is still present, and severe.

This will help with adjusting the parameters for the other models.

In [159]:
# what are the test scores of the other metrics for KNN model
print(f'KNN test RECALL: {recall_score(y_test, gs_knn.predict(X_test))}')
print(f'KNN test PRECISION: {precision_score(y_test, gs_knn.predict(X_test))}')
print(f'KNN test f1 SCORE: {f1_score(y_test, gs_knn.predict(X_test))}')

KNN test RECALL: 0.9008
KNN test PRECISION: 0.6269487750556793
KNN test f1 SCORE: 0.7393302692055155


There is significant improvement for Recall and f1 Scores, and a small improvement for Precision.<br>
High recall shows that we have very few false negatives.<br>
Low precision is indicative of high number of false positives.

**Summary of the test scores for the KNN Model with hyperparameter tuning**

In [24]:
print(tabulate([['Metric', 'Score'],
                ['Accuracy', gs_knn.score(X_test, y_test)],
               ['Recall', recall_score(y_test, gs_knn.predict(X_test))],
               ['Precision', precision_score(y_test, gs_knn.predict(X_test))],
               ['f1 Score', f1_score(y_test, gs_knn.predict(X_test))]],
               headers='firstrow'))

Metric        Score
---------  --------
Accuracy   0.6824
Recall     0.9008
Precision  0.626949
f1 Score   0.73933


### Random Forest with hyperparameter tuning

In [25]:
# pipeline for RandomForest
pipe_rf = Pipeline([
    ('cvec', CountVectorizer()), # instantiate CountVectorizer
    ('rf', RandomForestClassifier()) # instantiate RandomForest
])

In [26]:
# CountVectorizer and RandomForest parameters
pipe_rf_params = {
    'cvec__max_features': [None, 1000, 2000, 3000],
    'cvec__min_df': [1, 2, 3],
    'cvec__max_df': [.8, .85],
    'cvec__ngram_range': [(1,1), (1,2), (1,3)],
    'cvec__stop_words': [None, 'english'],
    'rf__n_estimators': [100, 200, 300],
    'rf__max_depth': [None, 1, 2, 3, 4, 5],
    'rf__random_state': [42],
}

In [27]:
# GridSearch for RandomForest
gs_rf = GridSearchCV(pipe_rf,
                     param_grid=pipe_rf_params,
                     cv=5,
                     n_jobs=-1)

In [28]:
# Fit the training data to the RandomForest model
gs_rf.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('rf', RandomForestClassifier())]),
             n_jobs=-1,
             param_grid={'cvec__max_df': [0.8, 0.85],
                         'cvec__max_features': [None, 1000, 2000, 3000],
                         'cvec__min_df': [1, 2, 3],
                         'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)],
                         'cvec__stop_words': [None, 'english'],
                         'rf__max_depth': [None, 1, 2, 3, 4, 5],
                         'rf__n_estimators': [100, 200, 300],
                         'rf__random_state': [42]})

In [29]:
# where are the best parameters in RandomForest model
print(gs_rf.best_params_)

# best score for RandForest model
gs_rf.best_score_

{'cvec__max_df': 0.8, 'cvec__max_features': None, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': None, 'rf__max_depth': None, 'rf__n_estimators': 200, 'rf__random_state': 42}


0.7797333333333334

In [30]:
# what are the RandomForest accuracy scores?
print(f'RandomForest train accuracy score: {gs_rf.score(X_train, y_train)}')
print(f'RandomForest test accuracy score: {gs_rf.score(X_test, y_test)}')

RandomForest train accuracy score: 1.0
RandomForest test accuracy score: 0.7932


Some interesting observations for RandomForest:
- Although the accuracy is high with test accuracy outperforming the KNN train accuracy, the model is still overfit.
- Uni-grams, work best for this model.
- No stop words were removed, which was the best for this model.

In [31]:
# what are the test scores of the other metrics for RandomForest model
print(f'RandomForest test RECALL: {recall_score(y_test, gs_rf.predict(X_test))}')
print(f'RandomForest test PRECISION: {precision_score(y_test, gs_rf.predict(X_test))}')
print(f'RandomForest test f1 SCORE: {f1_score(y_test, gs_rf.predict(X_test))}')

RandomForest test RECALL: 0.8624
RandomForest test PRECISION: 0.7575544624033732
RandomForest test f1 SCORE: 0.8065843621399178


Of all the scores, Precision was the lowest, similar to that of KNN, indicating that RandomForest also predicted a proportionally higher number of false positives.


**Summary of the test scores for the RandomForest Model with hyperparameter tuning**

In [32]:
print(tabulate([['Metric', 'Score'],
                ['Accuracy', gs_rf.score(X_test, y_test)],
               ['Recall', recall_score(y_test, gs_rf.predict(X_test))],
               ['Precision', precision_score(y_test, gs_rf.predict(X_test))],
               ['f1 Score', f1_score(y_test, gs_rf.predict(X_test))]],
               headers='firstrow'))

Metric        Score
---------  --------
Accuracy   0.7932
Recall     0.8624
Precision  0.757554
f1 Score   0.806584


### Multinomial Naive Bayes (MNB) with hyperparameter tuning

In [160]:
# create a pipeline for MNB
pipe_nb = Pipeline([
    ('cvec', CountVectorizer()), # instantiate CountVectorizer
    ('nb', MultinomialNB()) # instantiate MNB
])

In [161]:
# CountVectorizer and MNB parameters
pipe_nb_params = {
    'cvec__max_features': [None, 1000, 2000, 3000],
    'cvec__min_df': [1, 2, 3],
    'cvec__max_df': [.8, .85],
    'cvec__ngram_range': [(1,1), (1,2), (1,3)],
    'cvec__stop_words': [None, 'english'],
    'nb__alpha': np.linspace(0.5, 1.8, 8),
    'nb__fit_prior': [True, False]
}

In [162]:
# Instantiate GridSearchCV for MNB
gs_nb = GridSearchCV(pipe_nb, 
                     param_grid=pipe_nb_params,
                     cv=5,
                     n_jobs=-1)

In [163]:
# fit the training data to the MNB model
gs_nb.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('nb', MultinomialNB())]),
             n_jobs=-1,
             param_grid={'cvec__max_df': [0.8, 0.85],
                         'cvec__max_features': [None, 1000, 2000, 3000],
                         'cvec__min_df': [1, 2, 3],
                         'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)],
                         'cvec__stop_words': [None, 'english'],
                         'nb__alpha': array([0.5       , 0.68571429, 0.87142857, 1.05714286, 1.24285714,
       1.42857143, 1.61428571, 1.8       ]),
                         'nb__fit_prior': [True, False]})

In [164]:
# where are the best parameters in MNB model
print(gs_nb.best_params_)

# best score for MNB model
gs_nb.best_score_

{'cvec__max_df': 0.8, 'cvec__max_features': None, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': None, 'nb__alpha': 0.8714285714285714, 'nb__fit_prior': True}


0.8172

In [165]:
# what are the MNB accuracy scores?
print(f'MNB train accuracy score: {gs_nb.score(X_train, y_train)}')
print(f'MNB test accuracy score: {gs_nb.score(X_test, y_test)}')

MNB train accuracy score: 0.9972
MNB test accuracy score: 0.842


Interesting observations from MNB:
- Uni-grams to Tri-grams range worked the best
- No stop word removal worked best for the model
- alpha value of **0.8714**
- Model is strong predictor with a relatively high test score, but still overfit.

In [166]:
# what are the test scores of the other metrics for MNB model
print(f'MNB test RECALL: {recall_score(y_test, gs_nb.predict(X_test))}')
print(f'MNB test PRECISION: {precision_score(y_test, gs_nb.predict(X_test))}')
print(f'MNB test f1 SCORE: {f1_score(y_test, gs_nb.predict(X_test))}')

MNB test RECALL: 0.8512
MNB test PRECISION: 0.835820895522388
MNB test f1 SCORE: 0.8434403487911216


All scores apprear to not differ that much, with Precision being the lowest and Recall being the highest.<br>
F1 score outperforms that of baseline, and KNN and RandomForest with hyperparameter tuning.

**Summary of the test scores for the RandomForest Model with hyperparameter tuning**

In [40]:
print(tabulate([['Metric', 'Score'],
                ['Accuracy', gs_nb.score(X_test, y_test)],
               ['Recall', recall_score(y_test, gs_nbb.predict(X_test))],
               ['Precision', precision_score(y_test, gs_nb.predict(X_test))],
               ['f1 Score', f1_score(y_test, gs_nb.predict(X_test))]],
               headers='firstrow'))

Metric        Score
---------  --------
Accuracy   0.842
Recall     0.8512
Precision  0.835821
f1 Score   0.84344


### Logistic Regression with hyperparameter tuning

In [64]:
# create pipeline for LogisticRegression
pipe_log = Pipeline([
    ('cvec', CountVectorizer()), # instantiate CountVectorizer
    ('log', LogisticRegression()) # instantiate LogisticRegression
])

In [85]:
pipe_log_params = {
    'cvec__max_features': [None, 1000, 2000, 3000],
    'cvec__min_df': [1, 2, 3],
    'cvec__max_df': [.8, .85],
    'cvec__ngram_range': [(1,1), (1,2), (1,3)],
    'cvec__stop_words': [None, 'english'],
    'log__penalty': ['l1', 'l2'],
    'log__max_iter': [100, 300, 500],
    'log__random_state': [42]
}

In [101]:
# Instantiate GridSearchCV for Logistic Regression
gs_log = GridSearchCV(pipe_log, 
                     param_grid=pipe_log_params,
                     cv=5,
                     n_jobs=-1)

In [102]:
# fit the training data to the LogisticRegression model
gs_log.fit(X_train, y_train)

        nan 0.7772            nan 0.7772            nan 0.7772
        nan 0.80186667        nan 0.80186667        nan 0.80186667
        nan 0.7908            nan 0.7908            nan 0.7908
        nan 0.802             nan 0.802             nan 0.802
        nan 0.78773333        nan 0.78773333        nan 0.78773333
        nan 0.7944            nan 0.7944            nan 0.7944
        nan 0.77333333        nan 0.77333333        nan 0.77333333
        nan 0.8008            nan 0.8008            nan 0.8008
        nan 0.78186667        nan 0.78186667        nan 0.78186667
        nan 0.7992            nan 0.7992            nan 0.7992
        nan 0.78026667        nan 0.78026667        nan 0.78026667
        nan 0.78893333        nan 0.78893333        nan 0.78893333
        nan 0.768             nan 0.768             nan 0.768
        nan 0.79293333        nan 0.79293333        nan 0.79293333
        nan 0.77253333        nan 0.77253333        nan 0.77253333
        nan 0.7928       

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('log', LogisticRegression())]),
             n_jobs=-1,
             param_grid={'cvec__max_df': [0.8, 0.85],
                         'cvec__max_features': [None, 1000, 2000, 3000],
                         'cvec__min_df': [1, 2, 3],
                         'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)],
                         'cvec__stop_words': [None, 'english'],
                         'log__max_iter': [100, 300, 500],
                         'log__penalty': ['l1', 'l2'],
                         'log__random_state': [42]})

In [103]:
# where are the best parameters in LogisticRegression model
print(gs_log.best_params_)

# best score for LogisticRegression model
gs_log.best_score_

{'cvec__max_df': 0.8, 'cvec__max_features': None, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': None, 'log__max_iter': 100, 'log__penalty': 'l2', 'log__random_state': 42}


0.8019999999999999

In [104]:
# what are the LogisticRegression accuracy scores?
print(f'LogisticRegression train accuracy score: {gs_log.score(X_train, y_train)}')
print(f'LogisticRegression test accuracy score: {gs_log.score(X_test, y_test)}')

LogisticRegression train accuracy score: 1.0
LogisticRegression test accuracy score: 0.8208


Interesting observations with Logistic Regression:
- Uni-grams to Tri-grams worked the best for this model
- No stop words were removed, which worked the best for this model
- l2 penalty, ie Ridge Regression worked best for the model
- Model is also overfit, with train score being 1.0.

In [105]:
# what are the test scores of the other metrics for LogisticRegression model
print(f'LogisticRegression test RECALL: {recall_score(y_test, gs_log.predict(X_test))}')
print(f'LogisticRegression test PRECISION: {precision_score(y_test, gs_log.predict(X_test))}')
print(f'LogisticRegression test f1 SCORE: {f1_score(y_test, gs_log.predict(X_test))}')

LogisticRegression test RECALL: 0.8392
LogisticRegression test PRECISION: 0.8094135802469136
LogisticRegression test f1 SCORE: 0.8240377062058131


All scores apprear to not differ that much, with Precision being the lowest and Recall being the highest.<br>
F1 score outperforms that of baseline, and KNN and RandomForest with hyperparameter tuning, but is slighlty worse than MNB with hyperparameter tuning.

**Summary of the test scores for the Logistic Regression Model with hyperparameter tuning**

In [106]:
print(tabulate([['Metric', 'Score'],
                ['Accuracy', gs_log.score(X_test, y_test)],
               ['Recall', recall_score(y_test, gs_log.predict(X_test))],
               ['Precision', precision_score(y_test, gs_log.predict(X_test))],
               ['f1 Score', f1_score(y_test, gs_log.predict(X_test))]],
               headers='firstrow'))

Metric        Score
---------  --------
Accuracy   0.8208
Recall     0.8392
Precision  0.809414
f1 Score   0.824038


### Support Vector Machine (SVM) with hyperparameter tuning

In [92]:
# create pipeline for SVM
pipe_svm = Pipeline([
    ('cvec', CountVectorizer()), # instantiate CountVectorizer
    ('svm', svm.SVC()) # instantiate SVM
])

In [94]:
pipe_svm_params = {
    'cvec__max_features': [None, 1000, 2000, 3000],
    'cvec__min_df': [1, 2, 3],
    'cvec__max_df': [.8, .85],
    'cvec__ngram_range': [(1,1), (1,2), (1,3)],
    'cvec__stop_words': [None, 'english'],
    'svm__kernel': ['rbf', 'linear'],
    'svm__gamma': ['scale', 'auto'],
    'svm__random_state': [42]
}

In [96]:
# Instantiate GridSearchCV for SVM
gs_svm = GridSearchCV(pipe_svm, 
                     param_grid=pipe_svm_params,
                     cv=5,
                     n_jobs=-1)

In [97]:
# fit the training data to the SVM model
gs_svm.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('svm', SVC())]),
             n_jobs=-1,
             param_grid={'cvec__max_df': [0.8, 0.85],
                         'cvec__max_features': [None, 1000, 2000, 3000],
                         'cvec__min_df': [1, 2, 3],
                         'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)],
                         'cvec__stop_words': [None, 'english'],
                         'svm__gamma': ['scale', 'auto'],
                         'svm__kernel': ['rbf', 'linear'],
                         'svm__random_state': [42]})

In [98]:
# where are the best parameters in SVM model
print(gs_svm.best_params_)

# best score for SVM model
gs_svm.best_score_

{'cvec__max_df': 0.8, 'cvec__max_features': None, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': None, 'svm__gamma': 'scale', 'svm__kernel': 'linear', 'svm__random_state': 42}


0.7993333333333333

In [107]:
# what are the SVM accuracy scores?
print(f'SVM train accuracy score: {gs_svm.score(X_train, y_train)}')
print(f'SVM test accuracy score: {gs_svm.score(X_test, y_test)}')

SVM train accuracy score: 1.0
SVM test accuracy score: 0.818


Interesting observations from Support Vector Machine:
- Uni-grams to Bi-grams range worked the best for the model
- No stop word removal worked best for the model
- Linear worked better than Radial Basis Function(rbf) for the kernel parameter
- Model is overfit, with the train score being 1.0.

In [99]:
# what are the test scores of the other metrics for SVM model
print(f'LogisticRegression test RECALL: {recall_score(y_test, gs_svm.predict(X_test))}')
print(f'LogisticRegression test PRECISION: {precision_score(y_test, gs_svm.predict(X_test))}')
print(f'LogisticRegression test f1 SCORE: {f1_score(y_test, gs_svm.predict(X_test))}')

LogisticRegression test RECALL: 0.84
LogisticRegression test PRECISION: 0.8045977011494253
LogisticRegression test f1 SCORE: 0.8219178082191781


All scores apprear to not differ that much, with Precision being the lowest and Recall being the highest.<br>
f1 score outperforms that of baseline, and KNN and RandomForest with hyperparameter tuning, but is slighlty worse than MNB and Logistic Regression with hyperparameter tuning.

**Summary of the test scores for the Support Vector Machine Model with hyperparameter tuning**

In [100]:
print(tabulate([['Metric', 'Score'],
                ['Accuracy', gs_svm.score(X_test, y_test)],
               ['Recall', recall_score(y_test, gs_svm.predict(X_test))],
               ['Precision', precision_score(y_test, gs_svm.predict(X_test))],
               ['f1 Score', f1_score(y_test, gs_svm.predict(X_test))]],
               headers='firstrow'))

Metric        Score
---------  --------
Accuracy   0.818
Recall     0.84
Precision  0.804598
f1 Score   0.821918


## Summary of modelling scores

In [114]:
print(tabulate([['Model', 'Accuracy', 'Recall', 'Precision', 'f1 Score'],
                ['Baseline', pipe_knn_base.score(X_test, y_test), recall_score(y_test, pipe_knn_base.predict(X_test)), precision_score(y_test, pipe_knn_base.predict(X_test)), f1_score(y_test, pipe_knn_base.predict(X_test))],
               ['k-Nearest Neighbors', gs_knn.score(X_test, y_test), recall_score(y_test, gs_knn.predict(X_test)), precision_score(y_test, gs_knn.predict(X_test)), f1_score(y_test, gs_knn.predict(X_test))],
               ['Random Forest', gs_rf.score(X_test, y_test), recall_score(y_test, gs_rf.predict(X_test)), precision_score(y_test, gs_rf.predict(X_test)), f1_score(y_test, gs_rf.predict(X_test))],
               ['Multinomial Naive Bayes', gs_nb.score(X_test, y_test), recall_score(y_test, gs_nb.predict(X_test)), precision_score(y_test, gs_nb.predict(X_test)), f1_score(y_test, gs_nb.predict(X_test))],
               ['Logistic Regression', gs_log.score(X_test, y_test), recall_score(y_test, gs_log.predict(X_test)), precision_score(y_test, gs_log.predict(X_test)), f1_score(y_test, gs_log.predict(X_test))],
               ['Support Vector Machine', gs_svm.score(X_test, y_test), recall_score(y_test, gs_svm.predict(X_test)), precision_score(y_test, gs_svm.predict(X_test)), f1_score(y_test, gs_svm.predict(X_test))]],
               headers='firstrow'))

print('') # leave a space

print('The Baseline model being k-Nearest Neighbors without any hyperparameter tuning.')

Model                      Accuracy    Recall    Precision    f1 Score
-----------------------  ----------  --------  -----------  ----------
Baseline                     0.6076    0.6392     0.601204    0.61962
k-Nearest Neighbors          0.6824    0.9008     0.626949    0.73933
Random Forest                0.7932    0.8624     0.757554    0.806584
Multinomial Naive Bayes      0.842     0.8512     0.835821    0.84344
Logistic Regression          0.8208    0.8392     0.809414    0.824038
Support Vector Machine       0.818     0.84       0.804598    0.821918

The Baseline model being k-Nearest Neighbors without any hyperparameter tuning.


As can be seen from the table above, Multinomial Naive Bayes has the best accuracy, precision and f1 score, f1 score being the most important metric in this project. 

k-Nearest Neighbors has the best recall.

As Multinomial Naive Bayes has the highest number of best metrics, especially the f1 score, it will serve as our main chosen model.

# References
1. https://www.analyticsvidhya.com/blog/2020/11/a-tour-of-evaluation-metrics-for-machine-learning/