---
# Text Feature Selection
---

This notebook contains the code and processes used to assess the effectiveness of text feature selection models when used with the pre-processed data. Using the columns of lemmatized words, the two vectorization transformers, CountVectorizer and TF-IDF, are modelled using
- Naive Bayes
- K-Nearest Neighbors
- Logistic Regression Classifier

A GridSearch is run across all models to rule out non-viable options. The text feature selection that give the models with the most predictive potential are then selected and optimized in the next notebook.

---

#### Imports and load file

In [1]:
# library imports
import requests
import time
import pandas as pd
import numpy as np
import re

from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, accuracy_score


pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import warnings
warnings.filterwarnings('ignore')

In [2]:
# open cleaned data
df_clean = pd.read_csv('./dataset/df_clean.csv')

---

In [3]:
df_clean.head()

Unnamed: 0,is_sw,cleaned_post_stem,cleaned_post_lem
0,0,new wiki avoid accident encourag suicid spot c...,new wiki avoid accidentally encouraging suicid...
1,0,remind absolut activ kind allow day want recog...,reminder absolutely activism kind allowed day ...
2,0,haha help ye suicid ye get help post mobil bel...,haha help yes suicidal yes getting help posted...
3,0,someon pleas talk anyon pleas absolut one turn...,someone please talk anyone please absolutely o...
4,0,usual respond post feel like post lot peopl ne...,usually responding post feel like posting lot ...


Set the X matrix to contain features as both cleaned_post_lem and cleaned_post_stem, and our y target matrix to is_ls.

In [4]:
X = df_clean[['cleaned_post_lem']]
y = df_clean['is_sw']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [6]:
print('X_train length:{}'.format(len(X_train)))
print('y_train length:{}'.format(len(y_train)))
print('X_test length:{}'.format(len(X_test)))
print('y_test length:{}'.format(len(y_test)))      

X_train length:1431
y_train length:1431
X_test length:478
y_test length:478


---
### 1. Text Feature Extraction Comparison by means of Modelling

As the feature words are still unstructured for analysis, employ count vectorization and TF-IDF to transform the lists of the cleaned reviews above into features passable into a model.

- It will create columns (also knon as vectors), where each column counts how many times each word is observed in each review.



---
#### Count Vectorizer.



In [7]:
# List of steps in pipeline model for each classifier model
steps_list_gr_cvec = [
    [('cvec',CountVectorizer()),('multi_nb',MultinomialNB())],
    [('cvec',CountVectorizer()),('scaler',StandardScaler(with_mean=False)),('knn',KNeighborsClassifier())], 
    [('cvec',CountVectorizer()),('scaler',StandardScaler(with_mean=False)),('logreg',LogisticRegression())]
]

In [8]:
steps_titles = ['multi_nb','knn','logreg']

In [9]:
# set parameters for models
pipe_params_cvec = [
    {"cvec__ngram_range":[(1,1),(1,2)], 'cvec__max_features': [1000, 5000, 10000], 'cvec__min_df': [2, 3], 'cvec__max_df': [.2, 0.25, .5, .8],},
    {"cvec__ngram_range":[(1,1),(1,2)], 'cvec__max_features': [1000, 5000, 10000], 'cvec__min_df': [2, 3], 'cvec__max_df': [.2, 0.25, .5, .8],},
    {"cvec__ngram_range":[(1,1),(1,2)], 'cvec__max_features': [1000, 5000, 10000], 'cvec__min_df': [2, 3], 'cvec__max_df': [.2, 0.25, .5, .8],}
]

In [10]:
# create results DataFrame
grid_results_cvec = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy','tn','fp','fn','tp'])
grid_results_cvec.head()

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp


In [11]:
%%time

for i in range(len(steps_list_gr_cvec)):                 
    pipe = Pipeline(steps=steps_list_gr_cvec[i])         # configure pipeline for each model
    grid = GridSearchCV(pipe, pipe_params_cvec[i], cv=5) # fit GridSearchCV to model and model's params

    model_results = {}

    grid.fit(X_train['cleaned_post_lem'], y_train)
    
    print('Model: ',steps_titles[i])
    model_results['model'] = steps_titles[i]

    print('Best Params: ', grid.best_params_)
    model_results['best_params'] = grid.best_params_

    print(grid.score(X_train['cleaned_post_lem'], y_train), '\n')
    model_results['train_accuracy'] = grid.score(X_train['cleaned_post_lem'], y_train)
    
    print(grid.score(X_test['cleaned_post_lem'], y_test), '\n')
    model_results['test_accuracy'] = grid.score(X_test['cleaned_post_lem'], y_test)

    tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test['cleaned_post_lem'])).ravel()
    print("True Negatives: %s" % tn)
    model_results['tn'] = tn

    print("False Positives: %s" % fp)
    model_results['fp'] = fp

    print("False Negatives: %s" % fn)
    model_results['fn'] = fn

    print("True Positives: %s" % tp, '\n')
    model_results['tp'] = tp

    grid_results_cvec = grid_results_cvec.append(model_results, ignore_index=True)

Model:  multi_nb
Best Params:  {'cvec__max_df': 0.5, 'cvec__max_features': 5000, 'cvec__min_df': 3, 'cvec__ngram_range': (1, 2)}
0.8909853249475891 

0.698744769874477 

True Negatives: 193
False Positives: 53
False Negatives: 91
True Positives: 141 

Model:  knn
Best Params:  {'cvec__max_df': 0.2, 'cvec__max_features': 1000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2)}
0.6638714185883997 

0.5857740585774058 

True Negatives: 232
False Positives: 14
False Negatives: 184
True Positives: 48 

Model:  logreg
Best Params:  {'cvec__max_df': 0.5, 'cvec__max_features': 10000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2)}
0.9958071278825996 

0.6234309623430963 

True Negatives: 170
False Positives: 76
False Negatives: 104
True Positives: 128 

Wall time: 3min 46s


---
#### Term Frequency-Inverse Document Frequency (TF-IDF)

- Common words are penalized.
- Rare words have more influence.

In [12]:
# List of steps in pipeline model for each classifier model
steps_list_gr_tvec = [
    [('tvec',TfidfVectorizer()),('multi_nb',MultinomialNB())],
    [('tvec',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('knn',KNeighborsClassifier())], 
    [('tvec',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('logreg',LogisticRegression())]
]

In [13]:
steps_titles = ['multi_nb','knn','logreg']

In [14]:
# set parameters for models
pipe_params_tvec = [
    {"tvec__ngram_range":[(1,1),(1,2)], 'tvec__max_features': [1000, 5000, 10000], 'tvec__min_df': [2, 3], 'tvec__max_df': [.2, 0.25, .5, .8],},
    {"tvec__ngram_range":[(1,1),(1,2)], 'tvec__max_features': [1000, 5000, 10000], 'tvec__min_df': [2, 3], 'tvec__max_df': [.2, 0.25, .5, .8],},
    {"tvec__ngram_range":[(1,1),(1,2)], 'tvec__max_features': [1000, 5000, 10000], 'tvec__min_df': [2, 3], 'tvec__max_df': [.2, 0.25, .5, .8],}
]

In [15]:
# create results DataFrame
grid_results_tvec = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy','tn','fp','fn','tp'])
grid_results_tvec.head()

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp


In [16]:
%%time

for i in range(len(steps_list_gr_tvec)):                 
    pipe = Pipeline(steps=steps_list_gr_tvec[i])         # configure pipeline for each model
    grid = GridSearchCV(pipe, pipe_params_tvec[i], cv=5) # fit GridSearchCV to model and model's params

    model_results = {}

    grid.fit(X_train['cleaned_post_lem'], y_train)
    
    print('Model: ',steps_titles[i])
    model_results['model'] = steps_titles[i]

    print('Best Params: ', grid.best_params_)
    model_results['best_params'] = grid.best_params_

    print(grid.score(X_train['cleaned_post_lem'], y_train), '\n')
    model_results['train_accuracy'] = grid.score(X_train['cleaned_post_lem'], y_train)
    
    print(grid.score(X_test['cleaned_post_lem'], y_test), '\n')
    model_results['test_accuracy'] = grid.score(X_test['cleaned_post_lem'], y_test)

    tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test['cleaned_post_lem'])).ravel()
    print("True Negatives: %s" % tn)
    model_results['tn'] = tn

    print("False Positives: %s" % fp)
    model_results['fp'] = fp

    print("False Negatives: %s" % fn)
    model_results['fn'] = fn

    print("True Positives: %s" % tp, '\n')
    model_results['tp'] = tp

    grid_results_tvec = grid_results_tvec.append(model_results, ignore_index=True)

Model:  multi_nb
Best Params:  {'tvec__max_df': 0.5, 'tvec__max_features': 1000, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 1)}
0.8301886792452831 

0.6924686192468619 

True Negatives: 197
False Positives: 49
False Negatives: 98
True Positives: 134 

Model:  knn
Best Params:  {'tvec__max_df': 0.25, 'tvec__max_features': 1000, 'tvec__min_df': 3, 'tvec__ngram_range': (1, 2)}
0.6198462613556953 

0.5376569037656904 

True Negatives: 238
False Positives: 8
False Negatives: 213
True Positives: 19 

Model:  logreg
Best Params:  {'tvec__max_df': 0.2, 'tvec__max_features': 5000, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 2)}
0.9944095038434662 

0.6485355648535565 

True Negatives: 169
False Positives: 77
False Negatives: 91
True Positives: 141 

Wall time: 3min 33s


___
### 2. Results assessment

Add columns measuring the difference of accuracy scores between training and test set, and test set and baseline accuracy. This will tell us about the level of overfitting that may be present in each model.

In [17]:
# identify majority as baseline accuracy
df_clean['is_sw'].value_counts(normalize=True)

0    0.514929
1    0.485071
Name: is_sw, dtype: float64

The baseline accuracy is the likelihood of a post being is_sw=1 based solely on the percentage of the dataset that is the target value. Normalizing the value counts, and identify majority group and take that as the baseline accuracy of 51.4%.

In [25]:
# specificity
grid_results_tvec['specificity'] = grid_results_tvec['tn'] / (grid_results_tvec['tn']+grid_results_tvec['fp'])
grid_results_cvec['specificity'] = grid_results_cvec['tn'] / (grid_results_cvec['tn']+grid_results_cvec['fp'])

In [19]:
# false positive rate
grid_results_tvec['fpr'] = grid_results_tvec['fp'] / (grid_results_tvec['tn']+grid_results_tvec['fp'])
grid_results_cvec['fpr'] = grid_results_cvec['fp'] / (grid_results_cvec['tn']+grid_results_cvec['fp'])

In [21]:
# difference of accuracy scores between training and test set = tt_diff
grid_results_tvec['tt_diff'] = grid_results_tvec['train_accuracy'] - grid_results_tvec['test_accuracy']
grid_results_cvec['tt_diff'] = grid_results_cvec['train_accuracy'] - grid_results_cvec['test_accuracy']

In [22]:
# baseline accuracy = ba_diff
grid_results_tvec['bl_diff'] =  grid_results_tvec['test_accuracy'] - df_clean['is_sw'].value_counts(normalize=True)[0]
grid_results_cvec['bl_diff'] =  grid_results_cvec['test_accuracy'] - df_clean['is_sw'].value_counts(normalize=True)[0]

In [26]:
# show grid results for CountVect models
grid_results_cvec.sort_values('test_accuracy',ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp,fpr,tt_diff,bl_diff,specificity
0,multi_nb,"{'cvec__max_df': 0.5, 'cvec__max_features': 50...",0.890985,0.698745,193,53,91,141,0.215447,0.192241,0.183815,0.784553
2,logreg,"{'cvec__max_df': 0.5, 'cvec__max_features': 10...",0.995807,0.623431,170,76,104,128,0.308943,0.372376,0.108502,0.691057
1,knn,"{'cvec__max_df': 0.2, 'cvec__max_features': 10...",0.663871,0.585774,232,14,184,48,0.0569106,0.078097,0.070845,0.943089


In [27]:
# show grid results for tf-idf models
grid_results_tvec.sort_values('test_accuracy',ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp,fpr,tt_diff,bl_diff,specificity
0,multi_nb,"{'tvec__max_df': 0.5, 'tvec__max_features': 10...",0.830189,0.692469,197,49,98,134,0.199187,0.13772,0.177539,0.800813
2,logreg,"{'tvec__max_df': 0.2, 'tvec__max_features': 50...",0.99441,0.648536,169,77,91,141,0.313008,0.345874,0.133606,0.686992
1,knn,"{'tvec__max_df': 0.25, 'tvec__max_features': 1...",0.619846,0.537657,238,8,213,19,0.0325203,0.082189,0.022728,0.96748


It is observed that CountVectorized and TF-IDF models performed relatively similarly. We assess which models will be the best to optimize by consolidating and sorting the results values by test_accuracy. Additionally, two of the three models using cvec feature selection tools are registered lower false negatives. False negatives are posts wrongly predicted to be in the "Depression" subreddit instead of "SuicideWatch". Before tuning, it is less of a concern but we should take note of that for model tuning and optimization. (The false negative rate is also 1-specificity)

**Since CountVectorized models registered the highest scores for accuracy, we will use that as our vectorizer.**

