# DSI19 Project 3 - Model Selection
---

## Table of Contents

* [1. Establishing Baseline](#chapter1)
* [2. Data Preparation](#chapter2)
    * [2.1 Create Train/Test Data](#chapter2_1)
    * [2.2 Create X and y Variables](#chapter2_2)
* [3. Model Selection](#chapter3)
    * [3.1 Creating Pipelines](#chapter3_1)
    * [3.2 Defining Function](#chapter3_2)
    * [3.3 Gathering Results](#chapter3_3)
    * [3.4 Results Analysis and Model Selection](#chapter3_4)

In [3]:
# Library imports

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, accuracy_score

## 1. Establishing Baseline <a class="anchor" id="chapter1"></a>
---

As the problem is a classification problem, the baseline score will be the majority of the 2 classes.

In [4]:
text_df = pd.read_csv('../data/processed_text.csv')

In [5]:
display(text_df.head())

Unnamed: 0,text_raw,is_tifu,text_token,text_base,text_lem,text_stem
0,TIFU By accidentally being racist to an Asian ...,1,"['tifu', 'by', 'accidentally', 'being', 'racis...",tifu by accidentally being racist to an asian ...,tifu by accidentally being racist to an asian ...,tifu by accident be racist to an asian friend ...
1,TIFU by ordering dish towels from Amazon (NSFW...,1,"['tifu', 'by', 'ordering', 'dish', 'towels', '...",tifu by ordering dish towels from amazon nsfw ...,tifu by ordering dish towel from amazon nsfw o...,tifu by order dish towel from amazon nsfw obli...
2,TIFU because I told the teacher I like drugs P...,1,"['tifu', 'because', 'told', 'the', 'teacher', ...",tifu because told the teacher like drugs prett...,tifu because told the teacher like drug pretty...,tifu becaus told the teacher like drug pretti ...
3,TIFU by losing my phone and getting picked up ...,1,"['tifu', 'by', 'losing', 'my', 'phone', 'and',...",tifu by losing my phone and getting picked up ...,tifu by losing my phone and getting picked up ...,tifu by lose my phone and get pick up by campu...
4,TIFU by setting out a digital picture frame TI...,1,"['tifu', 'by', 'setting', 'out', 'digital', 'p...",tifu by setting out digital picture frame tifu...,tifu by setting out digital picture frame tifu...,tifu by set out digit pictur frame tifu by set...


In [6]:
# Baseline
print(text_df['is_tifu'].value_counts(normalize=True))

1    0.502014
0    0.497986
Name: is_tifu, dtype: float64


The baseline score if 50.2% for the `/r/tifu` subreddit.

## 2. Data Preparation <a class="anchor" id="chapter2"></a>
---

For the purposes of model selection, our feature will be `text_base` and target will be `is_tifu`.

20% of the data will be stripped out as the test set that will be used for model evaluation once the best model has been selected.

### 2.1 Create Train/Test Data <a class="anchor" id="chapter2_1"></a>

In [7]:
# Split the dataset into a train and test set
test_csv = text_df.sample(frac=.2)
train_csv = text_df.drop(list(test_csv.index),axis=0)

In [8]:
# Check that test data extracted is no longer in train data
display(test_csv[test_csv['text_raw'].isin(train_csv['text_raw']).astype(int) ==1])

Unnamed: 0,text_raw,is_tifu,text_token,text_base,text_lem,text_stem


In [9]:
# Save the train and test data as csv files
train_csv.to_csv('../data/train.csv',index=False)
test_csv.to_csv('../data/test.csv',index=False)

### 2.2 Create `X` and `y` Variables <a class="anchor" id="chapter2_2"></a>

Using `train.csv`, the data will be split into a training data set and a validation data set.

In [133]:
# Creating X and y variables
X = train_csv['text_base']
y = train_csv['is_tifu']

## 3. Model Selection <a class="anchor" id="chapter3"></a>
---

Given that the data has been prepared into a `training` and `validation` set, the model pipeline will now be defined in order to cross validate and select the best performing model.

Pipelines will be created based on the following.

Feature Selection Tools:
- CountVectorizer
- TfidVectorizer

Classification model
- MultinomialBayes
- LogisticRegression
- KNearestNeighbors

### 3.1 Creating Pipelines <a class="anchor" id="chapter3_1"></a>

In [134]:
# Creating dictionary for 2 text feature extraction tools, with English stop words
text_feature = {'cvec' : CountVectorizer(stop_words='english'), 
                'tvec': TfidfVectorizer(stop_words='english')}

# Creading dictionary for classification models
model = {'multi_nb':MultinomialNB(),
        'knn3':KNeighborsClassifier(n_neighbors=3,n_jobs=-1),
        'knn5':KNeighborsClassifier(n_neighbors=5,n_jobs=-1),
        'lr':LogisticRegression(max_iter=2000,n_jobs=-1)}

In [135]:
for n in range(len(text_feature)*len(model)): # Look through number of possible combinations for pipelines
    
    pipelines = [] # Create list of empty pipelines
    
    for text, selector in text_feature.items(): # Iterate through text feature extraction dictionary
        for name, mode in model.items(): # Iterate through classification model dictionary
            pipe = Pipeline([            # Create pipeline
                (text,selector),
                (name,mode)
            ])
            pipelines.append(pipe) # Append pipeline to list

display(pipelines)

[Pipeline(steps=[('cvec', CountVectorizer(stop_words='english')),
                 ('multi_nb', MultinomialNB())]),
 Pipeline(steps=[('cvec', CountVectorizer(stop_words='english')),
                 ('knn3', KNeighborsClassifier(n_jobs=-1, n_neighbors=3))]),
 Pipeline(steps=[('cvec', CountVectorizer(stop_words='english')),
                 ('knn5', KNeighborsClassifier(n_jobs=-1))]),
 Pipeline(steps=[('cvec', CountVectorizer(stop_words='english')),
                 ('lr', LogisticRegression(max_iter=2000, n_jobs=-1))]),
 Pipeline(steps=[('tvec', TfidfVectorizer(stop_words='english')),
                 ('multi_nb', MultinomialNB())]),
 Pipeline(steps=[('tvec', TfidfVectorizer(stop_words='english')),
                 ('knn3', KNeighborsClassifier(n_jobs=-1, n_neighbors=3))]),
 Pipeline(steps=[('tvec', TfidfVectorizer(stop_words='english')),
                 ('knn5', KNeighborsClassifier(n_jobs=-1))]),
 Pipeline(steps=[('tvec', TfidfVectorizer(stop_words='english')),
                 ('lr

### 3.2 Defining Function <a class="anchor" id="chapter3_2"></a>

In [136]:
def model_eval(X,y): # Takes in X, and y data
    
    # Train test split to create train and validation data
    X_train, X_val, y_train, y_val = train_test_split(X, y, stratify = y) 
    
    # Parameter grid for CountVectorizer
    cvec_params = {
        'cvec__max_features': [50, 100, 150],
        'cvec__min_df': [1, 2, 3],
        'cvec__max_df': [.5, .6, .9],
        'cvec__ngram_range': [(1,1), (1,2), (1,3)],
    }
    
    # Parameter grid for TfidVectorizer
    tvec_params = {
        'tvec__ngram_range' : [(1,1), (1,2), (1,3)],
        'tvec__min_df' : [1, 2, 3],
        'tvec__max_df' : [.5,.6,.8, .9]
    }
    
    # Creating lists to store outputs for each pipe
    cross_score = [] # Cross validation score
    opt_params = [] # Optimal parameters from gridsearch
    train_score = [] # Score on training set
    val_score = [] # Score on validation set
    tn_list = [] # True negatives predicted, interpreted as predicted /r/confessions and is /r/confessions
    fp_list = [] # False positives predicted, interpreted as predicted /r/tifu and is /r/confessions
    fn_list = [] # False negatives predicted, interpreted as predicted /r/confessions and is /r/tifu
    tp_list = [] # True positives predicted, interpreted as predicted /r/tifu and is /r/tifu
    
    for pipe in pipelines: # Iteration through all the pipes created earlier
        
        cross = cross_val_score(pipe,X_train,y_train,cv=5).mean() # Obtaining cross validation score of the pipe
        cross_score.append(cross) # Appending cross validation score to list
        
        if pipe.steps[0][0] == 'cvec': # Checking which text feature extraction tool is used in the pipe
            params = cvec_params # Uses cvec param grid if cvec
        else:
            params = tvec_params # Uses tvec param grid if tvec
        
        # Grid search for optimal params
        gs = GridSearchCV(pipe,
                          param_grid=params,
                          cv=5,
                          verbose=1)
        
        # Fit Grid search on training data
        gs.fit(X_train,y_train)
        
        # Obtaining outputs of confusion matrix, based on fitted grid search
        tn, fp, fn, tp = confusion_matrix(y_val, gs.predict(X_val)).ravel()
        
        opt_params.append(gs.best_params_) # Appending best parameters of pipeline
        train_score.append(gs.best_score_) # Appending best score based on training data
        val_score.append(gs.score(X_val,y_val)) # Appending score bast on validation data
        tn_list.append(tn) # Appending number of true negatives
        fp_list.append(fp) # Appending number of false positives
        fn_list.append(fn) # Appending number of false negatives
        tp_list.append(tp) # Appending number of true positives
        
        print(gs.best_score_)
        print(gs.best_params_)    
        
    # Creating dataframe to store outputs
    eval_df = pd.DataFrame([cross_score,
                            opt_params,
                            train_score,
                            val_score,
                            tn_list,
                            fp_list,
                            fn_list,
                            tp_list],
                          index=['crossval_score','opt_params','train_score','val_score','tn','fp','fn','tp']).T
    
    return eval_df
        

### 3.3 Gathering Results <a class="anchor" id="chapter3_3"></a>

Function will gather the following for all the pipelines created earlier:
- `crossval_score`: Cross validation score
- `opt_params`: Optimal parameters for text feature extraction tool, either `CountVectorizer` or `TfidVectorizer`
- `train_score`: Best accuracy score on train data set
- `val_score`: Best accuracy score on validation data set
- `tn`: Number of True Negatives
- `fp`: Number of False Positives
- `fn`: Number of False Negatives
- `tp`: Number of True Positives

In [137]:
# Run function on X and y data
eval_df = model_eval(X,y)

Fitting 5 folds for each of 81 candidates, totalling 405 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 405 out of 405 | elapsed:  2.5min finished


0.8900284800112512
{'cvec__max_df': 0.6, 'cvec__max_features': 150, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 2)}
Fitting 5 folds for each of 81 candidates, totalling 405 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 405 out of 405 | elapsed:  2.9min finished


0.8740620934566294
{'cvec__max_df': 0.6, 'cvec__max_features': 50, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3)}
Fitting 5 folds for each of 81 candidates, totalling 405 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 405 out of 405 | elapsed:  2.7min finished


0.8790935621110367
{'cvec__max_df': 0.6, 'cvec__max_features': 50, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 3)}
Fitting 5 folds for each of 81 candidates, totalling 405 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 405 out of 405 | elapsed:  2.6min finished


0.9941281952111389
{'cvec__max_df': 0.6, 'cvec__max_features': 50, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 1)}
Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:  1.4min finished


0.8614675995921381
{'tvec__max_df': 0.6, 'tvec__min_df': 3, 'tvec__ngram_range': (1, 1)}
Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:  1.5min finished


0.7682676417847474
{'tvec__max_df': 0.8, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 2)}
Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:  1.4min finished


0.7808551035476952
{'tvec__max_df': 0.8, 'tvec__min_df': 1, 'tvec__ngram_range': (1, 1)}
Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:  2.2min finished


0.9311803382440843
{'tvec__max_df': 0.8, 'tvec__min_df': 3, 'tvec__ngram_range': (1, 2)}


In [138]:
display(eval_df)

Unnamed: 0,crossval_score,opt_params,train_score,val_score,tn,fp,fn,tp
0,0.858943,"{'cvec__max_df': 0.6, 'cvec__max_features': 15...",0.890028,0.854271,146,52,6,194
1,0.506297,"{'cvec__max_df': 0.6, 'cvec__max_features': 50...",0.874062,0.889447,183,15,29,171
2,0.501262,"{'cvec__max_df': 0.6, 'cvec__max_features': 50...",0.879094,0.899497,186,12,28,172
3,0.979853,"{'cvec__max_df': 0.6, 'cvec__max_features': 50...",0.994128,0.994975,197,1,1,199
4,0.832945,"{'tvec__max_df': 0.6, 'tvec__min_df': 3, 'tvec...",0.861468,0.819095,140,58,14,186
5,0.766598,"{'tvec__max_df': 0.8, 'tvec__min_df': 1, 'tvec...",0.768268,0.738693,137,61,43,157
6,0.780855,"{'tvec__max_df': 0.8, 'tvec__min_df': 1, 'tvec...",0.780855,0.746231,148,50,51,149
7,0.910197,"{'tvec__max_df': 0.8, 'tvec__min_df': 3, 'tvec...",0.93118,0.919598,177,21,11,189


Given the gathered results, a few features will be created to evaluate the models
- label the respective pipelines
- compute sensitivity score
- compute specificity score

In [139]:
# Create list of tuples for labeling the pipelines
steps = []
for pipe in pipelines:
    (x, y) = ( pipe.steps[0][0], pipe.steps[1][0] )
    steps.append((x,y))
    
eval_df['steps'] = steps # Add column of labels to evaluation dataframe

# Compute sensitivity score
eval_df['sensitivity'] = eval_df['tp'] / (eval_df['tp']+eval_df['fn'])

# Compute specificity score
eval_df['specificity'] = eval_df['tn'] / (eval_df['tn']+eval_df['fp'])

# Re-order columns for visualisation
eval_df = eval_df[['steps', 
                   'crossval_score',
                   'opt_params', 
                   'train_score', 
                   'val_score', 
                   'tn', 
                   'fp', 
                   'fn', 
                   'tp', 
                   'sensitivity', 
                   'specificity']]

### 3.4 Results Analysis and Model Selection <a class="anchor" id="chapter3_4"></a>

In [140]:
display(eval_df)

Unnamed: 0,steps,crossval_score,opt_params,train_score,val_score,tn,fp,fn,tp,sensitivity,specificity
0,"(cvec, multi_nb)",0.858943,"{'cvec__max_df': 0.6, 'cvec__max_features': 15...",0.890028,0.854271,146,52,6,194,0.97,0.737374
1,"(cvec, knn3)",0.506297,"{'cvec__max_df': 0.6, 'cvec__max_features': 50...",0.874062,0.889447,183,15,29,171,0.855,0.924242
2,"(cvec, knn5)",0.501262,"{'cvec__max_df': 0.6, 'cvec__max_features': 50...",0.879094,0.899497,186,12,28,172,0.86,0.939394
3,"(cvec, lr)",0.979853,"{'cvec__max_df': 0.6, 'cvec__max_features': 50...",0.994128,0.994975,197,1,1,199,0.995,0.994949
4,"(tvec, multi_nb)",0.832945,"{'tvec__max_df': 0.6, 'tvec__min_df': 3, 'tvec...",0.861468,0.819095,140,58,14,186,0.93,0.707071
5,"(tvec, knn3)",0.766598,"{'tvec__max_df': 0.8, 'tvec__min_df': 1, 'tvec...",0.768268,0.738693,137,61,43,157,0.785,0.691919
6,"(tvec, knn5)",0.780855,"{'tvec__max_df': 0.8, 'tvec__min_df': 1, 'tvec...",0.780855,0.746231,148,50,51,149,0.745,0.747475
7,"(tvec, lr)",0.910197,"{'tvec__max_df': 0.8, 'tvec__min_df': 3, 'tvec...",0.93118,0.919598,177,21,11,189,0.945,0.893939


- Based on the results gathered, the pipeline with `CountVectorizer` and `LogisticRegression` returned the highest cross validation score.
- It also achieved the highest accuracy score on the training and validation data respectively, with a small difference in between the two scores.
- This indicates that the model achieves low bias and low variance, which is optimal.
- Sensitivity scores for this pipe is 1, indicating that all the /r/tifu subreddits were predicted correctly
- Specificity scores for the pipe is close to 1, indicating a small margin of error for the /r/confessions subreddit

Given these results, the pipeline selected if pipeline 3.

In [143]:
print(pipelines[3]) # Indicating steps in pipeline for selected model

Pipeline(steps=[('cvec', CountVectorizer(stop_words='english')),
                ('lr', LogisticRegression(max_iter=2000, n_jobs=-1))])


The optimal `CountVectorizer` parameters are as follows.

In [142]:
print(eval_df['opt_params'][3])

{'cvec__max_df': 0.6, 'cvec__max_features': 50, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 1)}
