<a href="https://colab.research.google.com/github/estherkxy/GA_Projects/blob/main/Capstone/code/2_MLmodels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Models

Notebook 2/6

In [21]:
#!pip install transformers
#!pip install tweet-preprocessor

In [1]:
# Importing libraries needed for data cleaning and EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re
import collections
import transformers
import preprocessor as p
import nltk

nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import RegexpTokenizer, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
#from IPython.display import IFrame
from collections import Counter
from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, plot_roc_curve, roc_auc_score, \
                            accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, \
                             GradientBoostingClassifier, AdaBoostClassifier, \
                             VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import ComplementNB
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as Pipeline1


import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Loading datasets
tw = pd.read_csv('./drive/My Drive/GA/capstone/data/tweets_train.csv', encoding = 'latin-1')
train_df = pd.read_csv('./drive/My Drive/GA/capstone/data/train_tweets_clean.csv')
test_df = pd.read_csv('./drive/My Drive/GA/capstone/data/tweets_test.csv')

In [4]:
stop_words = stopwords.words('english')
extra_words = ['get', 'a', 'it', 'why', 'wont', 'u', 'ee', 'apple', 'aapl', 
               'via', 'inc', 'make', 'rt']
stop_words.extend(extra_words)

The next step that we'll cover below is selecting and tuning a model that can help predict where each post comes from. This is, in effect, a **binary classification** problem. To find the best model to use here, we'll carry out the following steps:

1. Run a Train-Test-Split on our data
2. Transform data using a vectorizer
3. Fit model to training data
4. Generate predictions using test data
5. Evaluate model based on various evaluation metrics (accuracy, precision, recall, ROC-AUC).
6. Select the best model and tune hyper-parameters

Besides `CountVectorizer()`, we'll also be using `TfidfVectorizer()`. `TfidfVectorizer()` is pretty similar to `CountVectorizer()`, except that it looks at the frequency of words in our data. This means that it downweights words that appear in many posts, while upweighting the rarer words.

We'll look to test a range of classification techniques including Logistic Regression, KNearest Neighbor Classifier, Boosting, Complement Naive Bayes classification and Decision tree classification.

F1-score and accuracy will be our main metrics here with a higher focus on f1-score given that it would be ideal to minimize our false negatives. This is because false negatives are tweets with negative sentiments (y = 0) that are incorrectly predicted as positive sentiments (y = 1). As such, it would be more damaging if negative tweets were predicted as positive and were then missed as a result when filtering out negative tweets for purposes such as customer satisfaction monitoring. 


#### Baseline

In [5]:
# Baseline
y = train_df['sentiment']
y.value_counts(normalize=True)

0    0.651178
1    0.348822
Name: sentiment, dtype: float64

To have something to compare our model against, we can use the normalized value of y which is the percentage of y within our target. This represents the simplest model we can use, where assigning a post randomly will give us a 65% chance of classifying it as a non-negative tweet. This will serve as a baseline for our model evaluation. This also shows that our model is imbalanced. 

Two ways to deal with the class imbalance are to use SMOTE or assign class weights. We will be using testing both methods and comparing the results. As this is a classification problem, we will be using the following models,

- Logistic Regression
- Complement Naive Bayes classifier
- K Nearest Neighbor


Note: The Complement Naive Bayes classifier was designed to correct the “severe assumptions” made by the standard Multinomial Naive Bayes classifier. It is particularly suited for imbalanced data sets.

# Train-Validation split

In [6]:
X = train_df['text']
y = train_df['sentiment']

In [7]:
# Split our data into train and validation data: 80:20 split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

In [8]:
X_train.shape

(2513,)

# Model Preparation & Tuning

Here, I instantiated a range of different vectorizers and models. To simplify my workflow slightly, I opted to create a function that utilises Sklearn's Pipeline tool that allows for easy fitting and transformation of data.

In [9]:
# Instantiate vectorizers
vectorizers = {'cvec': CountVectorizer(stop_words = stop_words),
               'tvec': TfidfVectorizer(stop_words = stop_words)}

In [10]:
# Vectorizer hyperparameters
cvec_params = {
    # Setting a limit of n-number of features included/vocab size
    'cvec__max_features': [None, 1_000],

    # Setting a minimum number of times the word/token has to appear in n-documents
    'cvec__min_df':[2, 3, 4],
    
    # Setting an upper threshold/max percentage of n% of documents from corpus 
    'cvec__max_df': [0.2, 0.3, 0.4],
    
    #'cvec__stop_words': [stop_words],
    
    # Testing with unigrams, bigrams and trigrams
    'cvec__ngram_range':[(1,1), (1,2)]
}

tvec_params = {
    'tvec__max_features': [None],
    'tvec__min_df':[3, 4, 5],
    'tvec__max_df': [0.2, 0.3, 0.4],
    #'tvec__stop_words': [stop_words],
    'tvec__ngram_range': [(1,1), (1,2)]
}

In [11]:
# Instiantiate models (default)
models = {'lr': LogisticRegression(random_state = 42),
          'knn': KNeighborsClassifier(),
          'cnb': ComplementNB()
        }

In [12]:
# model hyperparameters
lr_params = {'lr__C':[0.5, 1, 10],
             'lr__class_weight':[None, 'balanced'],
             'lr__penalty':['l2', None],
             'lr__solver':['newton-cg', 'sag']
             }

cnb_params = {'cnb__alpha': [0.1, 1, 2],
              'cnb__fit_prior': [True, False],
              'cnb__norm': [True, False]
              }

knn_params = {'knn__algorithm':['auto'],
              'knn__weights':['uniform', 'distance']
              }

There are a few parameters here that need a bit of explanation:

- Alpha represents the level of regularization that we want to apply to our model -- in the case of our naive bayes model, this affects the level of LaPlace smoothing which helps to reduce the impact of zero probabilities in the case where none of the words in the training sample appear in the test sentence.


- C is a penalty parameter, which represents misclassification or error term. The misclassification or error term tells the model how much error is bearable, which allows us to control the trade-off between decision boundary and misclassification term. When C is high it will classify all the data points correctly, but lead to a high risk of overfitting. Conversely, when C is low, it will 'smooth' out the decision boundary and could potentially lead to underfitting.


- Gamma is a parameter that influences the calculation of plausible line of separation for SVM. When gamma is higher, nearby points will have high influence in the calculation of the decision boundary. Conversely, a low gamma means far away points also be considered when calculating the decision boundary.


In [52]:
# Function to run model -- input vectorizer and model
def run_model(vec, mod, vec_params={}, mod_params={}, smote_model=True):
    
    results = {}
    
    if smote_model:
      pipe = Pipeline1([
                       (vec, vectorizers[vec]),
                        # set random_state so our score does not change
                       ('sampling', SMOTE(random_state = 42)),  
                       (mod, models[mod])
                       ])
    else:
      pipe = Pipeline([
                       (vec, vectorizers[vec]),
                       (mod, models[mod])
                       ])
    
    gs = GridSearchCV(pipe, param_grid = {**vec_params, **mod_params}, 
                      scoring='f1', cv=5, verbose=1, n_jobs=-1)
    gs.fit(X_train, y_train)
    pipe = gs



    
    # Retrieve metrics
    predictions = pipe.predict(X_val)
    train_predict = pipe.predict(X_train)

    results['model'] = mod
    results['vectorizer'] = vec
    results['train'] = pipe.score(X_train, y_train)
    results['test'] = pipe.score(X_val, y_val)
    results['train_acc'] = accuracy_score(y_train, train_predict)
    results['val_acc'] = accuracy_score(y_val, predictions)
    results['roc'] = roc_auc_score(y_val, predictions)
    results['precision'] = precision_score(y_val, predictions)
    results['recall'] = recall_score(y_val, predictions)
    
    if smote_model:
        smote_tuning_list.append(results)
        print('### BEST PARAMS ###')
        display(pipe.best_params_)
        display(pipe.best_score_)
        
    else:
        tuning_list.append(results)
        print('### BEST PARAMS ###')
        display(pipe.best_params_)
        display(pipe.best_score_)
    
    print('### METRICS ###')
    display(results)
    
    tn, fp, fn, tp = confusion_matrix(y_val, predictions).ravel()
    print(f"True Negatives: {tn}")
    print(f"False Positives: {fp}")
    print(f"False Negatives: {fn}")
    print(f"True Positives: {tp}")
    
    return pipe

In [53]:
# Instantiate list to store tuning results
tuning_list = []
smote_tuning_list = []

In [54]:
# Logistic Regression with CVEC - no smote
%%time
cvec_lr = run_model('cvec', 'lr', vec_params=cvec_params, mod_params=lr_params, 
                    smote_model=False)

Fitting 5 folds for each of 864 candidates, totalling 4320 fits
### BEST PARAMS ###


{'cvec__max_df': 0.2,
 'cvec__max_features': None,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 2),
 'lr__C': 1,
 'lr__class_weight': 'balanced',
 'lr__penalty': 'l2',
 'lr__solver': 'sag'}

0.7203455026684711

### METRICS ###


{'model': 'lr',
 'precision': 0.7263681592039801,
 'recall': 0.6666666666666666,
 'roc': 0.7662601626016259,
 'test': 0.6952380952380953,
 'train': 0.9518477043673013,
 'train_acc': 0.9657779546358933,
 'val_acc': 0.7965023847376789,
 'vectorizer': 'cvec'}

True Negatives: 355
False Positives: 55
False Negatives: 73
True Positives: 146
CPU times: user 31.6 s, sys: 2.06 s, total: 33.6 s
Wall time: 5min 2s


In [55]:
# Logistic Regression with CVEC - smote
%%time
cvec_lr_sm = run_model('cvec', 'lr', vec_params=cvec_params, 
                       mod_params=lr_params, smote_model=True)

Fitting 5 folds for each of 864 candidates, totalling 4320 fits
### BEST PARAMS ###


{'cvec__max_df': 0.2,
 'cvec__max_features': None,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1),
 'lr__C': 1,
 'lr__class_weight': None,
 'lr__penalty': 'l2',
 'lr__solver': 'newton-cg'}

0.6696381723887457

### METRICS ###


{'model': 'lr',
 'precision': 0.6178861788617886,
 'recall': 0.6940639269406392,
 'roc': 0.7323978171288562,
 'test': 0.6537634408602151,
 'train': 0.8949972512369434,
 'train_acc': 0.9239952248308795,
 'val_acc': 0.7440381558028617,
 'vectorizer': 'cvec'}

True Negatives: 316
False Positives: 94
False Negatives: 67
True Positives: 152
CPU times: user 34.3 s, sys: 1.04 s, total: 35.4 s
Wall time: 5min 53s


In [56]:
# Logistic Regression with TVEC - no smote
%%time
tvec_lr = run_model('tvec', 'lr', vec_params=tvec_params, mod_params=lr_params, 
                    smote_model=False)

Fitting 5 folds for each of 432 candidates, totalling 2160 fits
### BEST PARAMS ###


{'lr__C': 0.5,
 'lr__class_weight': 'balanced',
 'lr__penalty': 'l2',
 'lr__solver': 'newton-cg',
 'tvec__max_df': 0.2,
 'tvec__max_features': None,
 'tvec__min_df': 3,
 'tvec__ngram_range': (1, 2)}

0.7126432679540826

### METRICS ###


{'model': 'lr',
 'precision': 0.6807511737089202,
 'recall': 0.6621004566210046,
 'roc': 0.7481233990422096,
 'test': 0.6712962962962964,
 'train': 0.8509423186750428,
 'train_acc': 0.8961400716275368,
 'val_acc': 0.7742448330683624,
 'vectorizer': 'tvec'}

True Negatives: 342
False Positives: 68
False Negatives: 74
True Positives: 145
CPU times: user 16.3 s, sys: 1 s, total: 17.3 s
Wall time: 2min 6s


In [57]:
# Logistic Regression with TVEC - smote
%%time
tvec_lr_sm = run_model('tvec', 'lr', vec_params=tvec_params, 
                       mod_params=lr_params, smote_model=True)

Fitting 5 folds for each of 432 candidates, totalling 2160 fits
### BEST PARAMS ###


{'lr__C': 0.5,
 'lr__class_weight': None,
 'lr__penalty': 'l2',
 'lr__solver': 'newton-cg',
 'tvec__max_df': 0.2,
 'tvec__max_features': None,
 'tvec__min_df': 3,
 'tvec__ngram_range': (1, 1)}

0.6947480783037555

### METRICS ###


{'model': 'lr',
 'precision': 0.6462882096069869,
 'recall': 0.6757990867579908,
 'roc': 0.7391190555741174,
 'test': 0.6607142857142857,
 'train': 0.847206385404789,
 'train_acc': 0.8933545563072025,
 'val_acc': 0.7583465818759937,
 'vectorizer': 'tvec'}

True Negatives: 329
False Positives: 81
False Negatives: 71
True Positives: 148
CPU times: user 17 s, sys: 496 ms, total: 17.5 s
Wall time: 2min 31s


In [58]:
# Complement Naive Bayes with CVEC - no smote
%%time
cvec_cnb = run_model('cvec', 'cnb', vec_params=cvec_params, 
                     mod_params=cnb_params, smote_model=False)

Fitting 5 folds for each of 432 candidates, totalling 2160 fits
### BEST PARAMS ###


{'cnb__alpha': 2,
 'cnb__fit_prior': True,
 'cnb__norm': True,
 'cvec__max_df': 0.2,
 'cvec__max_features': None,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1)}

0.7174690523166714

### METRICS ###


{'model': 'cnb',
 'precision': 0.6539923954372624,
 'recall': 0.7853881278538812,
 'roc': 0.7817184541708431,
 'test': 0.7136929460580913,
 'train': 0.8413793103448276,
 'train_acc': 0.8810187027457222,
 'val_acc': 0.78060413354531,
 'vectorizer': 'cvec'}

True Negatives: 319
False Positives: 91
False Negatives: 47
True Positives: 172
CPU times: user 14.8 s, sys: 1.01 s, total: 15.8 s
Wall time: 2min 2s


In [59]:
# Complement Naive Bayes with CVEC - smote
%%time
cvec_cnb_sm = run_model('cvec', 'cnb', vec_params=cvec_params, 
                        mod_params=cnb_params, smote_model=True)

Fitting 5 folds for each of 432 candidates, totalling 2160 fits
### BEST PARAMS ###


{'cnb__alpha': 2,
 'cnb__fit_prior': True,
 'cnb__norm': False,
 'cvec__max_df': 0.2,
 'cvec__max_features': None,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1)}

0.7184261995852019

### METRICS ###


{'model': 'cnb',
 'precision': 0.6579925650557621,
 'recall': 0.8082191780821918,
 'roc': 0.7919144670898766,
 'test': 0.7254098360655737,
 'train': 0.8416050686378036,
 'train_acc': 0.8806207719856745,
 'val_acc': 0.7869634340222575,
 'vectorizer': 'cvec'}

True Negatives: 318
False Positives: 92
False Negatives: 42
True Positives: 177
CPU times: user 15.8 s, sys: 620 ms, total: 16.4 s
Wall time: 2min 25s


In [60]:
# Complement Naive Bayes with TVEC - no smote
%%time
tvec_cnb = run_model('tvec', 'cnb', vec_params=tvec_params, 
                     mod_params=cnb_params, smote_model=False)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
### BEST PARAMS ###


{'cnb__alpha': 2,
 'cnb__fit_prior': True,
 'cnb__norm': True,
 'tvec__max_df': 0.2,
 'tvec__max_features': None,
 'tvec__min_df': 4,
 'tvec__ngram_range': (1, 1)}

0.7190590045128832

### METRICS ###


{'model': 'cnb',
 'precision': 0.6363636363636364,
 'recall': 0.7351598173515982,
 'roc': 0.7553847867245796,
 'test': 0.6822033898305085,
 'train': 0.8014941302027748,
 'train_acc': 0.8519697572622363,
 'val_acc': 0.7615262321144675,
 'vectorizer': 'tvec'}

True Negatives: 318
False Positives: 92
False Negatives: 58
True Positives: 161
CPU times: user 7.69 s, sys: 197 ms, total: 7.88 s
Wall time: 58.3 s


In [61]:
# Complement Naive Bayes with TVEC - smote
%%time
tvec_cnb_sm = run_model('tvec', 'cnb', vec_params=tvec_params, 
                        mod_params=cnb_params, smote_model=True)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
### BEST PARAMS ###


{'cnb__alpha': 1,
 'cnb__fit_prior': True,
 'cnb__norm': False,
 'tvec__max_df': 0.2,
 'tvec__max_features': None,
 'tvec__min_df': 4,
 'tvec__ngram_range': (1, 1)}

0.7167462011686648

### METRICS ###


{'model': 'cnb',
 'precision': 0.623574144486692,
 'recall': 0.7488584474885844,
 'roc': 0.753697516427219,
 'test': 0.6804979253112033,
 'train': 0.8074468085106383,
 'train_acc': 0.8559490648627139,
 'val_acc': 0.7551669316375199,
 'vectorizer': 'tvec'}

True Negatives: 311
False Positives: 99
False Negatives: 55
True Positives: 164
CPU times: user 8.12 s, sys: 270 ms, total: 8.39 s
Wall time: 1min 9s


In [62]:
# KNearestNeighbor Classifier with CVEC - no smote
%%time
cvec_knn = run_model('cvec', 'knn', vec_params=cvec_params, 
                     mod_params=knn_params, smote_model=False)

Fitting 5 folds for each of 72 candidates, totalling 360 fits
### BEST PARAMS ###


{'cvec__max_df': 0.2,
 'cvec__max_features': None,
 'cvec__min_df': 4,
 'cvec__ngram_range': (1, 2),
 'knn__algorithm': 'auto',
 'knn__weights': 'distance'}

0.46151446202845053

### METRICS ###


{'model': 'knn',
 'precision': 0.9012345679012346,
 'recall': 0.3333333333333333,
 'roc': 0.656910569105691,
 'test': 0.48666666666666664,
 'train': 0.9861431870669746,
 'train_acc': 0.990449661758854,
 'val_acc': 0.7551669316375199,
 'vectorizer': 'cvec'}

True Negatives: 402
False Positives: 8
False Negatives: 146
True Positives: 73
CPU times: user 3.05 s, sys: 93.5 ms, total: 3.15 s
Wall time: 27.9 s


In [63]:
# KNearestNeighbor Classifier with CVEC - smote
%%time
cvec_knn_sm = run_model('cvec', 'knn', vec_params=cvec_params, 
                        mod_params=knn_params, smote_model=True)

Fitting 5 folds for each of 72 candidates, totalling 360 fits
### BEST PARAMS ###


{'cvec__max_df': 0.2,
 'cvec__max_features': 1000,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 2),
 'knn__algorithm': 'auto',
 'knn__weights': 'distance'}

0.6006452755128876

### METRICS ###


{'model': 'knn',
 'precision': 0.47570332480818417,
 'recall': 0.8493150684931506,
 'roc': 0.6746575342465753,
 'test': 0.6098360655737705,
 'train': 0.9544444444444444,
 'train_acc': 0.9673696776760844,
 'val_acc': 0.6216216216216216,
 'vectorizer': 'cvec'}

True Negatives: 205
False Positives: 205
False Negatives: 33
True Positives: 186
CPU times: user 3.33 s, sys: 89.4 ms, total: 3.41 s
Wall time: 33.5 s


In [64]:
# KNearestNeighbor Classifier with TVEC - no smote
%%time
tvec_knn = run_model('tvec', 'knn', vec_params=tvec_params, 
                     mod_params=knn_params, smote_model=False)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
### BEST PARAMS ###


{'knn__algorithm': 'auto',
 'knn__weights': 'distance',
 'tvec__max_df': 0.2,
 'tvec__max_features': None,
 'tvec__min_df': 5,
 'tvec__ngram_range': (1, 1)}

0.3587523524730831

### METRICS ###


{'model': 'knn',
 'precision': 0.7536231884057971,
 'recall': 0.2374429223744292,
 'roc': 0.5979897538701414,
 'test': 0.3611111111111111,
 'train': 0.9826187717265353,
 'train_acc': 0.9880620771985674,
 'val_acc': 0.7074721780604134,
 'vectorizer': 'tvec'}

True Negatives: 393
False Positives: 17
False Negatives: 167
True Positives: 52
CPU times: user 1.72 s, sys: 43.4 ms, total: 1.76 s
Wall time: 13.6 s


In [65]:
# KNearestNeighbor Classifier with TVEC - smote
%%time
tvec_knn_sm = run_model('tvec', 'knn', vec_params=tvec_params, 
                        mod_params=knn_params, smote_model=True)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
### BEST PARAMS ###


{'knn__algorithm': 'auto',
 'knn__weights': 'distance',
 'tvec__max_df': 0.2,
 'tvec__max_features': None,
 'tvec__min_df': 3,
 'tvec__ngram_range': (1, 1)}

0.6348995239187852

### METRICS ###


{'model': 'knn',
 'precision': 0.4824120603015075,
 'recall': 0.8767123287671232,
 'roc': 0.6871366521884397,
 'test': 0.6223662884927066,
 'train': 0.9878962536023055,
 'train_acc': 0.9916434540389972,
 'val_acc': 0.629570747217806,
 'vectorizer': 'tvec'}

True Negatives: 204
False Positives: 206
False Negatives: 27
True Positives: 192
CPU times: user 1.9 s, sys: 53.4 ms, total: 1.96 s
Wall time: 16.4 s


In [66]:
tuning_list_df = pd.DataFrame(tuning_list)
tuning_list_df['smote'] = 0
tuning_list_df

Unnamed: 0,model,vectorizer,train,test,train_acc,val_acc,roc,precision,recall,smote
0,lr,cvec,0.951848,0.695238,0.965778,0.796502,0.76626,0.726368,0.666667,0
1,lr,tvec,0.850942,0.671296,0.89614,0.774245,0.748123,0.680751,0.6621,0
2,cnb,cvec,0.841379,0.713693,0.881019,0.780604,0.781718,0.653992,0.785388,0
3,cnb,tvec,0.801494,0.682203,0.85197,0.761526,0.755385,0.636364,0.73516,0
4,knn,cvec,0.986143,0.486667,0.99045,0.755167,0.656911,0.901235,0.333333,0
5,knn,tvec,0.982619,0.361111,0.988062,0.707472,0.59799,0.753623,0.237443,0


In [67]:
smote_tuning_list_df = pd.DataFrame(smote_tuning_list)
smote_tuning_list_df['smote'] = 1
smote_tuning_list_df

Unnamed: 0,model,vectorizer,train,test,train_acc,val_acc,roc,precision,recall,smote
0,lr,cvec,0.894997,0.653763,0.923995,0.744038,0.732398,0.617886,0.694064,1
1,lr,tvec,0.847206,0.660714,0.893355,0.758347,0.739119,0.646288,0.675799,1
2,cnb,cvec,0.841605,0.72541,0.880621,0.786963,0.791914,0.657993,0.808219,1
3,cnb,tvec,0.807447,0.680498,0.855949,0.755167,0.753698,0.623574,0.748858,1
4,knn,cvec,0.954444,0.609836,0.96737,0.621622,0.674658,0.475703,0.849315,1
5,knn,tvec,0.987896,0.622366,0.991643,0.629571,0.687137,0.482412,0.876712,1


In [68]:
# Combining unsmoted and smote results
results_df = pd.concat([tuning_list_df, smote_tuning_list_df])
results_df.reset_index(drop=True, inplace=True)
results_df

Unnamed: 0,model,vectorizer,train,test,train_acc,val_acc,roc,precision,recall,smote
0,lr,cvec,0.951848,0.695238,0.965778,0.796502,0.76626,0.726368,0.666667,0
1,lr,tvec,0.850942,0.671296,0.89614,0.774245,0.748123,0.680751,0.6621,0
2,cnb,cvec,0.841379,0.713693,0.881019,0.780604,0.781718,0.653992,0.785388,0
3,cnb,tvec,0.801494,0.682203,0.85197,0.761526,0.755385,0.636364,0.73516,0
4,knn,cvec,0.986143,0.486667,0.99045,0.755167,0.656911,0.901235,0.333333,0
5,knn,tvec,0.982619,0.361111,0.988062,0.707472,0.59799,0.753623,0.237443,0
6,lr,cvec,0.894997,0.653763,0.923995,0.744038,0.732398,0.617886,0.694064,1
7,lr,tvec,0.847206,0.660714,0.893355,0.758347,0.739119,0.646288,0.675799,1
8,cnb,cvec,0.841605,0.72541,0.880621,0.786963,0.791914,0.657993,0.808219,1
9,cnb,tvec,0.807447,0.680498,0.855949,0.755167,0.753698,0.623574,0.748858,1


In [69]:
# Exporting results
results_df.to_csv('./drive/My Drive/GA/capstone/data/MLmetrics_df.csv')