# Model Testing

* KNN
* Logistic Regression
* Decision Tree Classifier
* Random Forest Classifier
* Bagging Classifier with KNN & Decision Tree Classifier
* AdeBoost with Decision Tree Classifier
* Multi-layer Perception Classifier
* Support Vector Classifier (rbf, poly, sigmoid)
* Linear Support Vector Classifier

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Functions" data-toc-modified-id="Functions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Functions</a></span></li><li><span><a href="#Model-Predictors-and-Target-Prep" data-toc-modified-id="Model-Predictors-and-Target-Prep-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Model Predictors and Target Prep</a></span></li><li><span><a href="#Train-Test-Split" data-toc-modified-id="Train-Test-Split-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Train-Test-Split</a></span></li><li><span><a href="#Dummyfication-of-Categorical-variables" data-toc-modified-id="Dummyfication-of-Categorical-variables-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Dummyfication of Categorical variables</a></span></li><li><span><a href="#CountVectorizer-Model-Testing" data-toc-modified-id="CountVectorizer-Model-Testing-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>CountVectorizer Model Testing</a></span></li><li><span><a href="#TfidfVectorizer-Model-Testing" data-toc-modified-id="TfidfVectorizer-Model-Testing-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>TfidfVectorizer Model Testing</a></span></li><li><span><a href="#Large-Scale-Model-Testing-with-TfidfVectorizer-and-Engineered-Features" data-toc-modified-id="Large-Scale-Model-Testing-with-TfidfVectorizer-and-Engineered-Features-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Large Scale Model Testing with TfidfVectorizer and Engineered Features</a></span></li></ul></div>

## Imports

**Basics**

In [1]:
import numpy as np
import pandas as pd
import re

In [2]:
import warnings
warnings.filterwarnings('ignore')

**NLP**

In [3]:
from nltk.corpus import stopwords

In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

**Visualisation**

In [5]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('ggplot')
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

import scikitplot as skplt
from matplotlib.colors import ListedColormap

**Modeling with sklearn**

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import cross_val_score


from sklearn import cluster


from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline

from sklearn.svm import SVC, LinearSVC

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

## Functions

In [7]:
def StandardScaler_processing(X_train, X_test):
    '''
    This function takes train and test X variables and fit and transforms the StandardScaler() on the
    X_train and transform the X_test. 
    It returns the standarised X_train and X_test dataframes.
    '''
    
    scaler = StandardScaler()
    X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
    X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
    
    return X_train, X_test

In [8]:
def dummyifying(X, columns):
    '''
    This function accept the predictor variables and the specified columns that should be dummified and
    returns the dataframe only containing the specified dummified features.
    '''
    X = pd.get_dummies(X, columns=columns, drop_first=True)
    return X

In [9]:
def cvec_processing(processing_column, X_train, y_train, X_test, y_test):
    '''
    
    '''
    stop = stopwords.words('english')
    stop += ['oh', 'ah', 've','ll','ooh','oooh','uh','aah','aaah','yeah','bum','na','la','doo','nah', 'eh','pow',
             'di','oo','whoa','naa','em','ga','da','hi','sha','ba','wee']

    cvec = CountVectorizer(strip_accents='unicode',
                       stop_words=stop, 
                       ngram_range=(1, 3),
                      min_df=0.01,
                      max_features=1000)

    train_matrix_c = cvec.fit_transform(X_train[processing_column])
    
    #remove the lyrics_processed column
    #X_train.drop('lyrics_processed', axis=1, inplace=True)
    
    CVEC_train = pd.DataFrame(train_matrix_c.toarray(),
                  columns=cvec.get_feature_names())
    
    #resetting the X_train index so it can be joined with the CVEC_train
    X_train.reset_index(drop=True, inplace=True)
    
    #resetting the y_train index 
    y_train.reset_index(drop=True, inplace=True)
    
    #joining the dataframe and the cvec_train dataframe
    X_train = pd.concat([X_train, CVEC_train], axis=1, sort=False)
    
    
    test_matrix_c = cvec.transform(X_test[processing_column])
    CVEC_test = pd.DataFrame(test_matrix_c.toarray(),
                  columns=cvec.get_feature_names())
    
    #remove the lyrics_processed column
    #X_test.drop('lyrics_processed', axis=1, inplace=True)
    
    X_test.reset_index(drop=True, inplace=True)
    
    y_test.reset_index(drop=True, inplace=True)

    X_test = pd.concat([X_test, CVEC_test], axis=1, sort=False)
    
    return X_train, y_train, X_test, y_test

In [10]:
def tvec_processing(processing_column, X_train, y_train, X_test, y_test):
    '''
    
    '''
    stop = stopwords.words('english')
    stop += ['oh', 'ah', 've','ll','ooh','oooh','uh','aah','aaah','yeah','bum','na','la','doo','nah', 'eh','pow',
             'di','oo','whoa','naa','em','ga','da','hi','sha','ba','wee']

    tvec = TfidfVectorizer(strip_accents='unicode',
                       stop_words=stop, 
                       ngram_range=(1, 3),
                      max_features=1000,
                      max_df = 0.9,
                      min_df = 0.01,
                      sublinear_tf=True)

    train_matrix_t = tvec.fit_transform(X_train[processing_column])

    
    #remove the lyrics_processed column
    #X_train.drop('lyrics_processed', axis=1, inplace=True)
    
    TVEC_train = pd.DataFrame(train_matrix_t.toarray(),
                  columns=tvec.get_feature_names())
    
    #resetting the X_train index so it can be joined with the CVEC_train
    X_train.reset_index(drop=True, inplace=True)
    
    #resetting the y_train index 
    y_train.reset_index(drop=True, inplace=True)
    
    #joining the dataframe and the cvec_train dataframe
    X_train = pd.concat([X_train, TVEC_train], axis=1, sort=False)
    
    
    test_matrix_t = tvec.transform(X_test[processing_column])
    TVEC_test = pd.DataFrame(test_matrix_t.toarray(),
                  columns=tvec.get_feature_names())
    
    #remove the lyrics_processed column
    #X_test.drop('lyrics_processed', axis=1, inplace=True)
    
    X_test.reset_index(drop=True, inplace=True)
    
    y_test.reset_index(drop=True, inplace=True)

    X_test = pd.concat([X_test, TVEC_test], axis=1, sort=False)
    
    return X_train, y_train, X_test, y_test

In [11]:
def model_table(name,
                model,
                processing,
                X_train,
                y_train,
                X_test,
                y_test):
    '''
    The model function can accept a given model name, the model object, the X_train, y_train, X_test and y_test
    as arguments to return a dataframe with the supplied model name, parameters, train and test set's accuracy, 
    precision, recall and F1.
    '''
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    model_name = []
    parameters = []
    train_accuracy_scores = []
    test_accuracy_scores = []
    train_precisions = []
    test_precisions = []
    train_recalls = []
    test_recalls = []
    train_f1_scores = []
    test_f1_scores = []
    cross_val_scores = []

    model_name.append(name)
    parameters.append(model)
    train_accuracy_scores.append('{0:.3f}'.format(model.score(
        X_train, y_train)))
    test_accuracy_scores.append('{0:.3f}'.format(model.score(X_test, y_test)))
    train_precisions.append('{0:.3f}'.format(
        precision_score(y_train, y_pred_train, average='macro')))
    test_precisions.append('{0:.3f}'.format(
        precision_score(y_test, y_pred_test, average='macro')))
    train_recalls.append('{0:.3f}'.format(
        recall_score(y_train, y_pred_train, average='macro')))
    test_recalls.append('{0:.3f}'.format(
        recall_score(y_test, y_pred_test, average='macro')))
    train_f1_scores.append('{0:.3f}'.format(
        f1_score(y_train, y_pred_train, average='macro')))
    test_f1_scores.append('{0:.3f}'.format(
        f1_score(y_test, y_pred_test, average='macro')))
    cross_val_scores.append('{0:.3f}'.format(cross_val_score(model, X_train, y_train, cv=5).mean()))

    return pd.DataFrame({
        'Model': model_name,
        'Parameters': parameters,
        'Processing': processing,
        'Train: Accuracy': train_accuracy_scores,
        'Train: Precision': train_precisions,
        'Train: Recall': train_recalls,
        'Train: F1': train_f1_scores,
        'Test: Accuracy': test_accuracy_scores,
        'Test: Precision': test_precisions,
        'Test: Recall': test_recalls,
        'Test: F1': test_f1_scores,
        'Cross-Val Score': cross_val_scores
    })

In [12]:
def grid_search_models(models, pars, model_names,processings, scoring, X_train, y_train, X_test, y_test):
    '''
    This function is able to carry out multiple gridsearches in a row using the scikit-learn
    GridSearchCV() class. 
    It accepts a list of models (models) from scikit-learn, a list of corresponding parameters (pars)
    and a list of corresponding model names(model_names) and a list of previous processing steps(processings) 
    predefined by the user.
    Furthermore, it accepts train and test set objects and the preferred scoring for then 
    gridsearch. 
    It only returns a table of all models run an it's results using the 
    model_table () function which uses scikit-learns classes.
    Results include:
        - model name
        - parameters
        - accuracies,
        - precisions
        - recalls
        - F1 scores
        - cross-val score of the best model 
        
    Parameters:
    ------------------------------
    models: List of models predefined by the user.
    pars: List of parameters predefined by the user.
    model_names: List of strings of how the models are displayed in the table.
    processings: List of strings describing each models previous processing steps.
    scoring: None,'accuracy', 'precision', 'recall', etc please refer to scikit-learn documentation.
    '''
    
    temp_df = pd.DataFrame({
        'Model': [],
        'Parameters': [],
        'Processing': [],
        'Train: Accuracy': [],
        'Train: Precision': [],
        'Train: Recall': [],
        'Train: F1': [],
        'Test: Accuracy': [],
        'Test: Precision': [],
        'Test: Recall': [],
        'Test: F1': [],
        'Cross-Val Score': []
    })

    

    print("starting Gridsearch")
    for i in range(len(models)):
        print('Running Model {} / {}.'.format(i + 1, len(models)))
        gs = GridSearchCV(models[i],
                          pars[i],
                          verbose=2,
                          refit=True,
                          n_jobs=-1,
                          iid=False,
                        scoring=scoring)
        
        gs_fit = gs.fit(X_train, y_train)

        temp2_df = model_table(name=model_names[i],
                               model=gs_fit.best_estimator_,
                               processing = processings[i],
                               X_train=X_train,
                               y_train=y_train,
                               X_test=X_test,
                               y_test=y_test)
        temp_df = pd.concat([temp_df, temp2_df])
    return temp_df

## Model Predictors and Target Prep

* Removed artists which have less than 10 songs
* Removed songs with only 'Instrumental'/no lyrics
* Defined X and y variables
* Baseline

In [13]:
data = pd.read_csv('/Users/constancemaurer/GA DSI 12/DSI12-lessons/projects/project-capstone/personal-github/Resources/capstone_feature_engineered.csv')

In [14]:
data.shape

(1336, 38)

In [15]:
len(data.artist_name.unique())

43

* Removed songs with only 'Instrumental'/no lyrics

In [16]:
data.drop(data[data.lyrics_processed.isnull()].index, inplace=True)

* Removed artists which have less than 10 songs

In [17]:
data.artist_name.value_counts()

The Black Keys           76
Taylor Swift             58
Mac Miller               58
Lady Gaga                54
Red Hot Chili Peppers    52
Rihanna                  48
Beyoncé                  48
Slipknot                 47
The Weeknd               47
System Of A Down         45
Kings of Leon            43
Linkin Park              43
Beach House              43
Radiohead                41
Lil Wayne                40
John Legend              40
Foo Fighters             38
Arctic Monkeys           36
Gorillaz                 36
Tyler, The Creator       33
A$AP Rocky               33
Kendrick Lamar           30
Frank Ocean              29
Wiz Khalifa              29
Fall Out Boy             26
Korn                     24
Aretha Franklin          23
Adele                    23
LCD Soundsystem          22
Nirvana                  21
Amy Winehouse            19
Tame Impala              17
Passion Pit              16
Metallica                16
Led Zeppelin             15
Fleetwood Mac       

In [18]:
list(data[
    (data.artist_name=='Queen')|
    (data.artist_name=='Sister Sledge')|
    (data.artist_name=='The Beatles')|
    (data.artist_name=='Twin Shadow')].index)

[815,
 816,
 817,
 818,
 819,
 820,
 962,
 963,
 964,
 965,
 1135,
 1136,
 1137,
 1272,
 1273]

In [19]:
data.drop(index=list(data[(data.artist_name == 'Queen') | 
                          (data.artist_name == 'Sister Sledge')| 
                          (data.artist_name == 'The Beatles') |
                          (data.artist_name == 'Twin Shadow')].index),
          inplace=True)

In [20]:
data.shape

(1312, 38)

In [21]:
len(data.artist_name.unique())

39

In [22]:
data.to_csv('Model_data.csv', index=False)

**Defining X and y variables**

Excluded from X:
- 'track_name'
- 'artist_name'
- 'release_year'
- 'spotify_uri'
- 'lyrics'
- 'genre'
- 'track_id'

In [23]:
data.columns

Index(['track_name', 'artist_name', 'release_year', 'spotify_uri', 'lyrics',
       'genre', 'track_id', 'popularity', 'acousticness', 'danceability',
       'duration_ms', 'energy', 'instrumentalness', 'key', 'liveness',
       'loudness', 'mode', 'speechiness', 'tempo', 'time_signature', 'valence',
       'n_sentences', 'word_count', 'character_count', 'n_syllables',
       'unique_word_count', 'n_long_words', 'n_monosyllable_words',
       'n_polysyllable_words', 'lyrics_processed', 'vader_compound',
       'vader_neg', 'vader_neu', 'vader_pos', 'objectivity_score',
       'pos_vs_neg', 'TTR', 'MTLD'],
      dtype='object')

In [24]:
X = data[['popularity',
       'acousticness', 'danceability', 'duration_ms', 'energy',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'time_signature', 'valence', 'n_sentences',
       'word_count', 'character_count', 'n_syllables', 'unique_word_count',
       'n_long_words', 'n_monosyllable_words', 'n_polysyllable_words','vader_compound', 'vader_neg', 'vader_neu',
       'vader_pos', 'objectivity_score', 'pos_vs_neg', 'TTR', 'MTLD','lyrics_processed']]

In [25]:
y = data.artist_name

**Baseline**

In [26]:
y.value_counts(normalize=True).max()

0.057926829268292686

## Train-Test-Split

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, shuffle=True, random_state=42)

**Export train and test sets**

In [28]:
train_set = pd.concat([X_train, y_train], axis=1, sort=False)
train_set.reset_index(drop=True, inplace=True)
train_set.to_csv('train_set.csv', index=False)

In [29]:
test_set = pd.concat([X_test, y_test], axis=1, sort=False)
test_set.reset_index(drop=True, inplace=True)
test_set.to_csv('test_set.csv', index=False)

## Dummyfication of Categorical variables

In [30]:
X_train.columns

Index(['popularity', 'acousticness', 'danceability', 'duration_ms', 'energy',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'time_signature', 'valence', 'n_sentences',
       'word_count', 'character_count', 'n_syllables', 'unique_word_count',
       'n_long_words', 'n_monosyllable_words', 'n_polysyllable_words',
       'vader_compound', 'vader_neg', 'vader_neu', 'vader_pos',
       'objectivity_score', 'pos_vs_neg', 'TTR', 'MTLD', 'lyrics_processed'],
      dtype='object')

In [31]:
X_train.dtypes

popularity                int64
acousticness            float64
danceability            float64
duration_ms               int64
energy                  float64
instrumentalness        float64
key                      object
liveness                float64
loudness                float64
mode                     object
speechiness             float64
tempo                   float64
time_signature           object
valence                 float64
n_sentences               int64
word_count                int64
character_count           int64
n_syllables               int64
unique_word_count         int64
n_long_words              int64
n_monosyllable_words      int64
n_polysyllable_words      int64
vader_compound          float64
vader_neg               float64
vader_neu               float64
vader_pos               float64
objectivity_score       float64
pos_vs_neg              float64
TTR                     float64
MTLD                    float64
lyrics_processed         object
dtype: o

In [32]:
X_train = dummyifying(X_train, ['key', 'time_signature','mode'])

In [33]:
X_train.head()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,...,key_D#,key_E,key_F,key_F#,key_G,key_G#,time_signature_3/4,time_signature_4/4,time_signature_5/4,mode_Minor
1139,44,0.0695,0.374,184720,0.925,0.456,0.262,-2.413,0.0852,115.657,...,0,1,0,0,0,0,0,1,0,1
615,61,0.000564,0.327,219800,0.895,0.0159,0.104,-7.428,0.0367,169.39,...,0,0,0,0,0,0,0,1,0,0
746,53,0.132,0.715,260974,0.794,0.0,0.361,-5.426,0.163,119.994,...,0,0,0,0,0,0,0,1,0,0
296,65,3.6e-05,0.231,255960,0.866,0.000552,0.29,-5.131,0.0517,138.311,...,0,0,0,0,0,0,0,1,0,0
180,72,0.143,0.545,253747,0.649,1.6e-05,0.0894,-4.062,0.0324,99.099,...,0,0,0,1,0,0,0,1,0,0


In [34]:
X_test = dummyifying(X_test, ['key', 'time_signature','mode'])

In [35]:
X_test.head()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,...,key_D#,key_E,key_F,key_F#,key_G,key_G#,time_signature_3/4,time_signature_4/4,time_signature_5/4,mode_Minor
262,53,0.00687,0.372,209720,0.898,6e-06,0.316,-3.338,0.118,178.829,...,0,0,0,0,0,0,0,1,0,0
84,79,0.00616,0.288,201726,0.758,0.0,0.303,-5.692,0.0371,97.094,...,0,0,0,1,0,0,0,1,0,1
117,36,0.847,0.465,131200,0.431,0.0,0.36,-10.847,0.0304,93.981,...,0,0,0,0,1,0,0,1,0,0
1324,47,0.108,0.694,259717,0.626,0.0,0.343,-8.185,0.278,135.966,...,0,0,0,0,0,0,0,1,0,0
654,71,0.000707,0.365,248587,0.751,0.0,0.318,-5.429,0.304,79.119,...,0,0,0,0,0,0,0,1,0,0


## CountVectorizer Model Testing

Converts a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

**Pros**

- only requires a few text cleaning steps
- simple way to both tokenize a collection of text documents and build a corpus
- stop word removal is built into the class and can easily be extended with NLTK

**Cons**

- based on BOW (bag of words) and does not capture the position in semantics, co-occurrences in the document
- suitable for a small corpus
- vectors can become extremely sparse, particularly as vocabularies get larger, which can have a significant impact on the speed and performance of machine learning models

In [36]:
X_train_c, y_train_c, X_test_c, y_test_c = cvec_processing('lyrics_processed', X_train, y_train, X_test, y_test)

In [37]:
X_train_c.drop('lyrics_processed', axis=1, inplace=True)
X_test_c.drop('lyrics_processed', axis=1, inplace=True)

In [38]:
X_train_c, X_test_c = StandardScaler_processing(X_train_c, X_test_c)

In [39]:
model1 = KNeighborsClassifier()

model2 = LogisticRegression()

model3 = RandomForestClassifier(random_state=42)

model4 = SVC()


parameters1 = {
    'n_neighbors': range(1, 51),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

parameters2 = {
    'penalty': ['l1', 'l2'],
    'solver': ['saga'],
    'multi_class': ['auto'],
    'C': np.logspace(-1, 1, 40),
}

parameters3 = {
    'criterion': ['gini', 'entropy'],
    'n_estimators': [500],
    'max_depth': [40, 50, 60, 70] + [None],
    'max_features':
    [None, 10, 50, 100, 500, 1000, X_train.shape[1]]
}

parameters4 = {
    'C': [0.01, 0.1, 1.0],
    'kernel': ['rbf', 'poly'],
    'gamma': [0.01, 0.1, 1.0],
}

In [40]:
model_names = ['KNN', 
               'Logistic Regression', 
               'Random Forest', 
               'Support Vector Machine']

models = [model1, 
          model2, 
          model3, 
          model4]

pars = [parameters1, 
        parameters2, 
        parameters3, 
        parameters4]

processings = ['StandardScaler(), CountVectorizer()',
               'StandardScaler(), CountVectorizer()',
               'StandardScaler(), CountVectorizer()',
               'StandardScaler(), CountVectorizer()']

In [41]:
cvec_df = grid_search_models(models, 
                             pars, 
                             model_names, 
                             processings,
                             'accuracy', 
                             X_train_c, 
                             y_train_c, 
                             X_test_c,
                             y_test_c)

starting Gridsearch
Running Model 1 / 4.
Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    7.7s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   18.4s
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:   37.3s
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:  1.0min finished


Running Model 2 / 4.
Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  9.3min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 16.5min finished


Running Model 3 / 4.
Fitting 3 folds for each of 70 candidates, totalling 210 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   38.6s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  6.2min
[Parallel(n_jobs=-1)]: Done 210 out of 210 | elapsed: 10.0min finished


Running Model 4 / 4.
Fitting 3 folds for each of 18 candidates, totalling 54 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    4.8s
[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:   11.5s finished


In [42]:
cvec_df

Unnamed: 0,Model,Parameters,Processing,Train: Accuracy,Train: Precision,Train: Recall,Train: F1,Test: Accuracy,Test: Precision,Test: Recall,Test: F1,Cross-Val Score
0,KNN,"KNeighborsClassifier(algorithm='auto', leaf_si...","StandardScaler(), CountVectorizer()",1.0,1.0,1.0,1.0,0.08,0.113,0.074,0.057,0.066
0,Logistic Regression,"LogisticRegression(C=7.896522868499725, class_...","StandardScaler(), CountVectorizer()",0.996,0.998,0.994,0.996,0.304,0.239,0.232,0.223,0.302
0,Random Forest,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), CountVectorizer()",1.0,1.0,1.0,1.0,0.452,0.445,0.36,0.361,0.433
0,Support Vector Machine,"SVC(C=1.0, cache_size=200, class_weight=None, ...","StandardScaler(), CountVectorizer()",0.994,0.998,0.991,0.994,0.106,0.009,0.055,0.015,0.09


In [43]:
cvec_df.to_csv('CountVectorizer_model_test.csv', index=False)

## TfidfVectorizer Model Testing

TF–IDF is computed by the scaled frequency of the appearance of the term in the document, normalized by the inverse of the scaled frequency of the term in the entire corpus. This gives less frequent words a higher weight compared to common words.

**Pros**

- only requires a few text cleaning steps
- simple way to both tokenize a collection of text documents and build a corpus giving higher weights to rare words
- stop word removal is built into the class and can easily be extended with NLTK
- returns a sparse matrix representation in the form of ((doc, term), tfidf) which can easily be converted to dataframe

**Cons**

- based on BOW (bag of words) and does not capture the position in semantics, co-occurrences in the document
- suitable for small corpus
- vectors can become extremely sparse, particularly as vocabularies get larger, which can have a significant impact on the speed and performance of machine learning models

In [44]:
X_train_t, y_train_t, X_test_t, y_test_t = tvec_processing('lyrics_processed', X_train, y_train, X_test, y_test)

In [45]:
X_train_t.drop('lyrics_processed', axis=1, inplace=True)
X_test_t.drop('lyrics_processed', axis=1, inplace=True)

In [46]:
X_train_t, X_test_t = StandardScaler_processing(X_train_t, X_test_t)

In [47]:
model_names = ['KNN', 
               'Logistic Regression', 
               'Random Forest', 
               'Support Vector Machine']

models = [model1, 
          model2, 
          model3, 
          model4]

pars = [parameters1, 
        parameters2, 
        parameters3, 
        parameters4]

processings = ['StandardScaler(), TfidfVectorizer()',
               'StandardScaler(), TfidfVectorizer()',
               'StandardScaler(), TfidfVectorizer()',
               'StandardScaler(), TfidfVectorizer()']

In [48]:
tvec_df = grid_search_models(models, 
                             pars, 
                             model_names, 
                             processings,
                             'accuracy', 
                             X_train_t, 
                             y_train_t, 
                             X_test_t,
                             y_test_t)

starting Gridsearch
Running Model 1 / 4.
Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   17.2s
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:   37.4s
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:  1.1min finished


Running Model 2 / 4.
Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  9.1min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 16.1min finished


Running Model 3 / 4.
Fitting 3 folds for each of 70 candidates, totalling 210 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   45.4s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  7.9min
[Parallel(n_jobs=-1)]: Done 210 out of 210 | elapsed: 13.3min finished


Running Model 4 / 4.
Fitting 3 folds for each of 18 candidates, totalling 54 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    5.6s
[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:   13.4s finished


In [49]:
tvec_df

Unnamed: 0,Model,Parameters,Processing,Train: Accuracy,Train: Precision,Train: Recall,Train: F1,Test: Accuracy,Test: Precision,Test: Recall,Test: F1,Cross-Val Score
0,KNN,"KNeighborsClassifier(algorithm='auto', leaf_si...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.103,0.187,0.077,0.067,0.104
0,Logistic Regression,"LogisticRegression(C=5.541020330009492, class_...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.43,0.357,0.364,0.343,0.399
0,Random Forest,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.403,0.35,0.29,0.269,0.398
0,Support Vector Machine,"SVC(C=0.01, cache_size=200, class_weight=None,...","StandardScaler(), TfidfVectorizer()",0.997,0.999,0.994,0.996,0.061,0.027,0.028,0.008,0.063


In [50]:
tvec_df.to_csv('TfidfVectorizer_model_test.csv', index=False)

## Large Scale Model Testing with TfidfVectorizer and Engineered Features

**K-Nearest Neighbor Classifier (KNN)**

Pros
* No training involved 
* Naturally handles multiclass classification and can learn complex decision boundaries
* Uses feature similarity to predict the cluster that the new point will fall into
* Does not assume any probability distributions on the training data. This can come in handy where the probability distribution is unknown.
* Can quickly respond to changes in training data. KNN employs lazy learning, which generalizes during testing and this allows it to change during real-time use.

Cons
* The distance metric to choose it not obvious and difficult to justify in many cases
* Performs poorly on high dimensional datasets 
* Sensitive to localized data - localized anomalies affect outcomes significantly, rather than for an algorithm that uses a generalized view of the data.
* Computation time - Lazy learning requires that most of KNN's computation be done during testing, rather than during training. This can be an issue for large datasets.
* Biased towards classes with more entries
* Relies on a correlation between closeness and similarity. One workaround for this issue is dimension reduction, which reduces the number of working variable dimensions (but can lose variable trends in the process).

**Logistic Regression**

Pros
* Low variance
* Probability scores for target variable which can be investigated with P-R curves, ROC curves etc (only in binary target easily addressable)
* Works well with diagonal (feature) decision boundaries
* Overfitting can be addressed though regularization 
* Multi-collinearity is not really an issue and can be countered with L2 regularization to an extent
* One-vesrus-one performs well on large dataset

Cons
* High bias
* May overfit when provided with large numbers of features
* Usually evaluates one-versus-all when predicting the classes
* Doesn’t perform well when feature space is too large
* Doesn’t handle large number of categorical features/variables well
* Relies on transformations for non-linear features
* Can only learn linear hypothesis functions so are less suitable to complex relationships between features and target

**Decision Tree Classifier**

Pros
* easy to interpret visually when the trees only contain several levels
* Can easily handle qualitative (categorical) features
* Works well with decision boundaries parellel to the feature axis

Cons
* Prone to overfitting
* Runs for a long time
* Possible issues with diagonal decision boundaries
* Can be very non-robust, meaning that small changes in the training dataset can lead to quite major differences in the hypothesis function that gets learned 
* Generally have worse performance than ensemble methods

**Bagging Classifier - Decision Tree**

Pros
* Reduces variance in comparison to regular decision trees
* Can provide variable importance measures
* Can easily handle qualitative (categorical) features
* Out of bag (OOB) estimates can be used for model validation

Cons
* Not as easy to visually interpret
* Does not reduce variance if the features are correlated

**Random Forest Classifier**

Pros
* Decorrelates trees (relative to bagged trees), which is especially useful when there is a lot of correlation
* Reduced variance in comparison to regular decision tree
* Has the ability to address class imbalance by using the balanced class weight flag
* Scales to large datasets

Cons
* Not as easy to visually interpret
* Long computation time when used in GridSearch
* Tends to overfit on the training data but is claimed to not be susceptible to that

**AdaBoost Classifier - Decision Tree**

Pros
* Can easily handle qualitative (categorical) features
* Somewhat more interpretable than bagged trees/random forest as the user can define the size of each tree resulting in a collection of stumps (1 level) which can be viewed as an additive model

Cons
* Unlike bagging and random forests, can overfit if number of trees is too large

**Multi-layer Perceptron Classifier**

Pros
* 


Cons
* 


**Support Vector Classifier**

Pros
* Performs similarly to logistic regression when linear separation
* Performs well with non-linear boundary depending on the kernel used
* Handle high dimensional data well
* Fairly robust against overfitting, especially in higher dimensional space

Cons
* Susceptible to overfitting/training issues depending on kernel
* Do not scale well to large datasets (difficult to parallelize) 


In [51]:
X_train_t, y_train_t, X_test_t, y_test_t = tvec_processing('lyrics_processed', X_train, y_train, X_test, y_test)

In [52]:
X_train_t.drop('lyrics_processed', axis=1, inplace=True)
X_test_t.drop('lyrics_processed', axis=1, inplace=True)

In [53]:
X_train_t, X_test_t = StandardScaler_processing(X_train_t, X_test_t)

In [54]:
knn = KNeighborsClassifier()

log = LogisticRegression(max_iter=1000)

dt = DecisionTreeClassifier(random_state=42)

bag_tree = BaggingClassifier(base_estimator=dt, n_estimators=100)

rf = RandomForestClassifier(random_state=42)

ada = AdaBoostClassifier(random_state=42)


knn_params = {
    'n_neighbors': range(1, 50),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

log1_params = {
    'penalty': ['l1'],
    'solver': ['liblinear'],
    'multi_class': ['ovr'],
    'C': np.logspace(-1, 1, 50)
}

log2_params = {
    'penalty': ['l2'],
    'solver': ['lbfgs', 'sag'],
    'multi_class': ['multinomial'],
    'C': np.logspace(-1, 1, 50)
}

log_saga_params = {
    'penalty': ['l1', 'l2'],
    'solver': ['saga'],
    'multi_class': ['auto'],
    'C': np.logspace(0, 1, 40)
}

dt_params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': list(range(1, 50)) + [None],
    'max_features': [None, 20, 40, 60, 80, 100, 500, 1000, X_train_t.shape[1]],
    'min_samples_split': [2, 10, 25, 50]
}

bag_tree_params = {
    'max_samples': np.linspace(0.8, 1.0, 3),
    'max_features': range(int(3 / 4. * X_train_t.shape[1]),
                          X_train_t.shape[1] + 1)
}

rf_params = {
    'n_estimators': [500],
    'criterion': ['gini', 'entropy'],
    'max_depth': [40, 50, 60, 70] + [None],
    'max_features': [None, 100, 150, 200]
}

ada_params = {
    'base_estimator': [
        DecisionTreeClassifier(max_depth=None),
        DecisionTreeClassifier(max_depth=3),
        DecisionTreeClassifier(max_depth=50)
    ],
    'n_estimators': [50, 100, 150, 200],
    'learning_rate':
    np.linspace(0.1, 1, 20)
}

In [55]:
model_names = [
    'KNN', 'Logistic Regression L1', 'Logistic Regression L2',
    'Logistic Regression, Saga Solver', 'Decision Tree',
    'Bagging with Decision Tree', 'Random Forest',
    'AdaBoost Classifier with Decision Tree'
]

models = [knn, log, log, log, dt, bag_tree, rf, ada]

pars = [
    knn_params, log1_params, log2_params, log_saga_params, dt_params,
    bag_tree_params, rf_params, ada_params
]

processings = [
    'StandardScaler(), TfidfVectorizer()',
    'StandardScaler(), TfidfVectorizer()',
    'StandardScaler(), TfidfVectorizer()',
    'StandardScaler(), TfidfVectorizer()',
    'StandardScaler(), TfidfVectorizer()',
    'StandardScaler(), TfidfVectorizer()',
    'StandardScaler(), TfidfVectorizer()',
    'StandardScaler(), TfidfVectorizer()']

In [56]:
grid_df = grid_search_models(models, 
                             pars, 
                             model_names, 
                             processings,
                             'accuracy', 
                             X_train_t, 
                             y_train_t, 
                             X_test_t,
                             y_test_t)

starting Gridsearch
Running Model 1 / 8.
Fitting 3 folds for each of 196 candidates, totalling 588 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   15.0s
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:   32.8s
[Parallel(n_jobs=-1)]: Done 588 out of 588 | elapsed:   55.0s finished


Running Model 2 / 8.
Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   12.5s
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed:  1.5min finished


Running Model 3 / 8.
Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   50.6s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  8.4min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 33.1min finished


Running Model 4 / 8.
Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  8.0min
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed: 54.7min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 94.7min finished


Running Model 5 / 8.
Fitting 3 folds for each of 3600 candidates, totalling 10800 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    3.0s
[Parallel(n_jobs=-1)]: Done 412 tasks      | elapsed:    4.6s
[Parallel(n_jobs=-1)]: Done 1630 tasks      | elapsed:   12.2s
[Parallel(n_jobs=-1)]: Done 3328 tasks      | elapsed:   24.9s
[Parallel(n_jobs=-1)]: Done 5518 tasks      | elapsed:   41.0s
[Parallel(n_jobs=-1)]: Done 7597 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 9178 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 10777 out of 10800 | elapsed:  1.8min remaining:    0.2s
[Parallel(n_jobs=-1)]: Done 10800 out of 10800 | elapsed:  1.8min finished


Running Model 6 / 8.
Fitting 3 folds for each of 786 candidates, totalling 2358 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   22.1s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 624 tasks      | elapsed:  9.7min
[Parallel(n_jobs=-1)]: Done 989 tasks      | elapsed: 15.8min
[Parallel(n_jobs=-1)]: Done 1434 tasks      | elapsed: 23.5min
[Parallel(n_jobs=-1)]: Done 1961 tasks      | elapsed: 32.7min
[Parallel(n_jobs=-1)]: Done 2358 out of 2358 | elapsed: 39.6min finished


Running Model 7 / 8.
Fitting 3 folds for each of 40 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   34.4s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:  5.4min finished


Running Model 8 / 8.
Fitting 3 folds for each of 240 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:    2.7s
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 624 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  2.4min finished


In [57]:
grid_df.reset_index(drop=True, inplace=True)
grid_df

Unnamed: 0,Model,Parameters,Processing,Train: Accuracy,Train: Precision,Train: Recall,Train: F1,Test: Accuracy,Test: Precision,Test: Recall,Test: F1,Cross-Val Score
0,KNN,"KNeighborsClassifier(algorithm='auto', leaf_si...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.103,0.187,0.077,0.067,0.104
1,Logistic Regression L1,"LogisticRegression(C=2.442053094548651, class_...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.338,0.305,0.295,0.289,0.293
2,Logistic Regression L2,"LogisticRegression(C=0.21209508879201905, clas...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.426,0.356,0.336,0.328,0.391
3,"Logistic Regression, Saga Solver","LogisticRegression(C=7.443803013251689, class_...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.433,0.402,0.369,0.363,0.387
4,Decision Tree,"DecisionTreeClassifier(class_weight=None, crit...","StandardScaler(), TfidfVectorizer()",0.414,0.366,0.335,0.314,0.247,0.228,0.211,0.197,0.241
5,Bagging with Decision Tree,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.426,0.386,0.343,0.338,0.4
6,Random Forest,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.437,0.366,0.341,0.33,0.433
7,AdaBoost Classifier with Decision Tree,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), TfidfVectorizer()",0.856,0.929,0.832,0.863,0.289,0.301,0.211,0.211,0.206


In [58]:
grid_df.Parameters[0]

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=25, p=2,
                     weights='distance')

In [59]:
bag_knn = BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=5),
                            n_estimators=50)

bag_best_knn = BaggingClassifier(base_estimator=grid_df.Parameters[0],
                                 n_estimators=50)

bag_best_tree = BaggingClassifier(base_estimator=grid_df.Parameters[4],
                                  n_estimators=100)

MLP = MLPClassifier(max_iter=1000)



bag_params = {
    'max_samples': np.linspace(0.8, 1.0, 3),
    'max_features': range(int(3 / 4. * X_train_t.shape[1]),
                          X_train_t.shape[1] + 1)
}

neur_params = {
    'solver': ['adam'],
    'alpha': [10**(-10), 10**(-5), 10**(-2)],
    'hidden_layer_sizes': [(8, 8, 8), (10, 10, 10), (20, 10, 8, 8)],
    'activation': ['identity', 'relu', 'logistic', 'tanh'],
    'random_state': [42],
    'batch_size': ['auto', 50, 500],
    'early_stopping': [True],
    'validation_fraction': [0.2]
}

In [60]:
model_names2 = [
    'Bagging with KNN', 'Bagging with Best KNN', 'Bagging with Best Decision Tree',
    'Multi-layer Perceptron Classifier'
]

models2 = [bag_knn, bag_best_knn, bag_best_tree, MLP]

pars2 = [
    bag_params, bag_params, bag_params, neur_params
]

processings2 = [
    'StandardScaler(), TfidfVectorizer()',
    'StandardScaler(), TfidfVectorizer()',
    'StandardScaler(), TfidfVectorizer()',
    'StandardScaler(), TfidfVectorizer()']

In [61]:
grid_df2 = grid_search_models(models2, 
                             pars2, 
                             model_names2, 
                             processings2,
                             'accuracy', 
                             X_train_t, 
                             y_train_t, 
                             X_test_t,
                             y_test_t)

starting Gridsearch
Running Model 1 / 4.
Fitting 3 folds for each of 786 candidates, totalling 2358 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  6.3min
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed: 15.5min
[Parallel(n_jobs=-1)]: Done 624 tasks      | elapsed: 28.5min
[Parallel(n_jobs=-1)]: Done 989 tasks      | elapsed: 47.2min
[Parallel(n_jobs=-1)]: Done 1434 tasks      | elapsed: 73.6min
[Parallel(n_jobs=-1)]: Done 1961 tasks      | elapsed: 107.2min
[Parallel(n_jobs=-1)]: Done 2358 out of 2358 | elapsed: 130.1min finished


Running Model 2 / 4.
Fitting 3 folds for each of 786 candidates, totalling 2358 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   59.7s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  6.3min
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed: 17.0min
[Parallel(n_jobs=-1)]: Done 624 tasks      | elapsed: 32.1min
[Parallel(n_jobs=-1)]: Done 989 tasks      | elapsed: 52.9min
[Parallel(n_jobs=-1)]: Done 1434 tasks      | elapsed: 79.2min
[Parallel(n_jobs=-1)]: Done 1961 tasks      | elapsed: 110.8min
[Parallel(n_jobs=-1)]: Done 2358 out of 2358 | elapsed: 134.5min finished


Running Model 3 / 4.
Fitting 3 folds for each of 786 candidates, totalling 2358 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    9.0s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   54.7s
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 624 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 989 tasks      | elapsed:  6.5min
[Parallel(n_jobs=-1)]: Done 1434 tasks      | elapsed:  9.5min
[Parallel(n_jobs=-1)]: Done 1961 tasks      | elapsed: 13.2min
[Parallel(n_jobs=-1)]: Done 2358 out of 2358 | elapsed: 16.0min finished


Running Model 4 / 4.
Fitting 3 folds for each of 108 candidates, totalling 324 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    1.8s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   10.4s
[Parallel(n_jobs=-1)]: Done 324 out of 324 | elapsed:   20.6s finished


In [62]:
grid_df2

Unnamed: 0,Model,Parameters,Processing,Train: Accuracy,Train: Precision,Train: Recall,Train: F1,Test: Accuracy,Test: Precision,Test: Recall,Test: F1,Cross-Val Score
0,Bagging with KNN,"(KNeighborsClassifier(algorithm='auto', leaf_s...","StandardScaler(), TfidfVectorizer()",0.528,0.927,0.537,0.618,0.053,0.084,0.061,0.035,0.036
0,Bagging with Best KNN,"(KNeighborsClassifier(algorithm='auto', leaf_s...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.103,0.179,0.076,0.061,0.102
0,Bagging with Best Decision Tree,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), TfidfVectorizer()",0.7,0.753,0.592,0.613,0.369,0.259,0.264,0.239,0.338
0,Multi-layer Perceptron Classifier,"MLPClassifier(activation='identity', alpha=1e-...","StandardScaler(), TfidfVectorizer()",0.722,0.694,0.658,0.65,0.148,0.101,0.109,0.101,0.142


In [63]:
test_table = pd.concat([grid_df, grid_df2], axis=0, sort=False)
test_table.reset_index(drop=True, inplace=True)
test_table

Unnamed: 0,Model,Parameters,Processing,Train: Accuracy,Train: Precision,Train: Recall,Train: F1,Test: Accuracy,Test: Precision,Test: Recall,Test: F1,Cross-Val Score
0,KNN,"KNeighborsClassifier(algorithm='auto', leaf_si...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.103,0.187,0.077,0.067,0.104
1,Logistic Regression L1,"LogisticRegression(C=2.442053094548651, class_...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.338,0.305,0.295,0.289,0.293
2,Logistic Regression L2,"LogisticRegression(C=0.21209508879201905, clas...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.426,0.356,0.336,0.328,0.391
3,"Logistic Regression, Saga Solver","LogisticRegression(C=7.443803013251689, class_...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.433,0.402,0.369,0.363,0.387
4,Decision Tree,"DecisionTreeClassifier(class_weight=None, crit...","StandardScaler(), TfidfVectorizer()",0.414,0.366,0.335,0.314,0.247,0.228,0.211,0.197,0.241
5,Bagging with Decision Tree,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.426,0.386,0.343,0.338,0.4
6,Random Forest,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.437,0.366,0.341,0.33,0.433
7,AdaBoost Classifier with Decision Tree,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), TfidfVectorizer()",0.856,0.929,0.832,0.863,0.289,0.301,0.211,0.211,0.206
8,Bagging with KNN,"(KNeighborsClassifier(algorithm='auto', leaf_s...","StandardScaler(), TfidfVectorizer()",0.528,0.927,0.537,0.618,0.053,0.084,0.061,0.035,0.036
9,Bagging with Best KNN,"(KNeighborsClassifier(algorithm='auto', leaf_s...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.103,0.179,0.076,0.061,0.102


In [64]:
svc = SVC()

linearSVC = LinearSVC()


rbf_params = {
    'C': np.linspace(0.01, 10, 20),
    'gamma': np.linspace(0, 1, 10),
    'kernel': ['rbf']
}

poly_params = {
    'C': np.linspace(0.01, 10, 20),
    'gamma': np.linspace(0.01, 1, 10),
    'kernel': ['poly']
}

sigmoid_params = {
    'C': np.linspace(0.01, 5, 20),
    'gamma': np.linspace(0, 1, 10),
    'kernel': ['sigmoid']
}

linearSVC_params = {
    'penalty': ['l2'],
    'loss': ['hinge', 'squared_hinge'],
    'tol': [0.0001],
    'C': np.linspace(0.001, 10, 30),
    'multi_class': ['ovr']
}

linearSVC_params2 = {
    'penalty': ['l2', 'l1'],
    'loss': ['hinge', 'squared_hinge'],
    'tol': [0.0001],
    'C': np.linspace(0.001, 10, 30),
    'multi_class': ['crammer_singer']
}

In [65]:
model_names3 = [
    'Support Vector Classifier - rbf', 'Support Vector Classifier - poly',
    'Support Vector Classifier - sigmoid', 'Linear Support Vector Classifier',
    'Linear Support Vector Classifier'
]

models3 = [svc, svc, svc, linearSVC, linearSVC]

pars3 = [
    rbf_params, poly_params, sigmoid_params, linearSVC_params,
    linearSVC_params2
]

processings3 = [
    'StandardScaler(), TfidfVectorizer()',
    'StandardScaler(), TfidfVectorizer()',
    'StandardScaler(), TfidfVectorizer()',
    'StandardScaler(), TfidfVectorizer()',
    'StandardScaler(), TfidfVectorizer()'
]

In [66]:
grid_df3 = grid_search_models(models3, 
                             pars3, 
                             model_names3, 
                             processings3,
                             'accuracy', 
                             X_train_t, 
                             y_train_t, 
                             X_test_t,
                             y_test_t)

starting Gridsearch
Running Model 1 / 5.
Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    3.6s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   22.1s
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:   54.9s
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:  1.6min finished


Running Model 2 / 5.
Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   23.7s
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:   56.4s
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:  1.6min finished


Running Model 3 / 5.
Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   16.5s
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:   36.2s
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:  1.0min finished


Running Model 4 / 5.
Fitting 3 folds for each of 60 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   39.1s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  8.3min
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 10.2min finished


Running Model 5 / 5.
Fitting 3 folds for each of 120 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    6.6s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:  5.8min finished


In [67]:
test_table = pd.concat([test_table, grid_df3], axis=0, sort=False)
test_table.reset_index(drop=True, inplace=True)
test_table

Unnamed: 0,Model,Parameters,Processing,Train: Accuracy,Train: Precision,Train: Recall,Train: F1,Test: Accuracy,Test: Precision,Test: Recall,Test: F1,Cross-Val Score
0,KNN,"KNeighborsClassifier(algorithm='auto', leaf_si...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.103,0.187,0.077,0.067,0.104
1,Logistic Regression L1,"LogisticRegression(C=2.442053094548651, class_...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.338,0.305,0.295,0.289,0.293
2,Logistic Regression L2,"LogisticRegression(C=0.21209508879201905, clas...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.426,0.356,0.336,0.328,0.391
3,"Logistic Regression, Saga Solver","LogisticRegression(C=7.443803013251689, class_...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.433,0.402,0.369,0.363,0.387
4,Decision Tree,"DecisionTreeClassifier(class_weight=None, crit...","StandardScaler(), TfidfVectorizer()",0.414,0.366,0.335,0.314,0.247,0.228,0.211,0.197,0.241
5,Bagging with Decision Tree,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.426,0.386,0.343,0.338,0.4
6,Random Forest,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.437,0.366,0.341,0.33,0.433
7,AdaBoost Classifier with Decision Tree,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), TfidfVectorizer()",0.856,0.929,0.832,0.863,0.289,0.301,0.211,0.211,0.206
8,Bagging with KNN,"(KNeighborsClassifier(algorithm='auto', leaf_s...","StandardScaler(), TfidfVectorizer()",0.528,0.927,0.537,0.618,0.053,0.084,0.061,0.035,0.036
9,Bagging with Best KNN,"(KNeighborsClassifier(algorithm='auto', leaf_s...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.103,0.179,0.076,0.061,0.102


In [69]:
test_table.sort_values('Cross-Val Score',ascending=False, inplace=True)
test_table

Unnamed: 0,Model,Parameters,Processing,Train: Accuracy,Train: Precision,Train: Recall,Train: F1,Test: Accuracy,Test: Precision,Test: Recall,Test: F1,Cross-Val Score
6,Random Forest,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.437,0.366,0.341,0.33,0.433
5,Bagging with Decision Tree,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.426,0.386,0.343,0.338,0.4
2,Logistic Regression L2,"LogisticRegression(C=0.21209508879201905, clas...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.426,0.356,0.336,0.328,0.391
3,"Logistic Regression, Saga Solver","LogisticRegression(C=7.443803013251689, class_...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.433,0.402,0.369,0.363,0.387
16,Linear Support Vector Classifier,"LinearSVC(C=0.001, class_weight=None, dual=Tru...","StandardScaler(), TfidfVectorizer()",0.996,0.997,0.993,0.995,0.388,0.304,0.319,0.299,0.384
15,Linear Support Vector Classifier,"LinearSVC(C=0.001, class_weight=None, dual=Tru...","StandardScaler(), TfidfVectorizer()",0.986,0.993,0.976,0.983,0.35,0.236,0.265,0.235,0.353
10,Bagging with Best Decision Tree,"(DecisionTreeClassifier(class_weight=None, cri...","StandardScaler(), TfidfVectorizer()",0.7,0.753,0.592,0.613,0.369,0.259,0.264,0.239,0.338
14,Support Vector Classifier - sigmoid,"SVC(C=0.27263157894736845, cache_size=200, cla...","StandardScaler(), TfidfVectorizer()",0.14,0.121,0.107,0.106,0.278,0.221,0.218,0.206,0.302
1,Logistic Regression L1,"LogisticRegression(C=2.442053094548651, class_...","StandardScaler(), TfidfVectorizer()",1.0,1.0,1.0,1.0,0.338,0.305,0.295,0.289,0.293
4,Decision Tree,"DecisionTreeClassifier(class_weight=None, crit...","StandardScaler(), TfidfVectorizer()",0.414,0.366,0.335,0.314,0.247,0.228,0.211,0.197,0.241


In [70]:
test_table.to_csv('Model_Test_sorted.csv', index=False)