<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Data-Preprocessing-(Data-Transformations)" data-toc-modified-id="Data-Preprocessing-(Data-Transformations)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Preprocessing (Data Transformations)</a></span><ul class="toc-item"><li><span><a href="#Drug-Outcome-variable-transformations" data-toc-modified-id="Drug-Outcome-variable-transformations-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Drug Outcome variable transformations</a></span><ul class="toc-item"><li><span><a href="#Remove-string-and-change-to-integer" data-toc-modified-id="Remove-string-and-change-to-integer-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Remove string and change to integer</a></span></li><li><span><a href="#Create-3-broader-outcome-variables-(Stimulants,-Depressants-and-Hallucinogens)" data-toc-modified-id="Create-3-broader-outcome-variables-(Stimulants,-Depressants-and-Hallucinogens)-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Create 3 broader outcome variables (<em>Stimulants, Depressants and Hallucinogens</em>)</a></span></li><li><span><a href="#Recode-from-6-levels-to-3-levels" data-toc-modified-id="Recode-from-6-levels-to-3-levels-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Recode from 6 levels to 3 levels</a></span></li></ul></li></ul></li><li><span><a href="#Rebalance" data-toc-modified-id="Rebalance-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Rebalance</a></span></li><li><span><a href="#Testing-Hyperparameters---GridSeardCV" data-toc-modified-id="Testing-Hyperparameters---GridSeardCV-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Testing Hyperparameters - GridSeardCV</a></span><ul class="toc-item"><li><span><a href="#LinearSVC-GridSerachCV" data-toc-modified-id="LinearSVC-GridSerachCV-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>LinearSVC GridSerachCV</a></span></li><li><span><a href="#Logistic-Regression-GridSearchCV" data-toc-modified-id="Logistic-Regression-GridSearchCV-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Logistic Regression GridSearchCV</a></span></li><li><span><a href="#Random-Forest-GridSearchCV" data-toc-modified-id="Random-Forest-GridSearchCV-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Random Forest GridSearchCV</a></span></li><li><span><a href="#Neural-Network-GridSearchCV" data-toc-modified-id="Neural-Network-GridSearchCV-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Neural Network GridSearchCV</a></span></li></ul></li><li><span><a href="#Final-Tables---Parameters-tested" data-toc-modified-id="Final-Tables---Parameters-tested-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Final Tables - Parameters tested</a></span></li></ul></div>

# Introduction

In this notebook hyperparameters for the **stimulant models** are optimized using GridSearchCV. This is an exhaustive search of all the parameters and parameter values defined in the dictionary in the Testing Hyperparameters section.

The best values for the parameters for each model are output and then used to recalibrate the models in the previous notebook to provide the final models.

# Data Preprocessing (Data Transformations)

In this first section, the dataset is prepared for modelling in a series of data transfromations.

For the outcome variable, the eighteen outcome variables are collapsed into three new outcome variables, representing broader classes of drugs. They are _**Stimulants, Depressants, and Hallucinogens**_.

Additionally, the 7 levels of drug use are also collapsed to three new levels of drug use: _**1 - unlike to use, 2 - medium use, 3 - high usage**_


In [4]:
#import libraries
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
import re
import numpy as np
%matplotlib inline

In [5]:
#read in dataset
df = pd.read_csv("../drug_consumption_cap_20230505.csv")
df

Unnamed: 0,ID,Age,Gender,Education,Country,Ethnicity,NEO_N,NEO_E,NEO_O,NEO_A,...,ECST,HEROIN,KETA,LEGALH,LSD,METH,MUSHRM,NICO,SEMER,VSA
0,1,0.49788,0.48246,-0.05921,0.96082,0.12600,0.31287,-0.57545,-0.58331,-0.91699,...,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL2,CL0,CL0
1,2,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,...,CL4,CL0,CL2,CL0,CL2,CL3,CL0,CL4,CL0,CL0
2,3,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.62090,...,CL0,CL0,CL0,CL0,CL0,CL0,CL1,CL0,CL0,CL0
3,4,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,...,CL0,CL0,CL2,CL0,CL0,CL0,CL0,CL2,CL0,CL0
4,5,0.49788,0.48246,1.98437,0.96082,-0.31685,0.73545,-1.63340,-0.45174,-0.30172,...,CL1,CL0,CL0,CL1,CL0,CL0,CL2,CL2,CL0,CL0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1880,1884,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,-1.19430,1.74091,1.88511,0.76096,...,CL0,CL0,CL0,CL3,CL3,CL0,CL0,CL0,CL0,CL5
1881,1885,-0.95197,-0.48246,-0.61113,-0.57009,-0.31685,-0.24649,1.74091,0.58331,0.76096,...,CL2,CL0,CL0,CL3,CL5,CL4,CL4,CL5,CL0,CL0
1882,1886,-0.07854,0.48246,0.45468,-0.57009,-0.31685,1.13281,-1.37639,-1.27553,-1.77200,...,CL4,CL0,CL2,CL0,CL2,CL0,CL2,CL6,CL0,CL0
1883,1887,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,0.91093,-1.92173,0.29338,-1.62090,...,CL3,CL0,CL0,CL3,CL3,CL0,CL3,CL4,CL0,CL0


## Drug Outcome variable transformations

### Remove string and change to integer

In [6]:
#select only the drug variable columns
df.iloc[:,13:]

Unnamed: 0,ALC,AMPHET,AMYL,BENZOS,CAFF,CANNABIS,CHOC,COCAINE,CRACK,ECST,HEROIN,KETA,LEGALH,LSD,METH,MUSHRM,NICO,SEMER,VSA
0,CL5,CL2,CL0,CL2,CL6,CL0,CL5,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL2,CL0,CL0
1,CL5,CL2,CL2,CL0,CL6,CL4,CL6,CL3,CL0,CL4,CL0,CL2,CL0,CL2,CL3,CL0,CL4,CL0,CL0
2,CL6,CL0,CL0,CL0,CL6,CL3,CL4,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL1,CL0,CL0,CL0
3,CL4,CL0,CL0,CL3,CL5,CL2,CL4,CL2,CL0,CL0,CL0,CL2,CL0,CL0,CL0,CL0,CL2,CL0,CL0
4,CL4,CL1,CL1,CL0,CL6,CL3,CL6,CL0,CL0,CL1,CL0,CL0,CL1,CL0,CL0,CL2,CL2,CL0,CL0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1880,CL5,CL0,CL0,CL0,CL4,CL5,CL4,CL0,CL0,CL0,CL0,CL0,CL3,CL3,CL0,CL0,CL0,CL0,CL5
1881,CL5,CL0,CL0,CL0,CL5,CL3,CL4,CL0,CL0,CL2,CL0,CL0,CL3,CL5,CL4,CL4,CL5,CL0,CL0
1882,CL4,CL6,CL5,CL5,CL6,CL6,CL6,CL4,CL0,CL4,CL0,CL2,CL0,CL2,CL0,CL2,CL6,CL0,CL0
1883,CL5,CL0,CL0,CL0,CL6,CL6,CL5,CL0,CL0,CL3,CL0,CL0,CL3,CL3,CL0,CL3,CL4,CL0,CL0


In [7]:
#remove 'CL' prefix
df.iloc[:,13:] = df.iloc[:,13:].applymap(lambda x: re.sub('CL','',x))
df.iloc[:,13:]

Unnamed: 0,ALC,AMPHET,AMYL,BENZOS,CAFF,CANNABIS,CHOC,COCAINE,CRACK,ECST,HEROIN,KETA,LEGALH,LSD,METH,MUSHRM,NICO,SEMER,VSA
0,5,2,0,2,6,0,5,0,0,0,0,0,0,0,0,0,2,0,0
1,5,2,2,0,6,4,6,3,0,4,0,2,0,2,3,0,4,0,0
2,6,0,0,0,6,3,4,0,0,0,0,0,0,0,0,1,0,0,0
3,4,0,0,3,5,2,4,2,0,0,0,2,0,0,0,0,2,0,0
4,4,1,1,0,6,3,6,0,0,1,0,0,1,0,0,2,2,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1880,5,0,0,0,4,5,4,0,0,0,0,0,3,3,0,0,0,0,5
1881,5,0,0,0,5,3,4,0,0,2,0,0,3,5,4,4,5,0,0
1882,4,6,5,5,6,6,6,4,0,4,0,2,0,2,0,2,6,0,0
1883,5,0,0,0,6,6,5,0,0,3,0,0,3,3,0,3,4,0,0


In [8]:
#recode as integer field type
df.iloc[:,13:] = df.iloc[:,13:].apply(lambda x: x.astype(int))
#check for field type of outcomes (should be integers)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885 entries, 0 to 1884
Data columns (total 32 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         1885 non-null   int64  
 1   Age        1885 non-null   float64
 2   Gender     1885 non-null   float64
 3   Education  1885 non-null   float64
 4   Country    1885 non-null   float64
 5   Ethnicity  1885 non-null   float64
 6   NEO_N      1885 non-null   float64
 7   NEO_E      1885 non-null   float64
 8   NEO_O      1885 non-null   float64
 9   NEO_A      1885 non-null   float64
 10  NEO_C      1885 non-null   float64
 11  IMP        1885 non-null   float64
 12  SS         1885 non-null   float64
 13  ALC        1885 non-null   int32  
 14  AMPHET     1885 non-null   int32  
 15  AMYL       1885 non-null   int32  
 16  BENZOS     1885 non-null   int32  
 17  CAFF       1885 non-null   int32  
 18  CANNABIS   1885 non-null   int32  
 19  CHOC       1885 non-null   int32  
 20  COCAINE 

### Create 3 broader outcome variables (*Stimulants, Depressants and Hallucinogens*)

In [9]:
#testing function to group drugs to create a new outcome variable
def create_drug_test(row):      
    return max(row['ALC'],row['AMPHET'],row['AMYL'],\
              row['BENZOS'],row['CANNABIS'])

In [10]:
#test on first three throws - before and after
display(df.iloc[:3,13:])

#selection from row
display(df.iloc[:3,13:].apply(lambda x: create_drug_test(x), axis=1))

Unnamed: 0,ALC,AMPHET,AMYL,BENZOS,CAFF,CANNABIS,CHOC,COCAINE,CRACK,ECST,HEROIN,KETA,LEGALH,LSD,METH,MUSHRM,NICO,SEMER,VSA
0,5,2,0,2,6,0,5,0,0,0,0,0,0,0,0,0,2,0,0
1,5,2,2,0,6,4,6,3,0,4,0,2,0,2,3,0,4,0,0
2,6,0,0,0,6,3,4,0,0,0,0,0,0,0,0,1,0,0,0


0    5
1    5
2    6
dtype: int64

In [11]:
#function to group drugs to create a new stimulants outcome variable
def create_stimulants(row):      
    return max(row['AMPHET'],row['NICO'],row['COCAINE'],\
              row['CRACK'],row['CAFF'],row['CHOC'])

In [12]:
#function to group drugs to create a new depressants outcome variable
def create_depressants(row):      
    return max(row['ALC'],row['AMYL'],row['BENZOS'],row['VSA'],row['HEROIN'],\
              row['METH'])

In [13]:
#function to group drugs to create a new hallucinogens outcome variable
def create_hallucinogens(row):      
    return max(row['CANNABIS'],row['ECST'],row['KETA'],row['LSD'],\
               row['MUSHRM'],row['LEGALH'])

In [14]:
#apply() recoding functions to create outcome variables
df["stimulants"] = df.iloc[:,13:].apply(lambda x: create_stimulants(x).astype(int), axis=1)
df["depressants"] = df.iloc[:,13:].apply(lambda x: create_depressants(x).astype(int), axis=1)
df["hallucinogens"] = df.iloc[:,13:].apply(lambda x: create_hallucinogens(x).astype(int), axis=1)



In [15]:
#show df
df

Unnamed: 0,ID,Age,Gender,Education,Country,Ethnicity,NEO_N,NEO_E,NEO_O,NEO_A,...,LEGALH,LSD,METH,MUSHRM,NICO,SEMER,VSA,stimulants,depressants,hallucinogens
0,1,0.49788,0.48246,-0.05921,0.96082,0.12600,0.31287,-0.57545,-0.58331,-0.91699,...,0,0,0,0,2,0,0,6,5,0
1,2,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,...,0,2,3,0,4,0,0,6,5,4
2,3,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.62090,...,0,0,0,1,0,0,0,6,6,3
3,4,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,...,0,0,0,0,2,0,0,5,4,2
4,5,0.49788,0.48246,1.98437,0.96082,-0.31685,0.73545,-1.63340,-0.45174,-0.30172,...,1,0,0,2,2,0,0,6,4,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1880,1884,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,-1.19430,1.74091,1.88511,0.76096,...,3,3,0,0,0,0,5,4,5,5
1881,1885,-0.95197,-0.48246,-0.61113,-0.57009,-0.31685,-0.24649,1.74091,0.58331,0.76096,...,3,5,4,4,5,0,0,5,5,5
1882,1886,-0.07854,0.48246,0.45468,-0.57009,-0.31685,1.13281,-1.37639,-1.27553,-1.77200,...,0,2,0,2,6,0,0,6,5,6
1883,1887,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,0.91093,-1.92173,0.29338,-1.62090,...,3,3,0,3,4,0,0,6,5,6


### Recode from 6 levels to 3 levels

In [16]:
#define recoding function
def recode(val):
    #for values 4 or greater
    if val >= 4:
        return 3
    #for values 2 and 3
    if (val >=2) & (val< 4):
        return 2
    else:
        return 0

In [17]:
#recode to 3 levels for each drug outcome
df[['stim_final','dep_final','hallu_final']] = df[['stimulants','depressants','hallucinogens']].applymap(lambda x: recode(x))

In [18]:
#output
df

Unnamed: 0,ID,Age,Gender,Education,Country,Ethnicity,NEO_N,NEO_E,NEO_O,NEO_A,...,MUSHRM,NICO,SEMER,VSA,stimulants,depressants,hallucinogens,stim_final,dep_final,hallu_final
0,1,0.49788,0.48246,-0.05921,0.96082,0.12600,0.31287,-0.57545,-0.58331,-0.91699,...,0,2,0,0,6,5,0,3,3,0
1,2,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,...,0,4,0,0,6,5,4,3,3,3
2,3,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.62090,...,1,0,0,0,6,6,3,3,3,2
3,4,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,...,0,2,0,0,5,4,2,3,3,2
4,5,0.49788,0.48246,1.98437,0.96082,-0.31685,0.73545,-1.63340,-0.45174,-0.30172,...,2,2,0,0,6,4,3,3,3,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1880,1884,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,-1.19430,1.74091,1.88511,0.76096,...,0,0,0,5,4,5,5,3,3,3
1881,1885,-0.95197,-0.48246,-0.61113,-0.57009,-0.31685,-0.24649,1.74091,0.58331,0.76096,...,4,5,0,0,5,5,5,3,3,3
1882,1886,-0.07854,0.48246,0.45468,-0.57009,-0.31685,1.13281,-1.37639,-1.27553,-1.77200,...,2,6,0,0,6,5,6,3,3,3
1883,1887,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,0.91093,-1.92173,0.29338,-1.62090,...,3,4,0,0,6,5,6,3,3,3


In [19]:
#value counts check
df[['stim_final','dep_final','hallu_final']].apply(df.value_counts)

Unnamed: 0,stim_final,dep_final,hallu_final
0,4,42,584
2,8,215,418
3,1873,1628,883


# Rebalance

In this section the outcome variable classes are rebalanced using the SMOTE oversampling method. The RandomOverSampler method was also used and compared with the SMOTE method. The four selected classifiers were then re-calibrated using the rebalanced data. All of the rebalanced models performed better than the initial models in decreasing overfitting. In the end the SMOTE results resulted in better overall model accuracy and was selected as the preferred balancing method.

In [20]:
#import libraries for re-balancing data
from imblearn.over_sampling import SMOTE, RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

In [21]:
#get independent variables
x_vars = df.iloc[:,:13]
#x_vars = df_d
#get drug outcome variables
out = df[['stim_final','dep_final','hallu_final']]
#create test train split
X_train, X_test, y_train, y_test = train_test_split(x_vars,out, test_size=.4, random_state=42)

In [22]:
#instantiate oversampler
over = SMOTE(k_neighbors=2,random_state=42)
#over = RandomOverSampler(random_state=42)

In [23]:
#resample training data
X_sampled,y_sampled = over.fit_resample(X_train,y_train["stim_final"])

In [24]:
#check counts for outcome variable
y_sampled.value_counts()

0    1124
2    1124
3    1124
Name: stim_final, dtype: int64

# Testing Hyperparameters - GridSeardCV

In this section the parameter search is carried out with GridSearchCV

In [25]:
#binarize stimulant outcome
y = label_binarize(y_sampled, classes=[0, 2, 3])
#get number of classes
n_classes = y.shape[1]
#set random state for repeatability
random_state = np.random.RandomState()

# shuffle and split training and test sets
#X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_sampled, y, test_size=0.5, random_state=0)

In [26]:
#import libraries for GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingRegressor, RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.multiclass import OneVsRestClassifier
import time

In [43]:
#finalized parameter grids
param_grid = [#parameters for Multinomial Logit
    {'estimator__penalty' : ['l1', 'l2'],
    'estimator__C': [1.0, 0.1, .001, .0001],
    'estimator__solver' : ['liblinear','newton-cg', 'lbfgs','saga']}
    ]

param_grid_svm = [#parameters for svm
    {'estimator__penalty' : ['l1', 'l2'],
     'estimator__dual': [True, False],
    'estimator__C': [1.0, 0.1, .001, .0001],
    'estimator__loss' : ['squared_hinge','hinge'],
    'estimator__max_iter': [1000, 5000, 10000],
    'estimator__multi_class': ['ovr', 'crammer_singer']}
    ]

param_grid_RF = [#parameters for Random Forest
    {#'estimator' : [OneVsRestClassifier(RandomForestClassifier())],
    #'estimator__n_estimators' : [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000],
    'estimator__n_estimators' : [200, 400, 600, 800],
    'estimator__max_features' : ['auto', 'sqrt','log2'],
    'estimator__bootstrap' : [True, False],
    #'estimator__max_depth': [5,10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'estimator__max_depth': [5,10, 20, 30, 40, 50, None],
     'estimator__min_samples_leaf': [1, 2, 4],
     'estimator__min_samples_split': [2, 5, 10],
    'estimator__criterion': ['gini','entropy','log_loss']}
]

param_grid_MLP = [#parameters for Neural Network
    
    {#'estimator' : [OneVsRestClassifier(MLPClassifier())],
    'estimator__hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'estimator__activation': ['tanh', 'relu'],
    'estimator__solver': ['sgd', 'adam'],
    'estimator__alpha': [0.0001, 0.05, 0.1],
    'estimator__learning_rate': ['constant','adaptive', 'invscaling'],
    'estimator__max_iter' :[200,300,400]}
    
    ]



## LinearSVC GridSerachCV

In [46]:
#SVM instantiate
SVM_clf = GridSearchCV(OneVsRestClassifier(svm.LinearSVC(C=.001, random_state=42, max_iter=10000, dual=False)),
                       param_grid = param_grid_svm, cv = 5, verbose=True, n_jobs=-1)

In [47]:
#get start time
start = time.time()

#fit
best_SVM_clf = SVM_clf.fit(X_sampled, y)

#print duration
print(f"GridSearchCV Total Time in seconds: {time.time()-start}")

Fitting 5 folds for each of 192 candidates, totalling 960 fits
GridSearchCV Total Time in seconds: 2003.7156116962433


 0.68570744 0.68570744        nan 0.47300582 0.68570744 0.68570744
        nan 0.18500978 0.68570744 0.68570744        nan 0.16538784
 0.68570744 0.68570744        nan 0.50055303 0.68570744 0.68570744
 0.80136982 0.79692626 0.68570744 0.68570744 0.80136982 0.79692626
 0.68570744 0.68570744 0.80136982 0.79692626 0.68570744 0.68570744
        nan        nan 0.68570744 0.68570744        nan        nan
 0.68570744 0.68570744        nan        nan 0.68570744 0.68570744
        nan 0.24527223 0.72546302 0.72546302        nan 0.30272909
 0.72546302 0.72546302        nan 0.49112914 0.72546302 0.72546302
        nan 0.26567799 0.72546302 0.72546302        nan 0.43624838
 0.72546302 0.72546302        nan 0.35992263 0.72546302 0.72546302
 0.78773975 0.78773931 0.72546302 0.72546302 0.78773975 0.78773931
 0.72546302 0.72546302 0.78773975 0.78773931 0.72546302 0.72546302
        nan        nan 0.72546302 0.72546302        nan        nan
 0.72546302 0.72546302        nan        nan 0.72546302 0.7254

In [48]:
#get best parameters
best_SVM_clf.best_estimator_

OneVsRestClassifier(estimator=LinearSVC(dual=False, penalty='l1',
                                        random_state=42))

In [28]:
#pipe = Pipeline([('estimator' , OneVsRestClassifier(LogisticRegression()))])

## Logistic Regression GridSearchCV

In [29]:
#logistic Regression Instantiate
LR_clf = GridSearchCV(OneVsRestClassifier(LogisticRegression()), param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)

In [30]:
#start time creation
start = time.time()

#fit
best_LR_clf = LR_clf.fit(X_sampled, y)

#print results
print(f"GridSearchCV Total Time in seconds: {time.time()-start}")

Fitting 5 folds for each of 32 candidates, totalling 160 fits
GridSearchCV Total Time in seconds: 3.5903987884521484


 0.77379492 0.00118519 0.7681798         nan        nan 0.00118519
 0.72458556 0.7580943  0.7349656  0.00118519 0.00088889        nan
        nan 0.00088889 0.13879679 0.16044664 0.16282053 0.00118519
 0.00088889        nan        nan 0.00088889 0.00148148 0.04506297
 0.04506297 0.00118519]


In [32]:
#get best parameters
best_LR_clf.best_estimator_

OneVsRestClassifier(estimator=LogisticRegression(penalty='l1',
                                                 solver='liblinear'))

In [40]:
#show results dataframe
pd.DataFrame(best_LR_clf.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_estimator__C,param_estimator__penalty,param_estimator__solver,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.072606,0.006096072,0.007579,0.0004883836,1.0,l1,liblinear,"{'estimator__C': 1.0, 'estimator__penalty': 'l...",0.647407,0.77037,0.804154,1.0,0.833828,0.811152,0.113756,1
1,0.00379,0.0007461363,0.0,0.0,1.0,l1,newton-cg,"{'estimator__C': 1.0, 'estimator__penalty': 'l...",,,,,,,,32
2,0.004388,0.0004886945,0.0,0.0,1.0,l1,lbfgs,"{'estimator__C': 1.0, 'estimator__penalty': 'l...",,,,,,,,31
3,0.302591,0.00720851,0.00738,0.0007981539,1.0,l1,saga,"{'estimator__C': 1.0, 'estimator__penalty': 'l...",0.004444,0.001481,0.0,0.0,0.0,0.001185,0.001728,15
4,0.041688,0.001828044,0.007979,0.001544987,1.0,l2,liblinear,"{'estimator__C': 1.0, 'estimator__penalty': 'l...",0.591111,0.757037,0.807122,1.0,0.84273,0.7996,0.132179,3
5,0.412496,0.01640714,0.007381,0.0004884026,1.0,l2,newton-cg,"{'estimator__C': 1.0, 'estimator__penalty': 'l...",0.631111,0.764444,0.802671,1.0,0.841246,0.807894,0.119318,2
6,0.143416,0.003699759,0.006981,1.907349e-07,1.0,l2,lbfgs,"{'estimator__C': 1.0, 'estimator__penalty': 'l...",0.562963,0.748148,0.649852,0.995549,0.912463,0.773795,0.160545,4
7,0.230982,0.009824628,0.007779,0.0007462128,1.0,l2,saga,"{'estimator__C': 1.0, 'estimator__penalty': 'l...",0.004444,0.001481,0.0,0.0,0.0,0.001185,0.001728,15
8,0.050465,0.007585002,0.006981,0.0006306757,0.1,l1,liblinear,"{'estimator__C': 0.1, 'estimator__penalty': 'l...",0.50963,0.724444,0.778932,1.0,0.827893,0.76818,0.15886,5
9,0.003391,0.0004883637,0.0,0.0,0.1,l1,newton-cg,"{'estimator__C': 0.1, 'estimator__penalty': 'l...",,,,,,,,28


## Random Forest GridSearchCV

In [33]:
#instantiate RF
RF_clf = GridSearchCV(OneVsRestClassifier(RandomForestClassifier()), param_grid = param_grid_RF, cv = 5, verbose=True, n_jobs=-1)

In [34]:
#create start time
start = time.time()

#fit
best_RF_clf = RF_clf.fit(X_sampled, y)

#print duration
print(f"GridSearchCV Total Time in seconds: {time.time()-start}")

Fitting 5 folds for each of 4536 candidates, totalling 22680 fits




GridSearchCV Total Time in seconds: 6918.977512598038


In [35]:
#get best parameters
best_RF_clf.best_estimator_

OneVsRestClassifier(estimator=RandomForestClassifier(bootstrap=False,
                                                     max_depth=20,
                                                     max_features='log2',
                                                     n_estimators=200))

## Neural Network GridSearchCV

In [36]:
#instantiate NN
MLP_clf = GridSearchCV(OneVsRestClassifier(MLPClassifier()), param_grid = param_grid_MLP, cv = 5, verbose=True, n_jobs=-1)

In [37]:
#create start time
start = time.time()

#fit
best_MLP_clf = MLP_clf.fit(X_sampled, y)

#print
print(f"GridSearchCV Total Time in seconds: {time.time()-start}")

Fitting 5 folds for each of 324 candidates, totalling 1620 fits
GridSearchCV Total Time in seconds: 638.8849527835846


In [38]:
#get best parameters
best_MLP_clf.best_estimator_

OneVsRestClassifier(estimator=MLPClassifier(activation='tanh', max_iter=400))

# Final Tables - Parameters tested

The final parameters and combinations tested are output as dataframes.

In [73]:
#parameters tested for logistic regression
param_grid_lr =    {'estimator__penalty' : ['l1', 'l2'],
    'estimator__C': [1.0, 0.1, .001, .0001],
    'estimator__solver' : ['liblinear','newton-cg', 'lbfgs','saga']}

#lr params df
lr_parms = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in param_grid_lr.items() ]))


#svm params
param_grid_svm = {
    'estimator__penalty' : ['l1', 'l2'],
     'estimator__dual': [True, False],
    'estimator__C': [1.0, 0.1, .001, .0001],
    'estimator__loss' : ['squared_hinge','hinge'],
    'estimator__max_iter': [1000, 5000, 10000],
    'estimator__multi_class': ['ovr', 'crammer_singer']}

#svm params df
svm_parms = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in param_grid_svm.items() ]))

#RF params tested
param_grid_RF = {#'estimator' : [OneVsRestClassifier(RandomForestClassifier())],
    #'estimator__n_estimators' : [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000],
    'estimator__n_estimators' : [200, 400, 600, 800],
    'estimator__max_features' : ['auto', 'sqrt','log2'],
    'estimator__bootstrap' : [True, False],
    #'estimator__max_depth': [5,10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'estimator__max_depth': [5,10, 20, 30, 40, 50, None],
     'estimator__min_samples_leaf': [1, 2, 4],
     'estimator__min_samples_split': [2, 5, 10],
    'estimator__criterion': ['gini','entropy','log_loss']}

#RF params df
RF_parms = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in param_grid_RF.items() ]))

#Neural Network params tested
param_grid_MLP = {
    #'estimator' : [OneVsRestClassifier(MLPClassifier())],
    'estimator__hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'estimator__activation': ['tanh', 'relu'],
    'estimator__solver': ['sgd', 'adam'],
    'estimator__alpha': [0.0001, 0.05, 0.1],
    'estimator__learning_rate': ['constant','adaptive', 'invscaling'],
    'estimator__max_iter' :[200,300,400]}

#Neural Network params df
MLP_parms = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in param_grid_MLP.items() ]))    

In [63]:
#inspect
param_grid_svm =    {'estimator__penalty' : ['l1', 'l2'],
    'estimator__C': [1.0, 0.1, .001, .0001],
    'estimator__solver' : ['liblinear','newton-cg', 'lbfgs','saga']}

[ (k,pd.Series(v)) for k,v in param_grid_svm.items() ]

[('estimator__penalty',
  0    l1
  1    l2
  dtype: object),
 ('estimator__C',
  0    1.0000
  1    0.1000
  2    0.0010
  3    0.0001
  dtype: float64),
 ('estimator__solver',
  0    liblinear
  1    newton-cg
  2        lbfgs
  3         saga
  dtype: object)]

In [69]:
#show df
lr_parms

Unnamed: 0,estimator__penalty,estimator__C,estimator__solver
0,l1,1.0,liblinear
1,l2,0.1,newton-cg
2,,0.001,lbfgs
3,,0.0001,saga


In [70]:
#show df
svm_parms

Unnamed: 0,estimator__penalty,estimator__dual,estimator__C,estimator__loss,estimator__max_iter,estimator__multi_class
0,l1,True,1.0,squared_hinge,1000.0,ovr
1,l2,False,0.1,hinge,5000.0,crammer_singer
2,,,0.001,,10000.0,
3,,,0.0001,,,


In [72]:
#show df
RF_parms

Unnamed: 0,estimator__n_estimators,estimator__max_features,estimator__bootstrap,estimator__max_depth,estimator__min_samples_leaf,estimator__min_samples_split,estimator__criterion
0,200.0,auto,True,5.0,1.0,2.0,gini
1,400.0,sqrt,False,10.0,2.0,5.0,entropy
2,600.0,log2,,20.0,4.0,10.0,log_loss
3,800.0,,,30.0,,,
4,,,,40.0,,,
5,,,,50.0,,,
6,,,,,,,


In [74]:
#show df
MLP_parms

Unnamed: 0,estimator__hidden_layer_sizes,estimator__activation,estimator__solver,estimator__alpha,estimator__learning_rate,estimator__max_iter
0,"(50, 50, 50)",tanh,sgd,0.0001,constant,200
1,"(50, 100, 50)",relu,adam,0.05,adaptive,300
2,"(100,)",,,0.1,invscaling,400
