<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Data-Preprocessing-(Data-Transformations)" data-toc-modified-id="Data-Preprocessing-(Data-Transformations)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Preprocessing (Data Transformations)</a></span><ul class="toc-item"><li><span><a href="#Drug-Outcome-variable-transformations" data-toc-modified-id="Drug-Outcome-variable-transformations-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Drug Outcome variable transformations</a></span><ul class="toc-item"><li><span><a href="#Remove-string-and-change-to-integer" data-toc-modified-id="Remove-string-and-change-to-integer-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Remove string and change to integer</a></span></li><li><span><a href="#Create-3-broader-outcome-variables-(Stimulants,-Depressants-and-Hallucinogens)" data-toc-modified-id="Create-3-broader-outcome-variables-(Stimulants,-Depressants-and-Hallucinogens)-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Create 3 broader outcome variables (<em>Stimulants, Depressants and Hallucinogens</em>)</a></span></li><li><span><a href="#Recode-from-6-levels-to-3-levels" data-toc-modified-id="Recode-from-6-levels-to-3-levels-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Recode from 6 levels to 3 levels</a></span></li></ul></li></ul></li><li><span><a href="#Initial-Models" data-toc-modified-id="Initial-Models-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Initial Models</a></span><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Support-Vector-Machine" data-toc-modified-id="Support-Vector-Machine-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Support Vector Machine</a></span></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#K-nn-classifier" data-toc-modified-id="K-nn-classifier-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>K-nn classifier</a></span></li><li><span><a href="#Decision-Tree" data-toc-modified-id="Decision-Tree-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Decision Tree</a></span></li><li><span><a href="#Gradient-Boosted-Trees" data-toc-modified-id="Gradient-Boosted-Trees-3.6"><span class="toc-item-num">3.6&nbsp;&nbsp;</span>Gradient Boosted Trees</a></span></li><li><span><a href="#Linear-Discriminant-Analysis-(LDA)" data-toc-modified-id="Linear-Discriminant-Analysis-(LDA)-3.7"><span class="toc-item-num">3.7&nbsp;&nbsp;</span>Linear Discriminant Analysis (LDA)</a></span></li><li><span><a href="#Neural-Network" data-toc-modified-id="Neural-Network-3.8"><span class="toc-item-num">3.8&nbsp;&nbsp;</span>Neural Network</a></span></li></ul></li></ul></div>

In [4]:
#import libraries
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
import re
%matplotlib inline

# Introduction

This module includes two sections, completing data transformations and initial model creations using the stimulants outcome. Following additional exploratory analysis, bivariate and multivariate regressions for feature selection in the previous module, this module creates the following initial models and takes a first glance at model performance on default settings:

1. Support Vector Machines
2. Multinomial Logit Regression (Softmax Regression)
3. K-Nearest Neighbours classifier
4. Decision Trees
5. Random Forests/Gradient Boosted Trees
6. Linear Discriminant Analysis (LDA)
7. Neural Network


# Data Preprocessing (Data Transformations)

In this first section, the dataset is prepared for modelling in a series of data transfromations.

For the outcome variable, the eighteen outcome variables are collapsed into three new outcome variables, representing broader classes of drugs. They are _**Stimulants, Depressants, and Hallucinogens**_.

Additionally, the 7 levels of drug use are also collapsed to three new levels of drug use: _**1 - unlike to use, 2 - medium use, 3 - high usage**_


In [2]:
#read in dataset
df = pd.read_csv("../drug_consumption_cap_20230505.csv")
df

Unnamed: 0,ID,Age,Gender,Education,Country,Ethnicity,NEO_N,NEO_E,NEO_O,NEO_A,...,ECST,HEROIN,KETA,LEGALH,LSD,METH,MUSHRM,NICO,SEMER,VSA
0,1,0.49788,0.48246,-0.05921,0.96082,0.12600,0.31287,-0.57545,-0.58331,-0.91699,...,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL2,CL0,CL0
1,2,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,...,CL4,CL0,CL2,CL0,CL2,CL3,CL0,CL4,CL0,CL0
2,3,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.62090,...,CL0,CL0,CL0,CL0,CL0,CL0,CL1,CL0,CL0,CL0
3,4,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,...,CL0,CL0,CL2,CL0,CL0,CL0,CL0,CL2,CL0,CL0
4,5,0.49788,0.48246,1.98437,0.96082,-0.31685,0.73545,-1.63340,-0.45174,-0.30172,...,CL1,CL0,CL0,CL1,CL0,CL0,CL2,CL2,CL0,CL0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1880,1884,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,-1.19430,1.74091,1.88511,0.76096,...,CL0,CL0,CL0,CL3,CL3,CL0,CL0,CL0,CL0,CL5
1881,1885,-0.95197,-0.48246,-0.61113,-0.57009,-0.31685,-0.24649,1.74091,0.58331,0.76096,...,CL2,CL0,CL0,CL3,CL5,CL4,CL4,CL5,CL0,CL0
1882,1886,-0.07854,0.48246,0.45468,-0.57009,-0.31685,1.13281,-1.37639,-1.27553,-1.77200,...,CL4,CL0,CL2,CL0,CL2,CL0,CL2,CL6,CL0,CL0
1883,1887,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,0.91093,-1.92173,0.29338,-1.62090,...,CL3,CL0,CL0,CL3,CL3,CL0,CL3,CL4,CL0,CL0


## Drug Outcome variable transformations

### Remove string and change to integer

In [3]:
#select only the drug variable columns
df.iloc[:,13:]

Unnamed: 0,ALC,AMPHET,AMYL,BENZOS,CAFF,CANNABIS,CHOC,COCAINE,CRACK,ECST,HEROIN,KETA,LEGALH,LSD,METH,MUSHRM,NICO,SEMER,VSA
0,CL5,CL2,CL0,CL2,CL6,CL0,CL5,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL2,CL0,CL0
1,CL5,CL2,CL2,CL0,CL6,CL4,CL6,CL3,CL0,CL4,CL0,CL2,CL0,CL2,CL3,CL0,CL4,CL0,CL0
2,CL6,CL0,CL0,CL0,CL6,CL3,CL4,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL1,CL0,CL0,CL0
3,CL4,CL0,CL0,CL3,CL5,CL2,CL4,CL2,CL0,CL0,CL0,CL2,CL0,CL0,CL0,CL0,CL2,CL0,CL0
4,CL4,CL1,CL1,CL0,CL6,CL3,CL6,CL0,CL0,CL1,CL0,CL0,CL1,CL0,CL0,CL2,CL2,CL0,CL0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1880,CL5,CL0,CL0,CL0,CL4,CL5,CL4,CL0,CL0,CL0,CL0,CL0,CL3,CL3,CL0,CL0,CL0,CL0,CL5
1881,CL5,CL0,CL0,CL0,CL5,CL3,CL4,CL0,CL0,CL2,CL0,CL0,CL3,CL5,CL4,CL4,CL5,CL0,CL0
1882,CL4,CL6,CL5,CL5,CL6,CL6,CL6,CL4,CL0,CL4,CL0,CL2,CL0,CL2,CL0,CL2,CL6,CL0,CL0
1883,CL5,CL0,CL0,CL0,CL6,CL6,CL5,CL0,CL0,CL3,CL0,CL0,CL3,CL3,CL0,CL3,CL4,CL0,CL0


In [9]:
#remove 'CL' prefix
df.iloc[:,13:] = df.iloc[:,13:].applymap(lambda x: re.sub('CL','',x))
df.iloc[:,13:]

Unnamed: 0,ALC,AMPHET,AMYL,BENZOS,CAFF,CANNABIS,CHOC,COCAINE,CRACK,ECST,HEROIN,KETA,LEGALH,LSD,METH,MUSHRM,NICO,SEMER,VSA
0,5,2,0,2,6,0,5,0,0,0,0,0,0,0,0,0,2,0,0
1,5,2,2,0,6,4,6,3,0,4,0,2,0,2,3,0,4,0,0
2,6,0,0,0,6,3,4,0,0,0,0,0,0,0,0,1,0,0,0
3,4,0,0,3,5,2,4,2,0,0,0,2,0,0,0,0,2,0,0
4,4,1,1,0,6,3,6,0,0,1,0,0,1,0,0,2,2,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1880,5,0,0,0,4,5,4,0,0,0,0,0,3,3,0,0,0,0,5
1881,5,0,0,0,5,3,4,0,0,2,0,0,3,5,4,4,5,0,0
1882,4,6,5,5,6,6,6,4,0,4,0,2,0,2,0,2,6,0,0
1883,5,0,0,0,6,6,5,0,0,3,0,0,3,3,0,3,4,0,0


In [16]:
#recode as integer field type
df.iloc[:,13:] = df.iloc[:,13:].apply(lambda x: x.astype(int))
#check for field type of outcomes (should be integers)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885 entries, 0 to 1884
Data columns (total 32 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         1885 non-null   int64  
 1   Age        1885 non-null   float64
 2   Gender     1885 non-null   float64
 3   Education  1885 non-null   float64
 4   Country    1885 non-null   float64
 5   Ethnicity  1885 non-null   float64
 6   NEO_N      1885 non-null   float64
 7   NEO_E      1885 non-null   float64
 8   NEO_O      1885 non-null   float64
 9   NEO_A      1885 non-null   float64
 10  NEO_C      1885 non-null   float64
 11  IMP        1885 non-null   float64
 12  SS         1885 non-null   float64
 13  ALC        1885 non-null   int32  
 14  AMPHET     1885 non-null   int32  
 15  AMYL       1885 non-null   int32  
 16  BENZOS     1885 non-null   int32  
 17  CAFF       1885 non-null   int32  
 18  CANNABIS   1885 non-null   int32  
 19  CHOC       1885 non-null   int32  
 20  COCAINE 

### Create 3 broader outcome variables (*Stimulants, Depressants and Hallucinogens*)

In [37]:
#testing function to group drugs to create a new outcome variable
def create_drug_test(row):      
    return max(row['ALC'],row['AMPHET'],row['AMYL'],\
              row['BENZOS'],row['CANNABIS'])

In [38]:
#test on first three throws - before and after
display(df.iloc[:3,13:])

#selection from row
display(df.iloc[:3,13:].apply(lambda x: create_drug_test(x), axis=1))

Unnamed: 0,ALC,AMPHET,AMYL,BENZOS,CAFF,CANNABIS,CHOC,COCAINE,CRACK,ECST,HEROIN,KETA,LEGALH,LSD,METH,MUSHRM,NICO,SEMER,VSA
0,5,2,0,2,6,0,5,0,0,0,0,0,0,0,0,0,2,0,0
1,5,2,2,0,6,4,6,3,0,4,0,2,0,2,3,0,4,0,0
2,6,0,0,0,6,3,4,0,0,0,0,0,0,0,0,1,0,0,0


0    5
1    5
2    6
dtype: int64

In [39]:
#function to group drugs to create a new stimulants outcome variable
def create_stimulants(row):      
    return max(row['AMPHET'],row['NICO'],row['COCAINE'],\
              row['CRACK'],row['CAFF'],row['CHOC'])

In [40]:
#function to group drugs to create a new depressants outcome variable
def create_depressants(row):      
    return max(row['ALC'],row['AMYL'],row['BENZOS'],row['VSA'],row['HEROIN'],\
              row['METH'])

In [41]:
#function to group drugs to create a new hallucinogens outcome variable
def create_hallucinogens(row):      
    return max(row['CANNABIS'],row['ECST'],row['KETA'],row['LSD'],\
               row['MUSHRM'],row['LEGALH'])

In [50]:
df["stimulants"] = df.iloc[:,13:].apply(lambda x: create_stimulants(x).astype(int), axis=1)
df["depressants"] = df.iloc[:,13:].apply(lambda x: create_depressants(x).astype(int), axis=1)
df["hallucinogens"] = df.iloc[:,13:].apply(lambda x: create_hallucinogens(x).astype(int), axis=1)



In [51]:
df

Unnamed: 0,ID,Age,Gender,Education,Country,Ethnicity,NEO_N,NEO_E,NEO_O,NEO_A,...,LEGALH,LSD,METH,MUSHRM,NICO,SEMER,VSA,stimulants,depressants,hallucinogens
0,1,0.49788,0.48246,-0.05921,0.96082,0.12600,0.31287,-0.57545,-0.58331,-0.91699,...,0,0,0,0,2,0,0,6,5,0
1,2,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,...,0,2,3,0,4,0,0,6,5,4
2,3,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.62090,...,0,0,0,1,0,0,0,6,6,3
3,4,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,...,0,0,0,0,2,0,0,5,4,2
4,5,0.49788,0.48246,1.98437,0.96082,-0.31685,0.73545,-1.63340,-0.45174,-0.30172,...,1,0,0,2,2,0,0,6,4,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1880,1884,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,-1.19430,1.74091,1.88511,0.76096,...,3,3,0,0,0,0,5,4,5,5
1881,1885,-0.95197,-0.48246,-0.61113,-0.57009,-0.31685,-0.24649,1.74091,0.58331,0.76096,...,3,5,4,4,5,0,0,5,5,5
1882,1886,-0.07854,0.48246,0.45468,-0.57009,-0.31685,1.13281,-1.37639,-1.27553,-1.77200,...,0,2,0,2,6,0,0,6,5,6
1883,1887,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,0.91093,-1.92173,0.29338,-1.62090,...,3,3,0,3,4,0,0,6,5,6


### Recode from 6 levels to 3 levels

In [52]:
#define recoding function
def recode(val):
    
    #for values greater than equal to 4
    if val >= 4:
        return 3
    
    #for values 2 and 3
    if (val >=2) & (val< 4):
        return 2
    
    else:
        return 0

In [57]:
df[['stim_final','dep_final','hallu_final']] = df[['stimulants','depressants','hallucinogens']].applymap(lambda x: recode(x))

In [58]:
df

Unnamed: 0,ID,Age,Gender,Education,Country,Ethnicity,NEO_N,NEO_E,NEO_O,NEO_A,...,MUSHRM,NICO,SEMER,VSA,stimulants,depressants,hallucinogens,stim_final,dep_final,hallu_final
0,1,0.49788,0.48246,-0.05921,0.96082,0.12600,0.31287,-0.57545,-0.58331,-0.91699,...,0,2,0,0,6,5,0,3,3,0
1,2,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,...,0,4,0,0,6,5,4,3,3,3
2,3,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.62090,...,1,0,0,0,6,6,3,3,3,2
3,4,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,...,0,2,0,0,5,4,2,3,3,2
4,5,0.49788,0.48246,1.98437,0.96082,-0.31685,0.73545,-1.63340,-0.45174,-0.30172,...,2,2,0,0,6,4,3,3,3,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1880,1884,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,-1.19430,1.74091,1.88511,0.76096,...,0,0,0,5,4,5,5,3,3,3
1881,1885,-0.95197,-0.48246,-0.61113,-0.57009,-0.31685,-0.24649,1.74091,0.58331,0.76096,...,4,5,0,0,5,5,5,3,3,3
1882,1886,-0.07854,0.48246,0.45468,-0.57009,-0.31685,1.13281,-1.37639,-1.27553,-1.77200,...,2,6,0,0,6,5,6,3,3,3
1883,1887,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,0.91093,-1.92173,0.29338,-1.62090,...,3,4,0,0,6,5,6,3,3,3


# Initial Models

In the following section, initial models are calibrated for the _**stimulants outcome variable**_. The models calibrated are:

- Support Vector Machines
- Multinomial Logit Regression (Softmax Regression)
- K-Nearest Neighbours classifier
- Decision Trees
- Random Forests/Gradient Boosted Trees
- Linear Discriminant Analysis (LDA)
- Neural Network

The model hyperparameters are default values for the time being and class have not yet been balanced. Another set of models will be run with balanced classes (SMOTE will be applied).

Model accuracy scores are also quickly calculated for the test set (test/train split: 30/70)

The section below provides a description of the methods used before going into the actual model calibration, fitting and scoring.


**Support Vector Machine**

Support Vector Machines are a supervised learning method that can be used for both classification and regression (Shmilovici, 2009). Knowing the labels for the data, the algorithm tries to find an optimal decision boundary known as a hyperplane in n-dimensional space (n is the number input features used) that can correctly classify the data points to the output labels given. The hyperplane that is selected is the one that separates the positive and negative classes by greatest margin, to allow for greater generalization. The hyperplane is usually a linear equation as the boundary is a straight line. SVMs however can be adapted to multiclass problems and non-linear hyperplanes (Burkov, 2019).

**Logistic Regression and Multinomial Logits**

Logistic Regression is a classification model which models the probabilities for binary outcomes (two outcomes) (Burkov, 2019). In logistic regression, a linear combination of inputs is squeezed by the standard logistic function or sigmoid function into a codomain of 0 and 1, resulting in an output of log odds. Negative values of the log odds can map to “0” and positive ones can map to “1”.

We can also map back the probabilities by the following equation (Kleinbaum, Dietz, Gail, Klein, & Klein, 2002):

〖p (x;b,w)=log〗⁡〖(p(x))/(1-p(x))〗  =  e^(β0+x·β )/(1+ e^(β0+x·β)  )  +  =  1/(1 + e^(-(β0+x·β)) )  

A threshold probability is set (i.e., .5), and whenever p > .5 it will result in the positive (“1”). Multinomial Logistic Regression extends this model to produce probabilities for multiple classes (more than 2).

**k-Nearest Neighbors Classifier**

The k-Nearest Neighbors algorithm is a non-parametric supervised learning algorithm that can be used to classify or perform regression. When the algorithm sees a new sample x that does not have a label, it finds k training examples that are closest to x according to distances between n input features. Distance metrics include Manhattan distance or cosine similarity are calculated for all input features between x and data points in the training set. The k data points that have the smallest distance are deemed the closest to the x sample, and the majority label among the k data points is given to x (Burkov, 2019).

**Decision Tree/Random Forest/Gradient Boosting Tree**

A decision tree is a non-parametric model that builds an acyclic graph (Burkov, 2019). The algorithm chooses a rule to split the data on (branching from nodes). When a value is above a threshold it follows one side of the branch otherwise it goes to the other one. When no more splits can be made, a leaf node is reached, and a decision is reached about which class to assign the data point. To determine whether a split is good, Entropy is calculated where high levels of entropy mean all values of a variable are probable and low entropy is where only one value is possible. A random forest extends this concept by generating multiple trees, where it randomly selects a new subset of features at each split. The outputs are combined at the end (i.e., through majority vote) to get a final classification. In doing so, this avoids correlated trees which would decrease the accuracy of prediction and reduces the variance of the final model to minimize the chance of overfitting. Another extension of the decision tree is the Gradient Boosting Tree, where multiple trees are built, but this time, each tree depends on the last tree, as the residuals for the last tree are calculated and are added back in as new labels. This modified training set is then used to produce the next tree which will have even smaller errors (i.e., smaller residuals) (Burkov, 2019).

**Neural Networks**

A neural network consists of a series of nested functions called layers (Burkov, 2019). The first layer is the input layer which holds an activation function (i.e., a logistic regression function) chosen by the analyst and can have multiple units. The last layer is the output layer and has one unit to combine all inputs from the second last layer into one output value. To get from the inputs (x) to the first layer, different weights are applied to each input before they are fed into all the units in the first layer. If the activation function is activated the output value is passed to the next set of units in the next layer. This continues until the output layer which produces a final regression value or class prediction. The neural network then assesses the output with the expected output and then uses that information in a process called back propagation to adjust the weights before each layer to produce a better final overall prediction.



## Setup

In [86]:
#import libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier

In [61]:
#subset independent variables
x_vars = df.iloc[:,:13]

#subset drug outcome variables
out = df[['stim_final','dep_final','hallu_final']]

#create test train set
X_train, X_test, y_train, y_test = train_test_split(x_vars,out, test_size=.3, random_state=42)

## Support Vector Machine

In [62]:
#instantiate SVM object
lin_clf = svm.LinearSVC()

#fit model
SVC = lin_clf.fit(X_train,y_train["stim_final"])





In [63]:
#accuracy score on test data set
lin_clf.score(X_test,y_test["stim_final"])

0.8639575971731449

## Logistic Regression

In [64]:
#instantiate logistic regression
mnlogit = LogisticRegression(multi_class ='multinomial',fit_intercept = True , solver ='lbfgs').fit(X_train 
                                                                                                    , y_train["stim_final"])


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [65]:
#accuracy score for training set
mnlogit.score(X_train,y_train["stim_final"])

0.9946929492039424

In [66]:
#accuracy score for test set
mnlogit.score(X_test,y_test["stim_final"])

0.991166077738516

## K-nn classifier

In [69]:
#instantiate k-nnclassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train,y_train["stim_final"])

KNeighborsClassifier(n_neighbors=3)

In [70]:
#accuracy score
neigh.score(X_test,y_test["stim_final"])

0.991166077738516

## Decision Tree

In [72]:
#instantiate DT
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train,y_train["stim_final"])

DecisionTreeClassifier(random_state=0)

In [73]:
#accuracy score
clf.score(X_test,y_test["stim_final"])

0.9858657243816255

## Gradient Boosted Trees

In [77]:
#instantiate
reg = GradientBoostingRegressor(random_state=0)
#fit
reg.fit(X_train,y_train["stim_final"])

GradientBoostingRegressor(random_state=0)

In [79]:
#accuracy score - training set
reg.score(X_train,y_train["stim_final"])

0.9456004825780904

In [80]:
#accuracy score -test set
reg.score(X_test,y_test["stim_final"])

-1.971166826424449

## Linear Discriminant Analysis (LDA)

In [83]:
#instantiate
LDA = LinearDiscriminantAnalysis()

#fit
LDA.fit(X_train,y_train["stim_final"])

LinearDiscriminantAnalysis()

In [84]:
#accuracy score - training set
LDA.score(X_train,y_train["stim_final"])

0.9931766489764974

In [85]:
#accuracy score -test set
LDA.score(X_test,y_test["stim_final"])

0.9876325088339223

## Neural Network

In [88]:
#instantiate
NN = MLPClassifier(random_state=1, max_iter=300).fit(X_train,y_train["stim_final"])

In [91]:
#accuracy score - train set
NN.score(X_train,y_train["stim_final"])

0.9946929492039424

In [90]:
#accuracy score -test set
NN.score(X_test,y_test["stim_final"])

0.991166077738516