# Machine Learning Processing

### Contents

 - I. [Data Importing & Processing](#Data-Import-and-Processing)
 - II. [Modling Plan](#Modeling-Plan)
     - [Iteration 0](#Iteration-0:-Tree-based-Naivete)
     - [Iteration 1](#Iteration-1:-Baby's-First-Pipe)
     - [Iteration 2](#Iteration-2:-Give-the-Computer-a-Chance)
     - [Progress So Far](#Progress-Report-Card)
     - [Optimization]
 - III. [Results & Discussion]
     - [Business Impact]
     - [Permutation Importance]

In [90]:
# Import packages etc.
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pickle

## Data Import and Processing

In [91]:
provdat = pickle.load(open('Xtrain.pickle', 'rb'))
provtarg = pickle.load(open('Xtest.pickle', 'rb'))
provgroup = pickle.load(open('Xdata.pickle', 'rb'))

In [92]:
# Examining Pre-Aggregated Per-Provider Data
print('Data Shape: ' + str(provdat.shape))
provdat.head()

Data Shape: (5410, 70)


Unnamed: 0_level_0,Number_of_Claims,Num_Unique_Patients,Num_Unique_Docs,Num_Unique_Diag_Codes,Num_Unique_Proc_Codes,Mean_InscClaimAmtReimbursed,Mean_DeductibleAmtPaid,Mean_Reimbursement_per_Day,Mean_Length_of_Stay,Mean_Length_of_Claim,...,Perc_ClaimStartDt_March,Perc_ClaimStartDt_April,Perc_ClaimStartDt_May,Perc_ClaimStartDt_June,Perc_ClaimStartDt_July,Perc_ClaimStartDt_August,Perc_ClaimStartDt_September,Perc_ClaimStartDt_October,Perc_ClaimStartDt_November,Perc_ClaimStartDt_December
Provider,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PRV51001,25,24,19,72,3,4185.6,213.6,3700.0,6.0,2.44,...,0.12,0.08,0.2,0.12,0.08,0.04,0.04,0.0,0.0,0.08
PRV51003,132,117,51,365,39,4588.41,502.17,2137.42,6.16,4.67,...,0.08,0.08,0.06,0.11,0.08,0.1,0.07,0.08,0.08,0.05
PRV51004,149,138,48,267,0,350.13,2.08,0.0,0.0,2.43,...,0.12,0.07,0.05,0.07,0.09,0.07,0.08,0.07,0.05,0.11
PRV51005,1165,495,7,1294,0,241.12,3.18,0.0,0.0,2.09,...,0.1,0.08,0.07,0.09,0.08,0.08,0.09,0.1,0.06,0.07
PRV51007,72,58,11,165,1,468.19,45.33,1080.56,6.33,1.96,...,0.12,0.06,0.12,0.12,0.04,0.08,0.1,0.03,0.08,0.08


In [93]:
# Examining Per-Provider Incidence of Fradulence
print('Data Shape: ' + str(provtarg.shape))
provtarg.head()

Data Shape: (5410,)


Provider
PRV51001    0
PRV51003    1
PRV51004    0
PRV51005    1
PRV51007    0
Name: PotentialFraud, dtype: int64

In [94]:
# Combining Target data with Provider data to generate supervised learning dataset
xdata = pd.merge(provdat.reset_index(), provtarg.reset_index(), on='Provider').set_index('Provider')
xdata.head()

Unnamed: 0_level_0,Number_of_Claims,Num_Unique_Patients,Num_Unique_Docs,Num_Unique_Diag_Codes,Num_Unique_Proc_Codes,Mean_InscClaimAmtReimbursed,Mean_DeductibleAmtPaid,Mean_Reimbursement_per_Day,Mean_Length_of_Stay,Mean_Length_of_Claim,...,Perc_ClaimStartDt_April,Perc_ClaimStartDt_May,Perc_ClaimStartDt_June,Perc_ClaimStartDt_July,Perc_ClaimStartDt_August,Perc_ClaimStartDt_September,Perc_ClaimStartDt_October,Perc_ClaimStartDt_November,Perc_ClaimStartDt_December,PotentialFraud
Provider,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PRV51001,25,24,19,72,3,4185.6,213.6,3700.0,6.0,2.44,...,0.08,0.2,0.12,0.08,0.04,0.04,0.0,0.0,0.08,0
PRV51003,132,117,51,365,39,4588.41,502.17,2137.42,6.16,4.67,...,0.08,0.06,0.11,0.08,0.1,0.07,0.08,0.08,0.05,1
PRV51004,149,138,48,267,0,350.13,2.08,0.0,0.0,2.43,...,0.07,0.05,0.07,0.09,0.07,0.08,0.07,0.05,0.11,0
PRV51005,1165,495,7,1294,0,241.12,3.18,0.0,0.0,2.09,...,0.08,0.07,0.09,0.08,0.08,0.09,0.1,0.06,0.07,1
PRV51007,72,58,11,165,1,468.19,45.33,1080.56,6.33,1.96,...,0.06,0.12,0.12,0.04,0.08,0.1,0.03,0.08,0.08,0


In [95]:
from sklearn.model_selection import train_test_split

# Split the data into train & test sets for supervised learning
Xtrain, Xtest, ytrain, ytest = train_test_split(xdata.iloc[:, :-1],
                                               xdata.iloc[:, -1:],
                                               test_size=0.2,
                                               random_state=430)

# Quick Heads Up
print('Training Data shape: ' + str(Xtrain.shape))
print('Test Data Shape: ' + str(Xtest.shape))

Training Data shape: (4328, 70)
Test Data Shape: (1082, 70)


## Modeling Plan
The general strategy of our approach incorporates an initial, broad spectrum testing of models, including the following.
 - Logistic Regression
 - KNN
 - Discriminant Analysis
 - Naive Bayes
 - SVM
 - CatBoost
 - Light GBM

This grouping should represent a decent collection of well-worn, classical methods with exceptionally long history & documentation as well as newer, useful methodologies. Of particular interest to this last point will be **Catboost & Light GBM**, two Gradient Boosting models that have shown a great deal of promise in terms of efficiency & accuracy.

### Iterative Strategy | What We Care About 
 - Iteration 0: Naive, untouched modeling with tree-based models; just fit.
 - Iteration 1: Pipeline == SCALE(No Norm) & FIT; no unsupervised.
     - 1a for gridsearch?
 - Iteration 2: Pipeline == SCALE(No Norm), PCA, & FIT.
     - 2a for gridsearch?
 
### Naming Procedure
We can abbreviate the model name alongside iteration number to keep track (eg. "cb0" for "Catboost, Iteration 0)

### Iteration 0: Tree-based Naivete
[Back to Top](#Machine-Learning-Processing)

Both Catboost & Light GBM are quite capable of handling practically unprocessed data, such as we have imported. For further reference, please consult their respective documentation. Here we will process these for some initial yield results.

In [96]:
from catboost import CatBoostClassifier
cb0 = CatBoostClassifier(verbose=False)
cb0.fit(Xtrain, ytrain)
cb0.score(Xtest, ytest)
print('CatBoost Train R2 Score:', cb0.score(Xtrain, ytrain))
print('CatBoost Test R2 Score:', cb0.score(Xtest, ytest))

CatBoost Train R2 Score: 0.9972273567467652
CatBoost Test R2 Score: 0.9316081330868762


In [97]:
from lightgbm import LGBMClassifier
lgbm0 = LGBMClassifier()
lgbm0.fit(Xtrain, ytrain)
print('LightGBM Boost Train R2 Score:', lgbm0.score(Xtrain, ytrain))
print('LightGBM Boost Test R2 Score:', lgbm0.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Boost Train R2 Score: 1.0
LightGBM Boost Test R2 Score: 0.9343807763401109


### Iteration 1: Baby's First Pipe
[Back to Top](#Machine-Learning-Processing)

Our goal here will be to test the results with some initial scaling built into a pipeline. 

In [98]:
badindices = np.where(np.isinf(Xtrain))
badindices

(array([ 220, 1330, 1347, 2198, 2599, 2614, 2920, 2946, 3951, 4321],
       dtype=int64),
 array([24, 24, 24, 24, 24, 24, 24, 24, 24, 24], dtype=int64))

In [99]:
Xtrain.iloc[-badindices[0], 24].mean()

27.769

In [100]:
Xtrain.iloc[badindices[0], 24] = Xtrain.iloc[-badindices[0], 24].mean()

In [101]:
badtests = np.where(np.isinf(Xtest))
Xtest.iloc[badtests[0], 24] = Xtrain.iloc[-badindices[0], 24].mean()

Cleanup Complete

In [102]:
# IT1: Logistic Regression
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
lgr1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('logistic', LogisticRegression(n_jobs=-2))])
lgr1.fit(Xtrain, ytrain)
print('Logistic Regression Train Score:', lgr1.score(Xtrain, ytrain))
print('Logistic Regression Test Score:', lgr1.score(Xtest, ytest))

  return f(*args, **kwargs)


Logistic Regression Train Score: 0.9376155268022182
Logistic Regression Test Score: 0.922365988909427


In [103]:
# IT1: KNN
from sklearn.neighbors import KNeighborsClassifier
knn1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('knn', KNeighborsClassifier(n_jobs=-2))])
knn1.fit(Xtrain, ytrain)
print('KNearestNeighbors Train Score:', knn1.score(Xtrain, ytrain))
print('KNearestNeighbors Test Score:', knn1.score(Xtest, ytest))

  return self._fit(X, y)


KNearestNeighbors Train Score: 0.9396950092421442
KNearestNeighbors Test Score: 0.9186691312384473


In [104]:
# IT1: LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('lda', LinearDiscriminantAnalysis())])
lda1.fit(Xtrain, ytrain)
print('LDA Train Score:', lda1.score(Xtrain, ytrain))
print('LDA Test Score:', lda1.score(Xtest, ytest))

LDA Train Score: 0.9348428835489834
LDA Test Score: 0.9195933456561922


  return f(*args, **kwargs)


In [105]:
# IT1: GNB
from sklearn.naive_bayes import GaussianNB
gnb1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('gnb', GaussianNB())])
gnb1.fit(Xtrain, ytrain)
print('GNB Train Score', gnb1.score(Xtrain, ytrain))
print('GNB Test Score', gnb1.score(Xtest, ytest))

  return f(*args, **kwargs)


GNB Train Score 0.6984750462107209
GNB Test Score 0.6728280961182994


In [106]:
# IT1: SVM
from sklearn.svm import SVC
svc1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('svc', SVC())])
svc1.fit(Xtrain, ytrain)
print('SVC Train Score', svc1.score(Xtrain, ytrain))
print('SVC Test Score', svc1.score(Xtest, ytest))

  return f(*args, **kwargs)


SVC Train Score 0.9413123844731978
SVC Test Score 0.9242144177449169


In [107]:
# IT1: CatBoost
cb1 = Pipeline(steps=[('scaler', StandardScaler()),
                     ('catboost', CatBoostClassifier(verbose=False))])
cb1.fit(Xtrain, ytrain)
print('Catboost Train Score', cb1.score(Xtrain, ytrain))
print('Catboost Test Score', cb1.score(Xtest, ytest))

Catboost Train Score 0.9976894639556377
Catboost Test Score 0.9353049907578558


In [108]:
# IT1: LightGBM
lgbm1 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('lightgbm', LGBMClassifier(n_jobs=-2))])
lgbm1.fit(Xtrain, ytrain)
print('LightGBM Train Score', lgbm1.score(Xtrain, ytrain))
print('LightGBM Test Score', lgbm1.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Train Score 1.0
LightGBM Test Score 0.9343807763401109


## Iteration 2: Give the Computer a Chance

[Back to Top](#Machine-Learning-Processing)



In [109]:
# IT2: Logistic
from sklearn.decomposition import PCA 
lgr2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('lightgbm', LGBMClassifier(n_jobs=-2))])
lgr2.fit(Xtrain, ytrain)
print('LightGBM Train Score', lgr2.score(Xtrain, ytrain))
print('LightGBM Test Score', lgr2.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Train Score 1.0
LightGBM Test Score 0.9242144177449169


In [110]:
# IT2: KNN
knn2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('knn', KNeighborsClassifier(n_jobs=-2))])
knn2.fit(Xtrain, ytrain)
print('LightGBM Train Score', knn2.score(Xtrain, ytrain))
print('LightGBM Test Score', knn2.score(Xtest, ytest))

  return self._fit(X, y)


LightGBM Train Score 0.9396950092421442
LightGBM Test Score 0.9214417744916821


In [111]:
# IT2: LDA
lda2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('lda', LinearDiscriminantAnalysis())])
lda2.fit(Xtrain, ytrain)
print('LightGBM Train Score', lda2.score(Xtrain, ytrain))
print('LightGBM Test Score', lda2.score(Xtest, ytest))

LightGBM Train Score 0.9336876155268022
LightGBM Test Score 0.9232902033271719


  return f(*args, **kwargs)


In [112]:
# IT2: GNB
gnb2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('gnb', GaussianNB())])
gnb2.fit(Xtrain, ytrain)
print('LightGBM Train Score', gnb2.score(Xtrain, ytrain))
print('LightGBM Test Score', gnb2.score(Xtest, ytest))

LightGBM Train Score 0.5459796672828097
LightGBM Test Score 0.5425138632162662


  return f(*args, **kwargs)


In [113]:
# IT2: SVM
svc2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('svc', SVC())])
svc2.fit(Xtrain, ytrain)
print('LightGBM Train Score', svc2.score(Xtrain, ytrain))
print('LightGBM Test Score', svc2.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Train Score 0.9394639556377079
LightGBM Test Score 0.9242144177449169


In [114]:
# IT2: CatBoost
cb2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('Catboost', CatBoostClassifier(verbose=False))])
cb2.fit(Xtrain, ytrain)
print('LightGBM Train Score', cb2.score(Xtrain, ytrain))
print('LightGBM Test Score', cb2.score(Xtest, ytest))

LightGBM Train Score 0.9988447319778189
LightGBM Test Score 0.9205175600739371


In [115]:
# IT2: LightGBM
lgbm2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('LightGBM', LGBMClassifier())])
lgbm2.fit(Xtrain, ytrain)
print('LightGBM Train Score', lgbm2.score(Xtrain, ytrain))
print('LightGBM Test Score', lgbm2.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Train Score 1.0
LightGBM Test Score 0.9242144177449169


### Progress Report Card

[Back to Top](#Machine-Learning-Processing)

In [116]:
# Function to merge PCA data back to original
def PCAmerge(train, test, n_comp = 0.95, r_state = 42):
    from sklearn.decomposition import PCA 
    pca = PCA(n_components = n_comp, random_state = r_state)
    train_pca = pd.DataFrame(pca.fit_transform(train), index=train.index)
    test_pca = pd.DataFrame(pca.transform(test), index=test.index)
    PCA_col_list = []
    for num in range(1, train_pca.shape[1]+1):
        PCA_col_list.append(('PCA' + str(num)))
    train_pca.columns = PCA_col_list
    train = pd.concat([train, train_pca], axis=1)
    test_pca.columns = PCA_col_list
    test = pd.concat([test, test_pca], axis=1)
    return train, test
    

In [117]:
# TRYING TO MAKE IT PIPELINE-COMPATIBLE. NOT SO EASY BUT HAVE A LEAD ON SOMETHING CALLED KEDRO
# https://towardsdatascience.com/kedro-prepare-to-pimp-your-pipeline-f8f68c263466
# from sklearn.base import BaseEstimator, TransformerMixin
# class PCAmergePipe(BaseEstimator, TransformerMixin):
#     def __init__(self, n_comp = 0.99, r_state = 42):
#             self.n_comp = n_comp
#             self.r_state = r_state
#     def fit_transform(self, X, y):
#             from sklearn.decomposition import PCA 
#             pca = PCA(n_components = n_comp, random_state = r_state)
#             df_pca = pd.DataFrame(pca.fit_transform(self), index=self.index)
#             PCA_col_list = []
#             for num in range(1, df_pca.shape[1]+1):
#                 PCA_col_list.append(('PCA' + str(num)))
#             df_pca.columns = PCA_col_list
#             self = pd.concat([self, df_pca], axis=1)
#             return self
# #     def transform(self, x):
# #             return x[self.ind_feature,:]

In [118]:
Xtrain_Scaled

Unnamed: 0_level_0,Number_of_Claims,Num_Unique_Patients,Num_Unique_Docs,Num_Unique_Diag_Codes,Num_Unique_Proc_Codes,Mean_InscClaimAmtReimbursed,Mean_DeductibleAmtPaid,Mean_Reimbursement_per_Day,Mean_Length_of_Stay,Mean_Length_of_Claim,...,Perc_ClaimStartDt_March,Perc_ClaimStartDt_April,Perc_ClaimStartDt_May,Perc_ClaimStartDt_June,Perc_ClaimStartDt_July,Perc_ClaimStartDt_August,Perc_ClaimStartDt_September,Perc_ClaimStartDt_October,Perc_ClaimStartDt_November,Perc_ClaimStartDt_December
Provider,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PRV56682,-0.301220,-0.331952,-0.253264,-0.393598,-0.024100,-0.253894,0.048215,0.490073,0.591932,-0.151775,...,0.210209,-0.865219,1.166372,-0.376840,0.291816,0.267843,0.351610,-0.829907,0.409958,0.886706
PRV54630,-0.058777,-0.303445,-0.457363,0.005891,-0.373829,-0.419732,-0.499430,-0.621766,-0.690894,-0.200650,...,-0.100263,0.217539,-0.294321,-0.156588,0.404522,0.373150,-0.211133,0.403024,-0.048665,-0.264897
PRV55179,-0.099184,0.003010,0.388189,0.088029,0.850222,0.205114,0.226065,0.994156,0.699064,0.009514,...,0.003228,-0.077759,-0.002182,-0.486965,0.066405,-0.153387,0.689256,-0.110697,0.065991,-0.055514
PRV57312,0.855892,0.509017,0.009149,1.196890,-0.373829,-0.428211,-0.490469,-0.621766,-0.690894,-0.000261,...,0.003228,-0.274624,0.095197,0.173789,0.066405,0.057228,0.013965,-0.007953,0.065991,-0.160206
PRV54923,-0.080818,-0.061132,0.038306,-0.083714,-0.373829,-0.395312,-0.502472,-0.621766,-0.690894,-0.327726,...,0.106719,-0.373056,-0.294321,0.063663,0.517228,0.267843,0.013965,-0.007953,0.180646,0.049177
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
PRV55620,-0.315913,-0.367587,-0.515677,-0.505604,-0.373829,-0.472727,-0.459694,-0.621766,-0.690894,-0.972881,...,-0.928187,-0.176191,-0.196941,0.504166,-0.159007,1.215610,-0.098584,-0.829907,1.441859,-0.055514
PRV55267,0.157952,0.088532,-0.224107,0.424048,-0.373829,-0.435083,-0.493969,-0.621766,-0.690894,-0.259301,...,-0.100263,-0.471489,0.192577,-0.046462,0.066405,0.373150,0.013965,-0.110697,0.065991,0.153868
PRV52407,-0.213059,-0.339079,-0.428206,-0.255457,-0.373829,-0.399273,-0.472122,-0.621766,-0.690894,-0.713842,...,-0.410734,0.020674,-0.002182,0.394040,0.855346,-0.364002,-0.661327,0.403024,-0.048665,0.153868
PRV54154,-0.326933,-0.388967,-0.282421,-0.535472,-0.373829,-0.468592,-0.503289,-0.621766,-0.690894,-0.117562,...,0.831153,0.808134,-0.878598,-0.046462,-0.046301,-0.048079,1.026902,0.916745,-0.851255,-0.788352


## Iteration 3: Give the PCA some backup

[Back to Top](#Machine-Learning-Processing)



In [120]:
# IT3: PCA Scaled
from sklearn.decomposition import PCA 
from sklearn.preprocessing import StandardScaler
scaled = StandardScaler()
Xtrain_Scaled = pd.DataFrame(scaled.fit_transform(Xtrain), columns = Xtrain.columns, index = Xtrain.index)
Xtest_Scaled = pd.DataFrame(scaled.fit_transform(Xtest), columns = Xtest.columns, index = Xtest.index)
XtrainPCA, XtestPCA = PCAmerge(Xtrain_Scaled, Xtest_Scaled, n_comp = 0.95)

In [121]:
XtestPCA.shape[1]

113

In [122]:
# IT1: Logistic Regression
from sklearn.linear_model import LogisticRegression
lgr3 = LogisticRegression(n_jobs=-1)
lgr3.fit(XtrainPCA, ytrain)
print('Logistic Regression Train Score:', lgr3.score(XtrainPCA, ytrain))
print('Logistic Regression Test Score:', lgr3.score(XtestPCA, ytest))

  return f(*args, **kwargs)


Logistic Regression Train Score: 0.9371534195933456
Logistic Regression Test Score: 0.922365988909427


In [123]:
# IT1: KNN
from sklearn.neighbors import KNeighborsClassifier
knn3 = KNeighborsClassifier(n_jobs=-1)
knn3.fit(XtrainPCA, ytrain)
print('KNearestNeighbors Train Score:', knn3.score(XtrainPCA, ytrain))
print('KNearestNeighbors Test Score:', knn3.score(XtestPCA, ytest))

  return self._fit(X, y)


KNearestNeighbors Train Score: 0.9406192236598891
KNearestNeighbors Test Score: 0.9140480591497228


In [124]:
# IT1: LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda3 = LinearDiscriminantAnalysis()
lda3.fit(XtrainPCA, ytrain)
print('LDA Train Score:', lda3.score(XtrainPCA, ytrain))
print('LDA Test Score:', lda3.score(XtestPCA, ytest))

LDA Train Score: 0.9348428835489834
LDA Test Score: 0.9297597042513863


  return f(*args, **kwargs)


In [125]:
# IT1: GNB
from sklearn.naive_bayes import GaussianNB
gnb3 = GaussianNB()
gnb3.fit(XtrainPCA, ytrain)
print('GNB Train Score', gnb3.score(XtrainPCA, ytrain))
print('GNB Test Score', gnb3.score(XtestPCA, ytest))

  return f(*args, **kwargs)


GNB Train Score 0.6351663585951941
GNB Test Score 0.6349353049907579


In [126]:
# IT1: SVM
from sklearn.svm import SVC
svc3 = SVC()
svc3.fit(XtrainPCA, ytrain)
print('SVC Train Score', svc3.score(XtrainPCA, ytrain))
print('SVC Test Score', svc3.score(XtestPCA, ytest))

  return f(*args, **kwargs)


SVC Train Score 0.9396950092421442
SVC Test Score 0.9214417744916821


In [127]:
# IT1: CatBoost
from catboost import CatBoostClassifier
cb3 = CatBoostClassifier(verbose=False)
cb3.fit(XtrainPCA, ytrain)
print('Catboost Train Score', cb3.score(XtrainPCA, ytrain))
print('Catboost Test Score', cb3.score(XtestPCA, ytest))

Catboost Train Score 0.9983826247689463
Catboost Test Score 0.9306839186691312


In [128]:
# IT1: LightGBM
from lightgbm import LGBMClassifier
lgbm3 = LGBMClassifier(n_jobs=-1)
lgbm3.fit(XtrainPCA, ytrain)
print('LightGBM Train Score', lgbm3.score(XtrainPCA, ytrain))
print('LightGBM Test Score', lgbm3.score(XtestPCA, ytest))

  return f(*args, **kwargs)


LightGBM Train Score 1.0
LightGBM Test Score 0.9214417744916821


In [None]:
# Generate Progress Report Card
resul = pd.DataFrame(columns=['Model', 'Logistic', 'KNN', 'LDA', 'GNB', 'SVM', 'CatBoost', 'LightGBM'])
resul.set_index('Model', inplace=True)
resul.loc['Iteration0 Train'] = ['NA','NA','NA','NA','NA',0,0]
resul.loc['Iteration0 Test'] = ['NA','NA','NA','NA','NA',0,0]
resul.loc['Iteration1 Train'] = [0,0,0,0,0,0,0]
resul.loc['Iteration1 Test'] = [0,0,0,0,0,0,0]
resul.loc['Iteration2 Train'] = [0,0,0,0,0,0,0]
resul.loc['Iteration2 Test'] = [0,0,0,0,0,0,0]
resul.loc['Iteration3 Train'] = [0,0,0,0,0,0,0]
resul.loc['Iteration3 Test'] = [0,0,0,0,0,0,0]

# Group the Models
it0 = [cb0, lgbm0]
it1 = [lgr1, knn1, lda1, gnb1, svc1, cb1, lgbm1]
it2 = [lgr2, knn2, lda2, gnb2, svc2, cb2, lgbm2]
it3 = [lgr3, knn3, lda3, gnb3, svc3, cb3, lgbm3]
# Tabulate Iteration 0; only Catboost & LGBM
col = 5
for model in it0:
    resul.iloc[0, col] = model.score(Xtrain, ytrain)
    resul.iloc[1, col] = model.score(Xtest, ytest)
    col+=1
    
# Tabulate Iteration 1; Piped w/ Scaler
col = 0
for model in it1:
    resul.iloc[2, col] = model.score(Xtrain, ytrain)
    resul.iloc[3, col] = model.score(Xtest, ytest)
    col+=1

col = 0
for model in it2:
    resul.iloc[4, col] = model.score(Xtrain, ytrain)
    resul.iloc[5, col] = model.score(Xtest, ytest)
    col+=1

col = 0
for model in it3:
    resul.iloc[6, col] = model.score(XtrainPCA, ytrain)
    resul.iloc[7, col] = model.score(XtestPCA, ytest)
    col+=1
    

In [131]:
resul

Unnamed: 0_level_0,Logistic,KNN,LDA,GNB,SVM,CatBoost,LightGBM
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Iteration0 Train,,,,,,0.997227,1.0
Iteration0 Test,,,,,,0.931608,0.934381
Iteration1 Train,0.937616,0.939695,0.934843,0.698475,0.941312,0.997689,1.0
Iteration1 Test,0.922366,0.918669,0.919593,0.672828,0.924214,0.935305,0.934381
Iteration2 Train,1.0,0.939695,0.933688,0.54598,0.939464,0.998845,1.0
Iteration2 Test,0.924214,0.921442,0.92329,0.542514,0.924214,0.920518,0.924214
Iteration3 Train,0.937153,0.940619,0.934843,0.635166,0.939695,0.998383,1.0
Iteration3 Test,0.922366,0.914048,0.92976,0.634935,0.921442,0.930684,0.921442
