# Machine Learning Processing

### Contents

 - I. [Data Importing & Processing](#Data-Import-and-Processing)
 - II. [Modeling Plan](#Modeling-Plan)
     - [Iteration 0](#Iteration-0:-Tree-based-Naivete)
     - [Iteration 1](#Iteration-1:-Baby's-First-Pipe)
     - [Iteration 2](#Iteration-2:-Give-the-Computer-a-Chance)
         - [Progress So Far](#Progress-Report-Card)
     - [Iteration 1 Optimization](#Iteration-1-Optimizations)
     - [Iteration 2 Optimization](#Iteration-2-Optimizations)
         - [Second Report Card](#Post-Optimization-Report-Card)
 - III. [PCA Weighted Approach](#PCA-Boosting-Technique)
     - [Non-Optimized](#PCABoosted-Modeling)
     - [GridSearch Optimization](#Optimized-PCABoosted-Modeling)
     - [General Report Card](#PCABoosted-Report-Card)
 - IV. [Results & Discussion](#Results-&-Discussion)
     - [Permutation Importance]
     - [Predictions]
     - [Business Relevants & Benefits]

In [1]:
# Import packages etc.
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pickle

## Data Import and Processing

In [2]:
provdat = pickle.load(open('Xtrain.pickle', 'rb'))
provtarg = pickle.load(open('Xtest.pickle', 'rb'))
provgroup = pickle.load(open('Xdata.pickle', 'rb'))

In [3]:
# Examining Pre-Aggregated Per-Provider Data
print('Data Shape: ' + str(provdat.shape))
provdat.head()

Data Shape: (5410, 70)


Unnamed: 0_level_0,Number_of_Claims,Num_Unique_Patients,Num_Unique_Docs,Num_Unique_Diag_Codes,Num_Unique_Proc_Codes,Mean_InscClaimAmtReimbursed,Mean_DeductibleAmtPaid,Mean_Reimbursement_per_Day,Mean_Length_of_Stay,Mean_Length_of_Claim,...,Perc_ClaimStartDt_March,Perc_ClaimStartDt_April,Perc_ClaimStartDt_May,Perc_ClaimStartDt_June,Perc_ClaimStartDt_July,Perc_ClaimStartDt_August,Perc_ClaimStartDt_September,Perc_ClaimStartDt_October,Perc_ClaimStartDt_November,Perc_ClaimStartDt_December
Provider,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PRV51001,25,24,19,72,3,4185.6,213.6,3700.0,6.0,2.44,...,0.12,0.08,0.2,0.12,0.08,0.04,0.04,0.0,0.0,0.08
PRV51003,132,117,51,365,39,4588.41,502.17,2137.42,6.16,4.67,...,0.08,0.08,0.06,0.11,0.08,0.1,0.07,0.08,0.08,0.05
PRV51004,149,138,48,267,0,350.13,2.08,0.0,0.0,2.43,...,0.12,0.07,0.05,0.07,0.09,0.07,0.08,0.07,0.05,0.11
PRV51005,1165,495,7,1294,0,241.12,3.18,0.0,0.0,2.09,...,0.1,0.08,0.07,0.09,0.08,0.08,0.09,0.1,0.06,0.07
PRV51007,72,58,11,165,1,468.19,45.33,1080.56,6.33,1.96,...,0.12,0.06,0.12,0.12,0.04,0.08,0.1,0.03,0.08,0.08


In [4]:
# Examining Per-Provider Incidence of Fradulence
print('Data Shape: ' + str(provtarg.shape))
provtarg.head()

Data Shape: (5410,)


Provider
PRV51001    0
PRV51003    1
PRV51004    0
PRV51005    1
PRV51007    0
Name: PotentialFraud, dtype: int64

In [5]:
# Combining Target data with Provider data to generate supervised learning dataset
xdata = pd.merge(provdat.reset_index(), provtarg.reset_index(), on='Provider').set_index('Provider')
xdata.head()

Unnamed: 0_level_0,Number_of_Claims,Num_Unique_Patients,Num_Unique_Docs,Num_Unique_Diag_Codes,Num_Unique_Proc_Codes,Mean_InscClaimAmtReimbursed,Mean_DeductibleAmtPaid,Mean_Reimbursement_per_Day,Mean_Length_of_Stay,Mean_Length_of_Claim,...,Perc_ClaimStartDt_April,Perc_ClaimStartDt_May,Perc_ClaimStartDt_June,Perc_ClaimStartDt_July,Perc_ClaimStartDt_August,Perc_ClaimStartDt_September,Perc_ClaimStartDt_October,Perc_ClaimStartDt_November,Perc_ClaimStartDt_December,PotentialFraud
Provider,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PRV51001,25,24,19,72,3,4185.6,213.6,3700.0,6.0,2.44,...,0.08,0.2,0.12,0.08,0.04,0.04,0.0,0.0,0.08,0
PRV51003,132,117,51,365,39,4588.41,502.17,2137.42,6.16,4.67,...,0.08,0.06,0.11,0.08,0.1,0.07,0.08,0.08,0.05,1
PRV51004,149,138,48,267,0,350.13,2.08,0.0,0.0,2.43,...,0.07,0.05,0.07,0.09,0.07,0.08,0.07,0.05,0.11,0
PRV51005,1165,495,7,1294,0,241.12,3.18,0.0,0.0,2.09,...,0.08,0.07,0.09,0.08,0.08,0.09,0.1,0.06,0.07,1
PRV51007,72,58,11,165,1,468.19,45.33,1080.56,6.33,1.96,...,0.06,0.12,0.12,0.04,0.08,0.1,0.03,0.08,0.08,0


In [6]:
from sklearn.model_selection import train_test_split

# Split the data into train & test sets for supervised learning
Xtrain, Xtest, ytrain, ytest = train_test_split(xdata.iloc[:, :-1],
                                               xdata.iloc[:, -1:],
                                               test_size=0.2,
                                               random_state=430)

# Quick Heads Up
print('Training Data shape: ' + str(Xtrain.shape))
print('Test Data Shape: ' + str(Xtest.shape))

Training Data shape: (4328, 70)
Test Data Shape: (1082, 70)


## Modeling Plan
The general strategy of our approach incorporates an initial, broad spectrum testing of models, including the following.
 - Logistic Regression
 - KNN
 - Discriminant Analysis
 - Naive Bayes
 - SVM
 - CatBoost
 - Light GBM

This grouping should represent a decent collection of well-worn, classical methods with exceptionally long history & documentation as well as newer, useful methodologies. Of particular interest to this last point will be **Catboost & Light GBM**, two Gradient Boosting models that have shown a great deal of promise in terms of efficiency & accuracy.

### Iterative Strategy | What We Care About 
 - Iteration 0: Naive, untouched modeling with tree-based models; just fit.
 - Iteration 1: Pipeline == SCALE(No Norm) & FIT; no unsupervised.
     - 1a for gridsearch?
 - Iteration 2: Pipeline == SCALE(No Norm), PCA, & FIT.
     - 2a for gridsearch?
 
### Naming Procedure
We can abbreviate the model name alongside iteration number to keep track (eg. "cb0" for "Catboost, Iteration 0)

### Iteration 0: Tree-based Naivete
[Back to Top](#Machine-Learning-Processing)

Both Catboost & Light GBM are quite capable of handling practically unprocessed data, such as we have imported. For further reference, please consult their respective documentation. Here we will process these for some initial yield results.

In [7]:
from catboost import CatBoostClassifier
cb0 = CatBoostClassifier(verbose=False)
cb0.fit(Xtrain, ytrain)
cb0.score(Xtest, ytest)
print('CatBoost Train R2 Score:', cb0.score(Xtrain, ytrain))
print('CatBoost Test R2 Score:', cb0.score(Xtest, ytest))

CatBoost Train R2 Score: 0.9972273567467652
CatBoost Test R2 Score: 0.9316081330868762


In [8]:
from lightgbm import LGBMClassifier
lgbm0 = LGBMClassifier()
lgbm0.fit(Xtrain, ytrain)
print('LightGBM Boost Train R2 Score:', lgbm0.score(Xtrain, ytrain))
print('LightGBM Boost Test R2 Score:', lgbm0.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Boost Train R2 Score: 1.0
LightGBM Boost Test R2 Score: 0.9343807763401109


### Iteration 1: Baby's First Pipe
[Back to Top](#Machine-Learning-Processing)

Our goal here will be to test the results with some initial scaling built into a pipeline. 

In [9]:
badindices = np.where(np.isinf(Xtrain))
badindices

(array([ 220, 1330, 1347, 2198, 2599, 2614, 2920, 2946, 3951, 4321]),
 array([24, 24, 24, 24, 24, 24, 24, 24, 24, 24]))

In [10]:
Xtrain.iloc[-badindices[0], 24].mean()

27.769

In [11]:
Xtrain.iloc[badindices[0], 24] = Xtrain.iloc[-badindices[0], 24].mean()

In [12]:
badtests = np.where(np.isinf(Xtest))
Xtest.iloc[badtests[0], 24] = Xtrain.iloc[-badindices[0], 24].mean()

In [13]:
# IT1: Logistic Regression
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
lgr1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('logistic', LogisticRegression(n_jobs=-2))])
lgr1.fit(Xtrain, ytrain)
print('Logistic Regression Train Score:', lgr1.score(Xtrain, ytrain))
print('Logistic Regression Test Score:', lgr1.score(Xtest, ytest))

  return f(*args, **kwargs)


Logistic Regression Train Score: 0.9376155268022182
Logistic Regression Test Score: 0.922365988909427


In [14]:
# IT1: KNN
from sklearn.neighbors import KNeighborsClassifier
knn1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('knn', KNeighborsClassifier(n_jobs=-2))])
knn1.fit(Xtrain, ytrain)
print('KNearestNeighbors Train Score:', knn1.score(Xtrain, ytrain))
print('KNearestNeighbors Test Score:', knn1.score(Xtest, ytest))

  return self._fit(X, y)


KNearestNeighbors Train Score: 0.9396950092421442
KNearestNeighbors Test Score: 0.9186691312384473


In [15]:
# IT1: LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('lda', LinearDiscriminantAnalysis())])
lda1.fit(Xtrain, ytrain)
print('LDA Train Score:', lda1.score(Xtrain, ytrain))
print('LDA Test Score:', lda1.score(Xtest, ytest))

LDA Train Score: 0.9348428835489834
LDA Test Score: 0.9195933456561922


  return f(*args, **kwargs)


In [16]:
# IT1: GNB
from sklearn.naive_bayes import GaussianNB
gnb1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('gnb', GaussianNB())])
gnb1.fit(Xtrain, ytrain)
print('GNB Train Score', gnb1.score(Xtrain, ytrain))
print('GNB Test Score', gnb1.score(Xtest, ytest))

GNB Train Score 0.6984750462107209
GNB Test Score 0.6728280961182994


  return f(*args, **kwargs)


In [17]:
# IT1: SVM
from sklearn.svm import SVC
svc1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('svc', SVC())])
svc1.fit(Xtrain, ytrain)
print('SVC Train Score', svc1.score(Xtrain, ytrain))
print('SVC Test Score', svc1.score(Xtest, ytest))

  return f(*args, **kwargs)


SVC Train Score 0.9413123844731978
SVC Test Score 0.9242144177449169


In [18]:
# IT1: CatBoost
cb1 = Pipeline(steps=[('scaler', StandardScaler()),
                     ('catboost', CatBoostClassifier(random_state=430, verbose=False))])
cb1.fit(Xtrain, ytrain)
print('Catboost Train Score', cb1.score(Xtrain, ytrain))
print('Catboost Test Score', cb1.score(Xtest, ytest))

Catboost Train Score 0.9974584103512015
Catboost Test Score 0.933456561922366


In [19]:
# IT1: LightGBM
lgbm1 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('lightgbm', LGBMClassifier(n_jobs=-2, random_state=430))])
lgbm1.fit(Xtrain, ytrain)
print('LightGBM Train Score', lgbm1.score(Xtrain, ytrain))
print('LightGBM Test Score', lgbm1.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Train Score 1.0
LightGBM Test Score 0.9343807763401109


## Iteration 2: Give the Computer a Chance

[Back to Top](#Machine-Learning-Processing)



In [20]:
# IT2: Logistic
from sklearn.decomposition import PCA 
lgr2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('logistic', LogisticRegression(n_jobs=-2))])
lgr2.fit(Xtrain, ytrain)
print('LightGBM Train Score', lgr2.score(Xtrain, ytrain))
print('LightGBM Test Score', lgr2.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Train Score 0.9366913123844732
LightGBM Test Score 0.922365988909427


In [21]:
# IT2: KNN
knn2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('knn', KNeighborsClassifier(n_jobs=-2))])
knn2.fit(Xtrain, ytrain)
print('LightGBM Train Score', knn2.score(Xtrain, ytrain))
print('LightGBM Test Score', knn2.score(Xtest, ytest))

  return self._fit(X, y)


LightGBM Train Score 0.9396950092421442
LightGBM Test Score 0.9214417744916821


In [22]:
# IT2: LDA
lda2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('lda', LinearDiscriminantAnalysis())])
lda2.fit(Xtrain, ytrain)
print('LightGBM Train Score', lda2.score(Xtrain, ytrain))
print('LightGBM Test Score', lda2.score(Xtest, ytest))

LightGBM Train Score 0.9336876155268022
LightGBM Test Score 0.9232902033271719


  return f(*args, **kwargs)


In [23]:
# IT2: GNB
gnb2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('gnb', GaussianNB())])
gnb2.fit(Xtrain, ytrain)
print('LightGBM Train Score', gnb2.score(Xtrain, ytrain))
print('LightGBM Test Score', gnb2.score(Xtest, ytest))

LightGBM Train Score 0.5459796672828097
LightGBM Test Score 0.5425138632162662


  return f(*args, **kwargs)


In [24]:
# IT2: SVM
svc2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('svc', SVC())])
svc2.fit(Xtrain, ytrain)
print('LightGBM Train Score', svc2.score(Xtrain, ytrain))
print('LightGBM Test Score', svc2.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Train Score 0.9394639556377079
LightGBM Test Score 0.9242144177449169


In [25]:
# IT2: CatBoost
cb2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('Catboost', CatBoostClassifier(random_state=430, verbose=False))])
cb2.fit(Xtrain, ytrain)
print('LightGBM Train Score', cb2.score(Xtrain, ytrain))
print('LightGBM Test Score', cb2.score(Xtest, ytest))

LightGBM Train Score 0.9988447319778189
LightGBM Test Score 0.9242144177449169


In [26]:
# IT2: LightGBM
lgbm2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('LightGBM', LGBMClassifier(random_state=430))])
lgbm2.fit(Xtrain, ytrain)
print('LightGBM Train Score', lgbm2.score(Xtrain, ytrain))
print('LightGBM Test Score', lgbm2.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Train Score 1.0
LightGBM Test Score 0.9242144177449169


### Progress Report Card

[Back to Top](#Machine-Learning-Processing)

In [100]:
# Generate Progress Report Card
resul = pd.DataFrame(columns=['Model', 'Logistic', 'KNN', 'LDA', 'GNB', 'SVM', 'CatBoost', 'LightGBM'])
resul.set_index('Model', inplace=True)
resul.loc['Iteration0 Train'] = ['NA','NA','NA','NA','NA',0,0]
resul.loc['Iteration0 Test'] = ['NA','NA','NA','NA','NA',0,0]
resul.loc['Iteration1 Train'] = [0,0,0,0,0,0,0]
resul.loc['Iteration1 Test'] = [0,0,0,0,0,0,0]
resul.loc['Iteration2 Train'] = [0,0,0,0,0,0,0]
resul.loc['Iteration2 Test'] = [0,0,0,0,0,0,0]

# Group the Models
it0 = [cb0, lgbm0]
it1 = [lgr1, knn1, lda1, gnb1, svc1, cb1, lgbm1]
it2 = [lgr2, knn2, lda2, gnb2, svc2, cb2, lgbm2]

# Tabulate Iteration 0; only Catboost & LGBM
col = 5
for model in it0:
    resul.iloc[0, col] = model.score(Xtrain, ytrain)
    resul.iloc[1, col] = model.score(Xtest, ytest)
    col+=1
    
# Tabulate Iteration 1; Piped w/ Scaler
col = 0
for model in it1:
    resul.iloc[2, col] = model.score(Xtrain, ytrain)
    resul.iloc[3, col] = model.score(Xtest, ytest)
    col+=1

col = 0
for model in it2:
    resul.iloc[4, col] = model.score(Xtrain, ytrain)
    resul.iloc[5, col] = model.score(Xtest, ytest)
    col+=1
    
resul.head(6)

Unnamed: 0_level_0,Logistic,KNN,LDA,GNB,SVM,CatBoost,LightGBM
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Iteration0 Train,,,,,,0.997227,1.0
Iteration0 Test,,,,,,0.931608,0.934381
Iteration1 Train,0.937616,0.939695,0.934843,0.698475,0.941312,0.997458,1.0
Iteration1 Test,0.922366,0.918669,0.919593,0.672828,0.924214,0.933457,0.934381
Iteration2 Train,0.936691,0.939695,0.933688,0.54598,0.939464,0.998845,1.0
Iteration2 Test,0.922366,0.921442,0.92329,0.542514,0.924214,0.924214,0.924214


## Optimized Testing
[Back to Top](#Machine-Learning-Processing)

We already have a number of extremely promising results, the strongest of which seem confined - for now - to the tree-based CatBoost & LightGBM models, eschewing PCA, which further underlines the potential issue of blindly employing unsupervised learning.

For further results comparison (future processing from here):

[Compare Post-Opt](#Post-Optimization-Report-Card)

[Compare PCABoosted Results](#)

[Compare PCABoosted Opt](#)

### Iteration 0 Optimizations

In [29]:
# IT0-A: Optimized Naive Catboost
from sklearn.model_selection import GridSearchCV

grid = {'depth': [5, 6, 7, 8, 9],
       'min_data_in_leaf': [1, 2, 4]}
cb0_a = GridSearchCV(CatBoostClassifier(verbose=False, random_state=430),
                    grid, n_jobs=-2, cv=3)
cb0_a.fit(Xtrain, ytrain)

print('CB Optimized Train Score:', cb0_a.score(Xtrain, ytrain))
print('CB Optimized Test Score:', cb0_a.score(Xtest, ytest))
print('Best Params:', cb0_a.best_params_)

CB Optimized Train Score: 0.9946857670979667
CB Optimized Test Score: 0.9279112754158965
Best Params: {'depth': 5, 'min_data_in_leaf': 1}


In [30]:
# IT0-A: Optimized LightGBM
grid = {'num_leaves': [44, 45, 46, 47, 48],
       'max_depth': [9, 10, 11, 12, 13],
       'min_data_in_leaf': [15, 16, 17, 18, 19]}
lgbm0_a = GridSearchCV(LGBMClassifier(n_jobs=-2, random_state=430),
                      grid, n_jobs=-2, cv=3)
lgbm0_a.fit(Xtrain, ytrain)

print('LightGBM Optimized Train Score:', lgbm0_a.score(Xtrain, ytrain))
print('LightGBM Optimized Test Score:', lgbm0_a.score(Xtest, ytest))
print('Best Params:', lgbm0_a.best_params_)

  return f(*args, **kwargs)


LightGBM Optimized Train Score: 1.0
LightGBM Optimized Test Score: 0.9297597042513863
Best Params: {'max_depth': 12, 'min_data_in_leaf': 17, 'num_leaves': 46}


### Iteration 1 Optimizations
[Back to Top](#Machine-Learning-Processing)


In [31]:
# IT1-A: Optimized LogisticRegression, lgr1_a
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                          ('logistic', LogisticRegression(n_jobs=-2))])
grid = {'logistic__C': np.linspace(-4, 4, 50),
       'logistic__penalty': ['l1', 'l2']}
lgr1_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
lgr1_a.fit(Xtrain, ytrain)

print('LogisticRegression Optimized Train Score:', lgr1_a.score(Xtrain, ytrain))
print('LogisticRegression Optimized Test Score:', lgr1_a.score(Xtest, ytest))
print('Best Params:', lgr1_a.best_params_)

        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan 0.9336882         nan 0.93345704
        nan 0.93299455        nan 0.93253239        nan 0.93230139
        nan 0.93230139        nan 0.93207039        nan 0.93230139
        nan 0.93276355        nan 0.93276355        nan 0.93276355
        nan 0.93299471        nan 0.93299471        nan 0.93299471
        nan 0.93253255        nan 0.93253255        nan 0.93253255
        nan 0.93253255        nan 0.93253255        nan 0.93276355
        nan 0.93276355        nan 0.93276355        nan 0.9325

LogisticRegression Optimized Train Score: 0.9369223659889094
LogisticRegression Optimized Test Score: 0.9242144177449169
Best Params: {'logistic__C': 0.08163265306122458, 'logistic__penalty': 'l2'}


In [32]:
# IT1-A: Optimized KNN, knn1_a
ksqrt = int(np.round(np.sqrt(Xtrain.shape[0])))       # Get sqrt of observations for k
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                          ('knn', KNeighborsClassifier(n_jobs=-2))])
grid = {'knn__n_neighbors': [ksqrt-5, ksqrt-3, ksqrt, ksqrt+3, ksqrt+5],
       'knn__weights': ['uniform', 'distance'],
       'knn__metric': ['euclidean', 'manhattan']}
knn1_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
knn1_a.fit(Xtrain, ytrain)

print('KNN Optimized Train Score:', knn1_a.score(Xtrain, ytrain))
print('KNN Optimized Test Score:', knn1_a.score(Xtest, ytest))
print('Best Params:', knn1_a.best_params_)

  return self._fit(X, y)


KNN Optimized Train Score: 1.0
KNN Optimized Test Score: 0.9242144177449169
Best Params: {'knn__metric': 'euclidean', 'knn__n_neighbors': 66, 'knn__weights': 'distance'}


In [33]:
# IT1-A: Optimized LDA
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                          ('lda', LinearDiscriminantAnalysis())])
grid = {'lda__solver': ['svd', 'lsqr', 'eigen']}
lda1_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
lda1_a.fit(Xtrain, ytrain)

print('LDA Optimized Train Score:', lda1_a.score(Xtrain, ytrain))
print('LDA Optimized Test Score:', lda1_a.score(Xtest, ytest))
print('Best Params:', lda1_a.best_params_)

LDA Optimized Train Score: 0.9359981515711645
LDA Optimized Test Score: 0.9195933456561922
Best Params: {'lda__solver': 'lsqr'}


  return f(*args, **kwargs)


In [34]:
# IT1-A: Optimized GNB
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                          ('gnb', GaussianNB())])
grid = {'gnb__var_smoothing': np.logspace(0, -9, num=100)}
gnb1_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
gnb1_a.fit(Xtrain, ytrain)

print('GNB Optimized Train Score:', gnb1_a.score(Xtrain, ytrain))
print('GNB Optimized Test Score:', gnb1_a.score(Xtest, ytest))
print('Best Params:', gnb1_a.best_params_)

GNB Optimized Train Score: 0.6984750462107209
GNB Optimized Test Score: 0.6728280961182994
Best Params: {'gnb__var_smoothing': 0.0002310129700083158}


  return f(*args, **kwargs)


In [35]:
# IT1-A: Optimized SVM
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                          ('svc', SVC())])
grid = {'svc__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
       'svc__gamma': ['scale'],
       'svc__C': [0.001, 0.01, 0.1, 1, 10, 100]}
svc1_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
svc1_a.fit(Xtrain, ytrain)

print('SVC Optimized Train Score:', svc1_a.score(Xtrain, ytrain))
print('SVC Optimized Test Score:', svc1_a.score(Xtest, ytest))
print('Best Params:', svc1_a.best_params_)

  return f(*args, **kwargs)


SVC Optimized Train Score: 0.9376155268022182
SVC Optimized Test Score: 0.9306839186691312
Best Params: {'svc__C': 0.1, 'svc__gamma': 'scale', 'svc__kernel': 'linear'}


In [36]:
# IT1-A: Optimized CatBoost
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                          ('catboost', CatBoostClassifier(verbose=False))])
grid = {'catboost__depth': [5, 6, 7, 8, 9],
       'catboost__min_data_in_leaf': [1, 2, 4]}
cb1_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
cb1_a.fit(Xtrain, ytrain)

print('CB Optimized Train Score:', cb1_a.score(Xtrain, ytrain))
print('CB Optimized Test Score:', cb1_a.score(Xtest, ytest))
print('Best Params:', cb1_a.best_params_)



CB Optimized Train Score: 0.9993068391866913
CB Optimized Test Score: 0.9316081330868762
Best Params: {'catboost__depth': 7, 'catboost__min_data_in_leaf': 1}


In [46]:
# IT1-A: Optimized LightGBM
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                          ('lgbm', LGBMClassifier(n_jobs=-2, random_state=430))])
grid = {'lgbm__num_leaves': [31, 44, 45, 46, 47, 48],
       'lgbm__max_depth': [-1, 9, 10, 11, 12, 13],
       'lgbm__min_data_in_leaf': [15, 16, 17, 18, 19, 20]}
lgbm1_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
lgbm1_a.fit(Xtrain, ytrain)

print('LightGBM Optimized Train Score:', lgbm1_a.score(Xtrain, ytrain))
print('LightGBM Optimized Test Score:', lgbm1_a.score(Xtest, ytest))
print('Best Params:', lgbm1_a.best_params_)

  return f(*args, **kwargs)


LightGBM Optimized Train Score: 1.0
LightGBM Optimized Test Score: 0.9306839186691312
Best Params: {'lgbm__max_depth': 11, 'lgbm__min_data_in_leaf': 17, 'lgbm__num_leaves': 46}


### Iteration 2 Optimizations
[Back to Top](#Machine-Learning-Processing)

In [38]:
# IT2-A: Optimized PCA'd LogisticRegression
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                           ('pca', PCA(random_state=430)),
                           ('logistic', LogisticRegression(n_jobs=-2))])
grid = {'pca__n_components': [0.99],
       'logistic__C': np.linspace(-4, 4, 50),
       'logistic__penalty': ['l1', 'l2']}
lgr2_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
lgr2_a.fit(Xtrain, ytrain)

print('LogisticRegression Optimized Train Score:', lgr2_a.score(Xtrain, ytrain))
print('LogisticRegression Optimized Test Score:', lgr2_a.score(Xtest, ytest))
print('Best Params:', lgr2_a.best_params_)

        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan 0.93392           nan 0.93415084
        nan 0.934151          nan 0.93391984        nan 0.93391984
        nan 0.93391984        nan 0.93391984        nan 0.93391984
        nan 0.93391984        nan 0.93391984        nan 0.93391984
        nan 0.93391984        nan 0.93368868        nan 0.93345752
        nan 0.93345752        nan 0.93345752        nan 0.93345752
        nan 0.93345752        nan 0.93345752        nan 0.93345752
        nan 0.93345752        nan 0.93345752        nan 0.9334

LogisticRegression Optimized Train Score: 0.9366913123844732
LogisticRegression Optimized Test Score: 0.9214417744916821
Best Params: {'logistic__C': 0.408163265306122, 'logistic__penalty': 'l2', 'pca__n_components': 0.99}


In [39]:
# IT2-A: Optimized, PCA'd KNN     
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                           ('pca', PCA(random_state=430)),
                           ('knn', KNeighborsClassifier(n_jobs=-2))])
grid = {'pca__n_components': [0.99],
       'knn__n_neighbors': [ksqrt-5, ksqrt-3, ksqrt, ksqrt+3, ksqrt+5],   # already collected ksqrt
       'knn__weights': ['uniform', 'distance'],
       'knn__metric': ['euclidean', 'manhattan']}
knn2_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
knn2_a.fit(Xtrain, ytrain)

print('KNN Optimized Train Score:', knn2_a.score(Xtrain, ytrain))
print('KNN Optimized Test Score:', knn2_a.score(Xtest, ytest))
print('Best Params:', knn2_a.best_params_)

  return self._fit(X, y)


KNN Optimized Train Score: 1.0
KNN Optimized Test Score: 0.9242144177449169
Best Params: {'knn__metric': 'euclidean', 'knn__n_neighbors': 61, 'knn__weights': 'distance', 'pca__n_components': 0.99}


In [40]:
# IT2-A: Optimized, PCA'd LDA
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                           ('pca', PCA(random_state=430)),
                           ('lda', LinearDiscriminantAnalysis())])
grid = {'pca__n_components': [0.99],
        'lda__solver': ['svd', 'lsqr', 'eigen']}
lda2_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
lda2_a.fit(Xtrain, ytrain)

print('LDA Optimized Train Score:', lda2_a.score(Xtrain, ytrain))
print('LDA Optimized Test Score:', lda2_a.score(Xtest, ytest))
print('Best Params:', lda2_a.best_params_)

LDA Optimized Train Score: 0.9336876155268022
LDA Optimized Test Score: 0.9232902033271719
Best Params: {'lda__solver': 'svd', 'pca__n_components': 0.99}


  return f(*args, **kwargs)


In [41]:
# IT2-A: Optimized, PCA'd GNB
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                           ('pca', PCA(random_state=430)),
                           ('gnb', GaussianNB())])
grid = {'pca__n_components': [0.99],
        'gnb__var_smoothing': np.logspace(0, -9, num=100)}
gnb2_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
gnb2_a.fit(Xtrain, ytrain)

print('GNB Optimized Train Score:', gnb2_a.score(Xtrain, ytrain))
print('GNB Optimized Test Score:', gnb2_a.score(Xtest, ytest))
print('Best Params:', gnb2_a.best_params_)

GNB Optimized Train Score: 0.927449168207024
GNB Optimized Test Score: 0.9149722735674677
Best Params: {'gnb__var_smoothing': 0.8111308307896871, 'pca__n_components': 0.99}


  return f(*args, **kwargs)


In [42]:
# IT2-A: Optimized, PCA'd SVM
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                           ('pca', PCA(random_state=430)),
                           ('svc', SVC())])
grid = {'pca__n_components': [0.99],
       'svc__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
       'svc__gamma': ['scale'],
       'svc__C': [0.001, 0.01, 0.1, 1, 10, 100]}
svc2_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
svc2_a.fit(Xtrain, ytrain)

print('SVC Optimized Train Score:', svc2_a.score(Xtrain, ytrain))
print('SVC Optimized Test Score:', svc2_a.score(Xtest, ytest))
print('Best Params:', svc2_a.best_params_)

  return f(*args, **kwargs)


SVC Optimized Train Score: 0.9376155268022182
SVC Optimized Test Score: 0.9269870609981515
Best Params: {'pca__n_components': 0.99, 'svc__C': 0.1, 'svc__gamma': 'scale', 'svc__kernel': 'linear'}


In [43]:
# IT2-A: Optimized, PCA'd CatBoost
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                           ('pca', PCA(random_state=430)),
                           ('catboost', CatBoostClassifier(verbose=False))])
grid = {'pca__n_components': [0.99],
       'catboost__depth': [5, 6, 7, 8, 9],
       'catboost__min_data_in_leaf': [1, 2, 4]}
cb2_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
cb2_a.fit(Xtrain, ytrain)

print('CB Optimized Train Score:', cb2_a.score(Xtrain, ytrain))
print('CB Optimized Test Score:', cb2_a.score(Xtest, ytest))
print('Best Params:', cb2_a.best_params_)

CB Optimized Train Score: 1.0
CB Optimized Test Score: 0.9195933456561922
Best Params: {'catboost__depth': 9, 'catboost__min_data_in_leaf': 1, 'pca__n_components': 0.99}


In [44]:
# IT2-A: Optimized, PCA'd LightGBM
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                           ('pca', PCA(random_state=430)),
                           ('lgbm', LGBMClassifier(n_jobs=-2, random_state=430))])
grid = {'pca__n_components': [0.99],
       'lgbm__num_leaves': [31, 36, 41, 46, 51],
       'lgbm__max_depth': [5, 7, 9, 11, 13, 15, 17],
       'lgbm__min_data_in_leaf': [15, 16, 17, 18, 19]}
lgbm2_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
lgbm2_a.fit(Xtrain, ytrain)

print('LightGBM Optimized Train Score:', lgbm2_a.score(Xtrain, ytrain))
print('LightGBM Optimized Test Score:', lgbm2_a.score(Xtest, ytest))
print('Best Params:', lgbm2_a.best_params_)

  return f(*args, **kwargs)


LightGBM Optimized Train Score: 1.0
LightGBM Optimized Test Score: 0.922365988909427
Best Params: {'lgbm__max_depth': 15, 'lgbm__min_data_in_leaf': 17, 'lgbm__num_leaves': 41, 'pca__n_components': 0.99}


### Post-Optimization Report Card
[Back to Top](#Machine-Learning-Processing)

Compare:

[Original Optimized Testing](#Progress-Report-Card)

In [101]:
# Generate Progress Report Card
resul_a = pd.DataFrame(columns=['Model', 'Logistic', 'KNN', 'LDA', 'GNB', 'SVM', 'CatBoost', 'LightGBM'])
resul_a.set_index('Model', inplace=True)
resul_a.loc['Iteration0 Train'] = ['NA','NA','NA','NA','NA',0,0]
resul_a.loc['Iteration0 Test'] = ['NA','NA','NA','NA','NA',0,0]
resul_a.loc['Iteration1 Train'] = [0,0,0,0,0,0,0]
resul_a.loc['Iteration1 Test'] = [0,0,0,0,0,0,0]
resul_a.loc['Iteration2 Train'] = [0,0,0,0,0,0,0]
resul_a.loc['Iteration2 Test'] = [0,0,0,0,0,0,0]

# Group the Models
it0_a = [cb0_a, lgbm0_a]
it1_a = [lgr1_a, knn1_a, lda1_a, gnb1_a, svc1_a, cb1_a, lgbm1_a]
it2_a = [lgr2_a, knn2_a, lda2_a, gnb2_a, svc2_a, cb2_a, lgbm2_a]

# Tabulate Iteration 0; only Catboost & LGBM
col = 5
for model in it0_a:
    resul.iloc[0, col] = model.score(Xtrain, ytrain)
    resul.iloc[1, col] = model.score(Xtest, ytest)
    col+=1
    
# Tabulate Iteration 1; Piped w/ Scaler
col = 0
for model in it1_a:
    resul.iloc[2, col] = model.score(Xtrain, ytrain)
    resul.iloc[3, col] = model.score(Xtest, ytest)
    col+=1

col = 0
for model in it2_a:
    resul.iloc[4, col] = model.score(Xtrain, ytrain)
    resul.iloc[5, col] = model.score(Xtest, ytest)
    col+=1
    
resul.head(6)

Unnamed: 0_level_0,Logistic,KNN,LDA,GNB,SVM,CatBoost,LightGBM
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Iteration0 Train,,,,,,0.994686,1.0
Iteration0 Test,,,,,,0.927911,0.92976
Iteration1 Train,0.936922,1.0,0.935998,0.698475,0.937616,0.999307,1.0
Iteration1 Test,0.924214,0.924214,0.919593,0.672828,0.930684,0.931608,0.930684
Iteration2 Train,0.936691,1.0,0.933688,0.927449,0.937616,1.0,1.0
Iteration2 Test,0.921442,0.924214,0.92329,0.914972,0.926987,0.919593,0.922366


### Pickling Grouped Models

In [89]:
### Pickling Exports
pickle.dump(it0, open('iteration0_group.pickle', 'wb'))
pickle.dump(it1, open('iteration1_group.pickle', 'wb'))
pickle.dump(it2, open('iteration2_group.pickle', 'wb'))
pickle.dump(it0_a, open('iteration0_a_group.pickle', 'wb'))
pickle.dump(it1_a, open('iteration1_a_group.pickle', 'wb'))
pickle.dump(it2_a, open('iteration2_a_group.pickle', 'wb'))

# PCA Boosting Technique
[Back to Top](#Machine-Learning-Processing)


In [48]:
def PCAmerge(train, test, n_comp = 0.95, r_state = 42):
    from sklearn.decomposition import PCA 
    pca = PCA(n_components = n_comp, random_state = r_state)
    train_pca = pd.DataFrame(pca.fit_transform(train), index=train.index)
    test_pca = pd.DataFrame(pca.transform(test), index=test.index)
    PCA_col_list = []
    for num in range(1, train_pca.shape[1]+1):
        PCA_col_list.append(('PCA' + str(num)))
    train_pca.columns = PCA_col_list
    train = pd.concat([train, train_pca], axis=1)
    test_pca.columns = PCA_col_list
    test = pd.concat([test, test_pca], axis=1)
    return train, test


In [49]:
# IT3: PCA Scaled
scaled = StandardScaler()
Xtrain_Scaled = pd.DataFrame(scaled.fit_transform(Xtrain), columns = Xtrain.columns, index = Xtrain.index)
Xtest_Scaled = pd.DataFrame(scaled.fit_transform(Xtest), columns = Xtest.columns, index = Xtest.index)
XtrainPCA, XtestPCA = PCAmerge(Xtrain_Scaled, Xtest_Scaled, n_comp = 0.95)

In [50]:
XtestPCA.shape[1]

113

## PCABoosted Modeling
[Back to Top](#Machine-Learning-Processing)

Considering that the bespoke function would not play well with Pipeline, we have pre-processed the data here, first with scaling and then with the PCA Boost. Therefore, we can simply fit the raw models in this first, un-optimized modeling procedure.

In [54]:
# PCA_IT1: Logistic Regression, p_lgr1
p_lgr1 = LogisticRegression(n_jobs=-2)
p_lgr1.fit(XtrainPCA, ytrain)
print('Train Score:', p_lgr1.score(XtrainPCA, ytrain))
print('Test Score:', p_lgr1.score(XtestPCA, ytest))

  return f(*args, **kwargs)


Train Score: 0.9371534195933456
Test Score: 0.922365988909427


In [55]:
# PCA_IT1: KNN, p_knn1
p_knn1 = KNeighborsClassifier(n_jobs=-2)
p_knn1.fit(XtrainPCA, ytrain)
print('Train Score:', p_knn1.score(XtrainPCA, ytrain))
print('Test Score:', p_knn1.score(XtestPCA, ytest))

  return self._fit(X, y)


Train Score: 0.9406192236598891
Test Score: 0.9140480591497228


In [56]:
# PCA_IT1: LDA, p_lda1
p_lda1 = LinearDiscriminantAnalysis()
p_lda1.fit(XtrainPCA, ytrain)
print('Train Score:', p_lda1.score(XtrainPCA, ytrain))
print('Test Score:', p_lda1.score(XtestPCA, ytest))

Train Score: 0.9348428835489834
Test Score: 0.9297597042513863


  return f(*args, **kwargs)


In [57]:
# PCA_IT1: GNB, p_gnb1
p_gnb1 = GaussianNB()
p_gnb1.fit(XtrainPCA, ytrain)
print('Train Score:', p_gnb1.score(XtrainPCA, ytrain))
print('Test Score:', p_gnb1.score(XtestPCA, ytest))

Train Score: 0.6351663585951941
Test Score: 0.6349353049907579


  return f(*args, **kwargs)


In [58]:
# PCA_IT1: SVM, p_svc1
p_svc1 = SVC()
p_svc1.fit(XtrainPCA, ytrain)
print('Train Score:', p_svc1.score(XtrainPCA, ytrain))
print('Test Score:', p_svc1.score(XtestPCA, ytest))

  return f(*args, **kwargs)


Train Score: 0.9396950092421442
Test Score: 0.9214417744916821


In [59]:
# PCA_IT1: CatBoost, p_cb1
p_cb1 = CatBoostClassifier(verbose=False)
p_cb1.fit(XtrainPCA, ytrain)
print('Train Score:', p_cb1.score(XtrainPCA, ytrain))
print('Test Score:', p_cb1.score(XtestPCA, ytest))

Train Score: 0.9983826247689463
Test Score: 0.9306839186691312


In [60]:
# PCA_IT1: LightGBM, p_lgbm1
p_lgbm1 = LGBMClassifier(n_jobs=-2, random_state=430)
p_lgbm1.fit(XtrainPCA, ytrain)
print('Train Score:', p_lgbm1.score(XtrainPCA, ytrain))
print('Test Score:', p_lgbm1.score(XtestPCA, ytest))

  return f(*args, **kwargs)


Train Score: 1.0
Test Score: 0.9214417744916821


## Optimized PCABoosted Modeling
[Back to Top](#Machine-Learning-Processing)

In [69]:
# PCA_IT1-A: Logistic Regression, p_lgr1_a
grid = {'C': np.linspace(-4, 4, 50),
       'penalty': ['l1', 'l2']}
p_lgr1_a = GridSearchCV(LogisticRegression(n_jobs=-2),
                       grid, n_jobs=-2, cv=3)
p_lgr1_a.fit(XtrainPCA, ytrain)
print('Train Score:', p_lgr1_a.score(XtrainPCA, ytrain))
print('Test Score:', p_lgr1_a.score(XtestPCA, ytest))
print('Best Parameters:', p_lgr1_a.best_params_)

        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan 0.93415052        nan 0.93345688
        nan 0.93253239        nan 0.93230139        nan 0.93207039
        nan 0.93207039        nan 0.93230139        nan 0.93253239
        nan 0.93253239        nan 0.93276355        nan 0.93276355
        nan 0.93276355        nan 0.93299471        nan 0.93299471
        nan 0.93276339        nan 0.93253239        nan 0.93299455
        nan 0.93299455        nan 0.93253239        nan 0.93276355
        nan 0.93253239        nan 0.93253239        nan 0.9323

Train Score: 0.9366913123844732
Test Score: 0.9251386321626618
Best Parameters: {'C': 0.08163265306122458, 'penalty': 'l2'}


In [70]:
# PCA_IT1-A: KNN, p_knn1_a
grid = {'n_neighbors': [ksqrt-5, ksqrt-3, ksqrt-1, 
                        ksqrt, ksqrt+3, ksqrt+5],
       'weights': ['uniform', 'distance'],
       'metric': ['euclidean', 'manhattan']}
p_knn1_a = GridSearchCV(KNeighborsClassifier(n_jobs=-2),
                       grid, n_jobs=-2, cv=3)
p_knn1_a.fit(XtrainPCA, ytrain)
print('Train Score:', p_knn1_a.score(XtrainPCA, ytrain))
print('Test Score:', p_knn1_a.score(XtestPCA, ytest))
print('Best Parameters:', p_knn1_a.best_params_)

  return self._fit(X, y)


Train Score: 1.0
Test Score: 0.9186691312384473
Best Parameters: {'metric': 'euclidean', 'n_neighbors': 61, 'weights': 'distance'}


In [71]:
# PCA_IT1-A: LDA, p_lda1_a
grid = {'solver': ['svd', 'lsqr', 'eigen']}
p_lda1_a = GridSearchCV(LinearDiscriminantAnalysis(),
                       grid, n_jobs=-2, cv=3)
p_lda1_a.fit(XtrainPCA, ytrain)
print('Train Score:', p_lda1_a.score(XtrainPCA, ytrain))
print('Test Score:', p_lda1_a.score(XtestPCA, ytest))
print('Best Parameters:', p_lda1_a.best_params_)

Train Score: 0.9359981515711645
Test Score: 0.3539741219963031
Best Parameters: {'solver': 'lsqr'}


  return f(*args, **kwargs)


In [72]:
# PCA_IT1-A: GNB, p_gnb1_a
grid = {'var_smoothing': np.logspace(0, -9, num=100)}
p_gnb1_a = GridSearchCV(GaussianNB(),
                       grid, n_jobs=-2, cv=3)
p_gnb1_a.fit(XtrainPCA, ytrain)
print('Train Score:', p_gnb1_a.score(XtrainPCA, ytrain))
print('Test Score:', p_gnb1_a.score(XtestPCA, ytest))
print('Best Parameters:', p_gnb1_a.best_params_)

Train Score: 0.8858595194085028
Test Score: 0.8835489833641405
Best Parameters: {'var_smoothing': 1.0}


  return f(*args, **kwargs)


In [73]:
# PCA_IT1-A: SVC, p_svc1_a
grid = {'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
       'gamma': ['scale'],
       'C': [0.001, 0.01, 0.1, 1, 10, 100]}
p_svc1_a = GridSearchCV(SVC(), grid, n_jobs=-2, cv=3)
p_svc1_a.fit(XtrainPCA, ytrain)
print('Train Score:', p_svc1_a.score(XtrainPCA, ytrain))
print('Test Score:', p_svc1_a.score(XtestPCA, ytest))
print('Best Parameters:', p_svc1_a.best_params_)

  return f(*args, **kwargs)


Train Score: 0.9376155268022182
Test Score: 0.9260628465804066
Best Parameters: {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}


In [74]:
# PCA_IT1-A: CatBoost, p_cb1_a
grid = {'depth': [5, 6, 7, 8, 9],
       'min_data_in_leaf': [1, 2, 4]}
p_cb1_a = GridSearchCV(CatBoostClassifier(verbose=False),
                      grid, n_jobs=-2, cv=3)
p_cb1_a.fit(XtrainPCA, ytrain)
print('Train Score:', p_cb1_a.score(XtrainPCA, ytrain))
print('Test Score:', p_cb1_a.score(XtestPCA, ytest))
print('Best Parameters:', p_cb1_a.best_params_)



Train Score: 1.0
Test Score: 0.9297597042513863
Best Parameters: {'depth': 8, 'min_data_in_leaf': 1}


In [76]:
# PCA_IT1-A: LightGBM, p_lgbm1_a
grid = {'num_leaves': [31, 36, 41, 46, 51],
       'max_depth': [5, 7, 9, 11, 13, 15, 17],
       'min_data_in_leaf': [15, 16, 17, 18, 19, 20]}
p_lgbm1_a = GridSearchCV(LGBMClassifier(n_jobs=-2, random_state=430),
                        grid, n_jobs=-2, cv=3)
p_lgbm1_a.fit(XtrainPCA, ytrain)
print('Train Score:', p_lgbm1_a.score(XtrainPCA, ytrain))
print('Test Score:', p_lgbm1_a.score(XtestPCA, ytest))
print('Best Parameters:', p_lgbm1_a.best_params_)

  return f(*args, **kwargs)


Train Score: 1.0
Test Score: 0.9297597042513863
Best Parameters: {'max_depth': 15, 'min_data_in_leaf': 15, 'num_leaves': 41}


## PCABoosted Report Card
[Back to Top](#Machine-Learning-Processing)

Rather than combine literally all of our information to this point, here I would like to summarize some of the best general performance and make an observation in the process: Pipelining PCA processing into a model generally seems to decrease overall testing performance across most models, which otherwise seemed to fare much better.

By contrast, PCA Boosting clearly features some potential and should be a great contrast to our preceding results. Therefore we will direct our focus to our results in Pipelined Scaling & PCA Boosting Procedures.

Compare

[Original Unoptimized](#Progress-Report-Card)

[Original Optimized](#Post-Optimization-Report-Card)

In [95]:
# Generate Progress Report Card
p_resul = pd.DataFrame(columns=['Model', 'Logistic', 'KNN', 'LDA', 'GNB', 'SVM', 'CatBoost', 'LightGBM'])
p_resul.set_index('Model', inplace=True)
p_resul.loc['Unoptimized Train'] = [0,0,0,0,0,0,0] # it1
p_resul.loc['Unoptimized Test'] = [0,0,0,0,0,0,0] # it1
p_resul.loc['Optimized Train'] = [0,0,0,0,0,0,0] #it1_a
p_resul.loc['Optimized Test'] = [0,0,0,0,0,0,0] #it1_a
p_resul.loc['PCABoost Unopt Train'] = [0,0,0,0,0,0,0] #p_it1
p_resul.loc['PCABoost Unopt Test'] = [0,0,0,0,0,0,0] #p_it1
p_resul.loc['PCABoost Opt Train'] = [0,0,0,0,0,0,0] #p_it1_a
p_resul.loc['PCABoost Opt Test'] = [0,0,0,0,0,0,0] #p_it1_a


# Group the PCABoosts
p_it1 = [p_lgr1, p_knn1, p_lda1, p_gnb1, 
         p_svc1, p_cb1, p_lgbm1]
p_it1_a = [p_lgr1_a, p_knn1_a, p_lda1_a, 
           p_gnb1_a, p_svc1_a, p_cb1_a, p_lgbm1_a]

# Tabulate Iteration 1; Piped w/ Scaler
col = 0
for model in it1:
    p_resul.iloc[0, col] = model.score(Xtrain, ytrain)
    p_resul.iloc[1, col] = model.score(Xtest, ytest)
    col+=1

col = 0
for model in it1_a:
    p_resul.iloc[2, col] = model.score(Xtrain, ytrain)
    p_resul.iloc[3, col] = model.score(Xtest, ytest)
    col+=1

col = 0
for model in p_it1:
    p_resul.iloc[4, col] = model.score(XtrainPCA, ytrain)
    p_resul.iloc[5, col] = model.score(XtestPCA, ytest)
    col+=1

col = 0
for model in p_it1_a:
    p_resul.iloc[6, col] = model.score(XtrainPCA, ytrain)
    p_resul.iloc[7, col] = model.score(XtestPCA, ytest)
    col+=1

p_resul.head(8)

Unnamed: 0_level_0,Logistic,KNN,LDA,GNB,SVM,CatBoost,LightGBM
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Unoptimized Train,0.937616,0.939695,0.934843,0.698475,0.941312,0.997458,1.0
Unoptimized Test,0.922366,0.918669,0.919593,0.672828,0.924214,0.933457,0.934381
Optimized Train,0.936922,1.0,0.935998,0.698475,0.937616,0.999307,1.0
Optimized Test,0.924214,0.924214,0.919593,0.672828,0.930684,0.931608,0.930684
PCABoost Unopt Train,0.937153,0.940619,0.934843,0.635166,0.939695,0.998383,1.0
PCABoost Unopt Test,0.922366,0.914048,0.92976,0.634935,0.921442,0.930684,0.921442
PCABoost Opt Train,0.936691,1.0,0.935998,0.88586,0.937616,1.0,1.0
PCABoost Opt Test,0.925139,0.918669,0.353974,0.883549,0.926063,0.92976,0.92976


### Pickle New Groups

In [96]:
# Pickling Exports
pickle.dump(p_it1, open('p_iteration1_group.pickle', 'wb'))
pickle.dump(p_it1_a, open('p_iteration1_a_group.pickle', 'wb'))

**NB:** It is worth noting here that LightGBM's unoptimized run was met with exactly the same results from the naive, un-scaled test in Iteration 0.

In [98]:
# Iteration 0: LightGBM Test Score
print('LightGBM Train Score', lgbm0.score(Xtrain, ytrain))
print('LightGBM Test Score', lgbm0.score(Xtest, ytest))

LightGBM Train Score 1.0
LightGBM Test Score 0.9343807763401109


# Results & Discussion
[Back to Top](#Machine-Learning-Processing)

To recap, much of our exploration here has been to probe the usefulness of unsupervised learning in a pipelined format versus its application as a boosting procedure. Simultaneously, we have employed a diverse number of model types, including some newer devices in the CatBoostClassifier and the LightGBMClassifier.

From much of our results, we have revealed that both uses of PCA have their respective uses, the latter, the PCA Boosting Procedure, seems to have the greatest effect of narrowing overfit results. However, the greatest results of accuracy come to us through the raw, LightGBM results.

With that said, we will now turn our attention to the matter of insight from our findings. To whit, we will break this down into three parts:
   - Feature Importance
   - Prediction Refinement
   - Business Relevance & Benefits