# Machine Learning Processing

### Contents

 - I. [Data Importing & Processing](#Data-Import-and-Processing)
 - II. [Modeling Plan](#Modeling-Plan)
     - [Iteration 0](#Iteration-0:-Tree-based-Naivete)
     - [Iteration 1](#Iteration-1:-Baby's-First-Pipe)
     - [Iteration 2](#Iteration-2:-Give-the-Computer-a-Chance)
     - [Progress So Far](#Progress-Report-Card)
     - [Iteration 1 Optimization](#Iteration-1-Optimizations)
     - [Iteration 2 Optimization](#)
 - III. [PCA Weighted Approach](#)
 - IV. [Results & Discussion]
     - [Business Impact]
     - [Permutation Importance]

In [1]:
# Import packages etc.
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pickle

## Data Import and Processing

In [2]:
provdat = pickle.load(open('Xtrain.pickle', 'rb'))
provtarg = pickle.load(open('Xtest.pickle', 'rb'))
provgroup = pickle.load(open('Xdata.pickle', 'rb'))

In [3]:
# Examining Pre-Aggregated Per-Provider Data
print('Data Shape: ' + str(provdat.shape))
provdat.head()

Data Shape: (5410, 70)


Unnamed: 0_level_0,Number_of_Claims,Num_Unique_Patients,Num_Unique_Docs,Num_Unique_Diag_Codes,Num_Unique_Proc_Codes,Mean_InscClaimAmtReimbursed,Mean_DeductibleAmtPaid,Mean_Reimbursement_per_Day,Mean_Length_of_Stay,Mean_Length_of_Claim,...,Perc_ClaimStartDt_March,Perc_ClaimStartDt_April,Perc_ClaimStartDt_May,Perc_ClaimStartDt_June,Perc_ClaimStartDt_July,Perc_ClaimStartDt_August,Perc_ClaimStartDt_September,Perc_ClaimStartDt_October,Perc_ClaimStartDt_November,Perc_ClaimStartDt_December
Provider,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PRV51001,25,24,19,72,3,4185.6,213.6,3700.0,6.0,2.44,...,0.12,0.08,0.2,0.12,0.08,0.04,0.04,0.0,0.0,0.08
PRV51003,132,117,51,365,39,4588.41,502.17,2137.42,6.16,4.67,...,0.08,0.08,0.06,0.11,0.08,0.1,0.07,0.08,0.08,0.05
PRV51004,149,138,48,267,0,350.13,2.08,0.0,0.0,2.43,...,0.12,0.07,0.05,0.07,0.09,0.07,0.08,0.07,0.05,0.11
PRV51005,1165,495,7,1294,0,241.12,3.18,0.0,0.0,2.09,...,0.1,0.08,0.07,0.09,0.08,0.08,0.09,0.1,0.06,0.07
PRV51007,72,58,11,165,1,468.19,45.33,1080.56,6.33,1.96,...,0.12,0.06,0.12,0.12,0.04,0.08,0.1,0.03,0.08,0.08


In [4]:
# Examining Per-Provider Incidence of Fradulence
print('Data Shape: ' + str(provtarg.shape))
provtarg.head()

Data Shape: (5410,)


Provider
PRV51001    0
PRV51003    1
PRV51004    0
PRV51005    1
PRV51007    0
Name: PotentialFraud, dtype: int64

In [5]:
# Combining Target data with Provider data to generate supervised learning dataset
xdata = pd.merge(provdat.reset_index(), provtarg.reset_index(), on='Provider').set_index('Provider')
xdata.head()

Unnamed: 0_level_0,Number_of_Claims,Num_Unique_Patients,Num_Unique_Docs,Num_Unique_Diag_Codes,Num_Unique_Proc_Codes,Mean_InscClaimAmtReimbursed,Mean_DeductibleAmtPaid,Mean_Reimbursement_per_Day,Mean_Length_of_Stay,Mean_Length_of_Claim,...,Perc_ClaimStartDt_April,Perc_ClaimStartDt_May,Perc_ClaimStartDt_June,Perc_ClaimStartDt_July,Perc_ClaimStartDt_August,Perc_ClaimStartDt_September,Perc_ClaimStartDt_October,Perc_ClaimStartDt_November,Perc_ClaimStartDt_December,PotentialFraud
Provider,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PRV51001,25,24,19,72,3,4185.6,213.6,3700.0,6.0,2.44,...,0.08,0.2,0.12,0.08,0.04,0.04,0.0,0.0,0.08,0
PRV51003,132,117,51,365,39,4588.41,502.17,2137.42,6.16,4.67,...,0.08,0.06,0.11,0.08,0.1,0.07,0.08,0.08,0.05,1
PRV51004,149,138,48,267,0,350.13,2.08,0.0,0.0,2.43,...,0.07,0.05,0.07,0.09,0.07,0.08,0.07,0.05,0.11,0
PRV51005,1165,495,7,1294,0,241.12,3.18,0.0,0.0,2.09,...,0.08,0.07,0.09,0.08,0.08,0.09,0.1,0.06,0.07,1
PRV51007,72,58,11,165,1,468.19,45.33,1080.56,6.33,1.96,...,0.06,0.12,0.12,0.04,0.08,0.1,0.03,0.08,0.08,0


In [6]:
from sklearn.model_selection import train_test_split

# Split the data into train & test sets for supervised learning
Xtrain, Xtest, ytrain, ytest = train_test_split(xdata.iloc[:, :-1],
                                               xdata.iloc[:, -1:],
                                               test_size=0.2,
                                               random_state=430)

# Quick Heads Up
print('Training Data shape: ' + str(Xtrain.shape))
print('Test Data Shape: ' + str(Xtest.shape))

Training Data shape: (4328, 70)
Test Data Shape: (1082, 70)


## Modeling Plan
The general strategy of our approach incorporates an initial, broad spectrum testing of models, including the following.
 - Logistic Regression
 - KNN
 - Discriminant Analysis
 - Naive Bayes
 - SVM
 - CatBoost
 - Light GBM

This grouping should represent a decent collection of well-worn, classical methods with exceptionally long history & documentation as well as newer, useful methodologies. Of particular interest to this last point will be **Catboost & Light GBM**, two Gradient Boosting models that have shown a great deal of promise in terms of efficiency & accuracy.

### Iterative Strategy | What We Care About 
 - Iteration 0: Naive, untouched modeling with tree-based models; just fit.
 - Iteration 1: Pipeline == SCALE(No Norm) & FIT; no unsupervised.
     - 1a for gridsearch?
 - Iteration 2: Pipeline == SCALE(No Norm), PCA, & FIT.
     - 2a for gridsearch?
 
### Naming Procedure
We can abbreviate the model name alongside iteration number to keep track (eg. "cb0" for "Catboost, Iteration 0)

### Iteration 0: Tree-based Naivete
[Back to Top](#Machine-Learning-Processing)

Both Catboost & Light GBM are quite capable of handling practically unprocessed data, such as we have imported. For further reference, please consult their respective documentation. Here we will process these for some initial yield results.

In [7]:
from catboost import CatBoostClassifier
cb0 = CatBoostClassifier(verbose=False)
cb0.fit(Xtrain, ytrain)
cb0.score(Xtest, ytest)
print('CatBoost Train R2 Score:', cb0.score(Xtrain, ytrain))
print('CatBoost Test R2 Score:', cb0.score(Xtest, ytest))

CatBoost Train R2 Score: 0.9972273567467652
CatBoost Test R2 Score: 0.9316081330868762


In [8]:
from lightgbm import LGBMClassifier
lgbm0 = LGBMClassifier()
lgbm0.fit(Xtrain, ytrain)
print('LightGBM Boost Train R2 Score:', lgbm0.score(Xtrain, ytrain))
print('LightGBM Boost Test R2 Score:', lgbm0.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Boost Train R2 Score: 1.0
LightGBM Boost Test R2 Score: 0.9343807763401109


### Iteration 1: Baby's First Pipe
[Back to Top](#Machine-Learning-Processing)

Our goal here will be to test the results with some initial scaling built into a pipeline. 

In [9]:
badindices = np.where(np.isinf(Xtrain))
badindices

(array([ 220, 1330, 1347, 2198, 2599, 2614, 2920, 2946, 3951, 4321]),
 array([24, 24, 24, 24, 24, 24, 24, 24, 24, 24]))

In [10]:
Xtrain.iloc[-badindices[0], 24].mean()

27.769

In [11]:
Xtrain.iloc[badindices[0], 24] = Xtrain.iloc[-badindices[0], 24].mean()

In [12]:
badtests = np.where(np.isinf(Xtest))
Xtest.iloc[badtests[0], 24] = Xtrain.iloc[-badindices[0], 24].mean()

In [13]:
# IT1: Logistic Regression
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
lgr1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('logistic', LogisticRegression(n_jobs=-2))])
lgr1.fit(Xtrain, ytrain)
print('Logistic Regression Train Score:', lgr1.score(Xtrain, ytrain))
print('Logistic Regression Test Score:', lgr1.score(Xtest, ytest))

  return f(*args, **kwargs)


Logistic Regression Train Score: 0.9376155268022182
Logistic Regression Test Score: 0.922365988909427


In [14]:
# IT1: KNN
from sklearn.neighbors import KNeighborsClassifier
knn1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('knn', KNeighborsClassifier(n_jobs=-2))])
knn1.fit(Xtrain, ytrain)
print('KNearestNeighbors Train Score:', knn1.score(Xtrain, ytrain))
print('KNearestNeighbors Test Score:', knn1.score(Xtest, ytest))

  return self._fit(X, y)


KNearestNeighbors Train Score: 0.9396950092421442
KNearestNeighbors Test Score: 0.9186691312384473


In [15]:
# IT1: LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('lda', LinearDiscriminantAnalysis())])
lda1.fit(Xtrain, ytrain)
print('LDA Train Score:', lda1.score(Xtrain, ytrain))
print('LDA Test Score:', lda1.score(Xtest, ytest))

LDA Train Score: 0.9348428835489834
LDA Test Score: 0.9195933456561922


  return f(*args, **kwargs)


In [16]:
# IT1: GNB
from sklearn.naive_bayes import GaussianNB
gnb1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('gnb', GaussianNB())])
gnb1.fit(Xtrain, ytrain)
print('GNB Train Score', gnb1.score(Xtrain, ytrain))
print('GNB Test Score', gnb1.score(Xtest, ytest))

GNB Train Score 0.6984750462107209
GNB Test Score 0.6728280961182994


  return f(*args, **kwargs)


In [17]:
# IT1: SVM
from sklearn.svm import SVC
svc1 = Pipeline(steps=[('scaler', StandardScaler()),
                      ('svc', SVC())])
svc1.fit(Xtrain, ytrain)
print('SVC Train Score', svc1.score(Xtrain, ytrain))
print('SVC Test Score', svc1.score(Xtest, ytest))

  return f(*args, **kwargs)


SVC Train Score 0.9413123844731978
SVC Test Score 0.9242144177449169


In [18]:
# IT1: CatBoost
cb1 = Pipeline(steps=[('scaler', StandardScaler()),
                     ('catboost', CatBoostClassifier(random_state=430, verbose=False))])
cb1.fit(Xtrain, ytrain)
print('Catboost Train Score', cb1.score(Xtrain, ytrain))
print('Catboost Test Score', cb1.score(Xtest, ytest))

Catboost Train Score 0.9976894639556377
Catboost Test Score 0.9353049907578558


In [19]:
# IT1: LightGBM
lgbm1 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('lightgbm', LGBMClassifier(n_jobs=-2, random_state=430))])
lgbm1.fit(Xtrain, ytrain)
print('LightGBM Train Score', lgbm1.score(Xtrain, ytrain))
print('LightGBM Test Score', lgbm1.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Train Score 1.0
LightGBM Test Score 0.9343807763401109


## Iteration 2: Give the Computer a Chance

[Back to Top](#Machine-Learning-Processing)



In [20]:
# IT2: Logistic
from sklearn.decomposition import PCA 
lgr2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('logistic', LogisticRegression(n_jobs=-2))])
lgr2.fit(Xtrain, ytrain)
print('LightGBM Train Score', lgr2.score(Xtrain, ytrain))
print('LightGBM Test Score', lgr2.score(Xtest, ytest))

LightGBM Train Score 1.0
LightGBM Test Score 0.9242144177449169


  return f(*args, **kwargs)


In [21]:
# IT2: KNN
knn2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('knn', KNeighborsClassifier(n_jobs=-2))])
knn2.fit(Xtrain, ytrain)
print('LightGBM Train Score', knn2.score(Xtrain, ytrain))
print('LightGBM Test Score', knn2.score(Xtest, ytest))

  return self._fit(X, y)


LightGBM Train Score 0.9396950092421442
LightGBM Test Score 0.9214417744916821


In [22]:
# IT2: LDA
lda2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('lda', LinearDiscriminantAnalysis())])
lda2.fit(Xtrain, ytrain)
print('LightGBM Train Score', lda2.score(Xtrain, ytrain))
print('LightGBM Test Score', lda2.score(Xtest, ytest))

LightGBM Train Score 0.9336876155268022
LightGBM Test Score 0.9232902033271719


  return f(*args, **kwargs)


In [23]:
# IT2: GNB
gnb2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('gnb', GaussianNB())])
gnb2.fit(Xtrain, ytrain)
print('LightGBM Train Score', gnb2.score(Xtrain, ytrain))
print('LightGBM Test Score', gnb2.score(Xtest, ytest))

LightGBM Train Score 0.5459796672828097
LightGBM Test Score 0.5425138632162662


  return f(*args, **kwargs)


In [24]:
# IT2: SVM
svc2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('svc', SVC())])
svc2.fit(Xtrain, ytrain)
print('LightGBM Train Score', svc2.score(Xtrain, ytrain))
print('LightGBM Test Score', svc2.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Train Score 0.9394639556377079
LightGBM Test Score 0.9242144177449169


In [25]:
# IT2: CatBoost
cb2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('Catboost', CatBoostClassifier(random_state=430, verbose=False))])
cb2.fit(Xtrain, ytrain)
print('LightGBM Train Score', cb2.score(Xtrain, ytrain))
print('LightGBM Test Score', cb2.score(Xtest, ytest))

LightGBM Train Score 0.9988447319778189
LightGBM Test Score 0.9205175600739371


In [26]:
# IT2: LightGBM
lgbm2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('LightGBM', LGBMClassifier(random_state=430))])
lgbm2.fit(Xtrain, ytrain)
print('LightGBM Train Score', lgbm2.score(Xtrain, ytrain))
print('LightGBM Test Score', lgbm2.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Train Score 1.0
LightGBM Test Score 0.9242144177449169


### Progress Report Card

[Back to Top](#Machine-Learning-Processing)

In [27]:
# Generate Progress Report Card
resul = pd.DataFrame(columns=['Model', 'Logistic', 'KNN', 'LDA', 'GNB', 'SVM', 'CatBoost', 'LightGBM'])
resul.set_index('Model', inplace=True)
resul.loc['Iteration0 Train'] = ['NA','NA','NA','NA','NA',0,0]
resul.loc['Iteration0 Test'] = ['NA','NA','NA','NA','NA',0,0]
resul.loc['Iteration1 Train'] = [0,0,0,0,0,0,0]
resul.loc['Iteration1 Test'] = [0,0,0,0,0,0,0]
resul.loc['Iteration2 Train'] = [0,0,0,0,0,0,0]
resul.loc['Iteration2 Test'] = [0,0,0,0,0,0,0]

# Group the Models
it0 = [cb0, lgbm0]
it1 = [lgr1, knn1, lda1, gnb1, svc1, cb1, lgbm1]
it2 = [lgr2, knn2, lda2, gnb2, svc2, cb2, lgbm2]

# Tabulate Iteration 0; only Catboost & LGBM
col = 5
for model in it0:
    resul.iloc[0, col] = model.score(Xtrain, ytrain)
    resul.iloc[1, col] = model.score(Xtest, ytest)
    col+=1
    
# Tabulate Iteration 1; Piped w/ Scaler
col = 0
for model in it1:
    resul.iloc[2, col] = model.score(Xtrain, ytrain)
    resul.iloc[3, col] = model.score(Xtest, ytest)
    col+=1

col = 0
for model in it2:
    resul.iloc[4, col] = model.score(Xtrain, ytrain)
    resul.iloc[5, col] = model.score(Xtest, ytest)
    col+=1
    
resul.head(6)

Unnamed: 0_level_0,Logistic,KNN,LDA,GNB,SVM,CatBoost,LightGBM
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Iteration0 Train,,,,,,0.997227,1.0
Iteration0 Test,,,,,,0.931608,0.934381
Iteration1 Train,0.937616,0.939695,0.934843,0.698475,0.941312,0.997689,1.0
Iteration1 Test,0.922366,0.918669,0.919593,0.672828,0.924214,0.935305,0.934381
Iteration2 Train,1.0,0.939695,0.933688,0.54598,0.939464,0.998845,1.0
Iteration2 Test,0.924214,0.921442,0.92329,0.542514,0.924214,0.920518,0.924214


## Optimized Testing
[Back to Top](#Machine-Learning-Processing)

### Iteration 0 Optimizations

In [83]:
# IT0-A: Optimized Naive Catboost
grid = {'depth': [5, 6, 7, 8, 9],
       'min_data_in_leaf': [1, 2, 4]}
cb0_a = GridSearchCV(CatBoostClassifier(verbose=False, random_state=430),
                    grid, n_jobs=-2, cv=3)
cb0_a.fit(Xtrain, ytrain)

print('CB Optimized Train Score:', cb0_a.score(Xtrain, ytrain))
print('CB Optimized Test Score:', cb0_a.score(Xtest, ytest))
print('Best Params:', cb0_a.best_params_)

CB Optimized Train Score: 0.9946857670979667
CB Optimized Test Score: 0.9279112754158965
Best Params: {'depth': 5, 'min_data_in_leaf': 1}


In [84]:
# IT0-A: Optimized LightGBM
grid = {'num_leaves': [44, 45, 46, 47, 48],
       'max_depth': [9, 10, 11, 12, 13],
       'min_data_in_leaf': [15, 16, 17, 18, 19]}
lgbm0_a = GridSearchCV(LGBMClassifier(n_jobs=-2, random_state=430),
                      grid, n_jobs=-2, cv=3)
lgbm0_a.fit(Xtrain, ytrain)

print('LightGBM Optimized Train Score:', lgbm0_a.score(Xtrain, ytrain))
print('LightGBM Optimized Test Score:', lgbm0_a.score(Xtest, ytest))
print('Best Params:', lgbm0_a.best_params_)

  return f(*args, **kwargs)


LightGBM Optimized Train Score: 1.0
LightGBM Optimized Test Score: 0.9297597042513863
Best Params: {'max_depth': 12, 'min_data_in_leaf': 17, 'num_leaves': 46}


### Iteration 1 Optimizations
[Back to Top](#Machine-Learning-Processing)


In [36]:
# IT1-A: Optimized LogisticRegression, lgr1_a
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                          ('logistic', LogisticRegression(n_jobs=-2))])
grid = {'logistic__C': np.linspace(-4, 4, 50),
       'logistic__penalty': ['l1', 'l2']}
lgr1_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
lgr1_a.fit(Xtrain, ytrain)

print('LogisticRegression Optimized Train Score:', lgr1_a.score(Xtrain, ytrain))
print('LogisticRegression Optimized Test Score:', lgr1_a.score(Xtest, ytest))
print('Best Params:', lgr1_a.best_params_)

        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan 0.9336882         nan 0.93345704
        nan 0.93299455        nan 0.93253239        nan 0.93230139
        nan 0.93230139        nan 0.93207039        nan 0.93230139
        nan 0.93276355        nan 0.93276355        nan 0.93276355
        nan 0.93299471        nan 0.93299471        nan 0.93299471
        nan 0.93253255        nan 0.93253255        nan 0.93253255
        nan 0.93253255        nan 0.93253255        nan 0.93276355
        nan 0.93276355        nan 0.93276355        nan 0.9325

LogisticRegression Optimized Train Score: 0.9369223659889094
LogisticRegression Optimized Test Score: 0.9242144177449169
Best Params: {'logistic__C': 0.08163265306122458, 'logistic__penalty': 'l2'}


In [46]:
# IT1-A: Optimized KNN, knn1_a
ksqrt = int(np.round(np.sqrt(Xtrain.shape[0])))       # Get sqrt of observations for k
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                          ('knn', KNeighborsClassifier(n_jobs=-2))])
grid = {'knn__n_neighbors': [ksqrt-5, ksqrt-3, ksqrt, ksqrt+3, ksqrt+5],
       'knn__weights': ['uniform', 'distance'],
       'knn__metric': ['euclidean', 'manhattan']}
knn1_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
knn1_a.fit(Xtrain, ytrain)

print('KNN Optimized Train Score:', knn1_a.score(Xtrain, ytrain))
print('KNN Optimized Test Score:', knn1_a.score(Xtest, ytest))
print('Best Params:', knn1_a.best_params_)

  return self._fit(X, y)


KNN Optimized Train Score: 1.0
KNN Optimized Test Score: 0.9242144177449169
Best Params: {'knn__metric': 'euclidean', 'knn__n_neighbors': 66, 'knn__weights': 'distance'}


In [69]:
# IT1-A: Optimized LDA
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                          ('lda', LinearDiscriminantAnalysis())])
grid = {'lda__solver': ['svd', 'lsqr', 'eigen']}
lda1_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
lda1_a.fit(Xtrain, ytrain)

print('LDA Optimized Train Score:', lda1_a.score(Xtrain, ytrain))
print('LDA Optimized Test Score:', lda1_a.score(Xtest, ytest))
print('Best Params:', lda1_a.best_params_)

LDA Optimized Train Score: 0.9359981515711645
LDA Optimized Test Score: 0.9195933456561922
Best Params: {'lda__solver': 'lsqr'}


  return f(*args, **kwargs)


In [68]:
# IT1-A: Optimized GNB
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                          ('gnb', GaussianNB())])
grid = {'gnb__var_smoothing': np.logspace(0, -9, num=100)}
gnb1_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
gnb1_a.fit(Xtrain, ytrain)

print('GNB Optimized Train Score:', gnb1_a.score(Xtrain, ytrain))
print('GNB Optimized Test Score:', gnb1_a.score(Xtest, ytest))
print('Best Params:', gnb1_a.best_params_)

GNB Optimized Train Score: 0.6984750462107209
GNB Optimized Test Score: 0.6728280961182994
Best Params: {'gnb__var_smoothing': 0.0002310129700083158}


  return f(*args, **kwargs)


In [47]:
# IT1-A: Optimized SVM
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                          ('svc', SVC())])
grid = {'svc__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
       'svc__gamma': ['scale'],
       'svc__C': [0.001, 0.01, 0.1, 1, 10, 100]}
svc1_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
svc1_a.fit(Xtrain, ytrain)

print('SVC Optimized Train Score:', svc1_a.score(Xtrain, ytrain))
print('SVC Optimized Test Score:', svc1_a.score(Xtest, ytest))
print('Best Params:', svc1_a.best_params_)

  return f(*args, **kwargs)


SVC Optimized Train Score: 0.9376155268022182
SVC Optimized Test Score: 0.9306839186691312
Best Params: {'svc__C': 0.1, 'svc__gamma': 'scale', 'svc__kernel': 'linear'}


In [66]:
# IT1-A: Optimized CatBoost
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                          ('catboost', CatBoostClassifier(verbose=False))])
grid = {'catboost__depth': [5, 6, 7, 8, 9],
       'catboost__min_data_in_leaf': [1, 2, 4]}
cb1_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
cb1_a.fit(Xtrain, ytrain)

print('CB Optimized Train Score:', cb1_a.score(Xtrain, ytrain))
print('CB Optimized Test Score:', cb1_a.score(Xtest, ytest))
print('Best Params:', cb1_a.best_params_)

CB Optimized Train Score: 0.9993068391866913
CB Optimized Test Score: 0.9316081330868762
Best Params: {'catboost__depth': 7, 'catboost__min_data_in_leaf': 1}


In [64]:
# IT1-A: Optimized LightGBM
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                          ('lgbm', LGBMClassifier(n_jobs=-2, random_state=430))])
grid = {'lgbm__num_leaves': [44, 45, 46, 47, 48],
       'lgbm__max_depth': [9, 10, 11, 12, 13],
       'lgbm__min_data_in_leaf': [15, 16, 17, 18, 19]}
lgbm1_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
lgbm1_a.fit(Xtrain, ytrain)

print('LightGBM Optimized Train Score:', lgbm1_a.score(Xtrain, ytrain))
print('LightGBM Optimized Test Score:', lgbm1_a.score(Xtest, ytest))
print('Best Params:', lgbm1_a.best_params_)

  return f(*args, **kwargs)


LightGBM Optimized Train Score: 1.0
LightGBM Optimized Test Score: 0.9306839186691312
Best Params: {'lgbm__max_depth': 11, 'lgbm__min_data_in_leaf': 17, 'lgbm__num_leaves': 46}


### Iteration 2 Optimizations
[Back to Top](#Machine-Learning-Processing)

In [74]:
# IT2-A: Optimized PCA'd LogisticRegression
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                           ('pca', PCA(random_state=430)),
                           ('logistic', LogisticRegression(n_jobs=-2))])
grid = {'pca__n_components': [0.99],
       'logistic__C': np.linspace(-4, 4, 50),
       'logistic__penalty': ['l1', 'l2']}
lgr2_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
lgr2_a.fit(Xtrain, ytrain)

print('LogisticRegression Optimized Train Score:', lgr2_a.score(Xtrain, ytrain))
print('LogisticRegression Optimized Test Score:', lgr2_a.score(Xtest, ytest))
print('Best Params:', lgr2_a.best_params_)

        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan 0.93392           nan 0.93415084
        nan 0.934151          nan 0.93391984        nan 0.93391984
        nan 0.93391984        nan 0.93391984        nan 0.93391984
        nan 0.93391984        nan 0.93391984        nan 0.93391984
        nan 0.93391984        nan 0.93368868        nan 0.93345752
        nan 0.93345752        nan 0.93345752        nan 0.93345752
        nan 0.93345752        nan 0.93345752        nan 0.93345752
        nan 0.93345752        nan 0.93345752        nan 0.9334

LogisticRegression Optimized Train Score: 0.9366913123844732
LogisticRegression Optimized Test Score: 0.9214417744916821
Best Params: {'logistic__C': 0.408163265306122, 'logistic__penalty': 'l2', 'pca__n_components': 0.99}


In [75]:
# IT2-A: Optimized, PCA'd KNN     
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                           ('pca', PCA(random_state=430)),
                           ('knn', KNeighborsClassifier(n_jobs=-2))])
grid = {'pca__n_components': [0.99],
       'knn__n_neighbors': [ksqrt-5, ksqrt-3, ksqrt, ksqrt+3, ksqrt+5],   # already collected ksqrt
       'knn__weights': ['uniform', 'distance'],
       'knn__metric': ['euclidean', 'manhattan']}
knn2_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
knn2_a.fit(Xtrain, ytrain)

print('KNN Optimized Train Score:', knn2_a.score(Xtrain, ytrain))
print('KNN Optimized Test Score:', knn2_a.score(Xtest, ytest))
print('Best Params:', knn2_a.best_params_)

  return self._fit(X, y)


KNN Optimized Train Score: 1.0
KNN Optimized Test Score: 0.9242144177449169
Best Params: {'knn__metric': 'euclidean', 'knn__n_neighbors': 61, 'knn__weights': 'distance', 'pca__n_components': 0.99}


In [76]:
# IT2-A: Optimized, PCA'd LDA
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                           ('pca', PCA(random_state=430)),
                           ('lda', LinearDiscriminantAnalysis())])
grid = {'pca__n_components': [0.99],
        'lda__solver': ['svd', 'lsqr', 'eigen']}
lda2_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
lda2_a.fit(Xtrain, ytrain)

print('LDA Optimized Train Score:', lda2_a.score(Xtrain, ytrain))
print('LDA Optimized Test Score:', lda2_a.score(Xtest, ytest))
print('Best Params:', lda2_a.best_params_)

LDA Optimized Train Score: 0.9336876155268022
LDA Optimized Test Score: 0.9232902033271719
Best Params: {'lda__solver': 'svd', 'pca__n_components': 0.99}


  return f(*args, **kwargs)


In [77]:
# IT2-A: Optimized, PCA'd GNB
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                           ('pca', PCA(random_state=430)),
                           ('gnb', GaussianNB())])
grid = {'pca__n_components': [0.99],
        'gnb__var_smoothing': np.logspace(0, -9, num=100)}
gnb2_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
gnb2_a.fit(Xtrain, ytrain)

print('GNB Optimized Train Score:', gnb2_a.score(Xtrain, ytrain))
print('GNB Optimized Test Score:', gnb2_a.score(Xtest, ytest))
print('Best Params:', gnb2_a.best_params_)

GNB Optimized Train Score: 0.927449168207024
GNB Optimized Test Score: 0.9149722735674677
Best Params: {'gnb__var_smoothing': 0.8111308307896871, 'pca__n_components': 0.99}


  return f(*args, **kwargs)


In [78]:
# IT2-A: Optimized, PCA'd SVM
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                           ('pca', PCA(random_state=430)),
                           ('svc', SVC())])
grid = {'pca__n_components': [0.99],
       'svc__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
       'svc__gamma': ['scale'],
       'svc__C': [0.001, 0.01, 0.1, 1, 10, 100]}
svc2_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
svc2_a.fit(Xtrain, ytrain)

print('SVC Optimized Train Score:', svc2_a.score(Xtrain, ytrain))
print('SVC Optimized Test Score:', svc2_a.score(Xtest, ytest))
print('Best Params:', svc2_a.best_params_)

  return f(*args, **kwargs)


SVC Optimized Train Score: 0.9376155268022182
SVC Optimized Test Score: 0.9269870609981515
Best Params: {'pca__n_components': 0.99, 'svc__C': 0.1, 'svc__gamma': 'scale', 'svc__kernel': 'linear'}


In [79]:
# IT2-A: Optimized, PCA'd CatBoost
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                           ('pca', PCA(random_state=430)),
                           ('catboost', CatBoostClassifier(verbose=False))])
grid = {'pca__n_components': [0.99],
       'catboost__depth': [5, 6, 7, 8, 9],
       'catboost__min_data_in_leaf': [1, 2, 4]}
cb2_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
cb2_a.fit(Xtrain, ytrain)

print('CB Optimized Train Score:', cb2_a.score(Xtrain, ytrain))
print('CB Optimized Test Score:', cb2_a.score(Xtest, ytest))
print('Best Params:', cb2_a.best_params_)

CB Optimized Train Score: 1.0
CB Optimized Test Score: 0.9195933456561922
Best Params: {'catboost__depth': 9, 'catboost__min_data_in_leaf': 1, 'pca__n_components': 0.99}


In [80]:
# IT2-A: Optimized, PCA'd LightGBM
pipeline = Pipeline(steps=[('scaler', StandardScaler()),
                           ('pca', PCA(random_state=430)),
                           ('lgbm', LGBMClassifier(n_jobs=-2, random_state=430))])
grid = {'pca__n_components': [0.99],
       'lgbm__num_leaves': [31, 36, 41, 46, 51],
       'lgbm__max_depth': [5, 7, 9, 11, 13, 15, 17],
       'lgbm__min_data_in_leaf': [15, 16, 17, 18, 19]}
lgbm2_a = GridSearchCV(pipeline, grid, n_jobs=-2, cv=3)
lgbm2_a.fit(Xtrain, ytrain)

print('LightGBM Optimized Train Score:', lgbm2_a.score(Xtrain, ytrain))
print('LightGBM Optimized Test Score:', lgbm2_a.score(Xtest, ytest))
print('Best Params:', lgbm2_a.best_params_)

  return f(*args, **kwargs)


LightGBM Optimized Train Score: 1.0
LightGBM Optimized Test Score: 0.922365988909427
Best Params: {'lgbm__max_depth': 15, 'lgbm__min_data_in_leaf': 17, 'lgbm__num_leaves': 41, 'pca__n_components': 0.99}


### Post-Optimization Report Card
[Back to Top](#Machine-Learning-Optimization)

In [85]:
# Generate Progress Report Card
resul_a = pd.DataFrame(columns=['Model', 'Logistic', 'KNN', 'LDA', 'GNB', 'SVM', 'CatBoost', 'LightGBM'])
resul_a.set_index('Model', inplace=True)
resul_a.loc['Iteration0 Train'] = ['NA','NA','NA','NA','NA',0,0]
resul_a.loc['Iteration0 Test'] = ['NA','NA','NA','NA','NA',0,0]
resul_a.loc['Iteration1 Train'] = [0,0,0,0,0,0,0]
resul_a.loc['Iteration1 Test'] = [0,0,0,0,0,0,0]
resul_a.loc['Iteration2 Train'] = [0,0,0,0,0,0,0]
resul_a.loc['Iteration2 Test'] = [0,0,0,0,0,0,0]

# Group the Models
it0_a = [cb0_a, lgbm0_a]
it1_a = [lgr1_a, knn1_a, lda1_a, gnb1_a, svc1_a, cb1_a, lgbm1_a]
it2_a = [lgr2_a, knn2_a, lda2_a, gnb2_a, svc2_a, cb2_a, lgbm2_a]

# Tabulate Iteration 0; only Catboost & LGBM
col = 5
for model in it0_a:
    resul.iloc[0, col] = model.score(Xtrain, ytrain)
    resul.iloc[1, col] = model.score(Xtest, ytest)
    col+=1
    
# Tabulate Iteration 1; Piped w/ Scaler
col = 0
for model in it1_a:
    resul.iloc[2, col] = model.score(Xtrain, ytrain)
    resul.iloc[3, col] = model.score(Xtest, ytest)
    col+=1

col = 0
for model in it2_a:
    resul.iloc[4, col] = model.score(Xtrain, ytrain)
    resul.iloc[5, col] = model.score(Xtest, ytest)
    col+=1
    
resul.head(6)

Unnamed: 0_level_0,Logistic,KNN,LDA,GNB,SVM,CatBoost,LightGBM
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Iteration0 Train,,,,,,0.994686,1.0
Iteration0 Test,,,,,,0.927911,0.92976
Iteration1 Train,0.936922,1.0,0.935998,0.698475,0.937616,0.999307,1.0
Iteration1 Test,0.924214,0.924214,0.919593,0.672828,0.930684,0.931608,0.930684
Iteration2 Train,0.936691,1.0,0.933688,0.927449,0.937616,1.0,1.0
Iteration2 Test,0.921442,0.924214,0.92329,0.914972,0.926987,0.919593,0.922366


# Testing PCA Functions
[Back to Top](#Machine-Learning-Processing)

In [20]:
# Function to merge PCA data back to original
def PCAmerge(df, n_comp = 0.99, r_state = 42):
    from sklearn.decomposition import PCA 
    pca = PCA(n_components = n_comp, random_state = r_state)
    df_pca = pd.DataFrame(pca.fit_transform(df), index=df.index)
    PCA_col_list = []
    for num in range(1, df_pca.shape[1]+1):
        PCA_col_list.append(('PCA' + str(num)))
    df_pca.columns = PCA_col_list
    df = pd.concat([df, df_pca], axis=1)
    return df
    

In [27]:
# TRYING TO MAKE IT PIPELINE-COMPATIBLE. NOT SO EASY BUT HAVE A LEAD ON SOMETHING CALLED KEDRO
# https://towardsdatascience.com/kedro-prepare-to-pimp-your-pipeline-f8f68c263466
# from sklearn.base import BaseEstimator, TransformerMixin
# class PCAmergePipe(BaseEstimator, TransformerMixin):
#     def __init__(self, n_comp = 0.99, r_state = 42):
#             self.n_comp = n_comp
#             self.r_state = r_state
#     def fit_transform(self, X, y):
#             from sklearn.decomposition import PCA 
#             pca = PCA(n_components = n_comp, random_state = r_state)
#             df_pca = pd.DataFrame(pca.fit_transform(self), index=self.index)
#             PCA_col_list = []
#             for num in range(1, df_pca.shape[1]+1):
#                 PCA_col_list.append(('PCA' + str(num)))
#             df_pca.columns = PCA_col_list
#             self = pd.concat([self, df_pca], axis=1)
#             return self
# #     def transform(self, x):
# #             return x[self.ind_feature,:]

In [21]:
PCAmerge(Xtrain, 0.99, 430)

Unnamed: 0_level_0,Number_of_Claims,Num_Unique_Patients,Num_Unique_Docs,Num_Unique_Diag_Codes,Num_Unique_Proc_Codes,Mean_InscClaimAmtReimbursed,Mean_DeductibleAmtPaid,Mean_Reimbursement_per_Day,Mean_Length_of_Stay,Mean_Length_of_Claim,...,Perc_ClaimStartDt_July,Perc_ClaimStartDt_August,Perc_ClaimStartDt_September,Perc_ClaimStartDt_October,Perc_ClaimStartDt_November,Perc_ClaimStartDt_December,PCA1,PCA2,PCA3,PCA4
Provider,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PRV56682,19,19,11,66,4,863.68,168.63,1400.00,4.67,2.68,...,0.11,0.11,0.11,0.00,0.11,0.16,2721.536335,1510.184013,-13.625119,797.323463
PRV54630,85,23,4,173,0,302.12,1.18,0.00,0.00,2.58,...,0.12,0.12,0.06,0.12,0.07,0.05,-1335.867823,1104.179480,-797.684800,-412.460162
PRV55179,74,66,33,195,14,2417.97,223.01,2034.73,5.06,3.01,...,0.09,0.07,0.14,0.07,0.08,0.07,986.023464,-598.616462,637.878703,1041.106514
PRV57312,334,137,20,492,0,273.41,3.92,0.00,0.00,2.99,...,0.09,0.09,0.08,0.08,0.08,0.06,-4582.450114,427.695665,188.733308,-431.012063
PRV54923,79,57,21,149,0,384.81,0.25,0.00,0.00,2.32,...,0.13,0.11,0.08,0.08,0.09,0.08,3447.416497,2657.819578,4.046877,-407.644689
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
PRV55620,15,14,2,36,0,122.67,13.33,0.00,0.00,1.00,...,0.07,0.20,0.07,0.00,0.20,0.07,-3546.492028,532.989551,-1060.429670,-408.405878
PRV55267,144,78,12,285,0,250.14,2.85,0.00,0.00,2.46,...,0.09,0.12,0.08,0.07,0.08,0.09,-2798.762230,697.676356,-782.952063,-419.406958
PRV52407,43,18,5,103,0,371.40,9.53,0.00,0.00,1.53,...,0.16,0.05,0.02,0.12,0.07,0.09,1978.409508,2053.424472,-709.897248,-400.561357
PRV54154,12,11,10,28,0,136.67,0.00,0.00,0.00,2.75,...,0.08,0.08,0.17,0.17,0.00,0.00,-7403.879256,-608.482271,-1064.031067,-435.582912


## Iteration 3: Give the PCA some backup

[Back to Top](#Machine-Learning-Processing)



In [28]:
# IT3: Logistic
from sklearn.decomposition import PCA 
from lightgbm import LGBMClassifier
lgr3 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCAmergePipe', PCAmergePipe()),
                       ('lightgbm', LGBMClassifier(n_jobs=-2))])
lgr3.fit(Xtrain, ytrain)
print('LightGBM Train Score', lgr3.score(Xtrain, ytrain))
print('LightGBM Test Score', lgr3.score(Xtest, ytest))

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'PCAmergePipe()' (type <class '__main__.PCAmergePipe'>) doesn't

In [105]:
# IT2: KNN
knn2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('knn', KNeighborsClassifier(n_jobs=-2))])
knn2.fit(Xtrain, ytrain)
print('LightGBM Train Score', knn2.score(Xtrain, ytrain))
print('LightGBM Test Score', knn2.score(Xtest, ytest))

  return self._fit(X, y)


LightGBM Train Score 0.9396950092421442
LightGBM Test Score 0.9214417744916821


In [106]:
# IT2: LDA
lda2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('lda', LinearDiscriminantAnalysis())])
lda2.fit(Xtrain, ytrain)
print('LightGBM Train Score', lda2.score(Xtrain, ytrain))
print('LightGBM Test Score', lda2.score(Xtest, ytest))

LightGBM Train Score 0.9336876155268022
LightGBM Test Score 0.9232902033271719


  return f(*args, **kwargs)


In [107]:
# IT2: GNB
gnb2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('gnb', GaussianNB())])
gnb2.fit(Xtrain, ytrain)
print('LightGBM Train Score', gnb2.score(Xtrain, ytrain))
print('LightGBM Test Score', gnb2.score(Xtest, ytest))

LightGBM Train Score 0.5459796672828097
LightGBM Test Score 0.5425138632162662


  return f(*args, **kwargs)


In [108]:
# IT2: SVM
svc2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('svc', SVC())])
svc2.fit(Xtrain, ytrain)
print('LightGBM Train Score', svc2.score(Xtrain, ytrain))
print('LightGBM Test Score', svc2.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Train Score 0.9394639556377079
LightGBM Test Score 0.9242144177449169


In [109]:
# IT2: CatBoost
cb2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('Catboost', CatBoostClassifier(verbose=False))])
cb2.fit(Xtrain, ytrain)
print('LightGBM Train Score', cb2.score(Xtrain, ytrain))
print('LightGBM Test Score', cb2.score(Xtest, ytest))

LightGBM Train Score 0.9988447319778189
LightGBM Test Score 0.9205175600739371


In [110]:
# IT2: LightGBM
lgbm2 = Pipeline(steps=[('scaler', StandardScaler()),
                       ('PCA', PCA(n_components=0.99, random_state=430)),
                       ('LightGBM', LGBMClassifier())])
lgbm2.fit(Xtrain, ytrain)
print('LightGBM Train Score', lgbm2.score(Xtrain, ytrain))
print('LightGBM Test Score', lgbm2.score(Xtest, ytest))

  return f(*args, **kwargs)


LightGBM Train Score 1.0
LightGBM Test Score 0.9242144177449169
