# Sklearn Toolkit and Pipeline Demonstation

This document showcases the sklearn pipeline using the data available through the ASHRAE - Great Energy Predictor III competition available at https://www.kaggle.com/c/ashrae-energy-prediction/data?select=test.csv

In [1]:
import os
import glob
import scipy
from tqdm.notebook import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.exceptions import NotFittedError
import sklearn.pipeline

from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split

from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [2]:
files = glob.glob("data/*.csv")
data = {}
for f in tqdm(files):
    name = f.split('\\')[1].split('.')[0]
    data[name] = pd.read_csv(f)

print('Downloaded data:')
for f in files: print('\t{}'.format(f.split('\\')[1].split('.')[0]))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=6.0), HTML(value='')))


Downloaded data:
	building_metadata
	sample_submission
	test
	train
	weather_test
	weather_train


## Problem Setup:

For this demonstration, the "problem" has been reimagined to be a classification problem. This document aims to answer: if you have a year's worth of energy use, building area, and climate conditions, can we estimate the primary building use type?

In [3]:
# Aggrigate data by building
train_set = data['train'].groupby('building_id')['meter_reading'].sum().to_frame().reset_index()

# Organize Weather Data by min, max, mean
weather = data['weather_train'][['site_id', 'timestamp', 'air_temperature']]
weather_max = weather.groupby('site_id')['air_temperature'].max().to_frame().rename(columns={'air_temperature': 'temp_max'})
weather_mean = weather.groupby('site_id')['air_temperature'].mean().to_frame().rename(columns={'air_temperature': 'temp_mean'})
weather_min = weather.groupby('site_id')['air_temperature'].min().to_frame().rename(columns={'air_temperature': 'temp_min'})
weather = pd.concat([weather_max, weather_mean, weather_min], axis=1).reset_index()

# Exclude year and floors due to excess missing data
building = data['building_metadata'][['site_id', 'building_id', 'primary_use', 'square_feet']]

#Merge Data
merged_data = pd.merge(train_set, building, on='building_id')
merged_data = pd.merge(merged_data, weather, on=['site_id'])
merged_data = merged_data[['primary_use', 'square_feet', 'meter_reading', 'temp_max', 'temp_mean', 'temp_min']]
merged_data

Unnamed: 0,primary_use,square_feet,meter_reading,temp_max,temp_mean,temp_min
0,Education,7432,1.286461e+06,36.1,22.836021,1.7
1,Education,2720,6.576176e+05,36.1,22.836021,1.7
2,Education,5376,1.278194e+05,36.1,22.836021,1.7
3,Education,23685,2.069071e+06,36.1,22.836021,1.7
4,Education,116607,8.578074e+06,36.1,22.836021,1.7
...,...,...,...,...,...,...
1444,Entertainment/public assembly,19619,5.570443e+04,33.9,9.357618,-23.9
1445,Education,4298,3.525474e+04,33.9,9.357618,-23.9
1446,Entertainment/public assembly,11265,2.684063e+04,33.9,9.357618,-23.9
1447,Lodging/residential,29775,1.397959e+06,33.9,9.357618,-23.9


In [4]:
print(merged_data['primary_use'].value_counts())

Education                        549
Office                           279
Entertainment/public assembly    184
Public services                  156
Lodging/residential              147
Other                             25
Healthcare                        23
Parking                           22
Warehouse/storage                 13
Manufacturing/industrial          12
Retail                            11
Services                          10
Technology/science                 6
Food sales and service             5
Utility                            4
Religious worship                  3
Name: primary_use, dtype: int64


For the models below to work, the low-frequency sets were excluded from the test set. Perhaps the errors associated with including them could be resolved in a future revision.

In [5]:
use_type_count = merged_data['primary_use'].value_counts()
remove = use_type_count[use_type_count<10].index.tolist()
trimmed = merged_data[~merged_data['primary_use'].isin(remove)]

Our final step is to split our data into a training set and testing set

In [6]:
# Assign X and y features
X = trimmed[['meter_reading', 'square_feet', 'temp_max', 'temp_mean', 'temp_min']]
y = trimmed[['primary_use']].values.ravel()

#Split train test
X_train, X_test, y_train, y_test = train_test_split(X ,y, train_size=0.8, random_state=0)

### *Part 1 &mdash; Default Models, Dummy Classifiers, and Cross Validation*

The sklearn toolbox offers a variety of dummy models, which employ simple strategies to baseline against your models. In the example below, we will compare an unoptimized decision tree with the simple strategy of guessing the most common building type.

https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html

In [9]:
# Baseline 
# dum = DummyClassifier(strategy='most_frequent')
# dum.fit(X_train,y_train)

# print('The DummyClassifier has a score of\n {:0.1f}% training set\n {:0.1f}% testing set'.format(
#     dum.score(X_train, y_train)*100, dum.score(X_test, y_test)*100
# ))

# Decision tree using default hyperparameters
dt = DecisionTreeClassifier(random_state=1)
dt.fit(X_train,y_train)

print('')
print('The DecisionTreeClassifier has a score of\n {:0.1f}% training set\n {:0.1f}% testing set'.format(
    dt.score(X_train, y_train)*100, dt.score(X_test, y_test)*100
))


The DecisionTreeClassifier has a score of
 100.0% training set
 41.5% testing set


To fix this, we need to tune the hyperparameters of the model to avoid overfitting and make the model more robust. For example, one parameter in a decision tree is the max depth. Therefore, we could suspect that our trained decision tree is too deep, hence the overfitting.

In [11]:
# Depth of decision tree when max_depth is not trained
dt.get_depth()

25

In [9]:
for depth in range(2,10):
    print('for max_depth={}:  score on test data:{:0.1%}'.format(
        depth, DecisionTreeClassifier(max_depth=depth, random_state=0).fit(X_train, y_train).score(X_test, y_test)
    ))

for max_depth=2:  score on test data:45.3%
for max_depth=3:  score on test data:46.7%
for max_depth=4:  score on test data:46.7%
for max_depth=5:  score on test data:47.0%
for max_depth=6:  score on test data:46.0%
for max_depth=7:  score on test data:48.8%
for max_depth=8:  score on test data:46.7%
for max_depth=9:  score on test data:46.3%


Note: excerpt from lab 7 from Professor Andrew Delong's course: COMP6321 (Machine Learning)
<div style="border-bottom: 3px solid black; margin-bottom:5px"></div>

As a rule, data marked as a "test set" should ALMOST NEVER be used for training, or even for model selection. All modeling choices (parameters, best model) must be made based on training data, ONLY. Otherwise you will very likely fool yourself, or others, into thinking your system will perform well on held-out data when it will not. 

"Peeking" at the test data, directly or indirectly, or even measuring the performance on test data too often, is even considered cheating. In fact, at least <a href="https://www.cio.com/article/2935233/baidu-fires-researcher-involved-in-ai-contest-flap.html">one well-known machine learning scientist was <b>fired from his job</b></a> for trying to tune hyperparameters directly to the test data.


***K*-fold cross validation** is a specific procedure for estimating held-out performance using only the training set. It creates *K* different (training, validation) splits and then averaging the validation performance measured on each one. (Beware that scikit-learn's [desciption of cross validation](https://scikit-learn.org/stable/modules/cross_validation.html#k-fold) sometimes refers to the *K* individual validation sets as "test sets" so this can be confusing since they are not really validation sets.) The *K*-fold cross validation procedure is depicted below. When there are *K* splits the result is *K* different performance estimates, one for each of the held-out folds.
<img src="img/grid_search_cross_validation.png" width="550">

(Image source: https://scikit-learn.org/stable/modules/cross_validation.html)

Note that the "test data" depicted above is not needed for the cross validation procedure itself, and is only used as an (optional) final performance evaluation, after the model selection procedure.

Use the **[sklearn.metrics.accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)** function or, equivalently, the **[score](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.score)** method of your *DecisionTreeClassifier* to compute the training and testing accuracies.

Use the **[sklearn.model_selection.cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)** function to do the cross validation. It will return an array of *K* values, so you need to average them to get an overall estimate.

<div style="border-bottom: 3px solid black; margin-bottom:5px"></div>

For time based data you can also use a time series split rather than a k-fold split to better simulate time series data.

<img src="img/kFold.png" width="550">
<img src="img/TimeSeries.png" width="550">

image source(https://datascience.stackexchange.com/questions/41378/how-to-apply-stacking-cross-validation-for-time-series-data)

In [10]:
# Using k-fold cross validation on test data
dt = DecisionTreeClassifier(random_state=1, max_depth=7).fit(X_train, y_train)

print('The training accuracy is: {:.1%}'.format(dt.score(X_train, y_train)))
print('The testing accuracy is: {:.1%}'.format(dt.score(X_test, y_test)))

for i in range(2,7):
    accuracy = sklearn.model_selection.cross_val_score(dt, X_train, y_train, cv=i)
    print('held-out accuracy ({}-fold): {:.1%}'.format(i, np.mean(accuracy)))

The training accuracy is: 60.8%
The testing accuracy is: 48.8%
held-out accuracy (2-fold): 42.7%
held-out accuracy (3-fold): 43.6%
held-out accuracy (4-fold): 45.2%
held-out accuracy (5-fold): 44.1%
held-out accuracy (6-fold): 46.3%


From this we can conclude that using the cross validation give us a reliable esitmate of how our model with act on our testing set, without having to use our testing set

### *Part 2 &mdash; Hyperparmeter tuning using random search*

<img src="img/gridsearchvsrandomsearch.png" width="550">

image source (https://blog.usejournal.com/a-comparison-of-grid-search-and-randomized-search-using-scikit-learn-29823179bc85?gi=f8c3537f6b61)

In [12]:
dist = {
    'max_depth': range(100),
}

cv_dt = sklearn.model_selection.RandomizedSearchCV(
    estimator=dt,
    param_distributions=dist,
    verbose=1,
    cv=5, n_iter=100, n_jobs=8
).fit(X_train, y_train)
pd.DataFrame(cv_dt.cv_results_).sort_values('mean_test_score', ascending=False)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    3.6s
[Parallel(n_jobs=8)]: Done 480 tasks      | elapsed:    4.3s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:    4.3s finished


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
5,0.007002,0.000632,0.002601,0.000494,5,{'max_depth': 5},0.423581,0.471616,0.489083,0.475983,0.469298,0.465912,0.022245,1
3,0.005201,0.000403,0.002603,0.000490,3,{'max_depth': 3},0.427948,0.462882,0.467249,0.467249,0.464912,0.458048,0.015138,2
6,0.005798,0.000399,0.002456,0.000457,6,{'max_depth': 6},0.393013,0.462882,0.489083,0.484716,0.456140,0.457167,0.034430,3
4,0.007204,0.001599,0.003000,0.000004,4,{'max_depth': 4},0.414847,0.471616,0.475983,0.502183,0.416667,0.456259,0.034688,4
2,0.005399,0.000489,0.002598,0.000488,2,{'max_depth': 2},0.410480,0.458515,0.454148,0.475983,0.469298,0.453685,0.022940,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40,0.009203,0.000749,0.002802,0.000753,40,{'max_depth': 40},0.349345,0.379913,0.427948,0.401747,0.355263,0.382843,0.029268,18
99,0.008000,0.000003,0.001600,0.000489,99,{'max_depth': 99},0.349345,0.379913,0.427948,0.401747,0.355263,0.382843,0.029268,18
15,0.009202,0.000399,0.002599,0.000490,15,{'max_depth': 15},0.344978,0.401747,0.410480,0.393013,0.359649,0.381973,0.025269,98
19,0.009202,0.000400,0.002400,0.000490,19,{'max_depth': 19},0.349345,0.384279,0.406114,0.401747,0.355263,0.379350,0.023335,99


In [12]:
rf = RandomForestClassifier()

dist = {
    'max_depth': scipy.stats.reciprocal(1, 100),
    'max_features': ['sqrt', 'log2'],
    'bootstrap': [True, False]
}

cv_rf = sklearn.model_selection.RandomizedSearchCV(
    estimator=rf,
    param_distributions=dist,
    verbose=1,
    cv=5, n_iter=100, n_jobs=8
).fit(X_train, y_train)
pd.DataFrame(cv_rf.cv_results_).sort_values('mean_test_score', ascending=False)

[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    2.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    9.9s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:   21.9s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:   25.1s finished


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_bootstrap,param_max_depth,param_max_features,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
84,0.326417,0.016007,0.023976,0.002734,True,6.293079,log2,"{'bootstrap': True, 'max_depth': 6.29307899501...",0.489083,0.506550,0.506550,0.524017,0.473684,0.499977,0.017172,1
20,0.442300,0.017478,0.025406,0.003384,True,8.572146,log2,"{'bootstrap': True, 'max_depth': 8.57214611071...",0.445415,0.502183,0.524017,0.515284,0.504386,0.498257,0.027564,2
53,0.324190,0.013936,0.024007,0.001674,False,7.181467,sqrt,"{'bootstrap': False, 'max_depth': 7.1814674676...",0.493450,0.497817,0.493450,0.524017,0.482456,0.498238,0.013854,3
75,0.313110,0.006264,0.021806,0.001722,True,6.647599,log2,"{'bootstrap': True, 'max_depth': 6.64759882789...",0.489083,0.497817,0.515284,0.510917,0.478070,0.498234,0.013726,4
51,0.326872,0.008330,0.021805,0.001327,False,7.114845,sqrt,"{'bootstrap': False, 'max_depth': 7.1148448014...",0.493450,0.493450,0.502183,0.532751,0.469298,0.498226,0.020445,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26,0.303288,0.005674,0.023806,0.001833,True,1.541399,log2,"{'bootstrap': True, 'max_depth': 1.54139870162...",0.388646,0.379913,0.379913,0.379913,0.377193,0.381115,0.003910,92
43,0.324073,0.012280,0.023406,0.001744,True,1.413404,log2,"{'bootstrap': True, 'max_depth': 1.41340397662...",0.388646,0.379913,0.379913,0.379913,0.377193,0.381115,0.003910,92
49,0.215849,0.019358,0.024605,0.002727,False,1.574983,log2,"{'bootstrap': False, 'max_depth': 1.5749825483...",0.388646,0.379913,0.379913,0.379913,0.377193,0.381115,0.003910,92
97,0.255099,0.019266,0.021815,0.003107,True,1.450475,log2,"{'bootstrap': True, 'max_depth': 1.45047545688...",0.388646,0.379913,0.379913,0.379913,0.377193,0.381115,0.003910,92


In [13]:
svc = sklearn.svm.SVC()
dist = {
    'C': scipy.stats.reciprocal(1, 1000),
    'gamma': scipy.stats.reciprocal(1, 1000),
}

cv_svc = sklearn.model_selection.RandomizedSearchCV(
    estimator=svc,
    param_distributions=dist,
    verbose=1,
    cv=5, n_iter=100, n_jobs=8
).fit(X_train, y_train)
pd.DataFrame(cv_svc.cv_results_).sort_values('mean_test_score', ascending=False)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    2.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:   10.7s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:   24.9s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:   28.8s finished


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_gamma,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.411890,0.020038,0.029807,0.004262,859.779953,5.784698,"{'C': 859.7799530321922, 'gamma': 5.7846978549...",0.379913,0.379913,0.379913,0.379913,0.377193,0.379369,0.001088,1
63,0.427492,0.017846,0.029309,0.003682,177.09167,47.233154,"{'C': 177.09166975443986, 'gamma': 47.23315361...",0.379913,0.379913,0.379913,0.379913,0.377193,0.379369,0.001088,1
73,0.400709,0.017627,0.028508,0.001480,517.8824,13.372108,"{'C': 517.8824000570073, 'gamma': 13.372107754...",0.379913,0.379913,0.379913,0.379913,0.377193,0.379369,0.001088,1
72,0.412092,0.016301,0.029211,0.002227,15.845447,3.143069,"{'C': 15.845446759956616, 'gamma': 3.143069065...",0.379913,0.379913,0.379913,0.379913,0.377193,0.379369,0.001088,1
71,0.434496,0.012129,0.030207,0.002136,4.156799,1.045736,"{'C': 4.156798944193392, 'gamma': 1.0457363007...",0.379913,0.379913,0.379913,0.379913,0.377193,0.379369,0.001088,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30,0.446183,0.021357,0.027801,0.002497,31.556887,150.129367,"{'C': 31.556887104893608, 'gamma': 150.1293670...",0.379913,0.379913,0.379913,0.379913,0.377193,0.379369,0.001088,1
29,0.441678,0.020261,0.028608,0.000803,13.284505,61.500206,"{'C': 13.2845050871431, 'gamma': 61.5002061368...",0.379913,0.379913,0.379913,0.379913,0.377193,0.379369,0.001088,1
28,0.440657,0.019827,0.036119,0.005846,24.327763,13.981333,"{'C': 24.3277629246435, 'gamma': 13.9813334321...",0.379913,0.379913,0.379913,0.379913,0.377193,0.379369,0.001088,1
27,0.428899,0.012832,0.033007,0.004777,376.225322,6.295912,"{'C': 376.2253217029127, 'gamma': 6.2959120465...",0.379913,0.379913,0.379913,0.379913,0.377193,0.379369,0.001088,1


In [18]:
lr = sklearn.linear_model.LogisticRegression(penalty='none', solver='sag')
dist = {
    'fit_intercept': [True, False],
}

cv_lr = sklearn.model_selection.RandomizedSearchCV(
    estimator=lr,
    param_distributions=dist,
    verbose=1,
    cv=5, n_iter=100, n_jobs=8
).fit(X_train, y_train)
pd.DataFrame(cv_lr.cv_results_).sort_values('mean_test_score', ascending=False)

[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.1s finished


Fitting 5 folds for each of 2 candidates, totalling 10 fits




Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_fit_intercept,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.084378,0.007909,0.00332,0.000367,True,{'fit_intercept': True},0.379913,0.379913,0.379913,0.379913,0.377193,0.379369,0.001088,1
1,0.072823,0.010301,0.003202,0.0004,False,{'fit_intercept': False},0.379913,0.379913,0.379913,0.379913,0.377193,0.379369,0.001088,1


### *Part 3 &mdash; Sklearn Pipeline*

In [20]:
svcs = sklearn.pipeline.Pipeline(steps=[
    ('scaler', sklearn.preprocessing.StandardScaler()),
    ('model',  sklearn.svm.SVC())
])

In [21]:
sklearn.set_config(display='diagram')
svcs

In [22]:
def print_param_names(estimator):
    for name in estimator.get_params():
        print(name)

print_param_names(svcs)

memory
steps
verbose
scaler
model
scaler__copy
scaler__with_mean
scaler__with_std
model__C
model__break_ties
model__cache_size
model__class_weight
model__coef0
model__decision_function_shape
model__degree
model__gamma
model__kernel
model__max_iter
model__probability
model__random_state
model__shrinking
model__tol
model__verbose


In [23]:
dist = {
    'scaler__with_mean': [True, False],
    'model__C': scipy.stats.reciprocal(1, 100),
    'model__gamma': scipy.stats.reciprocal(1, 100),
}
cv_svcs = sklearn.model_selection.RandomizedSearchCV(
    estimator=svcs,
    param_distributions=dist,
    verbose=1,
    cv=5, n_iter=100, n_jobs=8
).fit(X_train, y_train)
pd.DataFrame(cv_svcs.cv_results_).sort_values('mean_test_score', ascending=False)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  52 tasks      | elapsed:    1.0s
[Parallel(n_jobs=8)]: Done 352 tasks      | elapsed:    6.4s
[Parallel(n_jobs=8)]: Done 485 out of 500 | elapsed:    8.6s remaining:    0.2s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:    8.7s finished


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__C,param_model__gamma,param_scaler__with_mean,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
59,0.151424,0.009890,0.015604,0.000491,1.246122,72.526019,False,"{'model__C': 1.2461219523498164, 'model__gamma...",0.484716,0.458515,0.537118,0.502183,0.451754,0.486857,0.030987,1
61,0.089419,0.014041,0.010100,0.001747,7.368663,1.667916,True,"{'model__C': 7.368662994578213, 'model__gamma'...",0.471616,0.471616,0.506550,0.519651,0.460526,0.485992,0.022880,2
96,0.072265,0.005841,0.009658,0.000537,1.223413,6.250581,True,"{'model__C': 1.2234134768491174, 'model__gamma...",0.489083,0.471616,0.515284,0.497817,0.451754,0.485111,0.021828,3
75,0.095842,0.011037,0.008233,0.003125,5.146538,2.130122,False,"{'model__C': 5.146537669648605, 'model__gamma'...",0.462882,0.467249,0.515284,0.510917,0.464912,0.484249,0.023638,4
90,0.110067,0.021341,0.006443,0.003676,20.50364,1.233929,False,"{'model__C': 20.50364000853946, 'model__gamma'...",0.445415,0.467249,0.515284,0.510917,0.478070,0.483387,0.026480,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4,0.128832,0.007680,0.013604,0.002059,24.859493,33.033821,True,"{'model__C': 24.85949345629452, 'model__gamma'...",0.462882,0.436681,0.484716,0.458515,0.399123,0.448384,0.028971,96
49,0.167845,0.013682,0.014799,0.000745,45.846585,37.30371,True,"{'model__C': 45.84658542465399, 'model__gamma'...",0.445415,0.427948,0.493450,0.458515,0.412281,0.447522,0.027784,97
20,0.137001,0.022370,0.014970,0.003205,23.70972,35.308304,False,"{'model__C': 23.709719930191678, 'model__gamma...",0.454148,0.436681,0.489083,0.458515,0.399123,0.447510,0.029503,98
76,0.146203,0.013887,0.011251,0.000701,35.96977,35.969273,True,"{'model__C': 35.96976961739412, 'model__gamma'...",0.449782,0.441048,0.484716,0.449782,0.407895,0.446644,0.024503,99


In [24]:
cv_svcs.best_estimator_

In [25]:
dt_best = cv_dt.best_estimator_.fit(X_train, y_train)
rf_best = cv_rf.best_estimator_.fit(X_train, y_train)
svcs_best = cv_svcs.best_estimator_.fit(X_train, y_train)

print('The results are:')
models = [dum, dt_best, rf_best, svcs_best]
names = ['dum', 'dt', 'rf', 'svcs']
for model, name in zip(models, names):
    print('{}:\ttrain acc:{:0.1%}\t test acc:{:0.1%}'.format(
        name, model.score(X_train, y_train), model.score(X_test, y_test)
    ))

NameError: name 'cv_rf' is not defined

### *Part 4 &mdash; Custom Classifiers and Estimators*

For this section, we will write a custom transformer and a custom estimator. You may be thinking up until now that this looks good, but you require special preprocessing and models based on your expertise in the data, so these pre-made sklearn models will not work. There is a way to integrate your particular needs into a sklearn style model. The advantage to this is that you can rapidly compare your models to a vast library of these pre-made models, and you can take advantage of the sklearn tools, such as random search

For the transformer below, we image a situation where we are unsure if the weather inputs are acually useful. Perhaps what would be better if we just designate the location, and we allow the models simply to understand the some buildings are in the same location, rather than trying to derive mening from the max, min, and mean weather conditions.

This custom transformer will any of the following depending on the input
1. Do nothing to the input features
2. Normalize all the input features base on an sklearn style pre-processer
3. encode the categorized weather data into a unique encoder using a sklearn style encoder
4. Both encode weather data and normalize non-weather data

In [29]:
class pre_processer(BaseEstimator, TransformerMixin):
    def __init__(self, normalizer=None, encoder=None):
        self.normalizer = normalizer
        self.encoder = encoder
        return None
    
    def fit(self, x, y=None):
        x_ = x
        if self.encoder is not None:
            x_left = x_.iloc[:,0:2]
            x_right = (x_.iloc[:, 2] * x_.iloc[:, 3] * x_.iloc[:, 4]).to_frame()
            self.encoder.fit(x_right)
            if self.normalizer is not None:
                self.normalizer.fit(x_left)
        elif self.normalizer is not None:          
            self.normalizer.fit(x_)
        return self

    
    def transform(self, x, y=None):
        x_ = x
        if self.encoder is not None:
            x_left = x_.iloc[:,0:2]
            x_right = (x_.iloc[:, 2] * x_.iloc[:, 3] * x_.iloc[:, 4]).to_frame()
            x_right = pd.DataFrame(self.encoder.transform(x_right))
            if self.normalizer is not None:
                x_left = pd.DataFrame(self.normalizer.transform(x_left) )
            x_left.reset_index(drop=True, inplace=True)
            x_right.reset_index(drop=True, inplace=True)
            return pd.concat([x_left, x_right], ignore_index=True, axis=1)
        elif self.normalizer is not None:          
            return pd.DataFrame(self.normalizer.transform(x_))
        else:
            return x_
        

In [36]:
pre = pre_processer()

pre = pre_processer(
    normalizer=sklearn.preprocessing.StandardScaler(),
    encoder=OneHotEncoder()
)

pre.fit_transform(X_train)

Unnamed: 0,0,1,2
0,-0.029349,-0.253839,"(0, 1)\t1.0"
1,-0.035644,-0.299195,"(0, 0)\t1.0"
2,-0.035935,-0.755706,"(0, 9)\t1.0"
3,-0.035308,-0.118857,"(0, 0)\t1.0"
4,-0.035061,0.073133,"(0, 6)\t1.0"
...,...,...,...
1139,-0.033885,-0.091068,"(0, 3)\t1.0"
1140,-0.035943,-0.779623,"(0, 12)\t1.0"
1141,-0.025827,0.616199,"(0, 4)\t1.0"
1142,-0.035858,-0.034194,"(0, 6)\t1.0"


In [37]:
svcspecialboi = sklearn.pipeline.Pipeline(steps=[
    ('pre_processing', pre_processer()),
    ('model',  sklearn.svm.SVC())
])

In [39]:
print_param_names(svcspecialboi)

memory
steps
verbose
pre_processing
model
pre_processing__encoder
pre_processing__normalizer
model__C
model__break_ties
model__cache_size
model__class_weight
model__coef0
model__decision_function_shape
model__degree
model__gamma
model__kernel
model__max_iter
model__probability
model__random_state
model__shrinking
model__tol
model__verbose


In [41]:
dist = {
    'pre_processing__encoder': [None, LabelEncoder(), OneHotEncoder()],
    'pre_processing__normalizer': [None, sklearn.preprocessing.StandardScaler()],
    'model__C': scipy.stats.reciprocal(1, 100),
    'model__gamma': scipy.stats.reciprocal(1, 100)
}
cv_svcspecialboi = sklearn.model_selection.RandomizedSearchCV(
    estimator=svcspecialboi,
    param_distributions=dist,
    verbose=1,
    cv=5, n_iter=100, n_jobs=8
).fit(X_train, y_train)
pd.DataFrame(cv_svcspecialboi.cv_results_).sort_values('mean_test_score', ascending=False)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    1.0s
[Parallel(n_jobs=8)]: Done 208 tasks      | elapsed:    5.1s
[Parallel(n_jobs=8)]: Done 485 out of 500 | elapsed:   11.5s remaining:    0.3s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:   11.9s finished


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__C,param_model__gamma,param_pre_processing__encoder,param_pre_processing__normalizer,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
30,0.111322,0.008606,0.014106,0.004931,1.12371,40.112954,,StandardScaler(),"{'model__C': 1.1237097219568681, 'model__gamma...",0.480349,0.454148,0.528384,0.502183,0.460526,0.485118,0.027385,1
69,0.086482,0.008712,0.013711,0.004488,11.830255,1.343733,,StandardScaler(),"{'model__C': 11.830254746558149, 'model__gamma...",0.462882,0.471616,0.506550,0.515284,0.464912,0.484249,0.022138,2
29,0.167166,0.017037,0.012496,0.003901,68.714069,1.022096,LabelEncoder(),StandardScaler(),"{'model__C': 68.714069421818, 'model__gamma': ...",0.471616,0.449782,0.519651,0.502183,0.469298,0.482506,0.025023,3
21,0.223781,0.007145,0.027382,0.004339,1.165174,86.241984,LabelEncoder(),StandardScaler(),"{'model__C': 1.1651737672773308, 'model__gamma...",0.493450,0.445415,0.537118,0.497817,0.438596,0.482479,0.036443,4
82,0.090794,0.012158,0.013439,0.002824,11.318951,1.506199,LabelEncoder(),StandardScaler(),"{'model__C': 11.318950573919613, 'model__gamma...",0.458515,0.454148,0.510917,0.515284,0.469298,0.481633,0.026199,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81,0.070161,0.007246,0.000000,0.000000,7.920748,1.164487,OneHotEncoder(),StandardScaler(),"{'model__C': 7.920747811750177, 'model__gamma'...",,,,,,,,66
85,0.063014,0.001096,0.000000,0.000000,5.932807,2.048302,OneHotEncoder(),,"{'model__C': 5.932807016394855, 'model__gamma'...",,,,,,,,61
88,0.068815,0.003868,0.000000,0.000000,12.282362,1.315329,OneHotEncoder(),StandardScaler(),"{'model__C': 12.282362036139409, 'model__gamma...",,,,,,,,67
89,0.072525,0.007332,0.000000,0.000000,26.550917,5.975423,OneHotEncoder(),StandardScaler(),"{'model__C': 26.550916550798977, 'model__gamma...",,,,,,,,68


Now we will write a sklearn estimator. Here you

In [42]:
lr_specialboi = sklearn.pipeline.Pipeline(steps=[
    ('pre_processing', pre_processer()),
    ('model',  sklearn.linear_model.LogisticRegression())
])
dist = {
    'pre_processing__encoder': [None, LabelEncoder(), OneHotEncoder()],
    'pre_processing__normalizer': [None, sklearn.preprocessing.StandardScaler()],
}
cv_lrspecialboi = sklearn.model_selection.RandomizedSearchCV(
    estimator=lr_specialboi,
    param_distributions=dist,
    verbose=1,
    cv=5, n_iter=100, n_jobs=8
).fit(X_train, y_train)
pd.DataFrame(cv_lrspecialboi.cv_results_).sort_values('mean_test_score', ascending=False)

[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=8)]: Done  30 out of  30 | elapsed:    0.7s finished


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_pre_processing__normalizer,param_pre_processing__encoder,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
1,0.135174,0.018131,0.008057,0.001924,StandardScaler(),,{'pre_processing__normalizer': StandardScaler(...,0.388646,0.371179,0.393013,0.427948,0.434211,0.402999,0.024145,1
3,0.179205,0.006676,0.010647,0.000827,StandardScaler(),LabelEncoder(),{'pre_processing__normalizer': StandardScaler(...,0.379913,0.371179,0.41048,0.388646,0.421053,0.394254,0.018714,2
0,0.330704,0.080381,0.005638,0.004663,,,"{'pre_processing__normalizer': None, 'pre_proc...",0.371179,0.379913,0.379913,0.379913,0.377193,0.377622,0.003389,3
2,0.356787,0.072825,0.009221,0.000997,,LabelEncoder(),"{'pre_processing__normalizer': None, 'pre_proc...",0.371179,0.379913,0.379913,0.379913,0.377193,0.377622,0.003389,3
4,0.110362,0.020613,0.0,0.0,,OneHotEncoder(),"{'pre_processing__normalizer': None, 'pre_proc...",,,,,,,,5
5,0.094868,0.007431,0.0,0.0,StandardScaler(),OneHotEncoder(),{'pre_processing__normalizer': StandardScaler(...,,,,,,,,6


In [25]:
def check_size(guess_num, length):
        if guess_num > length:
            raise ValueError('The number needs to be less than the unique values of the training set ({})'.format(length))
    

class silly_estimator(BaseEstimator, ClassifierMixin):
    def __init__(self, guess_number=0):
        self.guess_number = guess_number
        return None
        
    def fit(self, x, y):
        y_ = y
        self.options = np.unique(np.array(y_))
        self.length = len(self.options)
        check_size(self.guess_number, self.length)
        return self
    
    def predict(self, x):
        x_ = x
        length, _ = x_.shape
        guess = self.options[self.guess_number]
        return np.full((length,), guess).tolist()
    
    def score(self, x, y):
        x_ = x
        y_ = y
        y_pred = self.predict(x_)
        return sklearn.metrics.accuracy_score(y_, y_pred)

In [26]:
sil = silly_estimator(guess_number=0)
sil.fit(X_train, y_train).predict(X_test)

['Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Educ

In [27]:
dist = {
    'guess_number': range(12),
}
cv_sil = sklearn.model_selection.RandomizedSearchCV(
    error_score=0,
    estimator=sil,
    param_distributions=dist,
    verbose=1,
    cv=5, n_iter=100, n_jobs=8
).fit(X_train, y_train)
pd.DataFrame(cv_sil.cv_results_).sort_values('mean_test_score', ascending=False)



Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  60 out of  60 | elapsed:    0.0s finished


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_guess_number,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.000996,7e-06,0.000201,0.000403,0,{'guess_number': 0},0.379913,0.379913,0.379913,0.379913,0.377193,0.379369,0.001088,1
5,0.000799,0.0004,0.000399,0.000489,5,{'guess_number': 5},0.19214,0.187773,0.187773,0.187773,0.192982,0.189688,0.002361,2
1,0.000799,0.000399,0.000401,0.000491,1,{'guess_number': 1},0.135371,0.135371,0.135371,0.131004,0.135965,0.134617,0.001821,3
8,0.001201,0.000405,0.0,0.0,8,{'guess_number': 8},0.104803,0.10917,0.10917,0.10917,0.105263,0.107516,0.002032,4
3,0.001002,6e-06,0.000196,0.000392,3,{'guess_number': 3},0.104803,0.104803,0.104803,0.104803,0.105263,0.104895,0.000184,5
6,0.000398,0.000488,0.000598,0.000488,6,{'guess_number': 6},0.017467,0.021834,0.021834,0.017467,0.017544,0.019229,0.002127,6
2,0.001208,0.000406,0.0,0.0,2,{'guess_number': 2},0.017467,0.017467,0.0131,0.017467,0.017544,0.016609,0.001755,7
7,0.001001,7e-06,0.0002,0.000399,7,{'guess_number': 7},0.0131,0.0131,0.017467,0.0131,0.013158,0.013985,0.001741,8
4,0.000806,0.000403,0.0006,0.00049,4,{'guess_number': 4},0.008734,0.008734,0.008734,0.0131,0.013158,0.010492,0.002153,9
11,0.001009,4e-06,0.000404,0.000495,11,{'guess_number': 11},0.0131,0.008734,0.008734,0.008734,0.008772,0.009615,0.001743,10


In [28]:
class multi_model(BaseEstimator, ClassifierMixin):
    def __init__(self, model1=None, model2=None, model3=None, dominant_estimator=0):
        if self.model1 is None:
            raise ValueError("Model1 is not defined")
        if self.model2 is None and self.model3 is not None:
            raise ValueError("Define models in order")
        self.model1 = model1
        self.model2 = model2
        self.model3 = model3
        self.dominant_estimator = dominant_estimator # Right now this does nothing
        return None
        
    def fit(self, x, y):
        x_ = x
        y_ = y
        self.model1.fit(x_, y_)
        if self.model2 is not None:
            self.model2.fit(x_, y_)
        if self.model3 is not None:
            self.model3.fit(x_, y_)
        return self
    
    def predict(self, x):
        if self.model1 is None:
            raise ValueError("Model1 is not defined")
        x_ = x
        estimate_model1 = self.model1.predict(x_).reshape(-1,1)
        estimate = estimate_model1
        if self.model2 is not None:
            estimate_model2 = self.model2.predict(x_).reshape(-1,1)
            estimate = np.hstack([estimate_model1, estimate_model2])
        if self.model3 is not None:
            estimate_model3 = self.model3.predict(x_).reshape(-1,1)
            estimate = np.hstack([estimate_model1, estimate_model2, estimate_model3])
        y_pred, _ = scipy.stats.mode(estimate, axis=1)
        return y_pred
    
    def score(self, x, y):
        x_ = x
        y_ = y
        y_pred = self.predict(x_)
        return sklearn.metrics.accuracy_score(y_, y_pred)

In [27]:
mm = multi_model(
    model1=DecisionTreeClassifier(),
    model2=RandomForestClassifier(),
    model3=sklearn.svm.SVC()
)
mm.fit(X_train, y_train).predict(X_train)

NameError: name 'multi_model' is not defined

In [30]:
mm.score(X_test, y_test)

0.45993031358885017

In [31]:
custom_predictor = sklearn.pipeline.Pipeline(steps=[
    ('pre_processing', pre_processer()),
    ('model',  multi_model(
        model1=DecisionTreeClassifier(),
        model2=RandomForestClassifier(),
        model3=sklearn.svm.SVC()
    ))
])

In [33]:
print_param_names(custom_predictor)

memory
steps
verbose
pre_processing
model
pre_processing__encoder
pre_processing__normalizer
model__dominant_estimator
model__model1__ccp_alpha
model__model1__class_weight
model__model1__criterion
model__model1__max_depth
model__model1__max_features
model__model1__max_leaf_nodes
model__model1__min_impurity_decrease
model__model1__min_impurity_split
model__model1__min_samples_leaf
model__model1__min_samples_split
model__model1__min_weight_fraction_leaf
model__model1__presort
model__model1__random_state
model__model1__splitter
model__model1
model__model2__bootstrap
model__model2__ccp_alpha
model__model2__class_weight
model__model2__criterion
model__model2__max_depth
model__model2__max_features
model__model2__max_leaf_nodes
model__model2__max_samples
model__model2__min_impurity_decrease
model__model2__min_impurity_split
model__model2__min_samples_leaf
model__model2__min_samples_split
model__model2__min_weight_fraction_leaf
model__model2__n_estimators
model__model2__n_jobs
model__model2__o

In [34]:
dist = {
    'pre_processing__encoder': [None, LabelEncoder(), OrdinalEncoder()],
    'pre_processing__normalizer': [None, sklearn.preprocessing.StandardScaler()],
    
    'model__model1__max_depth': scipy.stats.reciprocal(1, 100),
    
    'model__model2__max_depth': scipy.stats.reciprocal(1, 100),
    'model__model2__max_features': ['sqrt', 'log2'],
    'model__model2__bootstrap': [True, False],
    
    'model__model3__C': scipy.stats.reciprocal(1, 100),
    'model__model3__gamma': scipy.stats.reciprocal(1, 100)
}
cv_custom_predictor = sklearn.model_selection.RandomizedSearchCV(
    estimator=custom_predictor,
    param_distributions=dist,
    verbose=1,
    cv=3, n_iter=500, n_jobs=8
).fit(X_train, y_train)
pd.DataFrame(cv_custom_predictor.cv_results_).sort_values('mean_test_score', ascending=False)

Fitting 3 folds for each of 500 candidates, totalling 1500 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    3.2s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:   15.4s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:   36.9s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:  1.1min
[Parallel(n_jobs=8)]: Done 1234 tasks      | elapsed:  1.7min
[Parallel(n_jobs=8)]: Done 1500 out of 1500 | elapsed:  2.0min finished
  return f(**kwargs)
  return f(**kwargs)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__model1__max_depth,param_model__model2__bootstrap,param_model__model2__max_depth,param_model__model2__max_features,param_model__model3__C,param_model__model3__gamma,param_pre_processing__encoder,param_pre_processing__normalizer,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
495,0.378752,0.013075,0.059516,0.003540,883.950316,True,7.332897,sqrt,25.161478,3.257314,LabelEncoder(),StandardScaler(),{'model__model1__max_depth': 883.9503158507903...,0.460733,0.506562,0.485564,0.484286,0.018731,1
238,0.459291,0.011407,0.077018,0.015559,801.89405,True,744.245563,sqrt,20.190889,2.14143,LabelEncoder(),StandardScaler(),{'model__model1__max_depth': 801.8940495546717...,0.437173,0.506562,0.501312,0.481682,0.031546,2
390,0.430612,0.010230,0.063014,0.002447,525.033648,True,670.794374,sqrt,29.028823,1.993288,OrdinalEncoder(),StandardScaler(),{'model__model1__max_depth': 525.0336475987777...,0.452880,0.509186,0.482940,0.481669,0.023005,3
349,0.460436,0.006650,0.063347,0.003401,740.324456,True,40.945485,sqrt,1.18195,11.971238,OrdinalEncoder(),StandardScaler(),{'model__model1__max_depth': 740.3244556471642...,0.447644,0.511811,0.477690,0.479048,0.026214,4
98,0.486775,0.012661,0.074351,0.008997,253.484299,True,496.780687,log2,2.389883,4.453322,LabelEncoder(),StandardScaler(),{'model__model1__max_depth': 253.4842992400476...,0.447644,0.506562,0.482940,0.479048,0.024210,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
445,0.640858,0.012913,0.077354,0.002624,160.770952,False,263.537749,sqrt,91.529115,5.295329,,,"{'model__model1__max_depth': 160.770952301703,...",0.405759,0.440945,0.406824,0.417843,0.016341,496
364,0.665408,0.032177,0.085686,0.004110,304.744195,False,62.409059,sqrt,46.87203,35.530147,,,{'model__model1__max_depth': 304.7441953713288...,0.390052,0.433071,0.419948,0.414357,0.018002,497
303,0.686267,0.014546,0.086686,0.006184,224.647883,False,901.695884,sqrt,1.586559,34.021064,,,{'model__model1__max_depth': 224.6478828287265...,0.392670,0.435696,0.414698,0.414355,0.017567,498
308,0.626474,0.013276,0.094356,0.009745,592.778932,False,714.089032,sqrt,3.45489,10.205316,,,{'model__model1__max_depth': 592.7789318054223...,0.397906,0.422572,0.419948,0.413475,0.011061,499


In [35]:
dt_best = cv_dt.best_estimator_.fit(X_train, y_train)
rf_best = cv_rf.best_estimator_.fit(X_train, y_train)
svcs_best = cv_svcs.best_estimator_.fit(X_train, y_train)
custom_predictor_best = cv_custom_predictor.best_estimator_.fit(X_train, y_train)

print('The results are:')
models = [dum, dt_best, rf_best, svcs_best, custom_predictor_best]
names = ['dum', 'dt', 'rf', 'svcs', 'custmm']
for model, name in zip(models, names):
    print('{}:\ttrain:{:0.1%}\t test:{:0.1%}'.format(
        name, model.score(X_train, y_train), model.score(X_test, y_test)
    ))

  return f(**kwargs)
  return f(**kwargs)


The results are:
dum:	train:37.9%	 test:40.1%
dt:	train:51.7%	 test:47.0%
rf:	train:57.3%	 test:53.0%
svcs:	train:53.1%	 test:49.8%
custmm:	train:70.9%	 test:51.9%


  return f(**kwargs)
  return f(**kwargs)


In [36]:
sklearn.set_config(display='diagram')
cv_custom_predictor.best_estimator_