## Using SKL_Search.py

In this notebook I demonstrate how to use the classes from SKL_search.py  to build an efficient, flexible pipeline for feature construction and model building with SKLearn. Full credit and appreciation to Panagiotis Katsaroumpas (https://github.com/codiply) for the core class in this code, EstimatorSelectionHelper, which he covers in his [blog post](http://www.codiply.com/blog/hyperparameter-grid-search-across-multiple-models-in-scikit-learn/). The example data is a binary classification problem of 20 pre-processed features from the [Predicting Poverty](https://www.drivendata.org/competitions/50/worldbank-poverty-prediction/page/97/) competition. The data and models themselves are unimportant, the code is more about the processes that it can automate. 

In [2]:
#standard imports
import pandas as pd
import numpy as np

Import some models to search over

In [78]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier

In [79]:
X = pd.read_csv('train_data.csv', index_col = 0)

In [80]:
y = pd.read_csv('train_output.csv', index_col = 0)

In [81]:
from SKL_search import *

Use a pipeline for consistency and to make feature construction easily tunable. Here add preprocessing/other additional steps can be added to the pipeline for a more end-to-end process. In our case the data is 'clean' already. Typical pre-processing pipelines would involve dealing with categorical data and scaling numeric data. 

In [82]:
num_pca = 3
num_kbest = 3
thresh_fromMod = 0.1
full_pipeline = Pipeline([
                    ('features', 
                    FeatureUnion([('pca', PCAFeatureSelector(k = num_pca)), 
                                  ('kbest', KBestFeatureSelector(k = num_kbest, scorefunc = chi2)),
                     ('fromMod', FromModelFeatureSelector(model = RandomForestClassifier(random_state = 4), threshold = thresh_fromMod))]))])

If you want to create a training and test set, fit the transformation against the training set, but do *not* refit the transformation for the test set (this would be 'cheating'...). Use the same transformation as the traning set, which is easily done using pipeline fit and transform functionality. However, this is an aside, since the EstimatorSelectionHelper class can use cross validation, which is a preferable technique when possible. 

In [83]:
from sklearn.model_selection import train_test_split
X_tr, X_test, Y_tr, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 5)

In [84]:
pipe_fit = full_pipeline.fit(X_tr,np.array(Y_tr).reshape(-1))

In [85]:
X_tr_select = pipe_fit.transform(X_tr)

In [86]:
X_test_select = pipe_fit.transform(X_test)

For this example we will use the full data and allow the class to use cross validation for training. 

In [87]:
X_select_full = full_pipeline.fit_transform(X,np.array(y).reshape(-1))
X_select_full.shape

(29913, 8)

In [88]:
from sklearn.ensemble import (ExtraTreesClassifier, RandomForestClassifier, 
                              AdaBoostClassifier, GradientBoostingClassifier)
from sklearn.svm import SVC

models1 = { 
    'RandomForestClassifier': RandomForestClassifier(),
    'AdaBoostClassifier': AdaBoostClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier(),
    'SVC': SVC(probability = True),
}

params1 = {  
    'RandomForestClassifier': { 'n_estimators': [50, 100], 'max_depth':[5,10],
                              'class_weight':['balanced']},
    'AdaBoostClassifier':  { 'base_estimator':[RandomForestClassifier(min_samples_leaf = 10)], 'n_estimators': [10, 20], 'learning_rate':[1, 2] },
    'GradientBoostingClassifier': { 'n_estimators': [100,200], 'learning_rate': [0.1,0.5] },
    'SVC': [
        {'kernel': ['linear'], 'C': [2.5,3], 'degree':[2,3]},
        {'kernel': ['rbf'], 'C': [10, 20, 50]}],
    
}

In [74]:
import time
start = time.time()
helper1 = EstimatorSelectionHelper(models1, params1)
helper1.fit(X_select_full, np.array(y).reshape(-1), scoring='accuracy', n_jobs=-1, cv = 2)
time_taken = time.time()-start
print('Time taken for estimator selection search: ', np.round(time_taken,1), ' seconds')

Running GridSearchCV for RandomForestClassifier.
Fitting 2 folds for each of 4 candidates, totalling 8 fits


[Parallel(n_jobs=-1)]: Done   3 out of   8 | elapsed:    2.8s remaining:    4.7s
[Parallel(n_jobs=-1)]: Done   5 out of   8 | elapsed:    3.0s remaining:    1.7s
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:    4.6s finished


Running GridSearchCV for AdaBoostClassifier.
Fitting 2 folds for each of 4 candidates, totalling 8 fits


[Parallel(n_jobs=-1)]: Done   3 out of   8 | elapsed:    6.5s remaining:   10.9s
[Parallel(n_jobs=-1)]: Done   5 out of   8 | elapsed:    7.4s remaining:    4.4s
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:   11.7s finished


Running GridSearchCV for GradientBoostingClassifier.
Fitting 2 folds for each of 4 candidates, totalling 8 fits


[Parallel(n_jobs=-1)]: Done   3 out of   8 | elapsed:    3.2s remaining:    5.4s
[Parallel(n_jobs=-1)]: Done   5 out of   8 | elapsed:    3.6s remaining:    2.1s
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:    4.7s finished


Running GridSearchCV for SVC.
Fitting 2 folds for each of 7 candidates, totalling 14 fits


[Parallel(n_jobs=-1)]: Done  10 out of  14 | elapsed:  1.9min remaining:   44.4s


Time taken for estimator selection search:  368.762


[Parallel(n_jobs=-1)]: Done  14 out of  14 | elapsed:  5.8min finished


Use the score summary method to display a useful dataframe summarising your findings. Update your features/parameters and keep firing away to find the perfect model!

In [77]:
helper1.score_summary(sort_by='min_score')

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score,C,base_estimator,class_weight,degree,kernel,learning_rate,max_depth,n_estimators
18,SVC,0.764458,0.764484,0.764509,2.55569e-05,50.0,,,,rbf,,,
17,SVC,0.764458,0.764484,0.764509,2.55569e-05,20.0,,,,rbf,,,
16,SVC,0.764458,0.764484,0.764509,2.55569e-05,10.0,,,,rbf,,,
15,SVC,0.764458,0.764484,0.764509,2.55569e-05,3.0,,,3.0,linear,,,
14,SVC,0.764458,0.764484,0.764509,2.55569e-05,3.0,,,2.0,linear,,,
13,SVC,0.764458,0.764484,0.764509,2.55569e-05,2.5,,,3.0,linear,,,
12,SVC,0.764458,0.764484,0.764509,2.55569e-05,2.5,,,2.0,linear,,,
8,GradientBoostingClassifier,0.759377,0.760138,0.760899,0.000760878,,,,,,0.1,,100.0
9,GradientBoostingClassifier,0.755031,0.755056,0.755082,2.52417e-05,,,,,,0.1,,200.0
5,AdaBoostClassifier,0.748412,0.751747,0.755082,0.00333473,,"RandomForestClassifier(bootstrap=True, class_w...",,,,1.0,,20.0
