## Advanced tuning of parameters

In this tutorial, we will apply skills from previous tutorials and build the classifier using Pipelines and FeatureUnion

In [4]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pickle
import time

from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier

### Data
We will use data about diabetes. We will build a classifier that predicts whether person has a diabetes or no using information about his health. The dataset can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing).

In [25]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [3]:
data = pd.read_csv('pima-indians-diabetes.csv', header= 0, sep=';')
data

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


### Task

Build classifier which predicts target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

In [19]:
X = data.drop('class', axis =1)
y = data['class']

In [12]:
pca = PCA()
SKB = SelectKBest()

In [13]:
combined_features = FeatureUnion([("pca", pca), ("univ_select", SKB)])

In [14]:
RF = RandomForestClassifier()

### GridSearch

In [15]:
pipeline = Pipeline([("features", combined_features), ("RF", RF)])

In [21]:
param_grid = {"features__pca__n_components": [1, 2, 3],
                  "features__univ_select__k": [1, 2, 3],
                  "RF__n_estimators":[100, 200],
                  "RF__max_depth": [8,9,10]}

In [22]:
grid_search = GridSearchCV(pipeline, param_grid, verbose=10, refit=True)

In [23]:
grid_search.fit(X, y)

Fitting 5 folds for each of 54 candidates, totalling 270 fits
[CV] RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=1 
[CV]  RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=1, score=0.714, total=   0.1s
[CV] RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=1 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s


[CV]  RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=1, score=0.656, total=   0.1s
[CV] RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=1 
[CV]  RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=1, score=0.721, total=   0.1s
[CV] RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=1 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.4s remaining:    0.0s


[CV]  RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=1, score=0.745, total=   0.1s
[CV] RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=1 
[CV]  RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=1, score=0.680, total=   0.1s
[CV] RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=2 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.6s remaining:    0.0s


[CV]  RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=2, score=0.753, total=   0.1s
[CV] RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=2 
[CV]  RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=2, score=0.701, total=   0.1s
[CV] RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=2 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.9s remaining:    0.0s


[CV]  RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=2, score=0.753, total=   0.1s
[CV] RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=2 
[CV]  RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=2, score=0.797, total=   0.1s
[CV] RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=2 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    1.1s remaining:    0.0s


[CV]  RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=2, score=0.719, total=   0.1s
[CV] RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=3 
[CV]  RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=3, score=0.786, total=   0.1s
[CV] RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=3 
[CV]  RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=3, score=0.714, total=   0.1s
[CV] RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=3 
[CV]  RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=3, score=0.779, total=   0.1s
[CV] RF__max_depth=8, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=3 
[CV]  RF__max_depth=8, RF__n_estimators=100, features__pca__n_co

[CV]  RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1, score=0.727, total=   0.2s
[CV] RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1 
[CV]  RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1, score=0.675, total=   0.2s
[CV] RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1 
[CV]  RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1, score=0.727, total=   0.2s
[CV] RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1 
[CV]  RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1, score=0.732, total=   0.2s
[CV] RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1 
[CV]  RF__max_depth=8, RF__n_estimators=200, features__pca__n_co

[CV]  RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2, score=0.688, total=   0.3s
[CV] RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2 
[CV]  RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2, score=0.760, total=   0.3s
[CV] RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2 
[CV]  RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2, score=0.765, total=   0.3s
[CV] RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2 
[CV]  RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2, score=0.732, total=   0.3s
[CV] RF__max_depth=8, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=3 
[CV]  RF__max_depth=8, RF__n_estimators=200, features__pca__n_co

[CV]  RF__max_depth=9, RF__n_estimators=100, features__pca__n_components=2, features__univ_select__k=3, score=0.779, total=   0.1s
[CV] RF__max_depth=9, RF__n_estimators=100, features__pca__n_components=2, features__univ_select__k=3 
[CV]  RF__max_depth=9, RF__n_estimators=100, features__pca__n_components=2, features__univ_select__k=3, score=0.804, total=   0.1s
[CV] RF__max_depth=9, RF__n_estimators=100, features__pca__n_components=2, features__univ_select__k=3 
[CV]  RF__max_depth=9, RF__n_estimators=100, features__pca__n_components=2, features__univ_select__k=3, score=0.725, total=   0.1s
[CV] RF__max_depth=9, RF__n_estimators=100, features__pca__n_components=3, features__univ_select__k=1 
[CV]  RF__max_depth=9, RF__n_estimators=100, features__pca__n_components=3, features__univ_select__k=1, score=0.740, total=   0.1s
[CV] RF__max_depth=9, RF__n_estimators=100, features__pca__n_components=3, features__univ_select__k=1 
[CV]  RF__max_depth=9, RF__n_estimators=100, features__pca__n_co

[CV]  RF__max_depth=9, RF__n_estimators=200, features__pca__n_components=2, features__univ_select__k=1, score=0.758, total=   0.3s
[CV] RF__max_depth=9, RF__n_estimators=200, features__pca__n_components=2, features__univ_select__k=1 
[CV]  RF__max_depth=9, RF__n_estimators=200, features__pca__n_components=2, features__univ_select__k=1, score=0.706, total=   0.3s
[CV] RF__max_depth=9, RF__n_estimators=200, features__pca__n_components=2, features__univ_select__k=2 
[CV]  RF__max_depth=9, RF__n_estimators=200, features__pca__n_components=2, features__univ_select__k=2, score=0.740, total=   0.3s
[CV] RF__max_depth=9, RF__n_estimators=200, features__pca__n_components=2, features__univ_select__k=2 
[CV]  RF__max_depth=9, RF__n_estimators=200, features__pca__n_components=2, features__univ_select__k=2, score=0.708, total=   0.3s
[CV] RF__max_depth=9, RF__n_estimators=200, features__pca__n_components=2, features__univ_select__k=2 
[CV]  RF__max_depth=9, RF__n_estimators=200, features__pca__n_co

[CV]  RF__max_depth=10, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=2, score=0.699, total=   0.1s
[CV] RF__max_depth=10, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=3 
[CV]  RF__max_depth=10, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=3, score=0.766, total=   0.1s
[CV] RF__max_depth=10, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=3 
[CV]  RF__max_depth=10, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=3, score=0.727, total=   0.2s
[CV] RF__max_depth=10, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=3 
[CV]  RF__max_depth=10, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=3, score=0.753, total=   0.1s
[CV] RF__max_depth=10, RF__n_estimators=100, features__pca__n_components=1, features__univ_select__k=3 
[CV]  RF__max_depth=10, RF__n_estimators=100, features__

[CV]  RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1, score=0.714, total=   0.3s
[CV] RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1 
[CV]  RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1, score=0.662, total=   0.3s
[CV] RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1 
[CV]  RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1, score=0.727, total=   0.3s
[CV] RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1 
[CV]  RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1, score=0.699, total=   0.3s
[CV] RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=1, features__univ_select__k=1 
[CV]  RF__max_depth=10, RF__n_estimators=200, features__

[CV]  RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2, score=0.753, total=   0.3s
[CV] RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2 
[CV]  RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2, score=0.695, total=   0.3s
[CV] RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2 
[CV]  RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2, score=0.766, total=   0.3s
[CV] RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2 
[CV]  RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2, score=0.758, total=   0.3s
[CV] RF__max_depth=10, RF__n_estimators=200, features__pca__n_components=3, features__univ_select__k=2 
[CV]  RF__max_depth=10, RF__n_estimators=200, features__

[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed:   55.6s finished


GridSearchCV(estimator=Pipeline(steps=[('features',
                                        FeatureUnion(transformer_list=[('pca',
                                                                        PCA()),
                                                                       ('univ_select',
                                                                        SelectKBest())])),
                                       ('RF', RandomForestClassifier())]),
             param_grid={'RF__max_depth': [8, 9, 10],
                         'RF__n_estimators': [100, 200],
                         'features__pca__n_components': [1, 2, 3],
                         'features__univ_select__k': [1, 2, 3]},
             verbose=10)

In [24]:
grid_search.best_params_

{'RF__max_depth': 9,
 'RF__n_estimators': 200,
 'features__pca__n_components': 3,
 'features__univ_select__k': 3}