## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [1]:
# IMPORT PACKAGES
import numpy as np
import pandas as pd

### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [3]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [6]:
df = pd.read_csv('pima-indians-diabetes.csv', sep=';')

In [7]:
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [18]:
from sklearn.decomposition import PCA
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.ensemble import RandomForestClassifier

In [12]:
X = df.drop('class', axis=1)
y = df['class']

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=27, stratify=y)

In [14]:
feature_union = FeatureUnion([('pca', PCA()), 
                              ('select_best', SelectKBest())])

In [15]:
pipe = Pipeline(steps=[('features', feature_union),
                       ('classifier', RandomForestClassifier())])

In [16]:
params = {'classifier__bootstrap': [True, False],
          'classifier__max_depth': [10, 25, 50, 75, 100],
          'classifier__n_estimators': [250, 500, 750, 1000], 
          'classifier__min_samples_split': [2, 5], 
          'features__pca__n_components': [3, 5],
          'features__select_best__k': [3, 6]}

In [20]:
grid = RandomizedSearchCV(pipe, param_distributions=params, cv=5)
grid.fit(X_train, y_train)

Traceback (most recent call last):
  File "C:\Users\engwb\anaconda3\envs\lhl-env\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\engwb\anaconda3\envs\lhl-env\lib\site-packages\sklearn\pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\engwb\anaconda3\envs\lhl-env\lib\site-packages\sklearn\pipeline.py", line 307, in _fit
    **fit_params_steps[name])
  File "C:\Users\engwb\anaconda3\envs\lhl-env\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\engwb\anaconda3\envs\lhl-env\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\engwb\anaconda3\envs\lhl-env\lib\site-packages\sklearn\pipeline.py", line 980, in fit_transform
    results = self._parallel_func(X, y, fit_params, _fit_transform_one

RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('features',
                                              FeatureUnion(transformer_list=[('pca',
                                                                              PCA()),
                                                                             ('select_best',
                                                                              SelectKBest())])),
                                             ('classifier',
                                              RandomForestClassifier())]),
                   param_distributions={'classifier__bootstrap': [True, False],
                                        'classifier__max_depth': [10, 20, 30,
                                                                  40, 50, 60,
                                                                  70, 80, 90,
                                                                  100, None],
                             

In [21]:
best_model = grid.best_estimator_
best_hyperparams = grid.best_params_
best_acc = grid.score(X_test, y_test)
print(f'Best test set accuracy: {best_acc}\nAchieved with hyperparameters: {best_hyperparams}')

Best test set accuracy: 0.7662337662337663
Achieved with hyperparameters: {'features__select_best__k': 3, 'features__pca__n_components': 5, 'classifier__n_estimators': 1000, 'classifier__min_samples_split': 2, 'classifier__max_depth': 30, 'classifier__bootstrap': True}


In [22]:
feature_union = FeatureUnion([('pca', PCA(n_components=5)), 
                              ('select_best', SelectKBest(k=3))])
pipe = Pipeline(steps=[('features', feature_union),
                       ('classifier', RandomForestClassifier(n_estimators=1000, min_samples_split=2, max_depth=30, bootstrap=True))])

In [23]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('pca', PCA(n_components=5)),
                                                ('select_best',
                                                 SelectKBest(k=3))])),
                ('classifier',
                 RandomForestClassifier(max_depth=30, n_estimators=1000))])

In [24]:
pipe.score(X_test, y_test)

0.7597402597402597