## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [1]:
# IMPORT PACKAGES

In [20]:
import pandas as pd
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier

### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [6]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(url, names=names)

In [7]:
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [16]:
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

In [55]:
pca = PCA()
kbest = SelectKBest(k=3)  # k has to be set because it is a required postiional argument
random_forest = RandomForestClassifier()

In [56]:
combined_features = FeatureUnion([("pca", pca), ("kbest", kbest)])

In [57]:
pipeline = Pipeline([
                     ("features", combined_features),
                     ("classifier", random_forest)
                    ])

In [63]:
param_grid = {"features__pca__n_components": [1,2,3],
              "features__kbest__k": [1,2,3],
              "classifier__n_estimators": [10,50,100,150,200],
              "classifier__max_depth": [3,5,10,None]
             }

In [64]:
grid_search = GridSearchCV(pipeline, param_grid, n_jobs=-1, verbose=1)

In [65]:
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 180 candidates, totalling 900 fits


GridSearchCV(estimator=Pipeline(steps=[('features',
                                        FeatureUnion(transformer_list=[('pca',
                                                                        PCA()),
                                                                       ('kbest',
                                                                        SelectKBest(k=3))])),
                                       ('classifier',
                                        RandomForestClassifier())]),
             n_jobs=-1,
             param_grid={'classifier__max_depth': [3, 5, 10, None],
                         'classifier__n_estimators': [10, 50, 100, 150, 200],
                         'features__kbest__k': [1, 2, 3],
                         'features__pca__n_components': [1, 2, 3]},
             verbose=1)

In [66]:
grid_search.score(X_test, y_test)

0.7077922077922078

In [67]:
grid_search.best_params_

{'classifier__max_depth': 3,
 'classifier__n_estimators': 100,
 'features__kbest__k': 3,
 'features__pca__n_components': 1}