## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [115]:
# IMPORT PACKAGES
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.decomposition import PCA
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV


from sklearn.ensemble import RandomForestClassifier

### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [116]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']
# col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [117]:
df=pd.read_csv("pima-indians-diabetes.csv", sep=';')

In [118]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   preg    768 non-null    int64  
 1   plas    768 non-null    int64  
 2   pres    768 non-null    int64  
 3   skin    768 non-null    int64  
 4   test    768 non-null    int64  
 5   mass    768 non-null    float64
 6   pedi    768 non-null    float64
 7   age     768 non-null    int64  
 8   class   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method √
- SelectKBest method √
- FeatureUnion √
- Random Forest √

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [121]:
#split features from target, Class is what we want to predict
X = df[col_names]
y = df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y)

# no scaling for no w
# scaler = StandardScaler() 

#What model do we want to use ? 
# We will initialize the classifier
RandomFo = RandomForestClassifier()

# Do we want dimentionality reducation ? 
# This dataset is way too high-dimensional. Better do PCA:
pca = PCA()  # WHY DOES THIS BREAK ?#n_components=2)

# Maybe some of the original features were good, too?
selection = SelectKBest(k=3)


In [122]:
# X_train = scaler.fit_transform(X_train)
X_train = pca.fit_transform(X_train)
RandomFo.fit(X_train, y_train)

from sklearn.pipeline import Pipeline
pipe = Pipeline([
        (#'scaler', StandardScaler()),
        'reduce_dim', PCA()),
        ('regressor', RandomForestClassifier())
        ])

pipe = pipe.fit(X_train, y_train)
print('Testing score: ', pipe.score(X_test, y_test))



Testing score:  0.59375


In [123]:

#Feature union is where we run the process in parallel 
# Build an transformer from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)]) # takes both PCA and best features 


# create our pipeline from FeatureUnion 
#pass the features + classifie
pipeline = Pipeline([("featuresYO", combined_features), ("RandomFo", RandomFo)])

n_features_to_test = np.arange(1, 8)
alpha_to_test = 2.0**np.arange(-6, +6)
# # set up our parameters grid
param_grid = {"featuresYO__pca__n_components": [1, 2, 3],
                    "featuresYO__univ_select__k": [1, 2, 3], 
                    'RandomFo__n_estimators': [10, 100],
                    'RandomFo__max_depth' : [4,5,6,7,8],
                    'RandomFo__max_features': ['auto', 'sqrt', 'log2']
                     }

# create a Grid Search object
grid_search = GridSearchCV(pipeline, param_grid, verbose=10, refit=True)    


# fit the model and tune parameters
grid_search.fit(X_train, y_train)



print(pipeline.steps[1][1].feature_importances_)
print('Testing score: ', pipeline.score(X_test, y_test))




[CV]  RandomFo__max_depth=8, RandomFo__max_features=sqrt, RandomFo__n_estimators=100, featuresYO__pca__n_components=1, featuresYO__univ_select__k=1, score=0.739, total=   0.2s
[CV] RandomFo__max_depth=8, RandomFo__max_features=sqrt, RandomFo__n_estimators=100, featuresYO__pca__n_components=1, featuresYO__univ_select__k=1 
[CV]  RandomFo__max_depth=8, RandomFo__max_features=sqrt, RandomFo__n_estimators=100, featuresYO__pca__n_components=1, featuresYO__univ_select__k=1, score=0.774, total=   0.1s
[CV] RandomFo__max_depth=8, RandomFo__max_features=sqrt, RandomFo__n_estimators=100, featuresYO__pca__n_components=1, featuresYO__univ_select__k=1 
[CV]  RandomFo__max_depth=8, RandomFo__max_features=sqrt, RandomFo__n_estimators=100, featuresYO__pca__n_components=1, featuresYO__univ_select__k=1, score=0.704, total=   0.1s
[CV] RandomFo__max_depth=8, RandomFo__max_features=sqrt, RandomFo__n_estimators=100, featuresYO__pca__n_components=1, featuresYO__univ_select__k=2 
[CV]  RandomFo__max_depth=8,

[Parallel(n_jobs=1)]: Done 1350 out of 1350 | elapsed:  2.0min finished


In [None]:
grid_search.best_params_