# Algorithm Chains and Pipelines
* chaining together many different processing steps and ML models
* Pipeline class and GridSearchCV to search over paramters for all processing steps at once

In [1]:
# Import required packages:
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [2]:
# load and split the data:
cancer = load_breast_cancer() #as sklearn.utils.Bunch
# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state =1)

In [3]:
# compute minimum and maximum of the training data
scaler = MinMaxScaler().fit(X_train)

In [4]:
#rescale the training data
X_train_scaled = scaler.transform(X_train)

In [5]:
#learn an SVM on the scaled training data:
svm = SVC()
svm.fit(X_train_scaled, y_train)
# scale the test data and score the scaled data:
X_test_scaled = scaler.transform(X_test)
print("Test score {:.2f}".format(svm.score(X_test_scaled, y_test)))

Test score 0.97


* we want to find better parameters for SVC using GridSearchCV
* **Following shows a naive approach (only for illustration purpose):**

In [6]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
             'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(SVC(), param_grid=param_grid, cv=5) #cv=5: 5-fold cross validation
grid.fit(X_train_scaled, y_train)
print('Best cv accuracy: {:.2f}'.format(grid.best_score_))
print('Best set score: {:.2f}'.format(grid.score(X_test_scaled, y_test)))
print('Best parameters: ', grid.best_params_)

Best cv accuracy: 0.97
Best set score: 0.97
Best parameters:  {'C': 10, 'gamma': 1}


## Problem of the approach:
* for each split in CV, some of the original training set will be declared as training data of the split and some as the test data of the split
* Test part will is used to measure what new data will look like to a model trained on the training part. 
* **By scaling the data we already used some information in the test part of the split** (find the right scaling for the complete training data)
* Splits in the CV no longer reflect how new data will look to the modelling process
* **This leads to overly optemistic results and parameters will may selected suboptimal**



## Solution
* splitting of the data set (CV) should be done before doing any preprocessing
* **Pipeline class: allows gluinig multiple processing steps into a single sk-learn estimator
* Most common use: chaining preprocessing steps (like scaling)

### Building Pipelines:

In [7]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([('scaler', MinMaxScaler()), ('svm', SVC())])
#Pipeline with 2 steps: first scaling (minmax), than svm as instance of SVC

In [8]:
#fit the pipeline like any other estimtor:
pipe.fit(X_train, y_train)
#1st calls fit on the first step(scaler) -> transforming the data
#2nd fits the SVM with the scaled data
print('Test score {:.2f}'.format(pipe.score(X_test, y_test)))

Test score 0.97


* score method on pipe: 
    * 1st transforms the test data using scaler
    * 2nd class the score method on the SVM using the scaled test data

### Using Pipelines in Grid Searches

* Slight change to the usual approach:
    * We need to specify for each paramter which step of the pipeline it belongs to

In [9]:
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100], 'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

In [10]:
# Now we can use GridSearchCV as normal:

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5) 
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_)) 
print("Test set score: {:.2f}".format(grid.score(X_test, y_test))) 
print("Best parameters: {}".format(grid.best_params_))

Best cross-validation accuracy: 0.97
Test set score: 0.97
Best parameters: {'svm__C': 10, 'svm__gamma': 1}


* Estimating the scale of the data using the test fold usually does not have a terrible impact, while using the test fold in feature extraction and feature selection can lead to substantial differences in outcomes