#Scikit-learn Pipelines

* A `pipeline` combines multiple processing steps in a single estimator.
* It has a `fit`, `predict` and `score` method, just like any sklearn machine learning model.
* Pipelines are built as a list of steps, which are (`name`, `algorithm`) tuples.
  * The `name` can be anything that does not contain `'__'`.
  * `'__'` is used to refer to algorithm's hyperparameters.




### Example 1: classify breast cancer data
We will classify the breast cancer data using:
1.   a data preprocessing step to standardise the features, and
2.   a neural network classifer.



In [1]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
print('Size of the dataset: ', cancer.data.shape[0])
print('Number of features: ', cancer.data.shape[1])
print('Number of classes: ', cancer.target.max() + 1)
print()

steps = [("scaler", StandardScaler()),
         ("NN", MLPClassifier(max_iter=1000))]
pipe = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,random_state=1)

pipe.fit(X_train, y_train)
print("Train score: {:.2f}".format(pipe.score(X_train, y_train)))
print("Test score: {:.2f}".format(pipe.score(X_test, y_test)))

Size of the dataset:  569
Number of features:  30
Number of classes:  2

Train score: 1.00
Test score: 0.97


To obtain the trained model, we use the `named_steps` dictionary.

In [2]:
clf = pipe.named_steps['NN']
print(clf)
for i in range(len(clf.hidden_layer_sizes)):
  print(i+1, 'hidden layer with ',clf.hidden_layer_sizes[i], ' neurons')

MLPClassifier(max_iter=1000)
1 hidden layer with  100  neurons


We can also visualise the pipeline.

In [3]:
from sklearn import set_config
set_config(display="diagram")
pipe

##Pipelines and GridSearch
We can use the pipleline as an estimator in a `GridSearchCV`. To do so, we need to define a grid for the values of the algorithm's hyperparameters. We refer to the hyperparameters by combining the pipeline step name with the parameter name (e.g., step: `NN`, parameter name" `hidden_layer_sizes` are combined as `NN__hidden_layer_sizes`).

In [4]:
from sklearn.model_selection import GridSearchCV

param_grid = {'NN__hidden_layer_sizes': [(10,), (50,), (100,), (10,10), (50,50), (100,100)],
              'NN__activation':['relu', 'logistic']}

steps = [("scaler", StandardScaler()),
         ("NN", MLPClassifier(max_iter=1000))]
pipe = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,random_state=1)

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))

Best cross-validation accuracy: 0.97
Test set score: 0.98
Best parameters: {'NN__activation': 'logistic', 'NN__hidden_layer_sizes': (100,)}


And again we can retrieve the best neural network using the `best_estimator_` from the `GridSearchCV` and the corresponding `name` from the pipeline.

In [5]:
clf = grid.best_estimator_.named_steps['NN']
print(clf)
for i in range(len(clf.hidden_layer_sizes)):
  print(i+1, 'hidden layer with ',clf.hidden_layer_sizes[i], ' neurons and ',clf.activation, 'activation function' )

MLPClassifier(activation='logistic', max_iter=1000)
1 hidden layer with  100  neurons and  logistic activation function


### Example 2: predict the price of houses
In this example we will use the `california_housing` dataset. We will also create a pipeline with two preprocessing steps and one regression model.

In [6]:
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge

housing = fetch_california_housing()

steps = [("scaler", StandardScaler()),
         ("poly_features", PolynomialFeatures()), # default degree = 2
         ("linear_regression", Ridge())] # default regularization strength = 1.0

pipe = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,random_state=0)
pipe.fit(X_train, y_train)
print("Train score: {:.2f}".format(pipe.score(X_train, y_train)))
print("Test score: {:.2f}".format(pipe.score(X_test, y_test)))

Train score: 0.69
Test score: -0.64


Let's find the hyperparameters for the `PolynomialFeatures` method (degree) and `Ridge` (regularization strength).

In [7]:
param_grid = {'poly_features__degree': [1, 2, 3],
              'linear_regression__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=1)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))
print("Test score: {:.2f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))

Best cross-validation accuracy: 0.61
Test score: 0.59
Best parameters: {'linear_regression__alpha': 10, 'poly_features__degree': 1}


### Feature unions

Sometimes you want to apply multiple preprocessing techniques and use the combined produced features. To do so, we use the `FeatureJoin` method.

###Example 3: Classify the Iris dataset
In this example, we will use the two principal components and the best k features to classify the data using a neural network.

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

iris = load_iris()

X, y = iris.data, iris.target

pca = PCA(n_components=2)
selection = SelectKBest(k=1)

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)

combined_features = FeatureUnion([("pca", pca), ("raw_select", selection)])

X_features_train = combined_features.fit(X_train, y_train).transform(X_train)
#X_features_test = combined_features.transform(X_test)
#this is not needed. But if you use a classifier directly, you should use only the .transform function
print("Combined space has", X_features_train.shape[1], "features")

steps = [("features", combined_features),
         ("NN", MLPClassifier(hidden_layer_sizes=(12,), activation='logistic', max_iter=5000))] # default regularization strength = 1.0

pipe = Pipeline(steps)

pipe.fit(X_train, y_train)

print("Train score: {:.2f}".format(pipe.score(X_train, y_train)))
print("Test score: {:.2f}".format(pipe.score(X_test, y_test)))

Combined space has 3 features
Train score: 0.96
Test score: 0.97


Now, let's do find the number of principal components and raw features, as well as the optimal number of hidden neurons and activation function using grid search.

In [9]:
param_grid = dict(features__pca__n_components=[1, 2, 3],
                  features__raw_select__k=[1, 2, 3, 4],
                  NN__hidden_layer_sizes=[(4,), (8,), (12,), (16,)])

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=1)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))
print("Test score: {:.2f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))

Best cross-validation accuracy: 0.99
Test score: 0.95
Best parameters: {'NN__hidden_layer_sizes': (8,), 'features__pca__n_components': 3, 'features__raw_select__k': 3}


###Exercise

Load the [digits dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) and build a pipeline to classify the data.

After loading the dataset you should:
* Preprocess the data (you can also use grid search to find the hyperparameters of the preprocessing methods, if any)
  * Standardize the data,
  * Minmax normalization,
  * Reduce dimension using PCA,
  * Best features selection,
  * etc.

* Select a classifier
  * Set the classifier hyperparameters using grid search

* Retrieve the best model