# Machine Learning Pipeline 

This notebook is meant to be a brief and simple introduction to pipelines with the hope that it will spark your interest to learn more.    

### Why should you create a pipeline?
* Reusable across projects
* Test new ideas (components easily)
* Reduce bugs/erros
* Prevents data leaking


In [14]:
from sklearn import datasets
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split,cross_val_score, GridSearchCV
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler, \
RobustScaler, MaxAbsScaler
from sklearn import set_config
import numpy as np
import pandas as pd

## 0 - Load data

In [2]:
# import some data to play with
iris = datasets.load_iris()
X = iris.data[50:]  # we only take the first two classes
y = iris.target[50:] # binary classification

# Split the data into train and test (val)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, 
                                                    random_state=123)

## 1 - Simple pipeline
We will explore sklearn [pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline) class

In [3]:
# Without the pipeline
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = SVC()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           1       1.00      0.90      0.95        10
           2       0.91      1.00      0.95        10

    accuracy                           0.95        20
   macro avg       0.95      0.95      0.95        20
weighted avg       0.95      0.95      0.95        20



How does this scale when we have more steps?

In [4]:
# With the pipeline
steps = [('scaler', StandardScaler()), # preprocessing steps
         ('SVM', SVC())]               # model

pipeline = Pipeline(steps) 
pipeline.fit(X_train, y_train) 
y_pred = pipeline.predict(X_test)
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           1       1.00      0.90      0.95        10
           2       0.91      1.00      0.95        10

    accuracy                           0.95        20
   macro avg       0.95      0.95      0.95        20
weighted avg       0.95      0.95      0.95        20



## 2 - Pipeline with cross-validation

Without the pipeline, if you want to prevent data leaking, you need to standardize separetly on every fold! 



In [6]:
steps = [('scaler', StandardScaler()), # preprocessing steps
         ('SVM', SVC())]               # model

pipeline = Pipeline(steps)

# We send X and y complete
cross_val_score(pipeline, X, y, cv=10, scoring ="f1")

array([1.        , 0.90909091, 1.        , 0.90909091, 0.88888889,
       0.88888889, 0.8       , 1.        , 1.        , 1.        ])

## 3 - Pipeline with GridSearch

In [7]:
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
    'SVM__C': np.logspace(-4, 4, 4),
}

steps = [('scaler', StandardScaler()), # preprocessing steps
         ('SVM', SVC())]               # model

pipeline = Pipeline(steps)

search = GridSearchCV(pipeline, param_grid, n_jobs=-1, cv = 10, scoring = "f1")
search.fit(X, y)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

Best parameter (CV score=0.932):
{'SVM__C': 0.0001}


## 4 - Pipeline with hyper-hyper parameter GridSearch
What if you want to explore different preprocessing steps (scalers)? 

In [12]:
scalers =  [
    StandardScaler(),
    RobustScaler(),
    'passthrough'] # none


steps = [('scaler', StandardScaler()), # preprocessing steps
         ('clf', SVC())]               # Model

param_grid = {
    'scaler': scalers,
    'clf__C': np.logspace(-4, 4, 4),
}

pipeline = Pipeline(steps)

search = GridSearchCV(pipeline, param_grid, n_jobs=-1, cv = 10, scoring = "recall")
search.fit(X, y)
print("Scaler: %s" % scaler)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)


Scaler: StandardScaler()
Best parameter (CV score=0.940):
{'clf__C': 0.0001, 'scaler': StandardScaler()}


**What if we want to explore different models?**

Note: here we could do it in different steps so that we don't scale the data multiple times.
We could scale once and then test different classifiers.
We decided to use the complete pipeline for code consistency. 


In [13]:
scalers =  [
    StandardScaler(),
    RobustScaler(),
    'passthrough'] # none

classifiers = [
    SVC(),
    LogisticRegression()
]

steps = [('scaler', StandardScaler()), # preprocessing steps
         ('clf', SVC())]               # Model

param_grid = {
    'scaler': scalers,
    'clf': classifiers,
    'clf__C': np.logspace(-4, 4, 4),
}

pipeline = Pipeline(steps)

search = GridSearchCV(pipeline, param_grid, n_jobs=-1, cv = 10, scoring = "recall")
search.fit(X, y)
print("Scaler: %s" % scaler)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

Scaler: StandardScaler()
Best parameter (CV score=0.960):
{'clf': LogisticRegression(C=21.54434690031882), 'clf__C': 21.54434690031882, 'scaler': StandardScaler()}


What other things can you play with? 

- Preprocessing data
    - Standardization, or mean removal and variance scaling
    - Non-linear transformation
    - Normalization
    - Encoding categorical features
    - Discretization
    - Imputation of missing values
    - Generating polynomial features
- Imputation of missing values
    - Univariate vs. Multivariate Imputation
    - Univariate feature imputation
    - Multivariate feature imputation
    - Nearest neighbors imputation
    - Marking imputed values
- Feature selection
- Dimensionality reduction
- Modeling

More ideas [here](https://scikit-learn.org/stable/data_transforms.html)

**What if I can't find the one I need?**
[Create it](https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-the-step-by-step-guide-with-python-code-4a7d9b068156)! 
