# Machine Learning Pipeline 

This notebook is meant to be a brief and simple introduction to pipelines with the hope that it will spark your interest to learn more.    

### Why should you create a pipeline?
* Reusable across projects
* Test new ideas (components easily)
* Reduce bugs/erros
* Prevents data leaking


In [1]:
from sklearn import datasets
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split,cross_val_score, GridSearchCV
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler, \
RobustScaler, MaxAbsScaler
from sklearn import set_config
import numpy as np
import pandas as pd

## 0 - Load data

In [2]:
# import some data to play with
iris = datasets.load_iris()
X = iris.data[50:]  # we only take the first two classes
y = iris.target[50:] # binary classification
# convert class 1 to class 0 and class 2 to class 1
# Split the data into train and test (val)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, 
                                                    random_state=123)

## 1 - Simple pipeline
We will explore sklearn [pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline) class

In [3]:
# Without the pipeline
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = SVC()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           1       1.00      0.90      0.95        10
           2       0.91      1.00      0.95        10

    accuracy                           0.95        20
   macro avg       0.95      0.95      0.95        20
weighted avg       0.95      0.95      0.95        20



How does this scale when we have more steps?

In [4]:
# With the pipeline
steps = [('scaler', StandardScaler()), # preprocessing steps
         ('SVM', SVC())]               # model

pipeline = Pipeline(steps) 
pipeline.fit(X_train, y_train) 
y_pred = pipeline.predict(X_test)
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           1       1.00      0.90      0.95        10
           2       0.91      1.00      0.95        10

    accuracy                           0.95        20
   macro avg       0.95      0.95      0.95        20
weighted avg       0.95      0.95      0.95        20



## 2 - Pipeline with cross-validation

Without the pipeline, if you want to prevent data leaking, you need to standardize separetly on every fold! 



In [5]:
steps = [('scaler', StandardScaler()), # preprocessing steps
         ('SVM', SVC())]               # model

pipeline = Pipeline(steps)

# We send X and y complete
cross_val_score(pipeline, X, y, cv=10, scoring ="f1")

array([1.        , 0.90909091, 1.        , 0.90909091, 0.88888889,
       0.88888889, 0.8       , 1.        , 1.        , 1.        ])

## 3 - Pipeline with GridSearch

In [6]:
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
    'SVM__C': np.logspace(-4, 4, 4),
}

steps = [('scaler', StandardScaler()), # preprocessing steps
         ('SVM', SVC())]               # model

pipeline = Pipeline(steps)

search = GridSearchCV(pipeline, param_grid, n_jobs=-1, cv = 10, scoring = "f1")
search.fit(X, y)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

Best parameter (CV score=0.932):
{'SVM__C': 0.0001}


## 4 - Pipeline with hyper-hyper parameter GridSearch
What if you want to explore different preprocessing steps (scalers)? 

**What if we want to explore different models?**

Note: here we could do it in different steps so that we don't scale the data multiple times.
We could scale once and then test different classifiers.
We decided to use the complete pipeline for code consistency. 


In [7]:
from sklearn.model_selection import StratifiedKFold, KFold

def create_iterator(X, y):
    '''
    Create an iterator to split interactions in data in 10 folds, stratified by the target variable.
    :param data:        Dataframe with student's interactions.
    :return:            An iterator.
    '''
    ### YOUR CODE HERE ###
    
    # Both passing a matrix with the raw data or just an array of indexes works
    return StratifiedKFold(n_splits=10, shuffle=True, random_state=123).split(X, y)

In [8]:
temp = create_iterator(X, y)

In [None]:
i = 0
for train_index, test_index in temp:
    print("Iteration: ", i)
    # Check if the percentage of positive examples is the same in train and test
    print("Train: ", np.bincount(y[train_index])/len(y[train_index]))
    print("Test: ", np.bincount(y[test_index])/len(y[test_index]))
    print("_______")
    i += 1

In [14]:
# my way of doing it
from sklearn.base import BaseEstimator
from sklearn.model_selection import GridSearchCV
# import random forest
from sklearn.ensemble import RandomForestClassifier as RandomF

class DummyEstimator(BaseEstimator):
    def fit(self): pass
    def score(self): pass
    
# Create a pipeline
pipe = Pipeline([
    ('scaler', None),
    ('clf', None)
    ]) # Placeholder Estimator

# Candidate learning algorithms and their hyperparameters
search_space = [
                {
                'clf': [LogisticRegression(), SVC()], # Actual Estimator
                'scaler': [StandardScaler(), 'passthrough', RobustScaler()],
                'clf__C': np.logspace(-4, 4, 4), # They have the same parameters, so we can do this.
                },
                # {
                # 'clf': [RandomF()],  # Actual Estimator
                # 'scaler': [StandardScaler(), RobustScaler()],
                # 'clf__n_estimators': [1, 10, 100, 1000],
                # 'clf__max_depth': [None, 5, 10, 20], # Random forest has different parameters compared to the other two
                # }
            ]
# Create grid search 
gs = GridSearchCV(pipe, search_space, n_jobs=-1, cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=123), scoring = "recall")
gs.fit(X, y)
print("Best parameter (CV score=%0.3f):" % gs.best_score_)
print(gs.best_params_)

Best parameter (CV score=0.960):
{'clf': LogisticRegression(C=0.046415888336127774), 'clf__C': 0.046415888336127774, 'scaler': StandardScaler()}


In [17]:
# print all the results
pd.DataFrame(gs.cv_results_).head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf,param_clf__C,param_scaler,params,split0_test_score,split1_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001183,0.000351,0.00052,2.6e-05,LogisticRegression(C=0.046415888336127774),0.0001,StandardScaler(),{'clf': LogisticRegression(C=0.046415888336127...,0.8,1.0,...,1.0,1.0,1.0,0.6,1.0,0.8,1.0,0.9,0.134164,20
1,0.001074,0.000267,0.000467,3.8e-05,LogisticRegression(C=0.046415888336127774),0.0001,passthrough,{'clf': LogisticRegression(C=0.046415888336127...,0.8,1.0,...,1.0,1.0,1.0,0.8,1.0,1.0,0.8,0.9,0.134164,20
2,0.001303,0.000263,0.000508,5.8e-05,LogisticRegression(C=0.046415888336127774),0.0001,RobustScaler(),{'clf': LogisticRegression(C=0.046415888336127...,0.8,1.0,...,1.0,1.0,0.8,0.6,0.8,0.8,1.0,0.84,0.149666,24
3,0.001155,0.000168,0.000565,0.000104,LogisticRegression(C=0.046415888336127774),0.046416,StandardScaler(),{'clf': LogisticRegression(C=0.046415888336127...,0.8,1.0,...,1.0,1.0,1.0,0.8,1.0,1.0,1.0,0.96,0.08,1
4,0.001628,0.000202,0.000645,0.000268,LogisticRegression(C=0.046415888336127774),0.046416,passthrough,{'clf': LogisticRegression(C=0.046415888336127...,0.8,1.0,...,1.0,1.0,1.0,0.8,1.0,1.0,1.0,0.94,0.091652,7


In [18]:
steps = [('scaler', StandardScaler()), # preprocessing steps
         ('clf', SVC())]               # Model

param_grid = {
    'scaler':  [StandardScaler(), RobustScaler(),'passthrough'],
    'clf': [SVC(), LogisticRegression()],
    'clf__C': np.logspace(-4, 4, 4),
}

pipeline = Pipeline(steps)

search = GridSearchCV(pipeline, param_grid, n_jobs=-1, cv = 10, scoring = "recall")
search.fit(X, y)
print("Scaler: %s" % scaler)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

Scaler: StandardScaler()
Best parameter (CV score=0.960):
{'clf': LogisticRegression(C=21.54434690031882), 'clf__C': 21.54434690031882, 'scaler': StandardScaler()}


In [19]:
# print all the results
pd.DataFrame(search.cv_results_).head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf,param_clf__C,param_scaler,params,split0_test_score,split1_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.000679,0.000105,0.00055,0.000106,SVC(),0.0001,StandardScaler(),"{'clf': SVC(), 'clf__C': 0.0001, 'scaler': Sta...",0.8,1.0,...,1.0,0.8,0.8,1.0,1.0,1.0,1.0,0.94,0.091652,7
1,0.001104,0.000222,0.000528,1.7e-05,SVC(),0.0001,RobustScaler(),"{'clf': SVC(), 'clf__C': 0.0001, 'scaler': Rob...",0.6,1.0,...,1.0,0.8,0.8,1.0,1.0,1.0,1.0,0.92,0.132665,14
2,0.000494,7.4e-05,0.000492,4.1e-05,SVC(),0.0001,passthrough,"{'clf': SVC(), 'clf__C': 0.0001, 'scaler': 'pa...",0.6,1.0,...,1.0,1.0,0.8,1.0,1.0,1.0,1.0,0.94,0.128062,7
3,0.000683,8.4e-05,0.000528,2.6e-05,SVC(),0.046416,StandardScaler(),"{'clf': SVC(), 'clf__C': 0.046415888336127774,...",0.8,1.0,...,1.0,0.8,0.8,1.0,1.0,1.0,1.0,0.94,0.091652,7
4,0.001036,0.000108,0.000529,3.1e-05,SVC(),0.046416,RobustScaler(),"{'clf': SVC(), 'clf__C': 0.046415888336127774,...",0.6,1.0,...,1.0,0.8,0.8,1.0,1.0,1.0,1.0,0.92,0.132665,14


In [29]:
scalers =  [
    StandardScaler(),
    RobustScaler(),
    'passthrough'] # none

classifiers = [
    SVC(),
    LogisticRegression()
]

steps = [('scaler', 'passthrough'), # preprocessing steps
         ('clf', 'passthrough')]               # Model

param_grid = {
    'scaler': scalers,
    'clf': classifiers,
    'clf__C': np.logspace(-4, 4, 4),
}

pipeline = Pipeline(steps)

search2 = GridSearchCV(pipeline, param_grid, n_jobs=-1, cv = 10, scoring = "recall")
search2.fit(X, y)
print("Scaler: %s" % scaler)
print("Best parameter (CV score=%0.3f):" % search2.best_score_)
print(search2.best_params_)

Scaler: StandardScaler()
Best parameter (CV score=0.470):
{'clf': SVC(C=21.54434690031882), 'clf__C': 21.54434690031882, 'scaler': RobustScaler()}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

What other things can you play with? 

- Preprocessing data
    - Standardization, or mean removal and variance scaling
    - Non-linear transformation
    - Normalization
    - Encoding categorical features
    - Discretization
    - Imputation of missing values
    - Generating polynomial features
- Imputation of missing values
    - Univariate vs. Multivariate Imputation
    - Univariate feature imputation
    - Multivariate feature imputation
    - Nearest neighbors imputation
    - Marking imputed values
- Feature selection
- Dimensionality reduction
- Modeling

More ideas [here](https://scikit-learn.org/stable/data_transforms.html)

**What if I can't find the one I need?**
[Create it](https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-the-step-by-step-guide-with-python-code-4a7d9b068156)! 


In [30]:
# print all the results
pd.DataFrame(search2.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf,param_clf__C,param_scaler,params,split0_test_score,split1_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.000664,0.000111,0.000535,3.7e-05,SVC(C=21.54434690031882),0.0001,StandardScaler(),"{'clf': SVC(C=21.54434690031882), 'clf__C': 0....",0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18
1,0.000959,7.3e-05,0.000522,2.2e-05,SVC(C=21.54434690031882),0.0001,RobustScaler(),"{'clf': SVC(C=21.54434690031882), 'clf__C': 0....",0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18
2,0.000437,6.8e-05,0.00048,4.7e-05,SVC(C=21.54434690031882),0.0001,passthrough,"{'clf': SVC(C=21.54434690031882), 'clf__C': 0....",0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18
3,0.000651,0.000173,0.00053,8.7e-05,SVC(C=21.54434690031882),0.046416,StandardScaler(),"{'clf': SVC(C=21.54434690031882), 'clf__C': 0....",0.5,0.6,...,0.6,0.8,0.0,0.0,0.0,0.0,0.0,0.31,0.317648,16
4,0.001017,0.000132,0.00053,2.6e-05,SVC(C=21.54434690031882),0.046416,RobustScaler(),"{'clf': SVC(C=21.54434690031882), 'clf__C': 0....",0.2,0.4,...,0.5,0.7,0.0,0.0,0.0,0.0,0.0,0.21,0.242693,17
5,0.000463,5.4e-05,0.000496,3.3e-05,SVC(C=21.54434690031882),0.046416,passthrough,"{'clf': SVC(C=21.54434690031882), 'clf__C': 0....",0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18
6,0.000585,0.000126,0.000486,2.6e-05,SVC(C=21.54434690031882),21.544347,StandardScaler(),"{'clf': SVC(C=21.54434690031882), 'clf__C': 21...",1.0,1.0,...,0.9,1.0,0.0,0.0,0.0,0.0,0.0,0.46,0.467333,8
7,0.000864,6.1e-05,0.000491,2.2e-05,SVC(C=21.54434690031882),21.544347,RobustScaler(),"{'clf': SVC(C=21.54434690031882), 'clf__C': 21...",1.0,1.0,...,0.9,1.0,0.0,0.0,0.0,0.0,0.0,0.47,0.473392,1
8,0.000385,8.7e-05,0.000527,0.000144,SVC(C=21.54434690031882),21.544347,passthrough,"{'clf': SVC(C=21.54434690031882), 'clf__C': 21...",1.0,1.0,...,0.9,1.0,0.0,0.0,0.0,0.0,0.0,0.46,0.467333,8
9,0.000589,7.4e-05,0.000499,2.8e-05,SVC(C=21.54434690031882),10000.0,StandardScaler(),"{'clf': SVC(C=21.54434690031882), 'clf__C': 10...",1.0,0.8,...,0.9,1.0,0.0,0.0,0.0,0.0,0.0,0.44,0.447661,12
