# Sklearn : Pipelines

![](./assets/pipes.jpg)
Source: https://eponline.com/articles/2019/04/19/-/media/ENV/eponline/Images/2019/09/LeadPipesFoundAroundCountry.jpg

## The Sklearn API Design

![](assets/ml_map.png)

Image Source: 
[Sklearn Tutorials](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

### Design Principles

The [Scikit-Learn API paper](http://arxiv.org/abs/1309.0238) outlines its design principles as:

+ __Consistency__: All objects share a common interface drawn from a limited set of methods, with consistent documentation.

+ __Inspection__: All specified parameter values are exposed as public attributes.

+ __Limited object hierarchy__: Only algorithms are represented by Python classes; datasets are represented in standard formats (```NumPy``` arrays, ```Pandas``` DataFrames, ```SciPy``` sparse matrices) and parameter names use standard Python strings.

+ __Composition__: Many machine learning tasks can be expressed as sequences of more fundamental algorithms, and Scikit-Learn makes use of this wherever possible.

+ __Sensible defaults__: When models require user-specified parameters, the library defines an appropriate default value.

### API Usage

![](assets/estimator-api.png)
Source: http://pages.stat.wisc.edu/~sraschka/teaching/stat479-fs2018/

The ```Scikit-Learn``` estimator API is used as follows (we have seen multiple examples in the modules so far):

+ Select a class of models by importing the appropriate estimator class from ```scikit-learn```.
+ Select a model and set its hyperparameters by instantiating this class with desired values.
+ Arrange data into a features matrix and a target vector.
+ Use the ```fit()``` method from the model instance on your data.

+ Apply the Model to new data, i.e. for evaluation (test sets) or new unseen data in production. This is done using ```predict()``` method from the model instance

## Why Pipelines?

Scikit-learn pipelines are a tool to simplify the modeling process. They have several key benefits:
+ Workflows become much easier to read and understand.
+ They help to enforce the implementation and order of steps in the current setting.
+ Helps in preparing models/results which are reproducible.

![](assets/sklearn-pipeline.png)
Source: http://pages.stat.wisc.edu/~sraschka/teaching/stat479-fs2018/

## Hands-on with Pipelines

In [0]:
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

In [0]:
iris = load_iris()
X, y = iris.data, iris.target

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                        test_size=0.2, shuffle=True,
                                        random_state=42, stratify=y)

In [0]:
# prepare pipeline
pipe = make_pipeline(StandardScaler(),
                     LogisticRegression(multi_class='ovr',solver='lbfgs'))

In [4]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='ovr', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

In [0]:
y_pred = pipe.predict(X_test)

In [6]:
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score
from sklearn.metrics import f1_score, roc_auc_score
import pandas as pd

cr = classification_report(y_test, y_pred)
print(cr)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       0.89      0.80      0.84        10
           2       0.82      0.90      0.86        10

    accuracy                           0.90        30
   macro avg       0.90      0.90      0.90        30
weighted avg       0.90      0.90      0.90        30



In [7]:
score_df = pd.DataFrame({'accuracy': accuracy_score(y_test, y_pred),
                         'precision': precision_score(y_test, y_pred,average='weighted'),
                         'recall': recall_score(y_test, y_pred,average='weighted'),
                         'f1': f1_score(y_test, y_pred,average='weighted')},
                         index=pd.Index([0]))

score_df

Unnamed: 0,accuracy,precision,recall,f1
0,0.9,0.902357,0.9,0.899749
