<a href="https://colab.research.google.com/github/cagBRT/Pipelines/blob/main/2_PipeLines_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pipelines**

The purpose of the pipeline is to assemble **several steps that can be cross-validated together** while setting different parameters.

A pipeline can be used to chain multiple estimators into one. <br>
This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification.<br>

Pipeline serves multiple purposes here:<br>

- Convenience and encapsulation
You only have to call fit and predict once on your data to fit a whole sequence of estimators.

- Joint parameter selection
You can grid search over parameters of all estimators in the pipeline at once.

- Safety
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

**All estimators in a pipeline, except the last one, must be transformers**(i.e. must have a transform method). <br>
The last estimator may be any type (transformer, classifier, etc.).

[Pipeline User Guide](https://scikit-learn.org/stable/modules/compose.html#pipeline)


The Pipeline is built using a list of (key, value) pairs, where <br>
- the key is a string containing the name you want to give this step and <br>
- value is an estimator object:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA

This pipeline does Principal Component Analysis then uses a Supprt Vector Machine Model

In [None]:
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
pipe

The utility function make_pipeline is a shorthand for constructing pipelines;<br>

it takes a variable number of estimators and returns a pipeline, filling in the names automatically:

Import the make_pipeline library

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import Binarizer

The pipeline makes the binarizer then inputs the data to the Naive Bayes model

Binarize data (set feature values to 0 or 1) according to a threshold.

In [None]:
make_pipeline(Binarizer(), MultinomialNB())

Example of [Binarize](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html)

In [None]:
from sklearn.preprocessing import Binarizer
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
transformer = Binarizer().fit(X)
transformer

transformer.transform(X)

**To access steps in the pipeline**

In [None]:
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
print("Pipe",pipe)
print("Pipe steps",pipe.steps[0], pipe.steps[1])
print("Pipe[0]=",pipe[0])
print("Naming by function:",pipe['reduce_dim'])

**Nested Parameters**<br>
Use estimator_parameters syntax to access the estimator parameters

In [None]:
#set the 'C' parameter for the SVM
pipe.set_params(clf__C=10)

**Setting parameters for Grid Searches**

In this example, the grid search will try 2,5,10 number of dimensions for the PCA and 0.1,10,100 for the 'C' parameter for the SVM

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = dict(reduce_dim__n_components=[2, 5, 10],
                  clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)

**Getting the features of the pipeline**

In [None]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest

In [None]:
from sklearn.linear_model import LogisticRegression
param_grid = dict(reduce_dim=['passthrough', PCA(5), PCA(10)],
                  clf=[SVC(), LogisticRegression()],
                  clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)
iris = load_iris()
pipe = Pipeline(steps=[
   ('select', SelectKBest(k=2)),
   ('clf', LogisticRegression())])
pipe.fit(iris.data, iris.target)

pipe[:-1].get_feature_names_out()