## Pipeline


As you have been learning for the past few weeks, a typical machine learning task consists of a varying degree 
of data preprocessing steps ranging from imputing missing values to creating new features. The tasks are iterative i.e. we need to run the same or slightly modified preprocessing steps multiple times to understand their efficacy. At the same time, we might need to compare wide ranging algorithms for their utility in building models on preprocessed data. Scikit-learn helps automate such workflow of repetitive tasks with Pipeline.

Apart from automating the workflow, Pipeline provides another significant value. One of the things that we need to be wary of in machine learning is the possibility of leaking data from training to dev and test dataset. One common way of letting down our guards against leaking is during data preprocessing such as when by applying data scaling or normalization on entire training dataset that would be further split into different folds of train and dev dataset for model building and hyperparameter tuning. Chaining a string of preprocessing and model building steps via pipeline can help you easily avoid such mistakes.





Consider a machine learning workflow with following tasks:

- Imputing missing values
- Creating polynomial features
- Applying feature scaling
- Fitting a classification model

We will combine the above four tasks into a pipeline to build a classification model against the Iris dataset. We will use the pipeline for two classification algorithms: logistic regression and random forest.



### Load the Iris dataset

In [129]:

import warnings
warnings.filterwarnings('ignore')
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target


### Split the dataset into trianing and testing. 

In [127]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.30, random_state = 1)

### Create a trasformer instance 


ColumnTransformer enable separate transformations of different columns or subset of columns. Create a transformer that does mean imputing for first and second columns, and generate polynomial features using third and fourth columns.

In [156]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
transformer = ColumnTransformer([("norm1", SimpleImputer(missing_values=np.nan, strategy='mean'), [0, 1]),
                                ('poly', PolynomialFeatures(2),[2,3])])

### Now chain the transformer together with scaler and logistic regression transform into a pipeline.


In [157]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
logistic_model=Pipeline(steps=([('transformer', transformer),
                      ('scaler',StandardScaler()), ('LR', LogisticRegression())]))

In [158]:
from sklearn import ensemble
rf_model = Pipeline(steps=([('transformer', transformer),
                      ('scaler',StandardScaler()), ('LR', ensemble.RandomForestClassifier())]))

### finally fit the two pipeline estimators

In [159]:

lr = logistic_model.fit(X_train,y_train)

rf = rf_model.fit(X_train,y_train)

In [160]:
from sklearn import metrics
y_pred = lr.predict(X_test);
metrics.accuracy_score(y_test, y_pred)

0.8888888888888888

In [161]:
y_pred = rf.predict(X_test);
metrics.accuracy_score(y_test, y_pred)

0.9555555555555556

Check out [here](https://stackoverflow.com/questions/40708077/what-is-the-difference-between-pipeline-and-make-pipeline-in-scikit) to learn the difference between `pipeline` and `make_pipeline`