# Introduction to Pipelines                    29/June/2020
Pipelines are a simple way to organize the data preprocessing and modeling of code. Specifically, a pipeline bundles preprocessing and modeling steps so that it can be used as if it were a single step.

* **Cleaner Code**: Accounting for data at each step of preprocessing can get messy. With a pipeline, it is not required to  manually keep track of the training and validation data at each step.
* **Fewer Bugs**: There are fewer opportunities to misapply a step or forget a preprocessing step.
* **Easier to Productionize**: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. 
* **More Options for Model Validation**: You will see an example in the next tutorial, which covers cross-validation.

## 3 steps are required to build a full pipeline
* **Step 1: Defining the preprocessing steps** - This step handles the following:
     * Handling of columns with numerical data using imputation techniques
     * Handling of columns with categorical data using encoding techniques
* **Step 2: Defining the model**
* **Step 3: Creation and evaluation of pipeline**


#### Notebook References:
* https://medium.com/vickdata/a-simple-guide-to-scikit-learn-pipelines-4ac0d974bdcf
* https://www.kaggle.com/alexisbcook/pipelines
* https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
* Dataset - https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
data = pd.read_csv('./melb_data.csv')

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

In [5]:
# Defining target and predictors
y = data['Price'] # target/dependent variable/label
x = data.drop(columns='Price') # features/independent variables/attributes/predictors

In [6]:
# Splitting training and valid dataset
x_train_full,x_valid_full,y_train_full,y_valid_full = train_test_split(x,y,train_size=0.8,test_size=0.2,random_state=0)

In [8]:
# checking length of training and validation dataset
len(x_train_full),len(x_valid_full),len(y_train_full),len(y_valid_full)

(10864, 2716, 10864, 2716)

In [12]:
# Creating a list to store the numerical data columns from the training data set.
num_cols = [col for col in x_train_full.columns if x_train_full[col].dtype in ['float64','int64']]
num_cols

['Rooms',
 'Distance',
 'Postcode',
 'Bedroom2',
 'Bathroom',
 'Car',
 'Landsize',
 'BuildingArea',
 'YearBuilt',
 'Lattitude',
 'Longtitude',
 'Propertycount']

In [14]:
# Creating a list to store the categorical data columns from the training data set.
cat_cols = [col for col in x_train_full.columns if (x_train_full[col].nunique() < 10) & (x_train_full[col].dtype == 'object')]
cat_cols

['Type', 'Method', 'Regionname']

In [21]:
# Preparing the training and validation set for the selected numerical and categorical column features only

sel_cols = cat_cols + num_cols
x_train = x_train_full[sel_cols].copy()
x_valid = x_valid_full[sel_cols].copy()

In [22]:
len(x_train.columns),len(x_valid.columns)

(15, 15)

In [23]:
x_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,1.0,0.0,,1940.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,1.0,193.0,,,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,1.0,555.0,,,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,1.0,265.0,,1995.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,2.0,673.0,673.0,1970.0,-37.7623,144.8272,4217.0


## Step 1- Defining the pipeline for preprocessing of data

In [24]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

In [34]:
# Preprocessing of numerical data
num_trans  = SimpleImputer(strategy='constant')

# Preprocessing of the categorical data
cate_trans = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='most_frequent')),
    ('OneHot',OneHotEncoder(handle_unknown='ignore'))])

# Bundling the preprocessors to handle numerical and categorical data

preprocessor = ColumnTransformer(transformers=[
    ('num',num_trans,num_cols),
    ('cat',cate_trans,cat_cols)
])

## Step 2- Defining the model

In [35]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators = 100,random_state=0)


## Step 3- Creation and evaluation of pipeline

In [37]:
# bundling the preprocessing steps and modelling steps into a single pipeline

final_pipeline = Pipeline(steps=[
    ('preprocessor',preprocessor),
    ('model',model)
])

# Fitting the model
final_pipeline.fit(x_train,y_train_full)

# Preprocessing the validation dataset and making predictions using the final_pipeline
pred = final_pipeline.predict(x_valid)

In [38]:
# evaluating the model
from sklearn.metrics import mean_absolute_error

print('MAE: ',mean_absolute_error(pred,y_valid_full))

MAE:  160679.18917034855


## Conclusion

Pipelines helps in reusability of code. Once defined on the training dataset, the transformation and preprocessing steps to be performed on the validation and test dataset can be done in a simple manner.

Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing.

They have several key benefits:
* They make the workflow much easier to read and understand.
* They enforce the implementation and order of steps in the project.
* These in turn make the work much more reproducible.