# PIPELINES

In this tutorial, you will learn how to use pipelines to clean up your modeling code.

# Introduction

Pipelines are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:

`Cleaner Code`: Accounting for data at each step of preprocessing can get messy. With a pipeline, you won't need to manually keep track of your training and validation data at each step.
    
`Fewer Bugs`: There are fewer opportunities to misapply a step or forget a preprocessing step.
    
`Easier to Productionize`: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.
    
`More Options for Model Validation`: You will see an example in the next tutorial, which covers cross-validation.

# Example

As in the previous tutorial, we will work with the [Melbourne Housing dataset.](https://github.com/bharathkreddy/ML-Bootcamp/blob/master/data/melb_data.csv)

We won't focus on the data loading step. Instead, you can imagine you are at a point where you already have the training and validation data in `X_train`, `X_valid`, `y_train`, and `y_valid`. 

We want to practice sklearn pipelines on columns with low cardinality & missing values. You can find the details about how to do this in [this notebook](https://github.com/bharathkreddy/ML-Bootcamp/blob/master/Working%20with%20categorical%20data.ipynb) 

You can follow the code in above notebook until line 80 to arrive at `X_train`, `X_valid`, `y_train`, and `y_valid`. 
> I am going to take a shortcut to get there but i suggest you follow what you have learnt.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the data
data = pd.read_csv('./cat_the_dat/data/melb_data.csv')

y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

***
We take a peek at the training data with the `head()` method below. Notice that the data contains both categorical data and columns with missing values. With a pipeline, it's easy to deal with both!
***

In [3]:
X_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,1.0,0.0,,1940.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,1.0,193.0,,,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,1.0,555.0,,,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,1.0,265.0,,1995.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,2.0,673.0,673.0,1970.0,-37.7623,144.8272,4217.0


***
We construct the full pipeline in three steps.

### Step 1: Define Preprocessing Steps

Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to bundle together different preprocessing steps. The code below:

1. Imputes missing values in numerical data, and
2. Imputes missing values and applies a one-hot encoding to categorical data.

***

In [5]:
from sklearn.pipeline import Pipeline       # This library is to create & use pipelines
from sklearn.impute import SimpleImputer    # This library is to use Simple Imputer pipeline
from sklearn.preprocessing import OneHotEncoder # This library is to use OneHotEncoder pipeline
from sklearn.compose import ColumnTransformer   # This library is to apply above pipelines to different columns.

# Instantiating an instance of Simple Imputer, we use mean as our strategy to replace missing values.

numerical_transformer = SimpleImputer(strategy='mean')

# Instantiating an instance of One Hot Encoder with two steps 
# STEP 1: is to impute values - since categorical columns might also have missing values. 
# NOTE: for this first step - We cant take mean as our strategy as categorical values would not have mean. 
# Instead we take strategy to replace missing categories with most frequent term- think of this as median for values.
# Step 2: AFTER we compute missing values we use One hot encoding.
# we club these two steps in one single pipleline.
# the Syntax for pipelines is = Pipeline(steps=[(a),(b),so on...]) , so pipeline is a LIST of steps
# where a and b are steps in our pipeline each step has to be written as ("name",STEP), so steps are TUPPLES.
# name in the step can be anything you want.

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# Bundle preprocessing for numerical and categorical data
# We now use COLUMN TRANSFORMER to apply each of these pipelines to different columns.
# again notice same syntax - Transformer are a LIST of TRANSFORMERS, each TRANSFORMER is a TUPPLE with these vaues - 
# step ("name", name as defined when you crated the transfomer, columns to which you want to apply these transformations)


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])


***
### Final step is to fit and transform our data using the pipeline
***

In [13]:
X_train_transformed = preprocessor.fit_transform(X_train)

In [14]:
# Lets see how our transformed data looks like.
X_train_transformed

array([[1.000e+00, 5.000e+00, 3.182e+03, ..., 1.000e+00, 0.000e+00,
        0.000e+00],
       [2.000e+00, 8.000e+00, 3.016e+03, ..., 0.000e+00, 1.000e+00,
        0.000e+00],
       [3.000e+00, 1.260e+01, 3.020e+03, ..., 0.000e+00, 1.000e+00,
        0.000e+00],
       ...,
       [4.000e+00, 6.700e+00, 3.058e+03, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [3.000e+00, 1.200e+01, 3.073e+03, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [4.000e+00, 6.400e+00, 3.011e+03, ..., 0.000e+00, 1.000e+00,
        0.000e+00]])

In [15]:
# Oh! so transformations remove column headers. No worries - we add back the columns using our original dataset.
# Also from a pandas dataframe - it now is converted to a numpy ndimentional-array
type(X_train_transformed)

numpy.ndarray

In [18]:
# to get this back to a dataframe - we can first conver to dataframe, then add column headers using our oridinal dataset
X_train_transformed = pd.DataFrame(X_train_transformed)
type(X_train_transformed )

pandas.core.frame.DataFrame

In [17]:
X_train_transformed.columns = X_train.columns 

ValueError: Length mismatch: Expected axis has 28 elements, new values have 15 elements

## Whats happening ? Why did we get an error?

If you read the error it says "Length mismatch: Expected axis has 28 elements, new values have 15 elements"
Can you guess why the no of columns changed ? 
Yes - our One Hot encoder has transformed some columns with categories into multiple columns with one element hot in each column. 

In [19]:
X_train_transformed

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
0,1.0,5.0,3182.0,1.0,1.0,1.0,0.0,153.764119,1940.000000,-37.85984,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,2.0,8.0,3016.0,2.0,2.0,1.0,193.0,153.764119,1964.839866,-37.85800,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,3.0,12.6,3020.0,3.0,1.0,1.0,555.0,153.764119,1964.839866,-37.79880,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,3.0,13.0,3046.0,3.0,1.0,1.0,265.0,153.764119,1995.000000,-37.70830,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,3.0,13.3,3020.0,3.0,1.0,2.0,673.0,673.000000,1970.000000,-37.76230,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10859,3.0,5.2,3056.0,3.0,1.0,2.0,212.0,153.764119,1964.839866,-37.77695,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
10860,3.0,10.5,3081.0,3.0,1.0,1.0,748.0,101.000000,1950.000000,-37.74160,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10861,4.0,6.7,3058.0,4.0,2.0,2.0,441.0,255.000000,2002.000000,-37.73572,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
10862,3.0,12.0,3073.0,3.0,1.0,1.0,606.0,153.764119,1964.839866,-37.72057,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


*** 

But as you will see, we dont need our data in pandas dataframe form to run models. See below where I have added a model to our pipeline and still be able to run the model. 

For now, Dont worry if this is all a bit too much of information. Just remember these things.
1. You can use one transformation at a time as well.
2. Pipelines are good to know but not allways necissary.
3. If you are able to read a pipeline code and be able to recreate steps using one transformation at a time - that is more than enough. 

***
### Adding model to pipeline
We can even add a model to the pipeline - Lets try that.

***

In [20]:
# STEP 1 : Import and instantiate the model

from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()

# STEP 2 add it to the pipeline. The Syntax for pipeline remains the same as I showed above.

my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', linear_model)
                             ])

***
Final step is to fit and transform the pipeline and predict.

***

In [21]:
# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

In [24]:
from sklearn.metrics import r2_score # r2 score is the Rsquared metric we learnt in our Linear regression.

In [25]:
r2_score(y_valid,preds) # our model has a score of 61.36 % , and you see that pipelines work just fine.

0.6136103508606694

# END OF NOTEBOOK