In this particular notebook, we're gonna learn about pipelines which can be used to keep our data preprocessing and modeling code organized. (bundles the preprocessing and modeling steps)<br>
This notebook has been created by inspiring from this [notebook](https://www.kaggle.com/alexisbcook/pipelines).<br>
We're gonna use the dataset from [Melbourne Housing Snapshot](https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home)<br>
We won't focus on the data loading step. Instead, you can imagine you are at a point where you already have the training and validation data in X_train, X_valid, y_train, and y_valid.<br>
Yes, let's have a start....

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

See how this dataset looks..

In [2]:
X_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,1.0,0.0,,1940.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,1.0,193.0,,,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,1.0,555.0,,,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,1.0,265.0,,1995.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,2.0,673.0,673.0,1970.0,-37.7623,144.8272,4217.0


Unlike the Missing values and categorical variables we did earlier with, we're gonna show the whole process of pipelining in 3 steps.

# Step 1: Define Preprocessing Steps.
Actally, to bundle together the the different preprocessing steps and modeling steps, we're gonna use the **ColumnTransformer**.<br>
The code we'll write below which- 
* Imputes missing values in *numerical* data, and
* Imputes missing values and applies a one-hot-encoding to *categorical* data

In [3]:
# first import all the necessary things 
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

Necessary libraries have been imported.<br>
Now, we're gonna call for an imputer for numerical data.<br>
We also keep the strategy as *constant*, fill_value is used to replace all occurrences of missing_values. [see here](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)

In [4]:
# preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy = 'constant')

preprocessing for categorical data(defined what should we do in above cell at step1)<br>
As stated above, we're gonna use both the imputer and one-hot-encoding in pipelining.<br>
We named the imputation as '*categ_imputer*' and one-hot-encoding as '*categ_onehot*' and did inside a list as tuple which is under the **Pipeline**.<br>

In [5]:
# preprocessing for categorical data
categorical_transformer = Pipeline(steps = [
    ('categ_imputer', SimpleImputer(strategy = 'most_frequent')), 
    ('categ_onehot', OneHotEncoder(handle_unknown = 'ignore'))
])


As we said earlier, let's do bundle precessing numerical and categorical data.<br>
We're naming our transformers as *num_trans* and *cate_trans*. num_trans is based on **numerical_cols** and cate_trans is based on **categorical_cols**.

In [6]:
bundle_preprocessor = ColumnTransformer(
transformers = [
    ('num_trans', numerical_transformer, numerical_cols),
    ('cate_trans', categorical_transformer, categorical_cols)
])

In [7]:
bundle_preprocessor

ColumnTransformer(transformers=[('num_trans',
                                 SimpleImputer(strategy='constant'),
                                 ['Rooms', 'Distance', 'Postcode', 'Bedroom2',
                                  'Bathroom', 'Car', 'Landsize', 'BuildingArea',
                                  'YearBuilt', 'Lattitude', 'Longtitude',
                                  'Propertycount']),
                                ('cate_trans',
                                 Pipeline(steps=[('categ_imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('categ_onehot',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Type', 'Method', 'Regionname'])])

# Step 2: Defining Model.

Just defining a model e.g randomforest

In [8]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators = 100, random_state = 0)

# Step 3: Creating and Evaluating the Pipeline.

Finally, we use the Pipeline class to define a pipeline that bundles the preprocessing and modeling steps.<br>
Though, in previous we have seen the preprocessing for the dataset first and then defining model and then fitting dataset in the model. Then finally we do prediction.<br>
*Eventually, MEA for each step. But now, we'll determine MEA only one time with everything of Preprocessing and model defining.*<br>
There are a few important things to notice:

* With the pipeline, we preprocess the training data and fit the model in a single line of code. (In contrast, without a pipeline, we have to do *imputation*, *one-hot encoding*, and *model training* in separate steps. This becomes especially *messy* if we have to deal with both numerical and categorical variables!)
* With the pipeline, we supply the unprocessed features in **X_valid** to the **predict()** command, and the pipeline automatically preprocesses the features before generating predictions. (However, without a pipeline, we have to remember to preprocess the validation data before making predictions.) <br>
Now, we're defining the pipeline to bundle the preprocessing and modeling steps.<br>
Inside the Pipeline class, we're passing preprocessor and model in a list of tuples

In [9]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', bundle_preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

MAE: 160679.18917034855


In [10]:
from sklearn.metrics import mean_absolute_error

# named preprocessor as preprocessor 'my_preprocessor' and model as 'my_model'
my_pipelines = Pipeline(steps = [
    ('my_preprocessor', bundle_preprocessor),
    ('my_model', model)
])

So, we've defined our pipeline for the predicions.<br>
Now, let's fit it and predict on validation data.

In [11]:
# as we've done everything in pipeline, we will use the main dataset (X_train/y_train) which were different in previous tutorials.
my_pipelines.fit(X_train, y_train)

preds = my_pipelines.predict(X_valid)

MEA after doing pipelining...

In [12]:
print("MEA by pipelining:", mean_absolute_error(preds, y_valid))

MEA by pipelining: 160679.18917034855


# Congratulations!
We're done with pipelining!
