<a href="https://colab.research.google.com/github/cirilwakounig/MachineLearning/blob/main/4_Developing_a_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Developing a Pipeline

This script is showing how to develop a data processing, model developing and validation pipeline used to prepare data for machine learning purposes, apply a model and assess the results.  

The benefits of pipelines include:

1. Cleaner Code
2. Fewer Bugs: 
3. Easier to Productionise
4. More Options for Model Validation

In [None]:
# Import the required Libraries
import pandas as pd
import numpy as np

# Data Processing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Pipelining
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Model Development and Validation
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

#### 1. Import and Process the required Data

---



##### 1.1 Import of Data

In [None]:
# Import the Data Set Set
file_path_train = '/content/drive/MyDrive/Colab Notebooks/Kaggle Course/Intermediate Machine Learning/train.csv'
file_path_test = '/content/drive/MyDrive/Colab Notebooks/Kaggle Course/Intermediate Machine Learning/test.csv'

# Read the data
X_full = pd.read_csv(file_path_train, index_col = 'Id')
X_test_full = pd.read_csv(file_path_test, index_col = 'Id')

# Remove missing target values
X_full.dropna(axis = 0, subset = ['SalePrice'], inplace = True)   # Inplace = True overrides existing data frame
# Assign the dependent variable 
y = X_full.SalePrice

# Separate features from predictors
X_full.drop(['SalePrice'], axis = 1, inplace = True)

# Split the data in train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_full, y, 
                                                  train_size = 0.8, test_size = 0.2, random_state = 0)

##### 1.2 Defining Categorical and Numerical Columns

Now, the data needs to be processed, such that it suits for the analysis.

In [None]:
# Cardinality refers to the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train.columns if 
                    X_train[cname].nunique() < 10 and 
                    X_train[cname].dtype == 'object']

# Select numerical columns
numerical_cols = [cname for cname in X_train.columns if 
                  X_train[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train[my_cols].copy()
X_val = X_val[my_cols].copy()
X_test = X_test_full[my_cols].copy()
X_train.head()

Unnamed: 0_level_0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Condition1,Condition2,BldgType,HouseStyle,RoofStyle,RoofMatl,MasVnrType,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,Heating,HeatingQC,CentralAir,Electrical,KitchenQual,Functional,FireplaceQu,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1
619,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Norm,Norm,1Fam,1Story,Hip,CompShg,BrkFace,Ex,TA,PConc,Ex,TA,Av,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,Gd,Attchd,Unf,TA,TA,Y,,,,New,Partial,20,90.0,11694,9,5,2007,2007,452.0,48,0,1774,1822,1828,0,0,1828,0,0,2,0,3,1,9,1,2007.0,3,774,0,108,0,0,260,0,0,7,2007
871,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,PosN,Norm,1Fam,1Story,Hip,CompShg,,TA,TA,CBlock,TA,TA,No,Unf,Unf,GasA,Gd,N,SBrkr,TA,Typ,,Detchd,Unf,TA,TA,Y,,,,WD,Normal,20,60.0,6600,5,5,1962,1962,0.0,0,0,894,894,894,0,0,894,0,0,1,0,2,1,5,0,1962.0,1,308,0,0,0,0,0,0,0,8,2009
93,RL,Pave,Grvl,IR1,HLS,AllPub,Inside,Gtl,Norm,Norm,1Fam,1Story,Gable,CompShg,,TA,Gd,BrkTil,Gd,TA,No,ALQ,Unf,GasA,Ex,Y,SBrkr,TA,Typ,,Detchd,Unf,TA,TA,Y,,,,WD,Normal,30,80.0,13360,5,7,1921,2006,0.0,713,0,163,876,964,0,0,964,1,0,1,0,2,1,5,0,1921.0,2,432,0,0,44,0,0,0,0,8,2009
818,RL,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Norm,Norm,1Fam,1Story,Hip,CompShg,BrkFace,Gd,TA,PConc,Gd,TA,No,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,Gd,Attchd,RFn,TA,TA,Y,,,,WD,Normal,20,,13265,8,5,2002,2002,148.0,1218,0,350,1568,1689,0,0,1689,1,0,2,0,3,1,7,2,2002.0,3,857,150,59,0,0,0,0,0,7,2008
303,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Norm,Norm,1Fam,1Story,Gable,CompShg,BrkFace,Gd,TA,PConc,Gd,TA,No,Unf,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal,20,118.0,13704,7,5,2001,2002,150.0,0,0,1541,1541,1541,0,0,1541,0,0,2,0,3,1,6,1,2001.0,3,843,468,81,0,0,0,0,0,1,2006


#### 2. Develop the Pipeline

---

In this section, a pipeline containing preprocessing and model development will be created. 

##### 2.1 Preprocessing

In order to deal with missing values, the SimpleImputer function from sklearn will be used. This will be applied to both numerical and categorical data. 

Categorical values will be transformed using the OneHotEncoding strategy. Note, that only low cardinality (<10) will be considered and thus, OneHotEncoding can be used. 

In [None]:
# Numerical Data will be processed using the SimpleImputer()
numerical_transformer = SimpleImputer(strategy = 'constant')


# Categorical Data will be processed using OneHotEncoding
categorical_transformer = Pipeline(steps =
                                   [('imputer', SimpleImputer(strategy='most_frequent')),
                                    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))])

# Bundle the preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers =[
                   ('num', numerical_transformer, numerical_cols),
                   ('cat', categorical_transformer, categorical_cols)])

##### 2.2 Model Development

In this section a RandomForestRegressor will be developed.

In [None]:
# Model Definition
model = RandomForestRegressor(n_estimators = 100, random_state = 0)

##### 2.3 Pipeline Consolidation

In this step, the preprocessor and the model are consolidated into a pipeline, in order to streamline all steps. Then the pipeline is used to fit the model, predict and validate the results.

In [None]:
# Consolidate Preprocessing and Model Development
my_pipe = Pipeline(steps = [('preprocessor', preprocessor),
                            ('model', model)])

# Fit the data using the pipeline
my_pipe.fit(X_train, y_train)

# Predict using the pipeline
preds = my_pipe.predict(X_val)

# Validate the pipeline
error = mean_absolute_error(y_val, preds)
print('The MAE amounts to',error)



The MAE amounts to 17861.780102739725


##### 2.4 Improving the pipeline

The pipeline can be improved by adjusting either the processing of the data, or the model itself. Here, we will adjust the processing and check if the performance improves.


In [None]:
# Adjust the categorical transformation in the keyword 'strategy'
categorical_transformer = Pipeline(steps =
                                   [('imputer', SimpleImputer(strategy='constant')),
                                    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))])

In [None]:
# Rerun the bundling of the preprocessor and the pipeline
preprocessor = ColumnTransformer(
    transformers =[
                   ('num', numerical_transformer, numerical_cols),
                   ('cat', categorical_transformer, categorical_cols)])

# Consolidate Preprocessing and Model Development
my_pipe = Pipeline(steps = [('preprocessor', preprocessor),
                            ('model', model)])

In [None]:
# Refit the pipeline and generate predictions on the validation set
my_pipe.fit(X_train, y_train)
preds = my_pipe.predict(X_val)

In [None]:
# Validate the pipeline
error = mean_absolute_error(y_val, preds)
print('The MAE amounts to',error)

The MAE amounts to 17621.3197260274


#### 3. Generate Test Predictions

Using the improved pipeline, we can now generate predictions for the test set. 

In [None]:
# Prediction generation using the pipeline.
preds_test = my_pipe.predict(X_test)