## Pipeline with categorical and numerical features

In this notebook we present complete pipelines, which chain together data preprocessing and Machine Learning models.
We will use the `auto-mpg-orig` dataset, which has missing values (in the `hp` feature), numerical features and categorical features.

Our data preprocessing must:
1. Impute missing data in the numerical columns. Because in our example dataset we know that the only missing data is in column `hp` and this column is numerical, we can limit our imputation to numerical columns. In the general case we could also do imputation on categorical columns. To impute data using the column median, we use sklearn's `SimpleImputer` with `strategy='median'`. For a categorical column, to use the mode, I would use `strategy='most_frequent'`. For more advanced imputation techniques, one could use `IterativeImputer` or `KNNImputer`.
2. One-hot-encode categorical columns. To this end we use sklearn's `OneHotEncoder`.
3. Standardise numerical columns. We use `StandardScaler`.

### Read data, mark missing values and categorical columns

In [1]:
import pandas as pd

In [2]:
d = pd.read_csv('auto-mpg-orig.csv')

In [3]:
# Transform '?' in column hp as NaN
d.hp = pd.to_numeric(d.hp, errors='coerce')

In [4]:
# Mark 'origin' as a categorical column, with the correct category names
d.origin = pd.Categorical(d.origin.replace({1: 'America', 2: 'Europe', 3: 'Japan'}))

In [5]:
label = 'mpg'
features = [col for col in d.columns if col != label]

# On top of the list of all features, we now also want
# the list of categorical features (only one in our example)
# and the list of numerical features (those which are not categorical)
categorical_features = ['origin']
numerical_features = [col for col in features if col not in categorical_features]

In [6]:
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import GridSearchCV, train_test_split
import numpy as np

In [7]:
X, y = d[features], d[label]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

## Create the preprocessing step of the pipeline

Because preprocessing is composed of many steps, we can use pipelines to create a preprocessor.
Later on, we will create a pipeline which joins together the preprocessor and the Machine Learning model (so we will be creating a pipeline in which one of the steps is... another pipeline!).

This approach is useful because, in our example, preprocessing is going to be the same no matter what Machine Learning model we use.
Therefore, by creating a generic preprocessor, we can then "recycle" it and use it in combination to whatever ML model we want.

When we must take different actions on different columns we must use a `ColumnTransformer`.
Examples of this situation are:
* one-hot-encode *only* categorical columns (and to nothing to numerical columns);
* impute *only* numerical columns (and do nothing to categorical columns); 
* standardise *only* numerical columns.

A column transformer object takes a list of tuples.
Each tuple has three elements:
1. The first one is just a name to describe it (i.e., `'numerical'` and `'categorical'` in the code below).
2. The second one is the actual object which does the preprocessing (i.e., the pipelines in the code below).
3. The third one is a subset of columns on which we want to apply the preprocessing (i.e., `numerical_columns` and `categorical_columns` in the code below).

It is important to note that the subsets of columns we pass to the various tuples (as their third element) must partition the entire set of features.
In other words, the same column cannot appear in more than one subset and each column must appear in at least one subset.
In our case, this means that `numerical_columns` and `categorical_columns` cannot share common elements, and there cannot be any column which is neiter in `numerical_columns` nor in `categorical_columns`.

If a column never appears in any of the column subsets, it will be dropped from the set of features!
Then, it's natural to ask: what if don't want to apply any preprocessing to some columns?
If I simply exclude them from any subset of columns, they will be dropped.
The answer is to add them to their own set and, when adding the corresponding tuple to `ColumnTransformer`, instead of passing a preprocessor as the second element, simply pass the string `'passthrough'`.
This is a special value which means: just leave these columns alone and don't touch them.
For example, if I don't want to 1-hot-encode categorical features, I can modify the code below and replace `('categorical', categorical_preprocessor, categorical_features)` with `('categorical', 'passthrough', categorical_features)`.

Finally, let's spend a couple of words about the preprocessors themselves.
I create two of them, one for numerical columns (`numerical_preprocessor`) and one for categorical columns (`categorical_preprocessor`).
As already mentioned, because preprocessing each of these two subsets of columns could involve multiple steps, it is reasonable to use a pipeline to create the preprocessors.
The numerical preprocessor first performs data imputation (using the column median for columns which have missing values) and then standardisation.
The categorical preprocessor only performs 1-hot-encoding (I passed `sparse=False` to tell `numpy` not to use sparse matrices when encoding, because sklearn's Machine Learning models do not like them!) in this particular example but, in principle, I oculd be dealing with data in which imputation is required for categorical columns, too.

In [8]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

In [9]:
numerical_preprocessor = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler()
)

categorical_preprocessor = make_pipeline(
    OneHotEncoder(sparse=False)
)

preprocessing = ColumnTransformer([
    ('numerical', numerical_preprocessor, numerical_features),
    ('categorical', categorical_preprocessor, categorical_features)
])

### Combining preprocessing and the models

For this example I am using two linear models with, respectively, Lasso and Ridge regulatisation terms.

In [10]:
lasso = make_pipeline(
    preprocessing,
    Lasso(max_iter=10000))

ridge = make_pipeline(
    preprocessing,
    Ridge())

### Hyperparameter tuning via grid search

In [11]:
lasso_alpha = np.logspace(start=-3, stop=0, num=20)
ridge_alpha = np.logspace(start=-1, stop=2, num=20)

In [12]:
lasso_cv = GridSearchCV(
    estimator=lasso,
    param_grid={
        'lasso__alpha': lasso_alpha
    },
    cv=5,
    scoring='neg_mean_squared_error')

ridge_cv = GridSearchCV(
    estimator=ridge,
    param_grid={
        'ridge__alpha': ridge_alpha
    },
    cv=5,
    scoring='neg_mean_squared_error')

### Fitting of the estimators

In [13]:
lasso_cv.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('numerical',
                                                                         Pipeline(steps=[('simpleimputer',
                                                                                          SimpleImputer(strategy='median')),
                                                                                         ('standardscaler',
                                                                                          StandardScaler())]),
                                                                         ['cylinders',
                                                                          'displacement',
                                                                          'hp',
                                                                          'weight',
                                          

In [14]:
lasso_cv.best_params_

{'lasso__alpha': 0.0379269019073225}

In [15]:
ridge_cv.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('numerical',
                                                                         Pipeline(steps=[('simpleimputer',
                                                                                          SimpleImputer(strategy='median')),
                                                                                         ('standardscaler',
                                                                                          StandardScaler())]),
                                                                         ['cylinders',
                                                                          'displacement',
                                                                          'hp',
                                                                          'weight',
                                          

In [16]:
ridge_cv.best_params_

{'ridge__alpha': 2.636650898730358}

### Evaluating the MSE on the test set

In [17]:
from sklearn.metrics import mean_squared_error

In [18]:
mean_squared_error(y_test, lasso_cv.predict(X_test))

8.86825679928795

In [19]:
mean_squared_error(y_test, ridge_cv.predict(X_test))

8.812101791027336

In [20]:
winner = ridge_cv.best_estimator_.fit(X, y)