# Modular pipelines with scikit-learn

In this notebook you will practice how to make your analyses modular, using `sklearn` tools. You will see how this structure can help build more sophisticated setups.

We will work on a classification task, using a credit risk dataset from Taiwan, with the goal of predicting the risk of credit default.

[Dataset reference](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients).

#### Import libraries

In [1]:
# Hide warnings
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import pickle

from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, FunctionTransformer, PolynomialFeatures
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, ShuffleSplit
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# For custom transformers
from sklearn.base import TransformerMixin, BaseEstimator

Load the dataset from `data/default.xls`:

In [5]:
pip install xlrd

Collecting xlrd
  Obtaining dependency information for xlrd from https://files.pythonhosted.org/packages/a6/0c/c2a72d51fe56e08a08acc85d13013558a2d793028ae7385448a6ccdfae64/xlrd-2.0.1-py2.py3-none-any.whl.metadata
  Downloading xlrd-2.0.1-py2.py3-none-any.whl.metadata (3.4 kB)
Downloading xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
   ---------------------------------------- 0.0/96.5 kB ? eta -:--:--
   ---------------------------------------- 96.5/96.5 kB 5.4 MB/s eta 0:00:00
Installing collected packages: xlrd
Successfully installed xlrd-2.0.1
Note: you may need to restart the kernel to use updated packages.


In [6]:
data = pd.read_excel("data/default.xls", skiprows=1)
data.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


## Basic pipeline

#### Create data matrices

- Create a feature matrix `X` and a target vector `y` from `data`. 
    * The target to be predicted is given in the last column. 
    * The feature matrix should consist of the rest of the columns (except for `'ID'`).
- Create train/test splits with `train_test_split()`, specifying that the data should be stratified by `y`.

In [9]:
# Your code here...
X = data.copy().drop(columns='default payment next month')
y = data['default payment next month']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

Lets have a look at the unique values in the target to check whether the dataset is imbalanced:

In [10]:
print("Unique values in y: {}".format(np.unique(y)))
print("Number of 1s in y: {}/{}".format(sum(y), len(y)))

Unique values in y: [0 1]
Number of 1s in y: 6636/30000


In [11]:
X_train.shape, X_test.shape

((22500, 24), (7500, 24))

#### Fit model and classify
Let's use `.fit()` and find the score of a Decision Tree classifier:

In [12]:
logreg = DecisionTreeClassifier()
logreg.fit(X_train, y_train)

print("Score: ", logreg.score(X_test, y_test))

y_pred = logreg.predict(X_test)

print(classification_report(y_test, y_pred))

Score:  0.7276
              precision    recall  f1-score   support

           0       0.83      0.81      0.82      5841
           1       0.39      0.42      0.41      1659

    accuracy                           0.73      7500
   macro avg       0.61      0.62      0.61      7500
weighted avg       0.73      0.73      0.73      7500



#### Scikit-learn pipeline

Now use the pipeline function of scikit-learn to arrive at the same results in three lines of code. Create a simple Decision Tree pipeline with `make_pipeline()` and test its predictions with `cross_val_score()`. Use a single split from now on, by setting `cv` to `ShuffleSplit(n_splits=1)`.

In [21]:
# Your code here...
split = ShuffleSplit(n_splits=1)

pipeline = make_pipeline(DecisionTreeClassifier())

cv_scores = cross_val_score(pipeline, X, y, cv = split)


In [22]:
cv_scores

array([0.72133333])

### Preprocessing pipeline

Our initial analysis overlooked several aspects that may hinder the performance of a predictive model. One is that the features should be transformed to an appropriate input type for a linear model. Here are the features we want to build:
* `SEX`, `EDUCATION`, `MARRIAGE`: One-hot encoded features for categorical columns.
* `PAY_...`, `BILL_AMT...`, `PAY_AMT...`: these features correspond to bills and payments made in the previous months. They're highly correlated so we will first scale them and then apply PCA.
* `AGE`, `LIMIT_BAL`: these are numerical values. We will use `FeatureUnion` to create log and polynomial versions of them.
* `Unpaid_by_user` / `avg_unpaid`: the ratio amount unpaid by the user divided by the average unpaid bill last month for all users. This should give us an idea of how bad at paying bills the user is.

We'll go through each of them step-by-step, leveraging scikit-learn transformers.

#### Column one-hot transformation

We will create one-hot encoded features for the categorical columns `SEX`, `EDUCATION`, `MARRIAGE`. The standard process would be to create the new columns and add them to a new data matrix, and for each such preprocessing step we would have to create a new data set variation. Instead, we can create a transformation to be used in a pipeline.

Create a one-hot transformer for the categorical variables, using `make_column_transformer()` and the `OneHotEncoder()`. The `ColumnTransformer` allows you to apply different transformations in parallel to different columns. Remember to set `remainder='passthrough'` to retain all other columns. 

Create a pipeline with the one-hot preprocessor and a Decision Tree. Test the model predictions with `cross_val_score()`.


In [25]:
# Your code here...
ohe_columns = ['SEX','EDUCATION','MARRIAGE']

categorical_transformer = make_column_transformer(
    (OneHotEncoder(sparse_output=False), ohe_columns),
    remainder='passthrough'
)

ohe_pipeline = Pipeline(steps= [
    ('ohe_preprocessor', categorical_transformer),
    ('model', DecisionTreeClassifier())
])

ohe_cv_scores = cross_val_score(ohe_pipeline, X, y, cv = split)

In [26]:
ohe_cv_scores

array([0.72333333])

#### Reducing correlated features with Scaler and PCA

We will scale the payment columns and reduce their dimensionality with PCA.

Create a sequential transformer that first scales the data and then applies PCA.

Choose a sensible value for `n_components`, which sets the number of components to keep (note that `n_components` can be specified as a `float` saying what proportion of variance is retained).

In [31]:
# Your code here...
pay_columns = [col for col in X_train.columns if col.startswith('PAY_')]
bill_amt_columns = [col for col in X_train.columns if col.startswith('BILL_AMT')]
pay_amt_columns = [col for col in X_train.columns if col.startswith('PAY_AMT_')]

scale_pca_columns = pay_columns + bill_amt_columns + pay_amt_columns
scale_pca_columns

scale_pca_pipeline = Pipeline(steps = [
    ('scale', MinMaxScaler()),
    ('pca', PCA(n_components=0.8))
])

# # Fit the pipeline to the selected columns
# scale_pca_pipeline.fit(X_train[scale_pca_columns])

# # Transform the selected columns
# transformed_data = scale_pca_pipeline.transform(X_train[scale_pca_columns])

# print("Transformed data shape:", transformed_data.shape)

Great, let's keep this transformer for later when we'll group all the pieces together.

#### Adding handcrafted features with `FeatureUnion`

`FeatureUnion` can be applied on the data to add features (for example: interaction features or transformed features). In this dataset we want to create new features by combining and transforming `AGE` and `LIMIT_BAL`.

Create a transformer that adds the log features with `FunctionTransformer()`, and a polynomial pairwise transformation (check the `PolynomialFeatures` documentation). For the latter transformer, specify `include_bias=False` and `interaction_only=True`.

Test the transformation by calling `.fit_transform()` on `X_train`, columns `AGE` and `LIMIT_BAL`. How many new columns does this transformation create?

In [38]:
# Your code here...
# log_transformer = FunctionTransformer(np.log)
# poly_transformer = PolynomialFeatures(2,include_bias=False, interaction_only=True)

# combined_transformed = FeatureUnion(transformer_list=[
#     ('log_features', log_transformer),
#     ('polynomial_features', poly_transformer)
# ])

# combined_transformed_data = combined_transformed.fit_transform(X_train[['AGE','LIMIT_BAL']])

# combined_transformed_data
# combined_transformed_data.shape


numerical_transformer = FeatureUnion([
    ('log', FunctionTransformer(np.log)),
    ('polynomial', PolynomialFeatures(2, include_bias=False, 
                                      interaction_only=True))
])

numerical_transformer.fit_transform(X_train[["AGE", "LIMIT_BAL"]]).shape

(22500, 5)

Note that if you have two features `a` and `b`, then `PolynomialFeatures(2, include_bias=False, interaction_only=True)` includes features `a`, `b` and `a*b`. If you remove `interaction_only`, it would include `a`, `b`, `a*b`, `a**2`, `b**2`. 

#### Custom Transformer

We'll create a custom transformer that indicates if a user is above or below average for the unpaid bill last month: `unpaid_by_user > avg_unpaid`.

For this we need to memorise the average amount unpaid by all users in the `.fit()` step, and then apply the transformation for any user in the `.transform()` step.

We've defined the template for the class you need to implement below. Note that it extends `TransformerMixin` and `BaseEstimator`; that's required to define a new `Transformer` in the right format for `sklearn`.

Follow the instructions below. You can run the code in the next cell to test your implementation.

1. Implement the `.fit()` method:
  * This method takes `X` and `y` (`y` is optional). 
  * It needs to compute `avg_unpaid_6`, `avg_unpaid_5`, and `avg_unpaid_4` - the amount unpaid on average for all users in the given training set for month 4, 5, 6. For example `avg_unpaid_6` is the mean of `BILL_AMT6` - `PAY_AMT6`.
  * To be compatible with other `sklearn` tools, `.fit()` needs to return `self`.


2. Implement the `.transform()` method:
  * This method takes `X` as input and needs to return a new DataFrame with columns `Unpaid_ratio_6`, `Unpaid_ratio_5`, and `Unpaid_ratio_4`, containing the same number of rows as `X`.
  * Each value indicates if the amount unpaid by user (`BILL_AMT` - `PAY_AMT`) is higher than the average amount saved in `.fit()`.

In [None]:
class UnpaidTransformer(TransformerMixin, BaseEstimator):
    
    def fit(self, X, y=None):
        # Your code here...
        pass

    def transform(self, X, y=None):
        # Your code here...
        pass

In [None]:
# Instantiate the transformer
unpaid_transfomer = UnpaidTransformer()

# Test .fit()
unpaid_transfomer.fit(X_train)
print(unpaid_transfomer.avg_unpaid_6)

# Test .transform()
unpaid_transfomer.transform(X_train).head()

#### Build a joint preprocessor

Now that we have all the transformers we need, we can create a single object using `ColumnTransformer`. First we need to define a list of columns that will be processed separately:

In [None]:
payment_cols = [col for col in X_train.columns if col.startswith("PAY_") or col.startswith("BILL_")]

num_cols = ["AGE", "LIMIT_BAL"]

# Features needed to compute the unpaid features:
unpaid_cols = ["PAY_AMT4", "PAY_AMT5", "PAY_AMT6", "BILL_AMT4", "BILL_AMT5", "BILL_AMT6"]
cat_cols = ["SEX", "EDUCATION", "MARRIAGE"]

Now let's create a `ColumnTransfomer` object, which takes a list of tuples as input. Each tuple needs to have the following format: (`name`, `transformer`, `columns`).

In [None]:
# Your code here...


Now we have a preprocessor object that we can reuse for any new data. The next step is to create a predictive model with it.

## Optimizing models

In order to evaluate different predictive algorithms, we can create several pipeline objects, each combining our preprocessor together with a different algorithm. 

Create new pipeline objects (`dtc` and `rfc`) with our joint preprocessor for the Decision Tree and the Random Forest classifiers. Test them with `cross_val_score()`.

In [None]:
# Your code here...


#### Using grid search with a pipeline

One major advantage of using a pipeline is that now we have both the processing **and** predictive modeling in one object, which can be trained and optimized jointly. It means that on every fold it is trained on, we will refit the model and also the transformations, which ensures we do not have data leakage and makes the evaluation more accurate. 

When tuning our model we can tune the hyperparameters of the model, and also some processing parameters, such as the number of PCA components. This would be much harder with a regular sequential preprocessing-then-training scenario.

Below we use grid search to tune the `max_depth` and `min_samples_split` of the Random Forest model, together with the `n_components` in PCA.

Note: to refer to a parameter, you need to use the name of the step, followed by `__`, then the parameter name. Since the PCA model is deep inside our pipeline, the name is a bit more complicated. Using `.named_steps` allows us to see what's inside our pipeline model:

In [None]:
rfc.named_steps

Here PCA is in `columntransformer`, then `payments` (the Pipeline), then `pca` (one can use `.get_params()` to make sure we got the name of the parameter right). 

Use `GridSearchCV()` to find the parameters with the best accuracy. Print these parameters and the corresponding score, with `.best_params_` and `.best_score_`.

In [None]:
# Your code here...


#### Recover and fit the model with the best hyper-parameters

You can recover the best hyper-parameters of the grid-search with `.set_params()`.

In [None]:
# Your code here...


### Saving pre-trained models

Now save the best pre-trained model to be used later, using the `pickle` library. Write your model to `"data/rf_model.pickle"`.

In [None]:
# Your code here...


Now load your model and print its score again, to check that you have successfully saved it.

In [None]:
with open("data/rf_model.pickle", "rb") as f:
    rfc_pickle = pickle.load(f)
cross_val_score(rfc_pickle, X, y, cv=split)

Pickle is really useful to save trained models, but keep in mind that it does not save dependencies. To be able to load your model, you need to import the code it relies on: external libraries and our own class definitions. Because of this, it is common to get errors when trying to load a model trained with an old version of a library.

#### Reproducible experiments

To make sure your model was not just lucky to get good results for a given training session, it is important to know how to make reproducible experiments. Apart from having the same code and data of course, to be reproducible, one needs to have same random number generator state. 

Run three instances of training your pipeline and testing it with `cross_val_score()`. Before two of them set the random seed to `1234`, with `np.random.seed()`. Do the results match exactly?

In [None]:
# Your code here...


In [None]:
# Your code here...


In [None]:
# Your code here...
