# Debugging scikit-learn pipelines

- toc: false
- branch: master
- badges: true
- comments: true
- categories: [machine-learning, programming, scikit-learn]

### Introduction

The idea of this post is to show how we can alter scikit-learn pipelines behaviour in order to be easier to debug. For this purpose, we'll:

* Create a pipeline with a bug on it.
* Create a tool to debug such pipeline.
* Solve the error and see what the debugging tool offers us.

We'll do this with the arrests data from `scikit-lego` library and a very simple pipeline.

### Data preparation

First of all, let's load the packages that we are going to use:

In [1]:
# Main tools
import numpy as np
import pandas as pd
import seaborn as sns

# Data
from sklego.datasets import load_arrests

# Scikit-learn classes
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Non-scikit learn classes
from category_encoders import OneHotEncoder


# fastcore helpers
from fastcore.dispatch import typedispatch

Also, let's load the data. The data are on US arrests. These data are usually used to study ML bias. We use this dataset in order to show how to debug a pipeline, but any other dataset would be completely valid. We can see that there are both numeric and categorical columns. We're going to try to predict the fact that an arrest was `released` or not.

In [2]:
df_arrests = load_arrests(as_frame=True)
df_arrests.head()

Unnamed: 0,released,colour,year,age,sex,employed,citizen,checks
0,Yes,White,2002,21,Male,Yes,Yes,3
1,No,Black,1999,17,Male,Yes,Yes,3
2,Yes,White,2000,24,Male,Yes,Yes,3
3,No,Black,2000,46,Male,Yes,Yes,1
4,Yes,Black,1999,27,Female,Yes,Yes,1


Let's do a very basic preprocessing: split the dataset betwwen X and y:

In [3]:
df_arrests_x = df_arrests.drop(columns='released')
df_arrests_y = df_arrests['released'].map({'Yes': 0, 'No': 1})

We want to train a ML model to predict `released`. For that purpose, we're going to use a scikit-learn pipeline, where we are going to scale data, impute missing values and train a logistic regression. Simple enough, it should work:

In [4]:
# Define pipeline
steps_list = [
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer()),
    ('learner', LogisticRegression())
]
pipe = Pipeline(steps_list)


We try to train the model but we get an error, and unfortunately we get an error, which is hard to read (kind of):

`ValueError: could not convert string to float: 'White'`

I don't even know which is the step of the pipeline that failed, and this makes it hard to debug:

In [5]:
pipe.fit(df_arrests_x, df_arrests_y)

ValueError: could not convert string to float: 'White'

To be able to debug this error, we are going to build a class that is derived from `Pipeline`. As a first step, we'll build `DebugPipeline` with a `debug_fit` method such that it is able to reproduce the same error as in the standard `fit`. What we try to do is reproduce the `fit` method, and we know that the idea of the `fit` of a `Pipeline` is to run `fit_transform` for every transformer and run a `fit` with the learner:

In [6]:
class DebugPipeline(Pipeline):

    def debug_fit(self, X, y):
        for transformer_name, transformer in self.steps[:-1]:
            X = transformer.fit_transform(X)
        return self.steps[-1][1].fit(X, y)


In [7]:
pipe = DebugPipeline(steps_list)

Indeed, we get the same error:

In [8]:
pipe.debug_fit(df_arrests_x, df_arrests_y)

ValueError: could not convert string to float: 'White'

Now we have a `debug_fit` method that we can control in an easier way than the `fit` method. The idea is to log several properties of our input data at each step in order to understand where is the pipeline failing and why is it failing at that step.

As a bonus trick, we'll use `typedispatch` from `fastcore` to log the types of the input data depending on its type.

Let's create the logging functions:

In [9]:
@typedispatch
def log_dtypes(df: pd.DataFrame):
    types_dict = dict(df.dtypes.items())
    dtypes_str = ", ".join(
        "({}, {})".format(name, dtype) for name, dtype in types_dict.items()
    )

    return f"types=[{dtypes_str}]"


@typedispatch
def log_dtypes(df: np.ndarray):
    types_dict = {i: col.dtype for i, col in enumerate(df.T)}
    dtypes_str = ", ".join(
        "({}, {})".format(name, dtype) for name, dtype in types_dict.items()
    )

    return f"types=[{dtypes_str}]"


def log_shape(df):
    return f"n_obs={df.shape[0]} n_col={df.shape[1]}"


Let's use these functions before the `fit_transform`:

In [10]:

class DebugPipeline(Pipeline):

    def debug_fit(self, X, y):
        for transformer_name, transformer in self.steps[:-1]:
            print(log_shape(X))
            print(log_dtypes(X))
            print(f"{transformer_name} to be applied")
            X = transformer.fit_transform(X)
        return self.steps[-1][1].fit(X, y)


And use the `debug_fit`. We can see that the pipeline is failing in the scaler step, and that the input data at these step are of type `object`. It looks like we needed to transform the categorical variables to numeric, as the standard scaler cannot handle categories.

In [11]:
pipe = DebugPipeline(steps_list)
pipe.debug_fit(df_arrests_x, df_arrests_y)

n_obs=5226 n_col=7
types=[(colour, object), (year, int64), (age, int64), (sex, object), (employed, object), (citizen, object), (checks, int64)]
scaler to be applied


ValueError: could not convert string to float: 'White'

Let's solve this issue by applying one-hot encoding before running the pipeline:

In [12]:
steps_list = [
    ('one-hot', OneHotEncoder()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer()),
    ('learner', LogisticRegression())
]


In [13]:
pipe = DebugPipeline(steps_list)
pipe.debug_fit(df_arrests_x, df_arrests_y)

n_obs=5226 n_col=7
types=[(colour, object), (year, int64), (age, int64), (sex, object), (employed, object), (citizen, object), (checks, int64)]
one-hot to be applied
n_obs=5226 n_col=11
types=[(colour_1, int64), (colour_2, int64), (year, int64), (age, int64), (sex_1, int64), (sex_2, int64), (employed_1, int64), (employed_2, int64), (citizen_1, int64), (citizen_2, int64), (checks, int64)]
scaler to be applied
n_obs=5226 n_col=11
types=[(0, float64), (1, float64), (2, float64), (3, float64), (4, float64), (5, float64), (6, float64), (7, float64), (8, float64), (9, float64), (10, float64)]
imputer to be applied


  elif pd.api.types.is_categorical(cols):


LogisticRegression()

And now, for free, we get to see which are the columns of the pipeline at each step. We can understand the features that our model is using better and have a more intuitive picture of our pipeline.

### Summary

* With its current implementation, we can easily change scikit-learn's Pipeline to slightly change its behaviour and make it easier to debug.
* With this implementation, we're getting logging for free.
* We could even log the predictions by building a `debug_predict`.