# 👁️ Visualizing scikit-learn pipelines in Jupyter

The goal of keeping this notebook is to:

- make it available for users that want to reproduce it locally
- archive the script in the event we want to rerecord this video with an
  update in the UI of scikit-learn in a future release.

## First we load the dataset

We need to define our data and target. In this case we will build a classification model

In [1]:
import pandas as pd

# Colab: https://github.com/INRIA/scikit-learn-mooc/raw/main/datasets/house_prices.csv
ames_housing = pd.read_csv("https://github.com/INRIA/scikit-learn-mooc/raw/main/datasets/house_prices.csv", na_values='?')

target_name = "SalePrice"
data, target = ames_housing.drop(columns=target_name), ames_housing[target_name]
target = (target > 200_000).astype(int)

We inspect the first rows of the dataframe

In [2]:
data

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,2,2008,WD,Normal
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,5,2007,WD,Normal
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,9,2008,WD,Normal
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,2,2006,WD,Abnorml
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,12,2008,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,8,2007,WD,Normal
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,0,,MnPrv,,0,2,2010,WD,Normal
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,0,,GdPrv,Shed,2500,5,2010,WD,Normal
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,4,2010,WD,Normal


In [3]:
target

0       1
1       0
2       1
3       0
4       1
       ..
1455    0
1456    1
1457    1
1458    0
1459    0
Name: SalePrice, Length: 1460, dtype: int64

For the sake of simplicity, we can cherry-pick some features and only retain
this arbitrary subset of data:

In [11]:
numeric_features = ['LotArea', 'FullBath', 'HalfBath']
categorical_features = ['Neighborhood', 'HouseStyle']
data = data[numeric_features + categorical_features]
data

Unnamed: 0,LotArea,FullBath,HalfBath,Neighborhood,HouseStyle
0,8450,2,1,CollgCr,2Story
1,9600,2,0,Veenker,1Story
2,11250,2,1,CollgCr,2Story
3,9550,1,0,Crawfor,2Story
4,14260,2,1,NoRidge,2Story
...,...,...,...,...,...
1455,7917,2,1,Gilbert,2Story
1456,13175,2,0,NWAmes,1Story
1457,9042,2,0,Crawfor,2Story
1458,9717,1,0,NAmes,1Story


## Then we create the pipeline

The first step is to define the preprocessing steps

In [5]:
from sklearn.impute import SimpleImputer
?SimpleImputer

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler(),
)])

categorical_transformer = OneHotEncoder(handle_unknown='ignore')

The next step is to apply the transformations using `ColumnTransformer`

In [7]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features),
])

Then we define the model and join the steps in order

In [8]:
from sklearn.linear_model import LogisticRegression

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression()),
])

Let's visualize it!

In [9]:
from sklearn import set_config
set_config(display="diagram")

model

## Finally we score the model

In [10]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target, cv=5)
scores = cv_results["test_score"]
print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} ± {scores.std():.3f}")

The mean cross-validation accuracy is: 0.859 ± 0.018


<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">ℹ️ Note</p>
<p>In this case, around 86% of the times the pipeline correctly predicts whether
the price of a house is above or below the 200_000 dollars threshold. But
be aware that this score was obtained by picking some features by hand, which
is not necessarily the best thing we can do for this classification task. In this
example we can hope that fitting a complex machine learning pipelines on a
richer set of features can improve upon this performance level.</p>
<p class="last">Reducing a price estimation problem to a binary classification problem with a
single threshold at 200_000 dollars is probably too coarse to be useful in
in practice. Treating this problem as a regression problem is probably a better
idea. We will see later in this MOOC how to train and evaluate the performance
of various regression models.</p>
</div>