# 📝 Exercise M1.05

The goal of this exercise is to **evaluate the impact of feature preprocessing
on a pipeline** that uses a `decision-tree`-based classifier instead of a `logistic
regression`.

- The first question is to empirically evaluate whether scaling numerical
  features is helpful or not;
- The second question is to evaluate whether it is empirically better (both
  from a computational and a statistical perspective) to use integer coded or
  one-hot encoded categories.

In [None]:
import pandas as pd

# Colab: https://github.com/INRIA/scikit-learn-mooc/blob/main/datasets/adult-census.csv?raw=true
adult_census = pd.read_csv("https://github.com/INRIA/scikit-learn-mooc/blob/main/datasets/adult-census.csv?raw=true")

In [None]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])
data

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States
...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States
48838,40,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States
48839,58,Private,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States
48840,22,Private,HS-grad,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States


As in the previous notebooks, we use the utility `make_column_selector`
to select only columns with a specific data type. Besides, we list in
advance all categories for the categorical columns.

In [None]:
from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)

## Reference pipeline (no numerical scaling and integer-coded categories)

First let's time the pipeline we used in the main notebook to serve as a
reference:

In [None]:
import time

from sklearn.model_selection import cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import HistGradientBoostingClassifier

categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)
preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns)],
    remainder="passthrough")

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

start = time.time()
cv_results = cross_validate(model, data, target)
elapsed_time = time.time() - start

scores = cv_results["test_score"]

print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} ± {scores.std():.3f} "
      f"with a fitting time of {elapsed_time:.3f}")

The mean cross-validation accuracy is: 0.873 ± 0.002 with a fitting time of 37.814


## Scaling numerical features

Let's write a similar pipeline that also scales the numerical features using
`StandardScaler` (or similar):

In [None]:
# Write your code here.
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
mi_preprocessor=ColumnTransformer(transformers=[('Numericas', numeric_transformer, numerical_columns),('Categoricas', categorical_preprocessor, categorical_columns)])
mi_modelo=make_pipeline(mi_preprocessor,HistGradientBoostingClassifier())

In [None]:
from sklearn import set_config
set_config(display="diagram")

mi_modelo

Probar el modelo con StandardScaler

In [None]:
start = time.time()
resultados = cross_validate(mi_modelo, data, target)
elapsed_time = time.time() - start

puntuaciones = resultados["test_score"]

print("The mean cross-validation accuracy is: "
      f"{puntuaciones.mean():.3f} ± {puntuaciones.std():.3f} "
      f"with a fitting time of {elapsed_time:.3f}")

The mean cross-validation accuracy is: 0.873 ± 0.003 with a fitting time of 8.024


## One-hot encoding of categorical variables

We observed that integer coding of categorical variables can be very
detrimental for linear models. However, it does not seem to be the case for
`HistGradientBoostingClassifier` models, as the cross-validation score
of the reference pipeline with `OrdinalEncoder` is reasonably good.

Let's see if we can get an even better accuracy with `OneHotEncoder`.

Hint: `HistGradientBoostingClassifier` does not yet support sparse input
data. You might want to use
`OneHotEncoder(handle_unknown="ignore", sparse=False)` to force the use of a
dense representation as a workaround.

In [None]:
from sklearn.preprocessing import OneHotEncoder
categorical_hot=OneHotEncoder(sparse=False,handle_unknown="infrequent_if_exists")

In [None]:
# Write your code here.
preprocessor_hot=ColumnTransformer(transformers=[('Numericas', numeric_transformer, numerical_columns),('Categoricas', categorical_hot, categorical_columns)])
modelo_hot=make_pipeline(preprocessor_hot,HistGradientBoostingClassifier())
modelo_hot

Probar el modelo con StandardScaler y OneHotEncoder

In [None]:
start = time.time()
resultados_hot = cross_validate(mi_modelo, data, target)
elapsed_time = time.time() - start

puntuaciones_hot = resultados_hot["test_score"]

print("The mean cross-validation accuracy is: "
      f"{puntuaciones_hot.mean():.3f} ± {puntuaciones_hot.std():.3f} "
      f"with a fitting time of {elapsed_time:.3f}")

The mean cross-validation accuracy is: 0.873 ± 0.002 with a fitting time of 8.218
