In [1]:
import pandas as pd

In [2]:
adult_census = pd.read_csv('data/adult-census.csv')

In [3]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

In [4]:
from sklearn.compose import make_column_selector as selector

In [5]:
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)

## Reference pipeline (no numerical scaling and integer-coded categories)

In [6]:
%%time 
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)
preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns)],
    remainder="passthrough")

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())
cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]
print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

The mean cross-validation accuracy is: 0.873 +/- 0.002
Wall time: 11.3 s


## Scaling numerical features

In [7]:
%%time
from sklearn.preprocessing import StandardScaler

preprocessor = ColumnTransformer([
    ('numerical', StandardScaler(), numerical_columns),
    ('categorical', OrdinalEncoder(handle_unknown="use_encoded_value",
                                   unknown_value=-1), categorical_columns)
])

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())
cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]
print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

The mean cross-validation accuracy is: 0.874 +/- 0.003
Wall time: 11 s


## Analysis

We can observe that both the accuracy and the training time are approximately the same as the reference pipeline (any time difference you might observe is not significant).

Scaling numerical features is indeed useless for most decision tree models in general and for HistGradientBoostingClassifier in particular.

## One-hot encoding of categorical variables

For linear models, we have observed that integer coding of categorical variables can be very detrimental. However for HistGradientBoostingClassifier models, it does not seem to be the case as the cross-validation of the reference pipeline with OrdinalEncoder is good.

Let’s see if we can get an even better accuracy with OneHotEncoder.

Hint: HistGradientBoostingClassifier does not yet support sparse input data. You might want to use OneHotEncoder(handle_unknown="ignore", sparse=False) to force the use of a dense representation as a workaround.

In [8]:
%%time
from sklearn.preprocessing import OneHotEncoder

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore", sparse=False)
preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns)],
    remainder="passthrough")

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())
cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]
print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

The mean cross-validation accuracy is: 0.874 +/- 0.002
Wall time: 22.1 s


## Analysis 

From an accuracy point of view, the result is almost exactly the same. The reason is that HistGradientBoostingClassifier is expressive and robust enough to deal with misleading ordering of integer coded categories (which was not the case for linear models).

However from a computation point of view, the training time is significantly longer: this is caused by the fact that OneHotEncoder generates approximately 10 times more features than OrdinalEncoder.

Note that the current implementation HistGradientBoostingClassifier is still incomplete, and once sparse representation are handled correctly, training time might improve with such kinds of encodings.

The main take away message is that arbitrary integer coding of categories is perfectly fine for HistGradientBoostingClassifier and yields fast training times.