In [24]:
import pandas as pd

adult_census = pd.read_csv("../input/adultcensus/adult-census.csv")
# drop the duplicated column `"education-num"` as stated in the first notebook
adult_census = adult_census.drop(columns="education-num")

target_name = "class"
target = adult_census[target_name]

data = adult_census.drop(columns=[target_name])

## Selection based on data types

We will separate categorical and numerical variables using their data
types to identify them, as we saw previously that `object` corresponds
to categorical columns (strings). We make use of `make_column_selector`
helper to select the corresponding columns.

In [25]:
from sklearn.compose import make_column_selector as selector

num_col_selector = selector(dtype_exclude = object)
cat_col_selector = selector(dtype_include = object)

num_col = num_col_selector(data)
cat_col = cat_col_selector(data)

## Dispatch columns to a specific processor

In the previous sections, we saw that we need to treat data differently
depending on their nature (i.e. numerical or categorical).

Scikit-learn provides a `ColumnTransformer` class which will send specific
columns to a specific transformer, making it easy to fit a single predictive
model on a dataset that combines both kinds of variables together
(heterogeneously typed tabular data).

We first define the columns depending on their data type:

* **one-hot encoding** will be applied to categorical columns. Besides, we
  use `handle_unknown="ignore"` to solve the potential issues due to rare
  categories.
* **numerical scaling** numerical features which will be standardized.

Now, we create our `ColumnTransfomer` by specifying three values:
the preprocessor name, the transformer, and the columns.
First, let's create the preprocessors for the numerical and categorical
parts.

In [26]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder

cat_preprocessor = OneHotEncoder(handle_unknown = "ignore")
num_preprocessor = StandardScaler()

In [27]:
# Now we create a transformer and associate each of these preprocessors with their respective columns
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ("one-hot-encoder", cat_preprocessor, cat_col),
    ("standard_scaler", num_preprocessor, num_col)])

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(preprocessor, LogisticRegression(max_iter = 500))

In [29]:
# Interactive diagram
from sklearn import set_config

set_config(display = "diagram")
model

In [30]:
# Train-test-split
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(data, target, random_state = 42)

In [31]:
# Let's train the model on the train set
_ = model.fit(data_train, target_train)

In [32]:
data_test.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
7762,56,Private,HS-grad,Divorced,Other-service,Unmarried,White,Female,0,0,40,United-States
23881,25,Private,HS-grad,Married-civ-spouse,Transport-moving,Own-child,Other,Male,0,0,40,United-States
30507,43,Private,Bachelors,Divorced,Prof-specialty,Not-in-family,White,Female,14344,0,40,United-States
28911,32,Private,HS-grad,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,40,United-States
19484,39,Private,Bachelors,Married-civ-spouse,Sales,Wife,White,Female,0,0,30,United-States


In [33]:
model.predict(data_test)[:5]

array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' >50K'], dtype=object)

In [34]:
target_test[:5]

7762      <=50K
23881     <=50K
30507      >50K
28911     <=50K
19484     <=50K
Name: class, dtype: object

In [35]:
model.score(data_test, target_test)

0.8575874211776268

## Evaluation of the model with cross-validation

* A predictive model should be evaluated by
cross-validation. Our model is usable with the cross-validation tools of
scikit-learn as any other predictors:

In [38]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target, cv = 5)
cv_results

{'fit_time': array([1.37903023, 1.08465838, 1.04651189, 1.11944914, 1.06203604]),
 'score_time': array([0.05375409, 0.0538826 , 0.06022191, 0.05862856, 0.05624771]),
 'test_score': array([0.8512642 , 0.8498311 , 0.84756347, 0.85247748, 0.85524161])}

In [39]:
scores = cv_results["test_score"]
print("The mean cross-validation accuracy is "
      f"{scores.mean():.3f} +/- {scores.std():.3f}"
     )

The mean cross-validation accuracy is 0.851 +/- 0.003


* The compound model has a higher prediction accuracy than the separate numerical and categorical models

## Fitting a more powerful model

**Linear models** are nice because they are usually cheap to train,
**small** to deploy, **fast** to predict and give a **good baseline**.

However, it is often useful to check whether more complex models such as an
ensemble of decision trees can lead to higher predictive performance. In this
section we will use such a model called **gradient-boosting trees** and
evaluate its generalization performance. More precisely, the scikit-learn model
we will use is called `HistGradientBoostingClassifier`. Note that boosting
models will be covered in more detail in a future module.

For tree-based models, the handling of numerical and categorical variables is
simpler than for linear models:
* we do **not need to scale the numerical features**
* using an **ordinal encoding for the categorical variables** is fine even if
  the encoding results in an arbitrary ordering

Therefore, for `HistGradientBoostingClassifier`, the preprocessing pipeline
is slightly simpler than the one we saw earlier for the `LogisticRegression`:

In [40]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import OrdinalEncoder

cat_preprocessor = OrdinalEncoder(handle_unknown = "use_encoded_value", unknown_value = -1)
preprocessor = ColumnTransformer([
    ("categorical", cat_preprocessor, cat_col)], remainder = "passthrough")

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

In [44]:
# Let us now check its generalization performance
_ = model.fit(data_train, target_train)

In [45]:
model.score(data_test, target_test)

0.8794529522561625

We can observe that we get significantly higher accuracies with the Gradient
Boosting model. This is often what we observe whenever the dataset has a
large number of samples and limited number of informative features (e.g. less
than 1000) with a mix of numerical and categorical variables.

This explains why Gradient Boosted Machines are very popular among
datascience practitioners who work with tabular data.