In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [3]:
df = pd.read_csv('adult_cencus.csv')

In [4]:
df.sample(n=5)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,class
26466,36,Private,331395,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,<=50K
43071,32,Self-emp-not-inc,135304,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,3942,0,32,United-States,<=50K
47992,20,Private,398166,11th,7,Never-married,Other-service,Own-child,Black,Male,0,0,40,United-States,<=50K
30897,18,,267399,12th,8,Never-married,,Own-child,White,Female,0,0,12,United-States,<=50K
11264,57,Private,191983,Some-college,10,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,50,United-States,<=50K


In [5]:
df['class'].value_counts(dropna=False)

 <=50K    37155
 >50K     11687
Name: class, dtype: int64

#### Separate features matrix from target vector

In [6]:
data, target = df.drop(columns=['class']), df['class']

#### Separate numeric and categoric features

In [14]:
from sklearn.compose import make_column_selector as selector

numerical = selector(dtype_include=np.number)(data)
categorical = selector(dtype_include=object)(data)

In [15]:
numerical

['age',
 'fnlwgt',
 'education_num',
 'capital_gain',
 'capital_loss',
 'hours_per_week']

In [16]:
categorical

['workclass',
 'education',
 'marital_status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native_country']

#### Since education variable is same as education_num i drop it.

In [11]:
# del categorical[categorical.index('education')]

In [12]:
categorical

['workclass',
 'marital_status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native_country']

## Strategies to encode categories

### Encoding ordinal categories

The most intuitive strategy is to encode each category with a different
number. The `OrdinalEncoder` will transform the data in such manner.
We will start by encoding a single column to understand how the encoding
works.

In [17]:
data_categorical = data[categorical]
data_categorical.sample(n=3)

Unnamed: 0,workclass,education,marital_status,occupation,relationship,race,sex,native_country
35743,Private,HS-grad,Never-married,Craft-repair,Not-in-family,White,Male,United-States
24082,Private,HS-grad,Married-civ-spouse,Other-service,Husband,White,Male,United-States
12594,State-gov,Some-college,Divorced,Adm-clerical,Not-in-family,Black,Female,United-States


In [18]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
education_encoded = encoder.fit_transform(data_categorical[['education']])

In [19]:
education_encoded[:5]

array([[ 9.],
       [ 9.],
       [11.],
       [ 1.],
       [ 9.]])

We see that each category in `"education"` has been replaced by a numeric
value. We could check the mapping between the categories and the numerical
values by checking the fitted attribute `categories_`.

In [21]:
encoder.categories_

[array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
        ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
        ' HS-grad', ' Masters', ' Preschool', ' Prof-school',
        ' Some-college'], dtype=object)]

Now, we can check the encoding applied on all categorical features.

In [22]:
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]

array([[ 6.,  9.,  4.,  0.,  1.,  4.,  1., 38.],
       [ 5.,  9.,  2.,  3.,  0.,  4.,  1., 38.],
       [ 3., 11.,  0.,  5.,  1.,  4.,  1., 38.],
       [ 3.,  1.,  2.,  5.,  0.,  2.,  1., 38.],
       [ 3.,  9.,  2.,  9.,  5.,  2.,  0.,  4.]])

In [23]:
data_encoded.shape

(48842, 8)

In [24]:
data_encoded = pd.DataFrame(data_encoded, columns=data_categorical.columns)
data_encoded.head()

Unnamed: 0,workclass,education,marital_status,occupation,relationship,race,sex,native_country
0,6.0,9.0,4.0,0.0,1.0,4.0,1.0,38.0
1,5.0,9.0,2.0,3.0,0.0,4.0,1.0,38.0
2,3.0,11.0,0.0,5.0,1.0,4.0,1.0,38.0
3,3.0,1.0,2.0,5.0,0.0,2.0,1.0,38.0
4,3.0,9.0,2.0,9.0,5.0,2.0,0.0,4.0


We see that the categories have been encoded for each feature (column)
independently. We also note that the number of features before and after the
encoding is the same.

However, be careful when applying this encoding strategy:
using this integer representation leads downstream predictive models
to assume that the values are ordered (0 < 1 < 2 < 3... for instance).

By default, `OrdinalEncoder` uses a lexicographical strategy to map string
category labels to integers. This strategy is arbitrary and often
meaningless. For instance, suppose the dataset has a categorical variable
named `"size"` with categories such as "S", "M", "L", "XL". We would like the
integer representation to respect the meaning of the sizes by mapping them to
increasing integers such as `0, 1, 2, 3`.
However, the lexicographical strategy used by default would map the labels
"S", "M", "L", "XL" to 2, 1, 0, 3, by following the alphabetical order.

The `OrdinalEncoder` class accepts a `categories` constructor argument to
pass categories in the expected ordering explicitly.

If a categorical variable does not carry any meaningful order information
then this encoding might be misleading to downstream statistical models and
you might consider using one-hot encoding instead (see below).

### Encoding nominal categories (without assuming any order)

`OneHotEncoder` is an alternative encoder that prevents the downstream
models to make a false assumption about the ordering of categories. For a
given feature, it will create as many new columns as there are possible
categories. For a given sample, the value of the column corresponding to the
category will be set to `1` while all the columns of the other categories
will be set to `0`.

We will start by encoding a single feature (e.g. `"education"`) to illustrate
how the encoding works.

In [25]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
education_encoded = encoder.fit_transform(data_categorical[['education']])

In [26]:
education_encoded[:5]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]])

In [27]:
data_encoded = encoder.fit_transform(data_categorical)

In [28]:
data_encoded[:2]

array([[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0.]])

In [29]:
column_encoded = encoder.get_feature_names_out(data_categorical.columns)
data_encoded = pd.DataFrame(data_encoded, columns=column_encoded)
data_encoded.head()

Unnamed: 0,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,workclass_nan,education_ 10th,...,native_country_ Puerto-Rico,native_country_ Scotland,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia,native_country_nan
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Choosing an encoding strategy

Choosing an encoding strategy will depend on the underlying models and the
type of categories (i.e. ordinal vs. nominal).

    Note
    In general OneHotEncoder is the encoding strategy used when the  downstream models are linear models while OrdinalEncoder is often a good  strategy with tree-based models.


## Evaluate our predictive pipeline

We can now integrate this encoder inside a machine learning pipeline. Let's train a linear classifier on the encoded data and check the generalization performance of this machine learning pipeline using
cross-validation.

Before we create the pipeline, we have to linger on the `native-country`.
Let's recall some statistics regarding this column.

In [30]:
data['native_country'].value_counts()

 United-States                 43832
 Mexico                          951
 Philippines                     295
 Germany                         206
 Puerto-Rico                     184
 Canada                          182
 El-Salvador                     155
 India                           151
 Cuba                            138
 England                         127
 China                           122
 South                           115
 Jamaica                         106
 Italy                           105
 Dominican-Republic              103
 Japan                            92
 Guatemala                        88
 Poland                           87
 Vietnam                          86
 Columbia                         85
 Haiti                            75
 Portugal                         67
 Taiwan                           65
 Iran                             59
 Greece                           49
 Nicaragua                        49
 Peru                             46
 

We see that the `Holand-Netherlands` category is occurring rarely. This will
be a problem during cross-validation: if the sample ends up in the test set
during splitting then the classifier would not have seen the category during
training and will not be able to encode it.

In scikit-learn, there are two solutions to bypass this issue:

* list all the possible categories and provide it to the encoder via the
  keyword argument `categories`;
* use the parameter `handle_unknown`.

Here, we will use the latter solution for simplicity.

We can now create our machine learning pipeline.

In [35]:

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(
    OneHotEncoder(handle_unknown='ignore'),
    LogisticRegression(max_iter=1000)
)

Here, we need to increase the maximum number of iterations to obtain a fully converged LogisticRegression and silence a ConvergenceWarning. Contrary to the numerical features, the one-hot encoded categorical features are all on the same scale (values are 0 or 1), so they would not benefit from scaling. In this case, increasing max_iter is the right thing to do.

Finally, we can check the model's generalization performance only using the categorical columns.

In [36]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(
    model, data_categorical, target,
    cv=5,
)

In [38]:
results = pd.DataFrame(cv_results, columns=[k for k in cv_results.keys()])
results

Unnamed: 0,fit_time,score_time,test_score
0,1.472827,0.039396,0.830075
1,1.456715,0.039932,0.829563
2,1.398185,0.040395,0.838247
3,1.386399,0.038378,0.832412
4,1.47418,0.038133,0.835278


In [39]:
print(f"accuracy: {results['test_score'].mean().round(3)} +/- {results['test_score'].std().round(3)}")

accuracy: 0.833 +/- 0.004



In this notebook we have:
* seen two common strategies for encoding categorical features: **ordinal
  encoding** and **one-hot encoding**;
* used a **pipeline** to use a **one-hot encoder** before fitting a logistic
  regression.

# 📝 Exercise

The goal of this exercise is to evaluate the impact of using an arbitrary
integer encoding for categorical variables along with a linear
classification model such as Logistic Regression.

To do so, let's try to use `OrdinalEncoder` to preprocess the categorical
variables. This preprocessor is assembled in a pipeline with
`LogisticRegression`. The generalization performance of the pipeline can be evaluated by cross-validation and then compared to the score obtained when using `OneHotEncoder`.

In [40]:
categorical

['workclass',
 'education',
 'marital_status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native_country']

In [41]:
data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country'],
      dtype='object')

In [51]:
data_categorical.isna().sum()

workclass         2799
education            0
marital_status       0
occupation        2809
relationship         0
race                 0
sex                  0
native_country     857
dtype: int64

In [52]:
data_categorical.fillna('Unknown', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


In [53]:
data_categorical.isna().sum()

workclass         0
education         0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
native_country    0
dtype: int64

In [54]:
model_ordinal = make_pipeline(
    OrdinalEncoder(
        handle_unknown='use_encoded_value', 
        unknown_value=-1, 
        ),
    LogisticRegression(max_iter=1000)
)

In [55]:
cv_results_ordinal = cross_validate(
    model_ordinal,
    data_categorical,
    target,
    cv=5,
    error_score='raise'
)

In [56]:
results_ordinal = pd.DataFrame(
    cv_results_ordinal, 
    columns=[k for k in cv_results_ordinal.keys()])

results_ordinal

Unnamed: 0,fit_time,score_time,test_score
0,0.567914,0.071839,0.75627
1,1.010058,0.071373,0.758931
2,0.678564,0.076399,0.757064
3,0.63416,0.071778,0.759009
4,0.876727,0.073054,0.756143


In [57]:
print(f"accuracy: {results_ordinal['test_score'].mean().round(3)} +/- {results_ordinal['test_score'].std().round(3)}")

accuracy: 0.757 +/- 0.001


I constat that linear model training with ordinal encoding is bad, contrary to the same model training with one hot encoding variables.