# Categorical Encoding Demo and Examples

This is a Jupyter notebook for exploring the categorical-encoding library discussed in a [Feature Labs article] I wrote on the topic.

## Encoder API

In [1]:
import categorical_encoding as ce
import featuretools as ft

from featuretools.tests.testing_utils import make_ecommerce_entityset

In [2]:
es = make_ecommerce_entityset()
f1 = ft.Feature(es["log"]["product_id"])
f2 = ft.Feature(es["log"]["purchased"])
f3 = ft.Feature(es["log"]["value"])
f4 = ft.Feature(es["log"]["countrycode"])

features = [f1, f2, f3, f4]
ids = [0, 1, 2, 3, 4, 5]
feature_matrix = ft.calculate_feature_matrix(features, es,
                                             instance_ids=ids)
print(feature_matrix)

    product_id  purchased  value countrycode
id                                          
0    coke zero       True    0.0          US
1    coke zero       True    5.0          US
2    coke zero       True   10.0          US
3          car       True   15.0          US
4          car       True   20.0          US
5   toothpaste       True    0.0          AL


Performing a train-test split is standard in machine learning pipelines. Here, I've just simulated an actual train-test split by randomly picking certain rows to be train or test data.

In [3]:
train_data = feature_matrix.iloc[[0, 1, 4, 5]]
print(train_data)

    product_id  purchased  value countrycode
id                                          
0    coke zero       True    0.0          US
1    coke zero       True    5.0          US
4          car       True   20.0          US
5   toothpaste       True    0.0          AL


In [4]:
test_data = feature_matrix.iloc[[2, 3]]
print(test_data)

   product_id  purchased  value countrycode
id                                         
2   coke zero       True   10.0          US
3         car       True   15.0          US


Next up, we initialize and call the encoder on our data.

In [5]:
enc = ce.Encoder(method='leave_one_out')

train_enc = enc.fit_transform(train_data, features, train_data['value'])

test_enc = enc.transform(test_data)

In [6]:
print(train_enc)

    PRODUCT_ID_leave_one_out  purchased  value  COUNTRYCODE_leave_one_out
id                                                                       
0                       5.00       True    0.0                      12.50
1                       0.00       True    5.0                      10.00
4                       6.25       True   20.0                       2.50
5                       6.25       True    0.0                       6.25


In [7]:
print(test_enc)

    PRODUCT_ID_leave_one_out  purchased  value  COUNTRYCODE_leave_one_out
id                                                                       
2                       2.50       True   10.0                   8.333333
3                       6.25       True   15.0                   8.333333


Note how that the encoder only uses the training data to learn its encoding, and the test data is directly encoded using the learning mappings.

Now, we typically would have to redo the entire categorical encoding process for the following feature matrix.

In [8]:
fm2 = ft.calculate_feature_matrix(features, es, instance_ids=[6,7])
print(fm2)

    product_id  purchased  value countrycode
id                                          
6   toothpaste       True    1.0          AL
7   toothpaste       True    2.0          AL


However, by integration with Featuretools, we can generate already encoded data.

In [9]:
features_encoded = enc.get_features()

fm2_encoded = ft.calculate_feature_matrix(features_encoded, es, instance_ids=[6,7])
print(fm2_encoded)

    PRODUCT_ID_leave_one_out  purchased  value  COUNTRYCODE_leave_one_out
id                                                                       
6                       6.25       True    1.0                       6.25
7                       6.25       True    2.0                       6.25


## Encoding Methods Examples

For reference, here is our original encoder:

In [10]:
feature_matrix

Unnamed: 0_level_0,product_id,purchased,value,countrycode
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,coke zero,True,0.0,US
1,coke zero,True,5.0,US
2,coke zero,True,10.0,US
3,car,True,15.0,US
4,car,True,20.0,US
5,toothpaste,True,0.0,AL


### Classic Encoders

In [17]:
# Creates a new column for each unique value. 
enc_one_hot = ce.Encoder(method='one_hot')
fm_enc_one_hot = enc_one_hot.fit_transform(feature_matrix, features)
fm_enc_one_hot

Unnamed: 0_level_0,product_id = coke zero,product_id = car,product_id = toothpaste,purchased,value,countrycode = US,countrycode = AL
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1,0,0,True,0.0,1,0
1,1,0,0,True,5.0,1,0
2,1,0,0,True,10.0,1,0
3,0,1,0,True,15.0,1,0
4,0,1,0,True,20.0,1,0
5,0,0,1,True,0.0,0,1


In [12]:
# Each unique string value is assigned a counting number specific to that value.
enc_ord = ce.Encoder(method='ordinal')
fm_enc_ord = enc_ord.fit_transform(feature_matrix, features)
fm_enc_ord

    PRODUCT_ID_ordinal  purchased  value  COUNTRYCODE_ordinal
id                                                           
0                    1       True    0.0                    1
1                    1       True    5.0                    1
2                    1       True   10.0                    1
3                    2       True   15.0                    1
4                    2       True   20.0                    1
5                    3       True    0.0                    2


In [21]:
# The categories' values are first Ordinal encoded,
# the resulting integers are converted to binary,
# then the resulting digits are split into columns.
enc_bin = ce.Encoder(method='binary')
fm_enc_bin = enc_bin.fit_transform(feature_matrix, features)
fm_enc_bin

Unnamed: 0_level_0,PRODUCT_ID_binary[0],PRODUCT_ID_binary[1],PRODUCT_ID_binary[2],purchased,value,COUNTRYCODE_binary[0],COUNTRYCODE_binary[1]
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,0,1,True,0.0,0,1
1,0,0,1,True,5.0,0,1
2,0,0,1,True,10.0,0,1
3,0,1,0,True,15.0,0,1
4,0,1,0,True,20.0,0,1
5,0,1,1,True,0.0,1,0


In [23]:
# use a hashing algorithm to map category values to corresponding columns
enc_hash = ce.Encoder(method='hashing')
fm_enc_hash = enc_hash.fit_transform(feature_matrix, features)
fm_enc_hash

Unnamed: 0_level_0,PRODUCT_ID_hashing[0],PRODUCT_ID_hashing[1],PRODUCT_ID_hashing[2],PRODUCT_ID_hashing[3],PRODUCT_ID_hashing[4],PRODUCT_ID_hashing[5],PRODUCT_ID_hashing[6],PRODUCT_ID_hashing[7],purchased,value,COUNTRYCODE_hashing[0],COUNTRYCODE_hashing[1],COUNTRYCODE_hashing[2],COUNTRYCODE_hashing[3],COUNTRYCODE_hashing[4],COUNTRYCODE_hashing[5],COUNTRYCODE_hashing[6],COUNTRYCODE_hashing[7]
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
0,0,0,0,0,1,0,0,0,True,0.0,0,0,1,0,0,0,0,0
1,0,0,0,0,1,0,0,0,True,5.0,0,0,1,0,0,0,0,0
2,0,0,0,0,1,0,0,0,True,10.0,0,0,1,0,0,0,0,0
3,0,1,0,0,0,0,0,0,True,15.0,0,0,1,0,0,0,0,0
4,0,1,0,0,0,0,0,0,True,20.0,0,0,1,0,0,0,0,0
5,0,0,0,1,0,0,0,0,True,0.0,0,1,0,0,0,0,0,0


### Bayesian Encoders

In [25]:
# replaces each specific category value with a weighted average of the dependent variable.
enc_targ = ce.Encoder(method='target')
fm_enc_targ = enc_targ.fit_transform(feature_matrix, features, feature_matrix['value'])
fm_enc_targ

Unnamed: 0_level_0,PRODUCT_ID_target,purchased,value,COUNTRYCODE_target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,5.397343,True,0.0,9.970023
1,5.397343,True,5.0,9.970023
2,5.397343,True,10.0,9.970023
3,15.034704,True,15.0,9.970023
4,15.034704,True,20.0,9.970023
5,8.333333,True,0.0,8.333333


In [27]:
# identical to target except leaves own row out when calculating average
enc_leave = ce.Encoder(method='leave_one_out')
fm_enc_leave = enc_leave.fit_transform(feature_matrix, features, feature_matrix['value'])
fm_enc_leave

Unnamed: 0_level_0,PRODUCT_ID_leave_one_out,purchased,value,COUNTRYCODE_leave_one_out
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,7.5,True,0.0,12.5
1,5.0,True,5.0,11.25
2,2.5,True,10.0,10.0
3,20.0,True,15.0,8.75
4,15.0,True,20.0,7.5
5,8.333333,True,0.0,8.333333


In [28]:
print(fm_enc_leave.to_html())

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>PRODUCT_ID_leave_one_out</th>
      <th>purchased</th>
      <th>value</th>
      <th>COUNTRYCODE_leave_one_out</th>
    </tr>
    <tr>
      <th>id</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>7.500000</td>
      <td>True</td>
      <td>0.0</td>
      <td>12.500000</td>
    </tr>
    <tr>
      <th>1</th>
      <td>5.000000</td>
      <td>True</td>
      <td>5.0</td>
      <td>11.250000</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2.500000</td>
      <td>True</td>
      <td>10.0</td>
      <td>10.000000</td>
    </tr>
    <tr>
      <th>3</th>
      <td>20.000000</td>
      <td>True</td>
      <td>15.0</td>
      <td>8.750000</td>
    </tr>
    <tr>
      <th>4</th>
      <td>15.000000</td>
      <td>True</td>
      <td>20.0</td>
      <td>7.500000</td>
    </tr>
    <tr>
      <th>