In this post, we talk about how to handle categorical features. We will use one-hot-encoding to convert a categorical feature with K nominals (`levels` in R's terminology) into a K-vector. 

In ML applications, typically we will divide the data set into training set and test set. It's not uncommon to see that sometimes a given feature' nominals that appear in the test set do not appear in the training set. This could easily break the encoding procedure if not treated carefully. We desmonstrate how to handle this kind of situation. 

Finally, we also consider cases when some values are missing. 

A good discussion thread can be found on stackoverflow:

http://stackoverflow.com/questions/10196860/python-pandas-how-to-turn-a-dataframe-with-factors-into-a-design-matrix-for-l

In [5]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [3]:
train_df = pd.DataFrame({'col1': ['A', 'B', 'A', 'C', 'B'],
                         'col2': ['Y', 'X', 'X', 'Y', 'Y']})
train_df

Unnamed: 0,col1,col2
0,A,Y
1,B,X
2,A,X
3,C,Y
4,B,Y


In [4]:
test_df = pd.DataFrame({'col1': ['B', 'A', 'E', 'B'],
                        'col2': ['Z', 'X', 'Y', 'Y']})
test_df

Unnamed: 0,col1,col2
0,B,Z
1,A,X
2,E,Y
3,B,Y


We will use scikit-learn utilities to do the heavy lifting. The procedure is the following:

1) For each feature, numerically encode the nominals, using the `LabelEncoder` module

2) For each of the encoded feature (values converted from 'A', 'B', 'C' ... to 0, 1, 2, ...), use `OneHotEncoder` to convert it to a K-vector

3) collect all the processed columns to form the design matrix


In [11]:
le = LabelEncoder()
le.fit(train_df['col1'])
print('nomials for col1: {0:}'.format(le.classes_))

nomials for col1: ['A' 'B' 'C']


In [14]:
# The above naive implementation will break when an unseen nominal is present, as the following shows. 
# By `unseen`, we meant that the value did not exist when we fit the encoding model
le.transform(['D'])

ValueError: y contains new labels: ['D']

In [32]:
# The following is the trick we use, to add a sentinel level called '0_unknown' to the fitted value. 
# This will be the first element in the list of nominals.
#  Before we throw test data into the encoder, first check if there are unseen values, if yse, replace those unseen
#  values with the sentinel values.

le_safer = LabelEncoder()
nominals = ['0_unknown'] + list(train_df['col1'].unique())
le_safer.fit(nominals)
print('nominal values: {0:}'.format(le_safer.classes_))
print('')
test_values = ['C', 'D', 'B']
print('encoded {0:}'.format(test_values))
print(le_safer.transform([e if e in le_safer.classes_[1:] else le_safer.classes_[0] for e in test_values]))

nominal values: ['0_unknown' 'A' 'B' 'C']

encoded ['C', 'D', 'B']
[3 0 2]


In [None]:
# Let's display the data frames with added columns

In [40]:
categorical_columns = ['col1', 'col2']
label_encoders = []
for col in categorical_columns:
    nominals = ['0_known'] + list(train_df[col].unique())
    le = LabelEncoder()
    le.fit(nominals)
    train_df['encoded_'+col] = le.transform(train_df[col])
    
    test_values = [e if e in le.classes_[1:] else le.classes_[0] for e in test_df[col].tolist()]
    test_df['encoded_'+col] = le.transform(test_values)
    
    label_encoders.append(le)

In [41]:
train_df

Unnamed: 0,col1,col2,encoded_col1,encoded_col2
0,A,Y,1,2
1,B,X,2,1
2,A,X,1,1
3,C,Y,3,2
4,B,Y,2,2


In [42]:
test_df

Unnamed: 0,col1,col2,encoded_col1,encoded_col2
0,B,Z,2,0
1,A,X,1,1
2,E,Y,0,2
3,B,Y,2,2


In [38]:
# note that in test_df, the value 'E' in col1 and the value 'Z' in col2 are both 
#  handled properly by the sentinel value

In [47]:
# Now, use OneHotEncoder to convert the categorical columns into a wide vector
n_values = []
for le in label_encoders:
    n_values.append(len(le.classes_))

encoded_columns = ['encoded_'+col for col in categorical_columns]

one_hot_encoder = OneHotEncoder(n_values=n_values, categorical_features=[0, 1],
                                handle_unknown='error', sparse=False)
one_hot_encoder.fit(train_df[encoded_columns])
train_data = one_hot_encoder.transform(train_df[encoded_columns])
test_data = one_hot_encoder.transform(test_df[encoded_columns])

In [48]:
train_data

array([[ 0.,  1.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  1.]])

In [49]:
test_data

array([[ 0.,  0.,  1.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  1.]])

In [50]:
# Now we consider the cases where some values are missing

In [69]:
train_df = pd.DataFrame({'col1': ['A', 'B', 'A', 'C', 'B'],
                         'col2': ['Y', 'X', 'X', 'Y', 'Y']})
train_df.ix[1, 'col1'] = '?'
train_df.ix[3, 'col2'] = '?'
train_df

Unnamed: 0,col1,col2
0,A,Y
1,?,X
2,A,X
3,C,?
4,B,Y


In [70]:
test_df = pd.DataFrame({'col1': ['B', 'A', 'E', 'B', 'A'],
                        'col2': ['Z', 'X', 'Y', 'Y', 'X']})
test_df.ix[0, 'col1'] = '?'
test_df.ix[4, 'col2'] = '?'
test_df

Unnamed: 0,col1,col2
0,?,Z
1,A,X
2,E,Y
3,B,Y
4,A,?


In [71]:
# We will add one column next to each of the original columns to indicate 
#  if the value in the original column is missing or not

categorical_columns = ['col1', 'col2']
for col in categorical_columns:
    train_df[col+'_missing'] = (train_df[col]=='?').astype(int)
    test_df[col+'_missing'] = (test_df[col]=='?').astype(int)
indicator_columns = [col+'_missing' for col in categorical_columns]

In [72]:
train_df

Unnamed: 0,col1,col2,col1_missing,col2_missing
0,A,Y,0,0
1,?,X,1,0
2,A,X,0,0
3,C,?,0,1
4,B,Y,0,0


In [73]:
test_df

Unnamed: 0,col1,col2,col1_missing,col2_missing
0,?,Z,1,0
1,A,X,0,0
2,E,Y,0,0
3,B,Y,0,0
4,A,?,0,1


In [92]:
# We follow the above procedure to encode the categorical columns. One modification is that when we fit a encoding
#   model on the training nominals, we exclude the missing values, i.e., '?' is not used 

label_encoders = []
for col in categorical_columns:
    nominals = list(train_df[col].unique())
    nominals = ['0_unknown'] + [e for e in nominals if e != '?']
    print('{0:} nominals: {1:}'.format(col, nominals))
        
    le = LabelEncoder()
    le.fit(nominals)
    
    values = [e if e in le.classes_[1:] else le.classes_[0] for e in train_df[col].tolist()]
    train_df['encoded_'+col] = le.transform(values)

    values = [e if e in le.classes_[1:] else le.classes_[0] for e in test_df[col].tolist()]
    test_df['encoded_'+col] = le.transform(values)
        
    label_encoders.append(le)        
    

col1 nominals: ['0_unknown', 'A', 'C', 'B']
col2 nominals: ['0_unknown', 'Y', 'X']


In [78]:
train_df

Unnamed: 0,col1,col2,col1_missing,col2_missing,encoded_col1,encoded_col2
0,A,Y,0,0,1,2
1,?,X,1,0,0,1
2,A,X,0,0,1,1
3,C,?,0,1,3,0
4,B,Y,0,0,2,2


In [79]:
test_df

Unnamed: 0,col1,col2,col1_missing,col2_missing,encoded_col1,encoded_col2
0,?,Z,1,0,0,0
1,A,X,0,0,1,1
2,E,Y,0,0,0,2
3,B,Y,0,0,2,2
4,A,?,0,1,1,0


In [80]:
# Note that both missing values and unseen values will be encoded to 0 in this scheme. 
#  the numerical value 0 corresponds to the sentinel nominal 0_unknown we added in the beginning of the classes.
# 
# However, for missing values, the additional indicator column will also be turned on. 
# So that when the encoded nominal is 0, we can still tell if is due to unseen value or missing value


In [103]:
n_values = []
for le in label_encoders:
    n_values.append(len(le.classes_))

encoded_columns = ['encoded_'+col for col in categorical_columns]
    
one_hot_encoder = OneHotEncoder(n_values=n_values, categorical_features=[0, 1], 
                                handle_unknown='error', sparse=False)  

one_hot_encoder.fit(train_df[encoded_columns])

train_data = one_hot_encoder.transform(train_df[encoded_columns])
train_data = np.hstack([train_data, train_df[indicator_columns].values])
test_data = one_hot_encoder.transform(test_df[encoded_columns])
test_data = np.hstack([test_data, test_df[indicator_columns].values])

In [105]:
train_df[encoded_columns+indicator_columns]

Unnamed: 0,encoded_col1,encoded_col2,col1_missing,col2_missing
0,1,2,0,0
1,0,1,1,0
2,1,1,0,0
3,3,0,0,1
4,2,2,0,0


In [106]:
train_data

array([[ 0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.]])

In [108]:
test_df[encoded_columns+indicator_columns]

Unnamed: 0,encoded_col1,encoded_col2,col1_missing,col2_missing
0,0,0,1,0
1,1,1,0,0
2,0,2,0,0
3,2,2,0,0
4,1,0,0,1


In [109]:
test_data

array([[ 1.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.]])