In [1]:
import numpy as np
import pandas as pd

# Categorical data encoding

In general, models/algoritms can only interpret and operate numerical data (via mathematics), not data stored in text form ("categorical data").

+ Linear regression use coefficients to decipher relationships between variables.
+ Decision trees rely on numerical conditions for splitting nodes.
+ Distance-based algorithms (K-Nearest Neighnors) calculate distance between data points.
         
+ Some model types (random forests, decision trees) can deal with categorical data as input, but they often perform better with numerical data.

**"Feature encoding"** is the process of transforming categorical data into numerical data.

## 1- CATEGORICAL VARIABLES

### 1.1 Nominal variables

**Nominal** variables are only names/labels, without specific order or hierarchy.

Ex: 
+ 'Movie genre' is a nominal variable: "Comedy", "Thriller", "Drama", "Crime", "Romance", "Fantasy", "Science fiction", "Western", ...
+ 'Weather condition' is a nominal variable: "Sunny", "Rainy", "Windy", "Stormy", "Cloudy", ...
+ 'Geographical classification' is a nominal variable: "Urban", "Suburban", "Rural", ...
+ 'Colors'
+ 'Zip codes'
+ ...

In [2]:
colors = ['Blue', 'Yellow', 'Green', 'Red', 'Blue', 'Green', 'Purple']

### 1.2 Ordinal variables

**Ordinal** variables have an inherent order or hierarchy.

Ex: 
+ 'Customer satisfaction' is an ordinal variable: "Very satisfied", "Satisfied", "Neutral", "Dissatisfied", "Very dissatisfied".\
  It is clear that "Very satisfied" represents a higher satisfaction level than "Satisfied", which in turn is higher than "Neutral", etc.
+ 'Education level' is an ordinal variable: "No degree", "Highschool", "Bachelor", "Master", "Doctorate".
+ 'Size' is an ordinal variable: "Very small", "Small", "Medium", "Large", "Very large".
+ 'Movie ratings' is an ordinal variable, from "One star" to "Five stars".

In [3]:
customer_satisfaction = ["Very dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very satisfied"]

Note:
+ The distance between two consecutive categories is often not quantifiable.
+ The distance between two consecutive categories is not necessarily the same, for instance it may differ between "Very satisfied" and "Satisfied", and between "Satisfied", "neutral". In other words, categories are not necessarily equally spaced.

## 2- ENCODING TECHNIQUES

### 2.1: Label encoding:

Simply replaces categorical labels with numerical labels

In [4]:
from sklearn.preprocessing import LabelEncoder

# print colors
print('colors:\n', colors)

# define the encoder
le = LabelEncoder()

# fit the encoder and transform data
encoded_colors = le.fit_transform(colors)

# print the encoded colors
print('\ncolors one-hot encoded:\n', encoded_colors)

# inverse transformation
print(list(le.inverse_transform(encoded_colors)))

colors:
 ['Blue', 'Yellow', 'Green', 'Red', 'Blue', 'Green', 'Purple']

colors one-hot encoded:
 [0 4 1 3 0 1 2]
['Blue', 'Yellow', 'Green', 'Red', 'Blue', 'Green', 'Purple']


_Use case:_ Nominal variables.\
_Pros:_ Simple, No increase in dimensionality.\
_Cons:_ Can introduce artificial ordering/importance when used with nominal variables, and the arbitrary assignment may not reflect the true distance between categories.

### 2.2: One-Hot encoding:

Creates new (as many as categories) binary (0 or 1) features
Also known as 'dummy encoding'.

#### 2.2.1 with scikit-learn

In [5]:
from sklearn.preprocessing import OneHotEncoder

In [6]:
# print colors
print(colors)

# define the encoder
ohe = OneHotEncoder(sparse_output=False)

# fit the encoder and transform data
ohencoded_colors = ohe.fit_transform(np.array(colors).reshape(-1, 1))

# print the encoded colors
print(ohencoded_colors)

['Blue', 'Yellow', 'Green', 'Red', 'Blue', 'Green', 'Purple']
[[1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]]


#### 2.2.2 with pandas

In [7]:
# convert the array into a pandas dataframe
colors_df = pd.DataFrame({'color': colors})

# print the dataframe
print('colors dataframe:\n', colors_df)

# encode the colors
ohencoded_colors_df = pd.get_dummies(colors_df, prefix='is', prefix_sep='_', sparse=False)

# print the encoded dataframe
print('\ncolors dataframe one-hot encoded:\n',  ohencoded_colors_df)

# inverse transformation
print('\ninverse transformation:\n', pd.from_dummies(ohencoded_colors_df, sep='_'))

colors dataframe:
     color
0    Blue
1  Yellow
2   Green
3     Red
4    Blue
5   Green
6  Purple

colors dataframe one-hot encoded:
    is_Blue  is_Green  is_Purple  is_Red  is_Yellow
0     True     False      False   False      False
1    False     False      False   False       True
2    False      True      False   False      False
3    False     False      False    True      False
4     True     False      False   False      False
5    False      True      False   False      False
6    False     False       True   False      False

inverse transformation:
        is
0    Blue
1  Yellow
2   Green
3     Red
4    Blue
5   Green
6  Purple


_Use case:_ Nominal variables with a small number of categories.\
_Pros:_ Does not introduce artificial ordering/importance, easy interpretation.\
_Cons:_ Significant increase in dimensionality, can result in a sparse matrix (where most elements are 0).

### 2.3: Binary encoding:

Categories are encoded in the binary numerical form, where numbers are represented using only 0 and 1.\
(0 is 0, 1 is 1, 2 is 10, 3 is 11, 4 is 100, 5 is 101, 6 is 110, etc)\
Binary encoding functions as follows:\
Category -- _label encoding_ --> Integer -- _binary encoding_ --> Binary number

In [8]:
import category_encoders as ce

Category Encoders is a package containing scikit-learn style transformers, with pandas dfs as input/(output).\
Note: It can be useful to set sklearn.set_confing(transform_output='pandas')

In [9]:
# print the dataframe
print('colors dataframe:\n', colors_df)

# define the encoder
bin_encoder = ce.BinaryEncoder(cols=['color'], return_df=True)

# fit the encoder and transform data
binencoded_colors_df = bin_encoder.fit_transform(colors_df)

# print the encoded dataframe
print('\ncolors dataframe binary encoded:\n',  binencoded_colors_df)

# inverse transformation
print('\ninverse transformation:\n', bin_encoder.inverse_transform(binencoded_colors_df))

colors dataframe:
     color
0    Blue
1  Yellow
2   Green
3     Red
4    Blue
5   Green
6  Purple

colors dataframe binary encoded:
    color_0  color_1  color_2
0        0        0        1
1        0        1        0
2        0        1        1
3        1        0        0
4        0        0        1
5        0        1        1
6        1        0        1

inverse transformation:
     color
0    Blue
1  Yellow
2   Green
3     Red
4    Blue
5   Green
6  Purple


_Use case:_ Categorical variables with many categories.\
_Pros:_ Preserves more information than label encoding, dimensionality increases, but not as much as for One-Hot encoding.\
_Cons:_ Interpretation not straightforward, not suited for ordinal variables (order not preserved).

### 2.4: Ordinal encoding:

Each category is assigned an integer, respecting the inherent ordering of the categories.

#### 2.4.1 with scikit-learn

In [10]:
from sklearn.preprocessing import OrdinalEncoder

In [11]:
# print customer satisfaction categories
print('customer satisfaction\n', customer_satisfaction)

# create fake data
cust = ['customer1', 'customer2', 'customer3', 'customer4', 'customer5']
sat = ['Neutral', 'Satisfied', 'Very satisfied', 'Very dissatisfied', 'Dissatisfied']
data = pd.DataFrame(list(zip(cust,sat)), columns=['customer', 'satisfaction'])

# print fake data
print('\nfake data:\n', data)

# define the encoder
# Note that we need to provide the ordered categories to the encoder
ord_encoder = OrdinalEncoder(categories=[["Very dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very satisfied"]])

# fit the encoder and transform data
# Note that we need to provide one ordered category for each column
# Since we are planning to order only the "satisfaction column", we drop the first column instead
ordencoded_data = ord_encoder.fit_transform(data.drop(columns=['customer']))

# print the encoded column
print('\ndataordinal encoded:\n',  ordencoded_data)

# inverse transformation
print('\ninverse transformation:\n', ord_encoder.inverse_transform(ordencoded_data))

customer satisfaction
 ['Very dissatisfied', 'Dissatisfied', 'Neutral', 'Satisfied', 'Very satisfied']

fake data:
     customer       satisfaction
0  customer1            Neutral
1  customer2          Satisfied
2  customer3     Very satisfied
3  customer4  Very dissatisfied
4  customer5       Dissatisfied

dataordinal encoded:
 [[2.]
 [3.]
 [4.]
 [0.]
 [1.]]

inverse transformation:
 [['Neutral']
 ['Satisfied']
 ['Very satisfied']
 ['Very dissatisfied']
 ['Dissatisfied']]


#### 2.4.2 with category_encoders

In [12]:
from category_encoders import OrdinalEncoder

In [13]:
# define the ordering scheme
order = [{'col': 'satisfaction', # NAME OF THE COLUMN IN THE DATAFRAME
              'mapping': {'Very dissatisfied':0,
                          'Dissatisfied':1,
                          'Neutral':2,
                          'Satisfied':3,
                          'Very satisfied':4}
             }]
# define the encoder
ord_encoder = ce.OrdinalEncoder(mapping=order)

# fit the encoder and transform data
result = ord_encoder.fit_transform(data)

# print the encoded dataframe
print('\ndataordinal encoded:\n',  result)

# inverse transformation
print('\ninverse transformation:\n', ord_encoder.inverse_transform(result))


dataordinal encoded:
     customer  satisfaction
0  customer1             2
1  customer2             3
2  customer3             4
3  customer4             0
4  customer5             1

inverse transformation:
     customer       satisfaction
0  customer1            Neutral
1  customer2          Satisfied
2  customer3     Very satisfied
3  customer4  Very dissatisfied
4  customer5       Dissatisfied


### 2.4.3 with pandas

In [14]:
# create a dictionnary and map the desired column
order_dict={'Very dissatisfied':0,
            'Dissatisfied':1,
            'Neutral':2,
            'Satisfied':3,
            'Very satisfied':4}
result = data.copy()
result['oe_satisfaction'] = data.satisfaction.map(order_dict)

# print the encoded dataframe
print('\ndataordinal encoded:\n',  result)


dataordinal encoded:
     customer       satisfaction  oe_satisfaction
0  customer1            Neutral                2
1  customer2          Satisfied                3
2  customer3     Very satisfied                4
3  customer4  Very dissatisfied                0
4  customer5       Dissatisfied                1


_Use case:_ Ordinal variables.\
_Pros:_ Preserves ordinal information, does not increase dimensionality.\
_Cons:_ Does not reflect the magnitude of the difference between categories, not suited for nominal variables.

### 2.5: Frequency encoding:

Individual categories are replaced by their frequency

In [18]:
# convert the array into a pandas dataframe
colors_df = pd.DataFrame({'color': colors})

# print the dataframe
print('colors dataframe:\n', colors_df)

# compute the frequency of each category
freq = colors_df.color.value_counts(normalize=True)

# Map the frequencies to the category
freq_encoded = colors_df.copy()
freq_encoded['freq_encoding'] = freq_encoded.color.map(freq)

# print the dataframe
print('\nfrequency encoded dataframe:\n', freq_encoded)

colors dataframe:
     color
0    Blue
1  Yellow
2   Green
3     Red
4    Blue
5   Green
6  Purple

frequency encoded dataframe:
     color  freq_encoding
0    Blue       0.285714
1  Yellow       0.142857
2   Green       0.285714
3     Red       0.142857
4    Blue       0.285714
5   Green       0.285714
6  Purple       0.142857


_Use case:_ Nominal variables with high cardinality.\
_Pros:_ Can deal with high cardinality, does not increase dimensionality.\
_Cons:_ Loss of categorical information, several categories may have the same frequency, can be nisleading if the distribution of frequencies does not reflect the meaning of the category.

In [16]:
### 2.6: Mean encoding:

Individual categories are replaced by the mean value of the target variable

In [17]:
# convert the array into a pandas dataframe
colors_df = pd.DataFrame({'color': colors})

# Add some target data
colors_df['target'] = [200, 1000, 600, 100, 400, 800, 2000]

# print the dataframe
print('\ncolors dataframe:\n', colors_df)

# compute the target mean value for each category
mean = colors_df.groupby('color')['target'].mean()

# Map the mean values to the category
mean_encoded = colors_df.copy()
mean_encoded['mean_encoding'] = mean_encoded.color.map(mean)

# print the dataframe
print('\nmean encoded dataframe:\n', mean_encoded)


colors dataframe:
     color  target
0    Blue     200
1  Yellow    1000
2   Green     600
3     Red     100
4    Blue     400
5   Green     800
6  Purple    2000

mean encoded dataframe:
     color  target  mean_encoding
0    Blue     200          300.0
1  Yellow    1000         1000.0
2   Green     600          700.0
3     Red     100          100.0
4    Blue     400          300.0
5   Green     800          700.0
6  Purple    2000         2000.0


_Use case:_ Nominal variables with high cardinality.\
_Pros:_ May capture complex relationships (non-linear, interaction effects) between the categorical variable and the target variable.\
_Cons:_ Risk of overfitting, especially with small datasest or when rare categories are present.

## 3- RISKS AND ADVANTAGES

### 3.1 Risks to consider:

1. _Encoding misinterpretation:_

   Ex: To perform encoding, we assign 1 to 'green', 2 to 'blue', 3 to 'red'.
   The model could assume (wrongly) that 'blue' is twice as significant as 'green' and 'red' thrice as significant as 'green' and produce wrong predictions.

2. _High cardinality:_

   The dataset contains a large number of unique categories in a given variable (like 'City', 'Zip code', 'Lastname', ...)

   Ex 1: Encoding this variable using One-Hot Encoding may lead to an extremely large array containing mostly NaNs.
   This is computationnally expensive, and may confuse the algorithm.

   Ex 2: To avoid this bias, one may turn to more simplistic encoding methods, which may lead to underfitting (the encoding is too simplistic and crucial information has been lost) or overfitting (the model is biased towards the more frequent categories).

3. _Unseen data:_

   Ex: The training data contains the categories 'green',  'blue', 'red'.
       The testing data contains the categories 'green',  'blue', 'red', 'yellow', 'purple'.
   The model has no knowledge of the new categories, and does not know how to deal with them, which may lead to wrong predictions.
   Such a model is not robust on unseen data.

4. _Overfitting with rare categories:_

   Ex: If rare categories have a strong influence on the variable we want to predict, the model may adjust to these rare instances, and perform poorly on unseen data.

5. _Data leakage:_

   Information from outside the training data is used to create the model.

   Ex: perform 'mean encoding' before splitting the data in train/test data.  

### 3.2 Advantages:

In general, models including categorical data encoding:
+ perform better (in terms of quality of the prediction and processing time)
+ are more robust when dealing with unseen data with categories not present in the training data
+ provide more informative feature importance

### 3.3 Tips:

+ It is not always straightforward which encoding method is the best. Trying various methods and comparing model performance can solve this issue.
+ There are dimensionality reduction techniques (PCA, t-sne) that can be applied after encoding, in case the dimensionality becomes too large.
+ Rare categories are often problematic. Frequency encoding might help, by assigning them a low frequency. It may be worth grouping rare categories into a generic one ('other') before encoding.
+ The category_encoders package contains more 'exotic' encoding algorithms, often with fancy mathematical definitions