# Doomed to failure: A story about target encoding and Tree-Based Algorithms
---

Have you ever worked with target encoders? Did you ever use any tree-based model? If you have worked with either there will likely be a time in which you are tempted to use both in the same pipeline for classifying data points with categorical features. DON'T! At least until you have read this article detailing one dangerous caveat that this combination has.

In this article, I will guide you through the caveats that appear when a pipeline combines certain members of the family of target encoders with tree-based models and extremely low or high entropy features. I'll clue you in: data leakeages and overfitting issues will break your pipeline's performance. 

The good news is that these problems that we will see are easy to solve - just use CatBoost and let it handle the encoding methodology for you. As I will be arguing the encoder that it uses by default for categorical features magnificently handles categorical fetures with both extremely low and high entropies and is still in the realm of target encoding.

**Important: to simplify the discussion we will frame the discussion under a binary classification task.**


#### Package Imports

In [None]:
import category_encoders as ce
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.pipeline import Pipeline

## Background Knowledge

### Encoders

Encoding methodologies aim to transform non-numerical categorical feature values into numerical values. There are multiple approaches and each comes with its own advantages and caveats. For instance, to mention a few of them:

- **Label Encoding**: it simply assigns a unique integer to each category in the data.
    *Pros:*
    1. Simple

    *Cons:*
    1. May introduce ordinal relationships
    2. Features with unknown amounts of categories which have high amounts of low-frecquency categories may be clustered into a marginal category, loosing predictive power based on them.

- **One-Hot Encoding**: for each category seen for a feature, it creates a new dummy feature which simply indicates with 0 or 1 when the category appears.
    *Pros:*
    1. Simple
    2. Does not assume order between categories

    *Cons:*
    1. Dimensionality may increase considerably making the use of a dimensionality reduction technique necessary and potentially exploding your system's memmory (tip: if you still want to use it check sparse matrices).
    2. Computationally expensive.
    3. Features with unknown amounts of categories which have high amounts of low-frecquency categories may be clustered into a marginal category, loosing predictive power based on them.

- **Binary Encoding, Hashing Encodings and more. Let's not get carried out into an endless enumeration.**

Importantly, encoding methodologies and classifier interact in different ways and consequently the resulting pipelines may inherit or develop different properties and caveats. 

### Target Encoders

In its most basic form, a target encoder substitutes each category $c$ of a feature with the training set statistic 
$$\frac{N_{positive\_samples\_with\_category\_c\_for\_the\_feature}}{N_{all\_samples\_with\_category\_c\_for\_the\_feature}}$$

Let us see this basic encoder in action with a basic custom implementation. Firstly, we create a trivial dataset:

In [None]:
dataset_1 = pd.DataFrame(
    {
        "feature_1": ["A", "B", "A", "C", "B", "A"],
        "target": [0, 1, 1, 1, 0, 0],
    }
)
dataset_1

Unnamed: 0,feature_1,target
0,A,0
1,B,1
2,A,1
3,C,1
4,B,0
5,A,0


Then, we implement the encoding of the categories according to the most basic target encoder:

In [13]:
categories = dataset_1.feature_1.unique()
stats_map = {
    category: dataset_1[dataset_1.feature_1 == category].target.mean() for category in categories
}
dataset_1["feature_1_encoded"] = dataset_1.feature_1.map(stats_map)
dataset_1

Unnamed: 0,feature_1,target,feature_1_encoded
0,A,0,0.333333
1,B,1,0.5
2,A,1,0.333333
3,C,1,1.0
4,B,0,0.5
5,A,0,0.333333


The resulting encoded feature is the only one used by ML algorithms. The idea is that we can actually encode features by their tendency to be associated with the positive class. This way an algorithm learns on top of the tendencies seen for all features, "weights" them and is able to generalize on nobel data points

Nevertheless, note that the encoded value for C has quite a strong meaning taking into account that we have only one instance of it (I know: the other categories are nearly as scarce from the statistical perspective). In other words, C could actually be a lucky sample and in reality C could actually be more associated with zero labels. Good catch! Actually, this first encoder comes with its own caveat: it is easy to overfit based on scarce categories. 

Therefore, many versions arised to distill this knowledge about category frecquency into the encodings. The most common approach is to simply perform a smoothing between the prior prevalence computed over the training set and the target statistic. For instance, the target encoding from the category-encoders library uses the smoothing:
$$enc(category) = p * stat + (1 - p) * prevalence$$
where:
$$p = \frac{1}{1 + \exp\big(\frac{(-n\_{value} - k)}{f}\big)}$$
and $k$ and $f$ are respectively called the minimum samples per leaf and smoothing parameters. To get further information checkout their [documentation](https://contrib.scikit-learn.org/category_encoders/targetencoder.html).