# Doomed to failure: A history about target encoding and Tree-Based Algorithms
---

Have you ever worked with target encoders? Did you ever use any tree-based model? If you have worked with either there will likely be a time in which you are tempted to use both in the same pipeline for classifying data points with categorical features. DON'T! At least until you have read this article detailing one dangerous caveat that this combination has.

In this article, I will guide you through the caveats that appear when a pipeline combines certain members of the family of target encoders with tree-based models and extremely low or high entropy features. I'll clue you in: data leakeages and overfitting issues will break your pipeline's performance. 

The good news is that these problems that we will see are easy to solve - just use CatBoost and let it handle the encoding methodology for you. As I will be arguing the encoder that it uses by default for categorical features magnificently handles categorical fetures with both extremely low and high entropies and is still in the realm of target encoding.

**Important: to simplify the discussion we will frame the discussion under a binary classification task.**


#### Package Imports

In [None]:
import category_encoders as ce
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.pipeline import Pipeline

## Background Knowledge

### Encoders

Encoding methodologies aim to transform non-numerical categorical feature values into numerical values. There are multiple approaches and each comes with its own advantages and caveats. For instance, to mention a few of them:

- **Label Encoding**: it simply assigns a unique integer to each category in the data.
    *Pros:*
    1. Simple

    *Cons:*
    1. May introduce ordinal relationships
    2. Features with unknown amounts of categories which have high amounts of low-frecquency categories may be clustered into a marginal category, loosing predictive power based on them.

- **One-Hot Encoding**: for each category seen for a feature, it creates a new dummy feature which simply indicates with 0 or 1 when the category appears.
    *Pros:*
    1. Simple
    2. Does not assume order between categories

    *Cons:*
    1. Dimensionality may increase considerably making the use of a dimensionality reduction technique necessary and potentially exploding your system's memmory (tip: if you still want to use it check sparse matrices).
    2. Computationally expensive.
    3. Features with unknown amounts of categories which have high amounts of low-frecquency categories may be clustered into a marginal category, loosing predictive power based on them.

- **Binary Encoding, Hashing Encodings and more. Let's not get carried out into an endless enumeration.**

Importantly, encoding methodologies and classifier interact in different ways and consequently the resulting pipelines may inherit or develop different properties and caveats. 

### Target Encoders

In its most basic form, a target encoder substitutes each category $c$ of a feature with the training set statistic 
$$\frac{N_{positive\_samples\_with\_category\_c\_for\_the\_feature}}{N_{\text{all\_samples\_with\_category\_c\_for\_the\_feature}}}$$