# Concept of mean encodings 

The general idea of this technique is to add new variables based on some feature to get where we started. In simplest case, we encode each level of categorical variable with corresponding target mean.

![pics/ex_1.png](pics/ex_1.png)


Let's take a look at the following example. Here, we have some binary classification task in which we have a categorical variable, some city. And of course, we want to numerically encode it. The most obvious way and what people usually use is **label encoding**.

It's what we have in second column.

Mean encoding is done differently, via encoding every city with corresponding mean target. For example, for Moscow, we have five rows with three 0s and two 1s. So we encode it with 2 divided by 5 or 0.4. Similarly, we deal with the rest of cities, pretty straightforward. What I've described here is a very high level idea. There are a huge number of pitfalls one should overcome in actual competition. We went deep into details for now, just keep it in mind.

![pics/ex_1.png](pics/ex_2.png)

At first, let me explain. Why does it even work? Imagine, that our dataset is much bigger and contains hundreds of different cities. Well, let's try to compare, of course, very abstractly, mean encoding with label encoding.

We plot future histograms for class 0 and class 1. In case of label encoding, we'll always get total and random picture because there's no logical order, but when we use mean target to encode the feature, classes look way more separable. The plot looks kind of sorted.

![pics/ex_1.png](pics/ex_23png)

It turns out that this sorting quality of mean encoding is quite helpful. Remember, what is the most popular and effective way to solve machine learning problem? Is grading using trees, GBM somee of the few downsides is an inability to handle high cardinality categorical variables.

Trees have limited depth, with mean encoding, we can compensate it,

we can reach better loss with shorter trees. Cross validation loss might even look like this.

In general, the **more complicated and non linear feature target dependency**, **the more effective is mean encoding**, okay. 

![ex_4.png](pics/ex_4.png)

Further in this section, you will learn how to construct mean encodings. There are actually a lot of ways. Also keep in mind that we use classification tests only as an example. We can use mathematics on other tests as well. The main idea remains the same.

Despite the simplicity of the idea, you need to be very careful with validation. It's got to be impeccable. It's probably the most important part. Understanding the correct linkless validation is also a basis for staking.

The last, but not least, are extensions. There are countless possibilities to derive new features from target variable. Sometimes, they produce significant improvement for your models.

Let's start with some characteristics of data sets, that indicate the usefulness of main encoding.

The presence of categorical variables with a lot of levels is already a good indicator, but we need to go a little deeper.

Let's take a look at each of these learning logs from Springleaf competition.

![ex_5.png](pics/ex_5.png)

I ran three models with different depths, 7, 9, and 11. Train logs are on the top plot. Validation logs are on the bottom one.

As you can see, with increasing the depths of trees, our training care becomes better and better, nearly perfect and that's a normal part.

But we don't actually over feed and that's weird. Our validation score also increase, **it's a sign that trees need a huge number of splits to extract information from some variables**. And we can check it for mortal dump.

It turns out that some features have a tremendous amount of split points, like 1200 or 1600 and that's a lot. Our model tries to treat all those categories differently and they are also very important for predicting the target. We can help our model via mean encodings.

There is a number of ways to calculate encodings. The first one is the one we've been discussing so far. Simply taking mean of target variable.

![ex_5.png](pics/ex_6.png)

Another popular option is to take initial logarithm of this value, it's called weight of evidence.

Or you can calculate all of the numbers of ones.

Or the difference between number of ones and the number of zeros. All of these are variable options.


Now, let's actually construct the features. We will do it on sprinkled data set,

![ex_5.png](pics/ex_7.png)

suppose we've already separated the data for train and validation, X_tr and X val data frames. These called snippet shows how to construct mean encoding for an arbitrary column and map it into a new data frame, train new and val new. We simply do group by on that column and use target as a map. Resulting commands were able. It is then mapped to tree and validation data sets by a map operator. After we've repeated this process for every call, we can fit each of those model on this new data.

![ex_5.png](pics/ex_8.png)

But something's definitely not right, after several efforts training AOC is nearly 1, while on validation, the score set rates around 0.55, which is practically noise.

It's a clear sign of terrible overfitting.


I'll explain what happened in a few moments. Right now, I want to point out that at least we validated correctly. We separated train and validation, and used all the train data to estimate mean encodings. If, for instance, we would have estimated mean encodings before train validation split, then we would not notice such an overfitting.

Now, let's figure out the reason of overfitting.

![ex_5.png](pics/ex_9.png)

When they are categorized, it's pretty common to get results like in an example, target 0 in train and target 1 in validation. Mean encodings turns into a perfect feature for such categories. That's why we immediately get very good scores on train and fail hardly on validation.

So far, we've grasped the concept of mean encodings and walked through some trivial examples, that obviously can not use mean encodings like this in practice. We need to deal with overfitting first, we need some kind of regularization.



# Regularization

![ex_5.png](pics/ex_10.png)

In previous chapter, we realized that mean encodings cannot be used as is and requires some kind of regularization on training part of data. Now, we'll carry out four different methods of regularization, namely, doing a cross-validation loop to construct mean encodings. Then, smoothing based on the size of category. Then, adding random noise. And finally, calculating expanding mean on some parametrization of data. We will go through all of these methods one by one.

![ex_5.png](pics/ex_11.png)

Let's start with CV loop regularization. It's a very intuitive and robust method. For a given data point, we don't want to use target variable of that data point. So we separate the data into K-node intersecting subsets, or in other words, folds. To get mean encoding value for some subset, we don't use data points from that subset and estimate the encoding only on the rest of subset.

We iteratively walk through all the data subsets.

Usually, four or five folds are enough to get decent results. You don't need to tune this number.

It may seem that we have completely avoided leakage from target variable. Unfortunately, it's not true.

It will become apparent if we perform leave one out scheme to separate the data.

![ex_5.png](pics/ex_12.png)

I'll return to it a little later, but first let's learn how to apply this method in practice. Suppose that our training data is in a DFTR data frame.

We will add mean encoded features into another train new data frame.

In the outer loop, we iterate through stratified K-fold iterator in order to separate training data into chunks. X_tr is used to estimate the encoding. X_val is used to apply estimating encoding.

After that, we iterate through all the columns and map estimated encodings to X_val data frame. At the end of the outer loop we fill train new data frame with the result. Finally, some rare categories may be present only in a single fold. So we don't have the data to estimate target mean for them. That's why we end up with some nans. We can fill them with global mean.

As you can see, the whole process is very simple.
![ex_5.png](pics/ex_13.png)


Now, let's return to the question of whether we leak information about target variable or not. Consider the following example.

Here we want to encode Moscow via leave-one-out scheme.

For the first row, we get 0.5, because there are two 1s and two 0s in the rest of rows. Similarly, for the second row we get 0.25 and so on. But look closely, all the resulting and the resulting features. It perfect splits the data, rows with feature mean equal or greater than 0.5 have target 0 and the rest of rows has target 1. We didn't explicitly use target variable, but our encoding is biased. Furthermore, this effect remains valid even for the KFold scheme, just milder.

So is this type of regularization useless?

Definitely not. In practice, if you have enough data and use four or five folds, the encodings will work fine with this regularization strategy. Just be careful and use correct validation.

![ex_5.png](pics/ex_14.png)


Another regularization method is smoothing. It's based on the following idea. If category is big, has a lot of data points, then we can trust this to mean encoding, but if category is rare it's the opposite. Formula on the slide uses this idea. It has hyper parameter **alpha that controls the amount of regularization**. When alpha is zero, we have no regularization, and when alpha approaches infinity everything turns into global mean.

In some sense alpha is equal to the category size we can trust. It's also possible to use some other formula, basically anything that punishes encoding software categories can be considered smoothing. Smoothing obviously won't work on its own but we can combine it with for example, CV loop regularization. 

![ex_15.png](pics/ex_15.png)

Another way to regularize encodence is to add some noise without regularization. Meaning codings have better quality for the training data than for the test data. And by adding noise, we simply degrade the quality of encoding on training data.

This method is pretty unstable, it's hard to make it work. The main problem is the amount of noise we need to add. Too much noise will turn the feature into garbage, while too little noise means worse regularization.

**This method is usually used together with leave one out regularization. You need to diligently fine tune it.**

So, it's probably not the best option if you don't have a lot of time.

![ex_15.png](pics/ex_16.png)

The last regularization method I'm going to cover is based on expanding mean. The idea is very simple.

We fix some sorting order of our data and use only rows from zero to n minus one to calculate encoding for row n.

You can check simple implementation in the code snippet.

Cumsum stores cumulative sum of target variable up to the given row and cumcnt stores cumulative count.

This method introduces the least amount of leakage from target variable and it requires no hyper parameter tuning. The only downside is that feature quality is not uniform.

But it's not a big deal. We can average models on encodings calculated from different data permutations.

It's also worth noting that it is expanding mean method that is used in CatBoost grading, boosting to it's library, which proves to perform magnificently on data sets with categorical features.


Okay, let's summarize what we've discussed in this video. We covered four different types of regularization.

Each of them has its own advantages and disadvantages. Sometimes unintuitively we introduce target variable leakage. But in practice, we can bear with it. 

**Personally, I recommend CV loop or expanding mean methods for practical tasks.**

They are the most robust and easy to tune. This is was regularization. In the next chapter, I will tell you about various extensions and practical applications of mean encodings. Thank you.
