# Video: Using Target Encodings

This video shows an example using target encodings on a data set with many categorical columns.

Script:
* Target encodings are a neat trick that feels like it is cheating.
* For category values with lots of data, it gives the model the average target value for the category value.
* No hedging or likes, it is the actual average target value.
* On the other hand, target encoding is not as strong for cases with lots distinct values and not many rows for each value.
* In that case, the low number of rows per value will often lead to the same target encoding, so all the low data cases get lumped together.
* So while target encodings feel like cheating where you have a lot of data, they do not have a special advantage for low data category values.
* Let's look at how to implement them in scikit-learn.

In [None]:
import pandas as pd

In [None]:
abalone = pd.read_csv("https://raw.githubusercontent.com/bu-omds/bu-omds-data/main/data/abalone.tsv", sep="\t")

Script:
* Like with one hot encoding, scikit-learn provides a class to handle target encodings.
* In this case, it is sklearn.preprocessing.TargetEncoder.

In [None]:
from sklearn.preprocessing import TargetEncoder

Script:
* Most of the default settings for TargetEncoder are reasonable, but you may need to set the target_type option.
* For the abalone data set, I had to set the target_type to "continuous", since it mistakenly identified the Rings column as categorical, not numerical.


In [None]:
target_encoder = TargetEncoder(target_type="continuous")

Script:
* I suspect that was because the column was interpreted as integers, not real or floating point numbers.
* The other option that you are might want to change would be the smooth option.
* That option controls the blending to the mean for low data cases, but the default smoothing setting is reasonable, so you probably do not need to change it.
* Let's fit the target encoder now.


In [None]:
target_encoder.fit(abalone[["Sex"]], abalone["Rings"])

Script:
* One difference that you will notice here compared to one hot encoding is that fitting this transform requires the target y values to be passed in, just like a regression.
* Those targets are needed to calculate the average target values for the target encoding.
* Let's spot check what the target encoder chose.

In [None]:
target_encoder.categories_

Script:
* Like the one hot encoder, the category values of F, I and M were found.
* We can look to see what values were associated with them.

In [None]:
target_encoder.encodings_

Script:
* Let's pair those up so we can see which is which more easily.

In [None]:
dict(zip(target_encoder.categories_[0], target_encoder.encodings_[0]))

Script:
* According to this, female abalone have slightly more rings than male abalone in this data set, while the infant abalone have fewer.

In [None]:
target_encoder.transform(abalone[["Sex"]])

Script:
* We can again change this output format to be more pandas friendly.

In [None]:
target_encoder.set_output(transform="pandas")

In [None]:
target_encoder.transform(abalone[["Sex"]])

Script:
* In contrast to the one hot encoding which makes several columns with different names, the target encoding makes just one output column.
* And that output column's name is the same as the original column.
* In this context, the usage is to just replace the old column wholesale.

In [None]:
abalone2 = abalone.copy()
abalone2["Sex"] = target_encoder.transform(abalone[["Sex"]])

abalone2

Script:
* And there you have it.
* The sex column has been transformed from strings representing categories to numbers based on the average target values.