<a href="https://colab.research.google.com/github/cagBRT/Machine-Learning/blob/master/Categorical_Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Categorical Encoding<br>
This notebook covers a number of encoding algorithms found in the sklearn encoders library

In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/Machine-Learning.git cloned-repo
%cd cloned-repo

In [None]:
from IPython.display import Image

**Categorical encoding is the process of transforming a categorical column into one (or more) numeric column(s).**

With numbers it’s easy to find relations (such as “bigger”, “smaller”, “double”, “half”).

With strings a computer can say pretty much only whether they are “equal” or “different”.

- **Supervised/Unsupervised**: when the encoding is based solely on the categorical column, then it’s unsupervised. Otherwise, if the encoding is based on some function of the original column and a second (numeric) column, then it’s supervised.<br>
- **Output dimension**: the encoding of a categorical column may produce one numeric column (output dimension = 1) or many numeric columns (output dimension > 1).
<br>
- **Mapping**: if each level has always the same output — whether a scalar (e.g. OrdinalEncoder) or an array (e.g. OneHotEncoder)— then the mapping is unique. On the contrary, if the same level is “allowed” to have different possible outputs, then the mapping is not unique.
<br>

In [None]:
Image("images/oneHot.png")

In [None]:
!pip install category_encoders
import category_encoders as ce

In [None]:
import pandas as pd
filename = 'BankChurners.csv'
df = pd.read_csv(filename)
df = df[df.columns[:-2]]
df = df.sample(frac=1.0, random_state=99)
print(df.shape)
df.head()

In [None]:
X= df.drop('Attrition_Flag', axis=1)
y=df["Attrition_Flag"]

In [None]:
X

In [None]:
y

# OrdinalEncoder
Each level is mapped to an integer, from 1 to L (where L is the number of levels).

In [None]:
sorted_x = sorted(set(df.Education_Level))
df.Education_Level = (df.Education_Level).replace(dict(zip(sorted_x, range(1, len(sorted_x) + 1))))

Ordinal encoding often produces nonsense, especially if the levels have no intrinsic order.

It’s only a representation of convenience, used often to save memory, or as an intermediate step for other types of encoding.

In [None]:
df

# sklearn OrdinalEncoder

In [None]:
from sklearn.preprocessing import LabelEncoder
ce_ord = ce.OrdinalEncoder(cols = ['Education_Level'])
ce_ord.fit_transform(df)



---



---



# CountEncoder

Each level is mapped to the number of observations of that level.

In [None]:
df = pd.read_csv(filename)
df = df[df.columns[:-2]]
df = df.sample(frac=1.0, random_state=99)
print(df.shape)
df.head()

In [None]:
df.Education_Level = df.Education_Level.replace(df.Education_Level.value_counts().to_dict())

Count encoding can be useful as an indicator of the “credibility” of each level. <br>
For instance, a machine learning algorithm may automatically decide to take into account the information brought by the level only if its count is above some threshold.

In [None]:
df



---



---



# OneHotEncoder

The OneHotEncoder is the algorithm for excellence (and the most used).

Each level is mapped to a dummy column (i.e. a column of 0/1), indicating whether that level is carried by that row.

Initially, your input is a single column, after encoding, your output consists of L columns (one for each level of the original column).

**This is why one-hot encoding should be handled with care:** you may end up with a dataframe that is far bigger than the original one.

In [None]:
df = pd.read_csv(filename)
df = df[df.columns[:-2]]
df = df.sample(frac=1.0, random_state=99)
print(df.shape)
df.head()

**Convert one column (Education_Level) to multiple columns**

In [None]:
ce_ord = ce.OrdinalEncoder(cols = ['Education_Level'])
ce_ord.fit_transform(X, y)

**The down side**, as you can see, is the columns have generic numbers instead of names. You now have to change the column names to something like Education_Level1,Education_Level2, etc.

# sklearn OneHotEncoder
Use the sklearn OneHotEncoder instead

In [None]:
df = pd.read_csv(filename)
df = df[df.columns[:-2]]
df = df.sample(frac=1.0, random_state=99)
print(df.shape)
df.head()

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder=OneHotEncoder(sparse_output=False)

df_encoded = pd.DataFrame (encoder.fit_transform(df[['Education_Level']]))

df_encoded.columns = encoder.get_feature_names_out(['Education_Level'])

df.drop(['Education_Level'] ,axis=1, inplace=True)

df_ed_level_encoded= pd.concat([df, df_encoded ], axis=1)

Scroll to the end of the columns and see the one hot encoding of Education_Level

In [None]:
df_ed_level_encoded



---



---



# SumEncoder

Sum Encoder compares the mean of the dependent variable (target) for a given level of a categorical column to the overall mean of the target.

Sum Encoding is very similar to OHE and both of them are commonly used in Linear Regression (LR) types of models.

If you use Category-Encoders it will look like this code below.

SumEncoder (as the next 3 encoders) belongs to a class called “contrast encodings”. These encodings are designed to have a specific behaviour when used in regression problems. In other words, you use one of these encodings if you want the regression coefficients to have some specific properties.

In particular, SumEncoder is used when you want the regression coefficients to have zero-sum.

**Simple example of SumEncoder**

In [None]:
df2 = pd.DataFrame({
    'color':["a", "b", "c", "d"],
    'outcome':[1, 2,  0, 1]})

In [None]:
df2

Step1: convert strings to numbers<br>
Step2: convert numbers to SumEncoder

In [None]:
ce_ord=ce.OrdinalEncoder(cols=['color'])
ce_ord.fit_transform(df2,df2.outcome)

ce_sum = ce.SumEncoder(cols = ['color'])
ce_sum.fit_transform(df2,df2.outcome)



---



**Dataframe example of SumEncoder**

In [None]:
df = pd.read_csv(filename)
df = df[df.columns[:-2]]
df = df.sample(frac=1.0, random_state=99)
print(df.shape)
df.head(20)

**SumEncode the categorical data**

In [None]:
ce_sum = ce.SumEncoder(cols = ['Education_Level'])
df_se= ce_sum.fit_transform(df)

In [None]:
df_se.head(20)

# BackwardDifferenceEncoder

This encoder is useful for ordinal variables, i.e. variables whose levels can be ordered in a meaningful way. BackwardDifferenceEncoder is designed to compare adjacent levels.

Suppose you have an ordinable variable (e.g. education level) and you want to know how it is related to a numeric variable (e.g. income). It may be interesting to compare each couple of consecutive levels (e.g. bachelors vs. high-school, masters vs. bachelors) with respect to the target variable. This is what BackwardDifferenceEncoder is designed for.



---



**Simple Example**

In [None]:
df2 = pd.DataFrame({
    'color':["a", "b", "c", "d"],
    'outcome':[1, 2,  0, 1]})
df2

In [None]:
X2 = df2.drop('outcome', axis = 1)
y2 = df2.drop('color', axis = 1)

In [None]:
ce_backward = ce.BackwardDifferenceEncoder(cols = ['color'])
ce_backward.fit_transform(X2, y2)



---



In [None]:
ce_backward = ce.BackwardDifferenceEncoder(cols = ['Education_Level'])
ce_backward.fit_transform(X,y)

# Helmert
HelmertEncoder is very similar to BackwardDifferenceEncoder, but instead of being compared just to the previous one, each level is compared with all the previous levels.



In [None]:
ce_helmert = ce.HelmertEncoder(cols = ['Education_Level'])
ce_helmert.fit_transform(X,y)

# Polynomial Encoder

PolynomialEncoder is designed to quantify linear, quadratic and cubic behaviors of the target variable with respect to the categorical variable.

How can a numeric variable have a linear (or quadratic, or cubic) relation with a variable that is not numeric? <br>

This is based on the assumption that the underlying categorical variable has levels that are not only ordinable, but also equally spaced.

**For this reason, use it with care, only when you are sure that the assumption is reasonable.**

In [None]:
ce_poly = ce.PolynomialEncoder(cols = ['Education_Level'])
ce_poly.fit_transform(X,y)

# BinaryEncoder

BinaryEncoder is basically the same of OrdinalEncoder, the only difference is that the integers are converted to binary numbers, then every positional digit is encoded.

**Simple Example**

In [None]:
df2 = pd.DataFrame({
    'color':["a", "b", "c", "d","e"],
    'outcome':[1, 2,  0, 1,0]})
df2

In [None]:
bin_color=ce.BinaryEncoder(cols=['color'])
bin_color.fit_transform(df2.color,df2.outcome)

In [None]:
ce_bin = ce.BinaryEncoder(cols = ['Education_Level'])
ce_bin.fit_transform(X, y)

# BaseNEncoder

BaseNEncoder is simply a generalization of the BinaryEncoder. <br>

In fact, in BinaryEncoder, the numbers are in base 2, whereas in BaseNEncoder, numbers are in base n, with n greater than 1.

Base-N encoder encodes the categories into arrays of their base-N representation. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.


This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.



**Simple Example**

In [None]:
df2 = pd.DataFrame({
    'color':["a", "b", "c", "d","e"],
    'outcome':[1, 2,  0, 1, 0]})
df2

In [None]:
ce_basen = ce.BaseNEncoder(cols = ['color'])
ce_basen.fit_transform(df2.color,df2.outcome)



---



---



In [None]:
ce_basen = ce.BaseNEncoder(cols = ['Education_Level'])
ce_basen.fit_transform(X, y)

# HashingEncoder

A multivariate hashing implementation with configurable dimensionality/precision.

The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.

It’s important to read about how max_process & max_sample work before setting them manually, inappropriate setting slows down encoding.

Default value of ‘max_process’ is 1 on Windows because multiprocessing might cause issues, see in : https://github.com/scikit-learn-contrib/categorical-encoding/issues/215 https://docs.python.org/2/library/multiprocessing.html?highlight=process#windows

**Hashing** is the process of transforming any given key or a string of characters into another value. This is usually represented by a shorter, fixed-length value or key that represents and makes it easier to find or employ the original string. The most popular use for hashing is the implementation of hash tables.

**Simple Example**

In [None]:
df2 = pd.DataFrame({
    'color':["a", "b", "c", "d","e"],
    'outcome':[1, 2,  0, 1, 0]})
df2

In [None]:
ce_hash = ce.HashingEncoder(cols = ['color'])
ce_hash.fit_transform(df2.color,df2.outcome)



---



---



In [None]:
ce_hash = ce.HashingEncoder(cols = ['Education_Level'])
ce_hash.fit_transform(X, y)

The fundamental property of hashing is that the resulting integer is uniformly distributed. So, if you take a divisor big enough, it’s unlikely that two different strings are mapped to the same integer. Why would that be useful? Actually, this has a very practical application called “hashing trick”.

Imagine that you want to make an email spam classifier using a logistic regression. You could do that by one-hot-encoding all the words contained in your dataset. The main downsides are that you would need to store the mapping in a separate dictionary and your model dimensions would change any time that new strings appear.

These issues may be easily overcome by using the hashing trick, because by hashing the input, you don’t need a dictionary anymore and your output dimension is fixed (it depends only on the divisor that you choose initially). Moreover, for the properties of hashing, you are granted that a new string will likely have a different encoding than the existing ones.



# TargetEncoder

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Some encoders behave differently on whether y is given or not. This is mainly due to regularisation in order to avoid overfitting.

**On training data transform should be called with y**,<br>
**On test data without**.

In TargetEncoder, the weight depends on the group numerosity and on a parameter called “smoothing”. When smoothing is 0, we rely solely on group means. Then, as smoothing increases, the global mean weights more and more, leading to a stronger regularization.



**Simple Example**

In [None]:
df2 = pd.DataFrame({
    'color':["a", "b", "c", "d","e"],
    'outcome':[1, 2,  0, 1, 0]})
df2

In [None]:
# Target with default parameters
ce_target = ce.TargetEncoder(cols = ['color'])

ce_target.fit(df2.color, df2.outcome)
# Must pass the series for y in v1.2.8

ce_target.transform(df2.color, df2.outcome)

In [None]:
df = pd.read_csv(filename)
df = df[df.columns[:-2]]
df = df.sample(frac=1.0, random_state=99)
print(df.shape)
df.head()


In [None]:
df['Attrition_Flag'].replace(['Existing Customer','Attrited Customer'],
                        [0,1], inplace=True)

In [None]:
ce_target = ce.TargetEncoder(cols = ['Education_Level'])
ce_target.fit(X, y)
ce_target.transform(X, y)



---



---

