# Categorical encoders

In this notebook, you will learn how you can use the different categorical encoders, to transform the categorical variables in a dataset into numbers. For the demonstration, I use the titanic dataset available in [Kaggle](https://www.kaggle.com/c/titanic/data)

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from importlib import reload

pd.set_option('display.max_columns', None)

from feature_engine import categorical_encoders as ce

In [2]:
# function to load the titanic dataset, and get the first letter of the variable cabin
# we will load the dataset multiple times during the demo

def load_titanic():
    data = pd.read_csv('titanic.csv')
    data['Cabin'] = data['Cabin'].astype(str).str[0]
    data['Pclass'] = data['Pclass'].astype('O')
    data['Embarked'].fillna('C', inplace=True)
    return data

## CountFrequencyCategoricalEncoder

The CountFrequencyCategoricalEncoder, replaces the strings / labels in a variable, by the count or frequency of the observations in the train set, for that label. 

Briefly, if you select "count", then for the variable colour, if there are 10 observations in the train set that show colour blue, blue will be replaced by 10. Alternatively, if you select "frequency", if 10% of the observations in the train set show blue colour, then blue will be replaced by 0.1.

### Frequency

Labels are replaced by the percentage of the observations that show that label in the train set.

In [3]:
# load data
data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [4]:
# we will encode the below variables, they have no missing values
data[['Cabin', 'Pclass', 'Embarked']].isnull().sum()

Cabin       0
Pclass      0
Embarked    0
dtype: int64

In [5]:
count_enc = ce.CountFrequencyCategoricalEncoder(encoding_method = 'frequency', variables = ['Cabin', 'Pclass', 'Embarked'])
count_enc.fit(data)

CountFrequencyCategoricalEncoder(encoding_method='frequency',
                 variables=['Cabin', 'Pclass', 'Embarked'])

In [6]:
# within the encoder, you can explore the encoder_dict_ to find out
# the string replacements.

count_enc.encoder_dict_

{'Cabin': {'A': 0.016835016835016835,
  'B': 0.052749719416386086,
  'C': 0.06621773288439955,
  'D': 0.037037037037037035,
  'E': 0.03591470258136925,
  'F': 0.014590347923681257,
  'G': 0.004489337822671156,
  'T': 0.001122334455667789,
  'n': 0.7710437710437711},
 'Embarked': {'C': 0.19079685746352412,
  'Q': 0.08641975308641975,
  'S': 0.7227833894500562},
 'Pclass': {1: 0.24242424242424243,
  2: 0.20650953984287318,
  3: 0.5510662177328844}}

In [7]:
#transform the data: see the change in the head view

data = count_enc.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,0.551066,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,0.771044,0.722783
1,2,1,0.242424,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,0.066218,0.190797
2,3,1,0.551066,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,0.771044,0.722783
3,4,1,0.242424,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,0.066218,0.722783
4,5,0,0.551066,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,0.771044,0.722783


### Count

Labels are replaced by the percentage of the observations that show that label in the train set.

In [8]:
# load data

data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [9]:
# this time we encode only 1 variable

count_enc = ce.CountFrequencyCategoricalEncoder(encoding_method = 'count', variables = 'Cabin')
count_enc.fit(data)

CountFrequencyCategoricalEncoder(encoding_method='count', variables=['Cabin'])

In [10]:
# again, you can find the mappings in the encoder_dict_ attribute.

count_enc.encoder_dict_

{'Cabin': {'A': 15,
  'B': 47,
  'C': 59,
  'D': 33,
  'E': 32,
  'F': 13,
  'G': 4,
  'T': 1,
  'n': 687}}

In [11]:
# transform the data: see the change in the head view for Cabin
data = count_enc.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,687,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,59,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,687,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,59,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,687,S


### Select categorical variables automatically

If we don't indicate which variables we want to encode, the encoder will find all categorical variables

In [12]:
# load data

data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [13]:
# this time we ommit the argument for variable

count_enc = ce.CountFrequencyCategoricalEncoder(encoding_method = 'count')
count_enc.fit(data)

CountFrequencyCategoricalEncoder(encoding_method='count',
                 variables=['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'])

In [14]:
# we can see that the encoder selected automatically all the categorical variables

count_enc.variables

['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

In [15]:
# transform the data: see the change in the head view

data = count_enc.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,491,1,577,22.0,1,0,1,7.25,687,644
1,2,1,216,1,314,38.0,1,0,1,71.2833,59,170
2,3,1,491,1,314,26.0,0,0,1,7.925,687,644
3,4,1,216,1,314,35.0,1,0,2,53.1,59,644
4,5,0,491,1,577,35.0,0,0,1,8.05,687,644


## MeanCategoricalEncoder

The MeanCategoricalEncoder replaces the labels of the variables by the mean value of the target for that label. For example, in the variable colour, if the mean value of the binary target is 0.5 for the label blue, then blue is replaced by 0.5

In [16]:
# load the data

data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [17]:
# we will transform the below 3 variables
mean_enc = ce.MeanCategoricalEncoder(variables = ['Cabin', 'Pclass', 'Embarked'])

# Note: transformed needs the target to fit
mean_enc.fit(data, data.Survived)

MeanCategoricalEncoder(variables=['Cabin', 'Pclass', 'Embarked'])

In [18]:
# see the dictionary with the mappings per variable

mean_enc.encoder_dict_

{'Cabin': {'A': 0.4666666666666667,
  'B': 0.7446808510638298,
  'C': 0.5932203389830508,
  'D': 0.7575757575757576,
  'E': 0.75,
  'F': 0.6153846153846154,
  'G': 0.5,
  'T': 0.0,
  'n': 0.29985443959243085},
 'Embarked': {'C': 0.5588235294117647,
  'Q': 0.38961038961038963,
  'S': 0.33695652173913043},
 'Pclass': {1: 0.6296296296296297,
  2: 0.47282608695652173,
  3: 0.24236252545824846}}

In [19]:
mean_enc.variables

['Cabin', 'Pclass', 'Embarked']

In [20]:
# we can see the transformed variables in the head view

data = mean_enc.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,0.242363,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,0.299854,0.336957
1,2,1,0.62963,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,0.59322,0.558824
2,3,1,0.242363,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,0.299854,0.336957
3,4,1,0.62963,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,0.59322,0.336957
4,5,0,0.242363,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,0.299854,0.336957


### Automatically select the variables

This encoder will select all categorical variables to encode, when no variables are specified when calling the encoder

In [21]:
# load the data

data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [22]:
# we will transform the below 3 variables
mean_enc = ce.MeanCategoricalEncoder()

# Note: transformed needs the target to fit
mean_enc.fit(data, data.Survived)

MeanCategoricalEncoder(variables=['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'])

In [23]:
mean_enc.variables

['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

In [24]:
# we can see the transformed variables in the head view

data = mean_enc.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,0.242363,0,0.188908,22.0,1,0,0.0,7.25,0.299854,0.336957
1,2,1,0.62963,1,0.742038,38.0,1,0,1.0,71.2833,0.59322,0.558824
2,3,1,0.242363,1,0.742038,26.0,0,0,1.0,7.925,0.299854,0.336957
3,4,1,0.62963,1,0.742038,35.0,1,0,0.5,53.1,0.59322,0.336957
4,5,0,0.242363,0,0.188908,35.0,0,0,0.0,8.05,0.299854,0.336957


## WoERatioCategoricalEncoder

This encoder replaces the labels by the weight of evidence or the ratio of probabilities. It only works for binary classification.

    The weight of evidence is given by: np.log( p(1) / p(0) )
    
    The target probability ratio is given by: p(1) / p(0)
    
This transformers will likely not work for variables with high cardinality or high number of infrequent labels. It is recommended to use it at the back of the RareLabelEncoder.
    
### Weight of evidence

In [25]:
# load the data

data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [26]:
## Rare value encoder first, see below for more details on this encoder

rare_encoder = ce.RareLabelCategoricalEncoder(tol = 0.03, n_categories=5, variables = ['Cabin', 'Pclass', 'Embarked'])
rare_encoder.fit(data)

# transform
data = rare_encoder.transform(data)

In [27]:
woe_enc = ce.WoERatioCategoricalEncoder(encoding_method = 'woe', variables = ['Cabin', 'Pclass', 'Embarked'])

# to fit you need to pass the target y
woe_enc.fit(data, data.Survived)

WoERatioCategoricalEncoder(encoding_method='woe',
              variables=['Cabin', 'Pclass', 'Embarked'])

In [28]:
woe_enc.encoder_dict_

{'Cabin': {'B': 1.0704414117014132,
  'C': 0.377294231141468,
  'D': 1.1394342831883648,
  'E': 1.0986122886681098,
  'Rare': 0.06062462181643484,
  'n': -0.8479911013161802},
 'Embarked': {'C': 0.23638877806423053,
  'Q': -0.44895022004790314,
  'S': -0.6768866596881652},
 'Pclass': {1: 0.5306282510621705,
  2: -0.10880285984879919,
  3: -1.1397703611616172}}

In [29]:
# transform and visualise the data

data = woe_enc.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,-1.13977,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,-0.847991,-0.676887
1,2,1,0.530628,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,0.377294,0.236389
2,3,1,-1.13977,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,-0.847991,-0.676887
3,4,1,0.530628,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,0.377294,-0.676887
4,5,0,-1.13977,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,-0.847991,-0.676887


### Ratio

Similarly, it is recommended to remove rare labels and high cardinality before using this encoder.

In [30]:
# load the data

data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [31]:
## Rare value encoder first, see below for more details on this encoder

rare_encoder = ce.RareLabelCategoricalEncoder(tol = 0.03, n_categories=5, variables = ['Cabin', 'Pclass', 'Embarked'])
rare_encoder.fit(data)

# transform
data = rare_encoder.transform(data)

In [32]:
ratio_enc = ce.WoERatioCategoricalEncoder(encoding_method = 'ratio', variables = ['Cabin', 'Pclass', 'Embarked'])

# to fit you need to pass the target y
ratio_enc.fit(data, data.Survived)

WoERatioCategoricalEncoder(encoding_method='ratio',
              variables=['Cabin', 'Pclass', 'Embarked'])

In [33]:
ratio_enc.encoder_dict_

{'Cabin': {'B': 2.916666666666666,
  'C': 1.4583333333333333,
  'D': 3.125,
  'E': 3.0,
  'Rare': 1.0625,
  'n': 0.42827442827442824},
 'Embarked': {'C': 1.2666666666666668,
  'Q': 0.6382978723404256,
  'S': 0.5081967213114753},
 'Pclass': {1: 1.7000000000000002,
  2: 0.8969072164948453,
  3: 0.31989247311827956}}

In [34]:
data = woe_enc.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,-1.13977,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,-0.847991,-0.676887
1,2,1,0.530628,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,0.377294,0.236389
2,3,1,-1.13977,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,-0.847991,-0.676887
3,4,1,0.530628,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,0.377294,-0.676887
4,5,0,-1.13977,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,-0.847991,-0.676887


## OrdinalCategoricalEncoder

The OrdinalCategoricalEncoder will replace the variable labels by digits, from 1 to the number of different labels. If you select "arbitrary", then the encoder will assign numbers as the labels appear in the variable. If you select "ordered", the encoder will assign numbers following the mean of the target value for that label. So labels for which the mean of the target is higher will get the number 1, and those where the mean of the target is smallest will get the number n.

### Ordered

In [35]:
# load data
data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [36]:
# we will encode 3 variables:

ordinal_enc = ce.OrdinalCategoricalEncoder(encoding_method = 'ordered', variables = ['Pclass', 'Cabin', 'Embarked'])

# for this encoder, you need to pass the target as argument
ordinal_enc.fit(data, data.Survived)

OrdinalCategoricalEncoder(encoding_method='ordered',
             variables=['Pclass', 'Cabin', 'Embarked'])

In [37]:
# here we can see the mappings
ordinal_enc.encoder_dict_

{'Cabin': {'A': 2,
  'B': 6,
  'C': 4,
  'D': 8,
  'E': 7,
  'F': 5,
  'G': 3,
  'T': 0,
  'n': 1},
 'Embarked': {'C': 2, 'Q': 1, 'S': 0},
 'Pclass': {1: 2, 2: 1, 3: 0}}

In [38]:
# transform: see the numerical values in the former categorical variables

data = ordinal_enc.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,0,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,1,0
1,2,1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,4,2
2,3,1,0,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,1,0
3,4,1,2,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,4,0
4,5,0,0,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,1,0


### Arbitrary

In [39]:
# load data
data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [40]:
ordinal_enc = ce.OrdinalCategoricalEncoder(encoding_method = 'arbitrary', variables = 'Cabin')

# for this encoder we don't need to add the target. You can leave it or remove it.
ordinal_enc.fit(data, data.Survived)

OrdinalCategoricalEncoder(encoding_method='arbitrary', variables=['Cabin'])

In [41]:
ordinal_enc.encoder_dict_

{'Cabin': {'A': 5,
  'B': 6,
  'C': 1,
  'D': 4,
  'E': 2,
  'F': 7,
  'G': 3,
  'T': 8,
  'n': 0}}

Note that the ordering of the different labels is  not the same when we select "arbitrary" or "ordered"

In [42]:
# transform: see the numerical values in the former categorical variables

data = ordinal_enc.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,0,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,1,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,0,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,0,S


### Automatically select categorical variables

These encoder as well selects all the categorical variables, if None is passed to the variable argument when calling the enconder.

In [43]:
# load data
data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [44]:
ordinal_enc = ce.OrdinalCategoricalEncoder(encoding_method = 'arbitrary')

# for this encoder we don't need to add the target. You can leave it or remove it.
ordinal_enc.fit(data)

OrdinalCategoricalEncoder(encoding_method='arbitrary',
             variables=['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'])

In [45]:
ordinal_enc.variables

['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

In [46]:
# transform: see the numerical values in the former categorical variables

data = ordinal_enc.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,0,0,0,22.0,1,0,0,7.25,0,0
1,2,1,1,1,1,38.0,1,0,1,71.2833,1,1
2,3,1,0,2,1,26.0,0,0,2,7.925,0,0
3,4,1,1,3,1,35.0,1,0,3,53.1,1,0
4,5,0,0,4,0,35.0,0,0,4,8.05,0,0


## OneHotCategoricalEncoder

Performs One Hot Encoding. The encoder can select how many different labels per variable to encode into binaries. When top_categories is set to None, all the labels will be transformed in binary variables. However, when top_categories is set to an integer, for example 10, then only the 10 most popular categories will be transformed into binary, and the rest will be discarded.

The encoder has also the possibility to create all binary variables from all labels (drop_last = False), or remove the binary for the last label (drop_last = True), for use in linear models.

### All binary, no top_categories

In [47]:
# load the data and reduce the number of labels to create less dummy variables

data = load_titanic()
data = rare_encoder.transform(data)

In [48]:
ohe_enc = ce.OneHotCategoricalEncoder(top_categories = None, variables = ['Pclass', 'Cabin', 'Embarked'], drop_last = False)
ohe_enc.fit(data)

OneHotCategoricalEncoder(drop_last=False, top_categories=None,
             variables=['Pclass', 'Cabin', 'Embarked'])

In [49]:
ohe_enc.drop_last

False

In [50]:
ohe_enc.encoder_dict_

{'Cabin': array(['n', 'C', 'E', 'Rare', 'D', 'B'], dtype=object),
 'Embarked': array(['S', 'C', 'Q'], dtype=object),
 'Pclass': array([3, 1, 2], dtype=object)}

In [51]:
data = ohe_enc.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Pclass_3,Pclass_1,Pclass_2,Cabin_n,Cabin_C,Cabin_E,Cabin_Rare,Cabin_D,Cabin_B,Embarked_S,Embarked_C,Embarked_Q
0,1,0,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,1,0,0,1,0,0,0,0,0,1,0,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,0,1,0,0,1,0,0,0,0,0,1,0
2,3,1,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,1,0,0,1,0,0,0,0,0,1,0,0
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,0,1,0,0,1,0,0,0,0,1,0,0
4,5,0,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,1,0,0,1,0,0,0,0,0,1,0,0


### Dropping the last category for linear models

In [52]:
data = load_titanic()
data = rare_encoder.transform(data)

In [53]:
ohe_enc = ce.OneHotCategoricalEncoder(top_categories = None, variables = ['Pclass', 'Cabin', 'Embarked'], drop_last = True)
ohe_enc.fit(data)

ohe_enc.encoder_dict_

{'Cabin': ['n', 'C', 'E', 'Rare', 'D'],
 'Embarked': ['S', 'C'],
 'Pclass': [3, 1]}

In [54]:
data = ohe_enc.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Pclass_3,Pclass_1,Cabin_n,Cabin_C,Cabin_E,Cabin_Rare,Cabin_D,Embarked_S,Embarked_C
0,1,0,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,1,0,1,0,0,0,0,1,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,0,1,0,1,0,0,0,0,1
2,3,1,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,1,0,1,0,0,0,0,1,0
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,0,1,0,1,0,0,0,1,0
4,5,0,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,1,0,1,0,0,0,0,1,0


### Selecting top_categories to encode

In [55]:
data = load_titanic()
data = rare_encoder.transform(data)

In [56]:
ohe_enc = ce.OneHotCategoricalEncoder(top_categories = 2, variables = ['Pclass', 'Cabin', 'Embarked'], drop_last = False)
ohe_enc.fit(data)

ohe_enc.encoder_dict_

{'Cabin': ['n', 'C'], 'Embarked': ['S', 'C'], 'Pclass': [3, 1]}

In [57]:
data = ohe_enc.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Pclass_3,Pclass_1,Cabin_n,Cabin_C,Embarked_S,Embarked_C
0,1,0,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,1,0,1,0,1,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,0,1,0,1,0,1
2,3,1,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,1,0,1,0,1,0
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,0,1,0,1,1,0
4,5,0,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,1,0,1,0,1,0


## RareLabelCategoricalEncoder

The RareLabelCategoricalEncoder groups labels that show a small number of observations in the dataset, into a new category called 'Rare'. This helps to avoid overfitting.

The argument tol indicates the percentage of observations that the label needs to have in order not to be re-grouped  into the "Rare" label. The argument n_categories indicates the minimum number of labels that a variable needs to have for any of the labels to be re-grouped into rare. If the number of labels is smaller than n_categories, then the encoder will not group the labels for that variable.

In [58]:
## Load data

data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [59]:
## Rare value encoder

rare_encoder = ce.RareLabelCategoricalEncoder(tol = 0.03, n_categories=5, variables = ['Cabin', 'Pclass', 'Embarked'])
rare_encoder.fit(data)

# the encoder_dict_ contains a dictionary of the variable: frequent labels pair
rare_encoder.encoder_dict_

{'Cabin': Index(['n', 'C', 'B', 'D', 'E'], dtype='object'),
 'Embarked': array(['S', 'C', 'Q'], dtype=object),
 'Pclass': array([3, 1, 2], dtype=object)}

In [60]:
data = rare_encoder.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


### Automatically select all categorical variables

If no variable list is passed as argument, it selects all the categorical variables.

In [61]:
## Load data

data = load_titanic()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [62]:
## Rare value encoder

rare_encoder = ce.RareLabelCategoricalEncoder(tol = 0.03, n_categories=5)
rare_encoder.fit(data)

# the encoder_dict_ contains a dictionary of the variable: frequent labels pair
rare_encoder.encoder_dict_

{'Cabin': Index(['n', 'C', 'B', 'D', 'E'], dtype='object'),
 'Embarked': array(['S', 'C', 'Q'], dtype=object),
 'Name': Index([], dtype='object'),
 'Pclass': array([3, 1, 2], dtype=object),
 'Sex': array(['male', 'female'], dtype=object),
 'Ticket': Index([], dtype='object')}

In [63]:
data = rare_encoder.transform(data)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,Rare,male,22.0,1,0,Rare,7.25,n,S
1,2,1,1,Rare,female,38.0,1,0,Rare,71.2833,C,C
2,3,1,3,Rare,female,26.0,0,0,Rare,7.925,n,S
3,4,1,1,Rare,female,35.0,1,0,Rare,53.1,C,S
4,5,0,3,Rare,male,35.0,0,0,Rare,8.05,n,S
