## **Categorical Encoding**

Categorical variable encoding is a broad term for collective techniques used to transform the strings or labels of categorical variables into numbers. There are multiple techniques under this method:

1. One-Hot encoding

2. Ordinal encoding

3. Count and Frequency encoding

4. Target encoding / Mean encoding


*Note: We will be using Titanic dataset for over analysis*

In [1]:
import numpy as np 
import pandas as pd 
import os
%matplotlib inline

os.chdir('/Users/appus/Downloads/Codes')
path=os.getcwd()
print(path)

C:\Users\appus\Downloads\Codes


In [2]:
# ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
# load the dataset
df_titanic = pd.read_csv('titanic.csv')
df_titanic.head(10)

Unnamed: 0,passenger_id,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,survived
0,1216,3,"Smyth, Miss. Julia",female,,0,0,335432,7.7333,,Q,1
1,699,3,"Cacic, Mr. Luka",male,38.0,0,0,315089,8.6625,,S,0
2,1267,3,"Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go...",female,30.0,1,1,345773,24.15,,S,0
3,449,2,"Hocking, Mrs. Elizabeth (Eliza Needs)",female,54.0,1,3,29105,23.0,,S,1
4,576,2,"Veal, Mr. James",male,40.0,0,0,28221,13.0,,S,0
5,1083,3,"Olsen, Mr. Henry Margido",male,28.0,0,0,C 4001,22.525,,S,0
6,898,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S,0
7,560,2,"Sinkkonen, Miss. Anna",female,30.0,0,0,250648,13.0,,S,1
8,1079,3,"Ohman, Miss. Velin",female,22.0,0,0,347085,7.775,,S,1
9,908,3,"Jussila, Miss. Mari Aina",female,21.0,1,0,4137,9.825,,S,0


**1. One-Hot Encoding**

One hot encoding creates a binary variable for each one of the different categories present in a variable. These binary variables take 1 if the observation shows a certain category or 0 otherwise. OHE is suitable for linear models. But, OHE expands the feature space quite dramatically if the categorical variables are highly cardinal, or if there are many categorical variables. In addition, many of the derived dummy variables could be highly correlated

In [4]:
# New Copy
df_1= df_titanic.copy()

In [5]:
df_1.sex.head()

0    female
1      male
2    female
3    female
4      male
Name: sex, dtype: object

In [6]:
# Perform One hot encoding
pd.get_dummies(df_1['sex']).head()

Unnamed: 0,female,male
0,1,0
1,0,1
2,1,0
3,1,0
4,0,1


*To encode categorical variable with k labels, we need k-1 dummy variables. We can achieve this task as follows :-*

In [7]:
# obtaining k-1 labels
pd.get_dummies(df_1['sex'], drop_first=True).head()

Unnamed: 0,male
0,0
1,1
2,0
3,0
4,1


In [8]:
pd.get_dummies(df_1['embarked']).head()

Unnamed: 0,C,Q,S
0,0,1,0
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1


In [9]:
pd.get_dummies(df_1['embarked'], drop_first=True).head()

Unnamed: 0,Q,S
0,1,0
1,0,1
2,0,1
3,0,1
4,0,1


**2. Ordinal Encoding**

When the categorical variable is ordinal, the most straightforward approach is to replace the labels by some ordinal number. As we don't have any ordinal categorical variable in data, I will use dummy data to demonstrate.


In [10]:
# example of a ordinal encoding
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# Dummy data
data = asarray([['red'], ['green'], ['blue']])
print(data)

[['red']
 ['green']
 ['blue']]


In [11]:
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = encoder.fit_transform(data)
print(result)

[[2.]
 [1.]
 [0.]]


**3. Count and Frequency Encoding**

In count encoding we replace the categories by the count of the observations that show that category in the dataset. Similarly, we can replace the category by the frequency -or percentage- of observations in the dataset. This approach is heavily used in case competitions, wherein we replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset.

In [12]:
# New Copy
df_2= df_titanic.copy()

In [13]:
#Replace na with missing
df_2.cabin.fillna('Missing', inplace=True)
df_2

Unnamed: 0,passenger_id,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,survived
0,1216,3,"Smyth, Miss. Julia",female,,0,0,335432,7.7333,Missing,Q,1
1,699,3,"Cacic, Mr. Luka",male,38.0,0,0,315089,8.6625,Missing,S,0
2,1267,3,"Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go...",female,30.0,1,1,345773,24.1500,Missing,S,0
3,449,2,"Hocking, Mrs. Elizabeth (Eliza Needs)",female,54.0,1,3,29105,23.0000,Missing,S,1
4,576,2,"Veal, Mr. James",male,40.0,0,0,28221,13.0000,Missing,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...
845,158,1,"Hipkins, Mr. William Edward",male,55.0,0,0,680,50.0000,C39,S,0
846,174,1,"Kent, Mr. Edward Austin",male,58.0,0,0,11771,29.7000,B37,C,0
847,467,2,"Kantor, Mrs. Sinai (Miriam Sternin)",female,24.0,1,0,244367,26.0000,Missing,S,1
848,1112,3,"Peacock, Miss. Treasteall",female,3.0,1,1,SOTON/O.Q. 3101315,13.7750,Missing,S,0


In [14]:
df_2['cabin'].value_counts().to_dict()

{'Missing': 659,
 'G6': 4,
 'C22 C26': 4,
 'B57 B59 B63 B66': 4,
 'B96 B98': 4,
 'D': 4,
 'C101': 3,
 'C23 C25 C27': 3,
 'A34': 3,
 'F33': 3,
 'C78': 3,
 'C7': 2,
 'B18': 2,
 'C62 C64': 2,
 'E121': 2,
 'C65': 2,
 'B51 B53 B55': 2,
 'C80': 2,
 'E34': 2,
 'D26': 2,
 'B22': 2,
 'F G73': 2,
 'C55 C57': 2,
 'C31': 2,
 'E33': 2,
 'E101': 2,
 'C6': 2,
 'C86': 2,
 'E44': 2,
 'D20': 2,
 'D36': 2,
 'B35': 2,
 'C83': 2,
 'F2': 2,
 'C32': 2,
 'D17': 2,
 'D15': 2,
 'B69': 2,
 'C54': 2,
 'B49': 2,
 'C124': 2,
 'D21': 2,
 'C104': 1,
 'C91': 1,
 'A20': 1,
 'E10': 1,
 'C2': 1,
 'F38': 1,
 'E17': 1,
 'D28': 1,
 'B39': 1,
 'B77': 1,
 'C97': 1,
 'F G63': 1,
 'B94': 1,
 'C111': 1,
 'C82': 1,
 'C130': 1,
 'C132': 1,
 'T': 1,
 'D43': 1,
 'C49': 1,
 'C39': 1,
 'D56': 1,
 'B5': 1,
 'B20': 1,
 'B30': 1,
 'D35': 1,
 'D47': 1,
 'A7': 1,
 'A23': 1,
 'D11': 1,
 'E24': 1,
 'B50': 1,
 'B38': 1,
 'C126': 1,
 'A6': 1,
 'E63': 1,
 'D40': 1,
 'C106': 1,
 'C46': 1,
 'C93': 1,
 'E39 E41': 1,
 'F E46': 1,
 'E25': 1,
 'A29':

In [15]:
# first we make a dictionary that maps each label to the counts
X_frequency_map = df_2.cabin.value_counts().to_dict()

# now we replace X2 labels both in train and test set with the same map
df_2.cabin = df_2.cabin.map(X_frequency_map)
df_2

Unnamed: 0,passenger_id,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,survived
0,1216,3,"Smyth, Miss. Julia",female,,0,0,335432,7.7333,659,Q,1
1,699,3,"Cacic, Mr. Luka",male,38.0,0,0,315089,8.6625,659,S,0
2,1267,3,"Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go...",female,30.0,1,1,345773,24.1500,659,S,0
3,449,2,"Hocking, Mrs. Elizabeth (Eliza Needs)",female,54.0,1,3,29105,23.0000,659,S,1
4,576,2,"Veal, Mr. James",male,40.0,0,0,28221,13.0000,659,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...
845,158,1,"Hipkins, Mr. William Edward",male,55.0,0,0,680,50.0000,1,S,0
846,174,1,"Kent, Mr. Edward Austin",male,58.0,0,0,11771,29.7000,1,C,0
847,467,2,"Kantor, Mrs. Sinai (Miriam Sternin)",female,24.0,1,0,244367,26.0000,659,S,1
848,1112,3,"Peacock, Miss. Treasteall",female,3.0,1,1,SOTON/O.Q. 3101315,13.7750,659,S,0


**4. Target / Mean Encoding**

In target encoding, also called mean encoding, we replace each category of a variable, by the mean value of the target for the observations that show a certain category.This technique has 3 advantages:

1. it does not expand the feature space,

2. it captures some information regarding the target at the time of encoding the category, and

3. it creates a monotonic relationship between the variable and the target.

*Monotonic relationships between variable and target tend to improve linear model performance.*

In [16]:
# New Copy
df_3= df_titanic[['cabin','survived']].copy()

In [17]:
#replace na with missing
df_3.cabin.fillna('Missing', inplace=True)

In [18]:
# Now we extract the first letter of the cabin
df_3['cabin'] = df_3['cabin'].astype(str).str[0]
df_3.head()

Unnamed: 0,cabin,survived
0,M,1
1,M,0
2,M,0
3,M,1
4,M,0


In [19]:
df_3.groupby(['cabin'])['survived'].mean()

cabin
A    0.583333
B    0.733333
C    0.580645
D    0.709677
E    0.791667
F    0.666667
G    0.500000
M    0.282246
T    0.000000
Name: survived, dtype: float64

In [20]:
#now let's do the same but capturing the result in a dictionary

ordered_labels = df_3.groupby(['cabin'])['survived'].mean().to_dict()
ordered_labels

{'A': 0.5833333333333334,
 'B': 0.7333333333333333,
 'C': 0.5806451612903226,
 'D': 0.7096774193548387,
 'E': 0.7916666666666666,
 'F': 0.6666666666666666,
 'G': 0.5,
 'M': 0.2822458270106222,
 'T': 0.0}

In [21]:
df_3['Cabin_ordered'] = df_3.cabin.map(ordered_labels)
df_3

Unnamed: 0,cabin,survived,Cabin_ordered
0,M,1,0.282246
1,M,0,0.282246
2,M,0,0.282246
3,M,1,0.282246
4,M,0,0.282246
...,...,...,...
845,C,0,0.580645
846,B,0,0.733333
847,M,1,0.282246
848,M,0,0.282246
