# One Hot Encoding

In machine learning, one-hot encoding is a frequently used method to deal with categorical data. Because many machine learning models need their input variables to be numeric, categorical variables need to be transformed in the pre-processing part.

https://en.wikipedia.org/wiki/One-hot#Machine_learning_and_statistics

In [None]:
# one hot encoding
import seaborn as sns

df = sns.load_dataset('penguins')
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [None]:
df.isnull().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

In [None]:
df.dropna(inplace=True)
df.isnull().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

In [None]:
df['species'].value_counts()

Adelie       146
Gentoo       119
Chinstrap     68
Name: species, dtype: int64

In [None]:
df['island'].value_counts()

Biscoe       163
Dream        123
Torgersen     47
Name: island, dtype: int64

In [None]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


In [None]:
# train test split, dependent Variable = species
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop(['species'], axis=1),
                                                    df['species'],
                                                    test_size=.2,
                                                    random_state=42)

print(X_train.shape)
print(X_test.shape)

(266, 6)
(67, 6)


## Categorical Encoding
* Sklearn One Hot Encoding
* Dummy Trap
* Pandas get_dummies
* Labelizer
* Weight of Evidence
* Frequency Encoding

### Categorical Data
* Nominal (Cat or Dog)
* Ordinal (Grades)
* Works better for limited labels in a category
* Engineer features with many labels

### Multicollinearity
* Predictors need to be independent of each other
* https://www.theanalysisfactor.com/multicollinearity-explained-visually/
* https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
* Cats_and_Dogs = [Cat, Dog, Dog, Cat, Cat, Dog]
* Cats = [1, 0, 0, 1, 1, 0]
* Dogs = [0, 1, 1, 0, 0, 1]



### Get Dummies

**Mismatch in Training and Test**

* Some labels in the train set don't show up in the test set

https://towardsdatascience.com/beware-of-the-dummy-variable-trap-in-pandas-727e8e6b8bde

In [None]:
import pandas as pd

df = pd.DataFrame({
    'Gender' : ['Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male','Male', 'Female','Male', 'Female'],
    'Age' : [41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 29],
    'EducationField': ['Life Sciences', 'Engineering', 'Life Sciences', 'Life Sciences', 'Medical', 'Life Sciences', 'Life Sciences', 'Life Sciences', 'Engineering', 'Medical', 'Life Sciences', 'Life Sciences'],
    'MonthlyIncome': [5993, 5130, 2090, 2909, 3468, 3068, 2670, 2693, 9526, 5237, 2426, 4193]
})

df.head()

Unnamed: 0,Gender,Age,EducationField,MonthlyIncome
0,Female,41,Life Sciences,5993
1,Male,49,Engineering,5130
2,Male,37,Life Sciences,2090
3,Male,33,Life Sciences,2909
4,Male,27,Medical,3468


In [None]:
pd.get_dummies(df)

Unnamed: 0,Age,MonthlyIncome,Gender_Female,Gender_Male,EducationField_Engineering,EducationField_Life Sciences,EducationField_Medical
0,41,5993,True,False,False,True,False
1,49,5130,False,True,True,False,False
2,37,2090,False,True,False,True,False
3,33,2909,False,True,False,True,False
4,27,3468,False,True,False,False,True
5,32,3068,True,False,False,True,False
6,59,2670,False,True,False,True,False
7,30,2693,False,True,False,True,False
8,38,9526,False,True,True,False,False
9,36,5237,True,False,False,False,True


### Multicollinearity

https://www.theanalysisfactor.com/multicollinearity-explained-visually

If all the variables are correlated, it will become difficult for the model to tell how strongly a particular variable affects the target since all the variables are related. In such a case, the coefficient of a regression model will not convey the correct information.

In [None]:
pd.get_dummies(df, drop_first=True)

Unnamed: 0,Age,MonthlyIncome,Gender_Male,EducationField_Life Sciences,EducationField_Medical
0,41,5993,False,True,False
1,49,5130,True,False,False
2,37,2090,True,True,False
3,33,2909,True,True,False
4,27,3468,True,False,True
5,32,3068,False,True,False
6,59,2670,True,True,False
7,30,2693,True,True,False
8,38,9526,True,False,False
9,36,5237,False,False,True


In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('MonthlyIncome', axis=1)
y = df['MonthlyIncome']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

pd.get_dummies(X_train).head(3)

Unnamed: 0,Age,Gender_Female,Gender_Male,EducationField_Engineering,EducationField_Life Sciences,EducationField_Medical
10,35,False,True,False,True,False
1,49,False,True,True,False,False
6,59,False,True,False,True,False


In [None]:
pd.get_dummies(X_test)

Unnamed: 0,Age,Gender_Male,EducationField_Life Sciences,EducationField_Medical
2,37,True,True,False
3,33,True,True,False
4,27,True,False,True


### Label Encoder

In [None]:
# sklearn OneHotEncoder
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
# https://stackoverflow.com/questions/50473381/scikit-learns-labelbinarizer-vs-onehotencoder
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer

pets = ['dog', 'cat', 'cat', 'dog', 'turtle', 'cat', 'cat', 'turtle', 'dog', 'cat']
print('cat = 0; dog = 1; turtle = 2')
le = LabelEncoder() # typically reserved for the dependent variable
int_values = le.fit_transform(pets)
print('Pets:', pets)
print('Label Encoder:', int_values)
int_values = int_values.reshape(len(int_values), 1)
print(pd.Series(pets))

ohe = OneHotEncoder(sparse=False)
ohe = ohe.fit_transform(int_values)
print('One Hot Encoder:\n', ohe)

lb = LabelBinarizer()
print('Label Binarizer:\n', lb.fit_transform(int_values))

cat = 0; dog = 1; turtle = 2
Pets: ['dog', 'cat', 'cat', 'dog', 'turtle', 'cat', 'cat', 'turtle', 'dog', 'cat']
Label Encoder: [1 0 0 1 2 0 0 2 1 0]
0       dog
1       cat
2       cat
3       dog
4    turtle
5       cat
6       cat
7    turtle
8       dog
9       cat
dtype: object
One Hot Encoder:
 [[0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]
Label Binarizer:
 [[0 1 0]
 [1 0 0]
 [1 0 0]
 [0 1 0]
 [0 0 1]
 [1 0 0]
 [1 0 0]
 [0 0 1]
 [0 1 0]
 [1 0 0]]


In [None]:
pets = pd.DataFrame(pd.Series(pets), columns=['Pets'])
pets.head()

Unnamed: 0,Pets
0,dog
1,cat
2,cat
3,dog
4,turtle


In [None]:
ohe = OneHotEncoder(sparse=False)
ohe_pets = ohe.fit_transform(pets)
pets_df = pd.DataFrame(ohe_pets, columns=ohe.get_feature_names_out(['Pets']))
pets_df

Unnamed: 0,Pets_cat,Pets_dog,Pets_turtle
0,0.0,1.0,0.0
1,1.0,0.0,0.0
2,1.0,0.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0
6,1.0,0.0,0.0
7,0.0,0.0,1.0
8,0.0,1.0,0.0
9,1.0,0.0,0.0


### Dummy Trap

The Dummy Variable Trap occurs when two or more dummy variables created by one-hot encoding are highly correlated (multi-collinear). This means that one variable can be predicted from the others, making it difficult to interpret predicted coefficient variables in regression models. In other words, the individual effect of the dummy variables on the prediction model can not be interpreted well because of multicollinearity.

https://www.learndatasci.com/glossary/dummy-variable-trap/

In [None]:
pets_df.corr()

Unnamed: 0,Pets_cat,Pets_dog,Pets_turtle
Pets_cat,1.0,-0.654654,-0.5
Pets_dog,-0.654654,1.0,-0.327327
Pets_turtle,-0.5,-0.327327,1.0


In [None]:
ohe = OneHotEncoder(drop='first', sparse=False)
ohe_pets = ohe.fit_transform(pets)
pets_df = pd.DataFrame(ohe_pets, columns=ohe.get_feature_names_out(['Pets']))
pets_df

Unnamed: 0,Pets_dog,Pets_turtle
0,1.0,0.0
1,0.0,0.0
2,0.0,0.0
3,1.0,0.0
4,0.0,1.0
5,0.0,0.0
6,0.0,0.0
7,0.0,1.0
8,1.0,0.0
9,0.0,0.0


In [None]:
pets_df.corr()

Unnamed: 0,Pets_dog,Pets_turtle
Pets_dog,1.0,-0.327327
Pets_turtle,-0.327327,1.0


## One Hot Encoding

### For features with 3 - 5 unique labels

In [None]:
# use sklearn one hot encoder
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categories='auto', drop='first', sparse=False, handle_unknown='ignore')

cat_features = ['island']
ohe_train = ohe.fit_transform(X_train[cat_features])
ohe_train = pd.DataFrame(ohe_train, columns=ohe.get_feature_names_out(cat_features))
ohe_train.index = X_train.index
X_train = ohe_train.join(X_train)
X_train.drop(cat_features, axis=1, inplace=True)

ohe_test = ohe.transform(X_test[cat_features])
ohe_test = pd.DataFrame(ohe_test, columns=ohe.get_feature_names_out(cat_features))
ohe_test.index = X_test.index
X_test = ohe_test.join(X_test)
X_test.drop(cat_features, axis=1, inplace=True)

In [None]:
X_train.sample(10)

Unnamed: 0,island_Dream,island_Torgersen,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
53,0.0,0.0,42.0,19.5,200.0,4050.0,Male
130,0.0,1.0,38.5,17.9,190.0,3325.0,Female
215,1.0,0.0,55.8,19.8,207.0,4000.0,Male
219,1.0,0.0,50.2,18.7,198.0,3775.0,Female
315,0.0,0.0,50.8,15.7,226.0,5200.0,Male
145,1.0,0.0,39.0,18.7,185.0,3650.0,Male
155,1.0,0.0,45.4,18.7,188.0,3525.0,Female
333,0.0,0.0,51.5,16.3,230.0,5500.0,Male
168,1.0,0.0,50.3,20.0,197.0,3300.0,Male
209,1.0,0.0,49.3,19.9,203.0,4050.0,Male


In [None]:
import numpy as np

print(np.sort(df['island'].unique()))

['Biscoe' 'Dream' 'Torgersen']


## Bi Label Mapping for Features

### For features with two unique labels

In [None]:
X_train['sex'] = X_train['sex'].map({'Female': 0, 'Male': 1})
X_test['sex'] = X_test['sex'].map({'Female': 0, 'Male': 1})

## Alternative Encodings

### Features with more than 5 labels

For features with many labels

* https://medium.com/analytics-vidhya/stop-one-hot-encoding-your-categorical-variables-bbb0fba89809
* https://medium.com/swlh/stop-one-hot-encoding-your-categorical-features-avoid-curse-of-dimensionality-16743c32cea4
* https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 (frequency and mean encoding)

### Frequency Encoding

In [None]:
# identify features with more than 5 features and use frequency encoding
freq_feats = []

for feat in freq_feats:
    freq = X_train.groupby(feat).size()/len(X_train)
    X_train.loc[:, feat] = X_train[feat].map(freq)
    freq = X_test.groupby(feat).size()/len(X_test)
    X_test.loc[:, feat] = X_test[feat].map(freq)

## Map Dependent Variable to Number

* Acceptable to use Label Encoder for the Dependent Variable

In [None]:
import numpy as np

print(np.sort(y_train.unique()))

['Adelie' 'Chinstrap' 'Gentoo']


In [None]:
y_train = y_train.map({'Adelie': 0, 'Chinstrap': 1, 'Gentoo': 2})
y_test = y_test.map({'Adelie': 0, 'Chinstrap': 1, 'Gentoo': 2})

In [None]:
X_train.head()

Unnamed: 0,island_Dream,island_Torgersen,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
230,0.0,0.0,40.9,13.7,214.0,4650.0,0
84,1.0,0.0,37.3,17.8,191.0,3350.0,0
303,0.0,0.0,50.0,15.9,224.0,5350.0,1
22,0.0,0.0,35.9,19.2,189.0,3800.0,0
29,0.0,0.0,40.5,18.9,180.0,3950.0,1


In [None]:
y_train.head()

230    2
84     0
303    2
22     0
29     0
Name: species, dtype: int64