# One Hot Encoding

## Categorical Encoding
* Sklearn One Hot Encoding
* Dummy Trap
* Pandas get_dummies
* Labelizer
* Weight of Evidence
* Frequency Encoding

### Categorical Data
* Nominal (Cat or Dog)
* Ordinal (Grades)
* Works better for limited labels in a category
* Engineer features with many labels

### Multicollinearity
* Predictors need to be independent of each other
* https://www.theanalysisfactor.com/multicollinearity-explained-visually/ 
* https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
* Cats_and_Dogs = [Cat, Dog, Dog, Cat, Cat, Dog]
* Cats = [1, 0, 0, 1, 1, 0]
* Dogs = [0, 1, 1, 0, 0, 1]

### Mismatch in Training and Test

* Some labels in the train set don't show up in the test set

https://towardsdatascience.com/beware-of-the-dummy-variable-trap-in-pandas-727e8e6b8bde

In [1]:
# sklearn OneHotEncoder
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
# https://stackoverflow.com/questions/50473381/scikit-learns-labelbinarizer-vs-onehotencoder
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer

pets = ['dog', 'cat', 'cat', 'dog', 'turtle', 'cat', 'cat', 'turtle', 'dog', 'cat']
print('cat = 0; dog = 1; turtle = 2')
le = LabelEncoder()
int_values = le.fit_transform(pets)
print('Pets:', pets)
print('Label Encoder:', int_values)
int_values = int_values.reshape(len(int_values), 1)
print(pd.Series(pets))

ohe = OneHotEncoder(sparse=False)
ohe = ohe.fit_transform(int_values)
print('One Hot Encoder:\n', ohe)

lb = LabelBinarizer()
print('Label Binarizer:\n', lb.fit_transform(int_values))

cat = 0; dog = 1; turtle = 2
Pets: ['dog', 'cat', 'cat', 'dog', 'turtle', 'cat', 'cat', 'turtle', 'dog', 'cat']
Label Encoder: [1 0 0 1 2 0 0 2 1 0]
0       dog
1       cat
2       cat
3       dog
4    turtle
5       cat
6       cat
7    turtle
8       dog
9       cat
dtype: object
One Hot Encoder:
 [[0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]
Label Binarizer:
 [[0 1 0]
 [1 0 0]
 [1 0 0]
 [0 1 0]
 [0 0 1]
 [1 0 0]
 [1 0 0]
 [0 0 1]
 [0 1 0]
 [1 0 0]]


In [2]:
pets = pd.DataFrame(pd.Series(pets), columns=['Pets'])
pets.head()

Unnamed: 0,Pets
0,dog
1,cat
2,cat
3,dog
4,turtle


In [3]:
ohe = OneHotEncoder(sparse=False)
ohe_pets = ohe.fit_transform(pets)
pets_df = pd.DataFrame(ohe_pets, columns=ohe.get_feature_names_out(['Pets']))
pets_df

Unnamed: 0,Pets_cat,Pets_dog,Pets_turtle
0,0.0,1.0,0.0
1,1.0,0.0,0.0
2,1.0,0.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0
6,1.0,0.0,0.0
7,0.0,0.0,1.0
8,0.0,1.0,0.0
9,1.0,0.0,0.0


### Dummy Trap

The Dummy Variable Trap occurs when two or more dummy variables created by one-hot encoding are highly correlated (multi-collinear). This means that one variable can be predicted from the others, making it difficult to interpret predicted coefficient variables in regression models. In other words, the individual effect of the dummy variables on the prediction model can not be interpreted well because of multicollinearity.

https://www.learndatasci.com/glossary/dummy-variable-trap/

In [4]:
pets_df.corr()

Unnamed: 0,Pets_cat,Pets_dog,Pets_turtle
Pets_cat,1.0,-0.654654,-0.5
Pets_dog,-0.654654,1.0,-0.327327
Pets_turtle,-0.5,-0.327327,1.0


In [5]:
ohe = OneHotEncoder(drop='first', sparse=False)
ohe_pets = ohe.fit_transform(pets)
pets_df = pd.DataFrame(ohe_pets, columns=ohe.get_feature_names_out(['Pets']))
pets_df

Unnamed: 0,Pets_dog,Pets_turtle
0,1.0,0.0
1,0.0,0.0
2,0.0,0.0
3,1.0,0.0
4,0.0,1.0
5,0.0,0.0
6,0.0,0.0
7,0.0,1.0
8,1.0,0.0
9,0.0,0.0


In [6]:
pets_df.corr()

Unnamed: 0,Pets_dog,Pets_turtle
Pets_dog,1.0,-0.327327
Pets_turtle,-0.327327,1.0


## Example 1

In [7]:
# one hot encoding
import seaborn as sns

df = sns.load_dataset('penguins')
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [8]:
df.isnull().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

In [9]:
df.dropna(inplace=True)
df.isnull().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

In [10]:
df['species'].value_counts()

Adelie       146
Gentoo       119
Chinstrap     68
Name: species, dtype: int64

In [11]:
df['island'].value_counts()

Biscoe       163
Dream        123
Torgersen     47
Name: island, dtype: int64

In [12]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


## Dependent Variable = species

In [13]:
# train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop(['species'], axis=1), 
                                                    df['species'], 
                                                    test_size=.2, 
                                                    random_state=42)

print(X_train.shape)
print(X_test.shape)

(266, 6)
(67, 6)


## One Hot Encoding For Features

### For features with more than 2 unique labels

In [14]:
# use sklearn one hot encoder
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categories='auto', drop='first', sparse=False, handle_unknown='ignore')

cat_features = ['island']
ohe_train = ohe.fit_transform(X_train[cat_features])
ohe_train = pd.DataFrame(ohe_train, columns=ohe.get_feature_names_out(cat_features))
ohe_train.index = X_train.index
X_train = ohe_train.join(X_train)
X_train.drop(cat_features, axis=1, inplace=True)

ohe_test = ohe.transform(X_test[cat_features])
ohe_test = pd.DataFrame(ohe_test, columns=ohe.get_feature_names_out(cat_features))
ohe_test.index = X_test.index
X_test = ohe_test.join(X_test)
X_test.drop(cat_features, axis=1, inplace=True)

In [15]:
X_train.sample(10)

Unnamed: 0,island_Dream,island_Torgersen,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
227,0.0,0.0,46.7,15.3,219.0,5200.0,Male
125,0.0,1.0,40.6,19.0,199.0,4000.0,Male
166,1.0,0.0,45.9,17.1,190.0,3575.0,Female
262,0.0,0.0,45.3,13.7,210.0,4300.0,Female
214,1.0,0.0,45.7,17.0,195.0,3650.0,Female
151,1.0,0.0,41.5,18.5,201.0,4000.0,Male
135,1.0,0.0,41.1,17.5,190.0,3900.0,Male
259,0.0,0.0,48.7,15.7,208.0,5350.0,Male
228,0.0,0.0,43.3,13.4,209.0,4400.0,Female
302,0.0,0.0,47.4,14.6,212.0,4725.0,Female


In [16]:
import numpy as np

print(np.sort(df['island'].unique()))

['Biscoe' 'Dream' 'Torgersen']


## Bi Label Mapping for Features

### For features with only two unique labels

In [17]:
X_train['sex'] = X_train['sex'].map({'Female': 0, 'Male': 1})
X_test['sex'] = X_test['sex'].map({'Female': 0, 'Male': 1})

## Map Dependent Variable to Number

In [18]:
import numpy as np

print(np.sort(y_train.unique()))

['Adelie' 'Chinstrap' 'Gentoo']


In [19]:
y_train = y_train.map({'Adelie': 0, 'Chinstrap': 1, 'Gentoo': 2})
y_test = y_test.map({'Adelie': 0, 'Chinstrap': 1, 'Gentoo': 2})

In [20]:
X_train.head()

Unnamed: 0,island_Dream,island_Torgersen,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
230,0.0,0.0,40.9,13.7,214.0,4650.0,0
84,1.0,0.0,37.3,17.8,191.0,3350.0,0
303,0.0,0.0,50.0,15.9,224.0,5350.0,1
22,0.0,0.0,35.9,19.2,189.0,3800.0,0
29,0.0,0.0,40.5,18.9,180.0,3950.0,1


In [21]:
y_train.head()

230    2
84     0
303    2
22     0
29     0
Name: species, dtype: int64

## Example 2

In [22]:
# get data
import pandas as pd

houses = pd.read_csv('https://raw.githubusercontent.com/gitmystuff/Datasets/main/house-prices.csv') 
print(houses.shape)
print(houses.head())
print(houses.info())

(1460, 81)
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   2008  

In [23]:
# train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    houses.drop('SalePrice', axis=1), 
    houses['SalePrice'], 
    test_size=0.25, 
    random_state=42)

### Get Dummies

In [24]:
# using pandas get_dummies
import pandas as pd

X_dummy = pd.get_dummies(X_train[['GarageType', 'GarageQual']], drop_first=True)
y_dummy = pd.get_dummies(X_test[['GarageType', 'GarageQual']], drop_first=True)
print(X_dummy.shape)
print(y_dummy.shape)

(1095, 9)
(365, 7)


### One Hot Encoder

In [25]:
# using one hot encoder
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(drop='first', sparse=False)

ohe_train = ohe.fit_transform(X_train[['GarageType', 'GarageQual']].dropna())
ohe_train = pd.DataFrame(ohe_train, columns=ohe.get_feature_names_out(['GarageType', 'GarageQual']))
print(ohe_train.shape)
print(ohe_train.head())

(1037, 9)
   GarageType_Attchd  GarageType_Basment  GarageType_BuiltIn  \
0                1.0                 0.0                 0.0   
1                1.0                 0.0                 0.0   
2                0.0                 0.0                 0.0   
3                1.0                 0.0                 0.0   
4                1.0                 0.0                 0.0   

   GarageType_CarPort  GarageType_Detchd  GarageQual_Fa  GarageQual_Gd  \
0                 0.0                0.0            0.0            0.0   
1                 0.0                0.0            0.0            0.0   
2                 0.0                1.0            0.0            0.0   
3                 0.0                0.0            0.0            0.0   
4                 0.0                0.0            0.0            0.0   

   GarageQual_Po  GarageQual_TA  
0            0.0            1.0  
1            0.0            1.0  
2            0.0            1.0  
3            0.0        

In [26]:
# ohe is already trained
ohe_test = ohe.transform(X_test[['GarageType', 'GarageQual']].dropna())
ohe_test = pd.DataFrame(ohe_test, columns=ohe_train.columns)
print(ohe_test.shape)
print(ohe_test.head())

(342, 9)
   GarageType_Attchd  GarageType_Basment  GarageType_BuiltIn  \
0                1.0                 0.0                 0.0   
1                1.0                 0.0                 0.0   
2                0.0                 0.0                 0.0   
3                0.0                 0.0                 0.0   
4                1.0                 0.0                 0.0   

   GarageType_CarPort  GarageType_Detchd  GarageQual_Fa  GarageQual_Gd  \
0                 0.0                0.0            0.0            0.0   
1                 0.0                0.0            0.0            0.0   
2                 0.0                1.0            0.0            0.0   
3                 0.0                1.0            0.0            0.0   
4                 0.0                0.0            0.0            0.0   

   GarageQual_Po  GarageQual_TA  
0            0.0            1.0  
1            0.0            1.0  
2            0.0            1.0  
3            0.0         

### One Hot Encoding Alternatives

For features with many labels

* https://medium.com/analytics-vidhya/stop-one-hot-encoding-your-categorical-variables-bbb0fba89809
* https://medium.com/swlh/stop-one-hot-encoding-your-categorical-features-avoid-curse-of-dimensionality-16743c32cea4
* https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 (frequency and mean encoding)

In [27]:
# review features with multiple labels
for val in X_train.columns.sort_values():
    print(val, len(X_train[val].dropna().unique()))

1stFlrSF 631
2ndFlrSF 345
3SsnPorch 16
Alley 2
BedroomAbvGr 8
BldgType 5
BsmtCond 4
BsmtExposure 4
BsmtFinSF1 531
BsmtFinSF2 109
BsmtFinType1 6
BsmtFinType2 6
BsmtFullBath 4
BsmtHalfBath 3
BsmtQual 4
BsmtUnfSF 655
CentralAir 2
Condition1 9
Condition2 6
Electrical 4
EnclosedPorch 94
ExterCond 5
ExterQual 4
Exterior1st 15
Exterior2nd 16
Fence 4
FireplaceQu 5
Fireplaces 4
Foundation 6
FullBath 4
Functional 7
GarageArea 383
GarageCars 5
GarageCond 5
GarageFinish 3
GarageQual 5
GarageType 6
GarageYrBlt 94
GrLivArea 707
HalfBath 3
Heating 6
HeatingQC 5
HouseStyle 8
Id 1095
KitchenAbvGr 4
KitchenQual 4
LandContour 4
LandSlope 3
LotArea 845
LotConfig 5
LotFrontage 105
LotShape 4
LowQualFinSF 19
MSSubClass 15
MSZoning 5
MasVnrArea 278
MasVnrType 4
MiscFeature 4
MiscVal 18
MoSold 12
Neighborhood 25
OpenPorchSF 184
OverallCond 9
OverallQual 10
PavedDrive 3
PoolArea 7
PoolQC 3
RoofMatl 7
RoofStyle 6
SaleCondition 6
SaleType 9
ScreenPorch 64
Street 2
TotRmsAbvGrd 12
TotalBsmtSF 603
Utilities 2
Wood

In [28]:
# find features with more than five labels and use frequency encoding
freq_feats = []

for val in X_train.columns.sort_values():
    if len(X_train[val].dropna().unique()) > 5:
        freq_feats.append(val)
        
print(freq_feats)

for feat in freq_feats:
    freq = X_train.groupby(feat).size()/len(X_train)
    X_train.loc[:, feat] = X_train[feat].map(freq)
    freq = X_test.groupby(feat).size()/len(X_test)
    X_test.loc[:, feat] = X_test[feat].map(freq)
    

['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'BedroomAbvGr', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFinType1', 'BsmtFinType2', 'BsmtUnfSF', 'Condition1', 'Condition2', 'EnclosedPorch', 'Exterior1st', 'Exterior2nd', 'Foundation', 'Functional', 'GarageArea', 'GarageType', 'GarageYrBlt', 'GrLivArea', 'Heating', 'HouseStyle', 'Id', 'LotArea', 'LotFrontage', 'LowQualFinSF', 'MSSubClass', 'MasVnrArea', 'MiscVal', 'MoSold', 'Neighborhood', 'OpenPorchSF', 'OverallCond', 'OverallQual', 'PoolArea', 'RoofMatl', 'RoofStyle', 'SaleCondition', 'SaleType', 'ScreenPorch', 'TotRmsAbvGrd', 'TotalBsmtSF', 'WoodDeckSF', 'YearBuilt', 'YearRemodAdd']


In [29]:
# check for bi-label features
bi_labels = []

for feat in X_train.columns.sort_values():
#     if (len(X_train[feat].unique()) == 2) and (max(val_cnts)/len(X_train[feat])) < 0.98:
    if (len(X_train[feat].unique()) == 2): 
        cutoff = 0.98
        val_cnts = X_train[feat].value_counts()
        perc = max(val_cnts)/len(X_train[feat])
        if bool(perc < cutoff):            
            print(val_cnts, max(val_cnts)/len(X_train[feat]))
            bi_labels.append(feat)
            
print(bi_labels)

Y    1017
N      78
Name: CentralAir, dtype: int64 0.9287671232876712
['CentralAir']


In [30]:
# bi-label mapping
# whatever you do for X_train, do for X_test
for feat in bi_labels:
    label0 = X_train[feat].value_counts().index.tolist()[0]
    label1 = X_train[feat].value_counts().index.tolist()[1]
    X_train[feat] = X_train[feat].map({label0:0,label1:1})
    X_test[feat] = X_test[feat].map({label0:0,label1:1})
    

### Encoding Order

* Bilabel Mapping (2 labels)
* Frequency (5+ labels)
* One Hot Encoding (3 - 5 labels)