## One Hot Encoding

One Hot Encoding is used for transforming categorical data into numerical data. Categorical features are turned into binary features. We create one binary feature per category - the feature value is 1 when the category is present else it is 0. For any datapoint, only one feature is hot (1) and the rest are cold (0).

The new features are referred to as dummy features.

One Hot encoding is used when your data is nominal and there is no natural relationship/ranking/ordering in the variable. For example, if we have a "color" categorical variable with three categories: red, green and blue. After one-hot encoding, each category will be represented by a binary vector. Red would be represented as [1,0,0], Green as [0,1,0] and Blue as [0,0,1].

Note that if the number of categories are very large, the one-hot encoding will result in a very large number of features.

You can use pandas' `get_dummies()` function or sklearn's `OneHotEncoder` for implementing One Hot Encoding.

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
# Create a sample DataFrame
df = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, 5, 6, 7],
    "city": ["New York", "London", "Tokyo", "New York", "Paris", "Tokyo", "Paris"],
    "age": [30, 25, 40, 22, 35, 42, 51],
    "income": [50000, 40000, 75000, 35000, 60000, 90000, 65000]
})


df

Unnamed: 0,customer_id,city,age,income
0,1,New York,30,50000
1,2,London,25,40000
2,3,Tokyo,40,75000
3,4,New York,22,35000
4,5,Paris,35,60000
5,6,Tokyo,42,90000
6,7,Paris,51,65000


What if we want to apply One Hot Encoding on the city variable here?

In [97]:
# One Hot encoding using pandas' get_dummies function... really simple!
pd.get_dummies(df, columns=['city'], dtype=float)

Unnamed: 0,customer_id,age,income,city_London,city_New York,city_Paris,city_Tokyo
0,1,30,50000,0.0,1.0,0.0,0.0
1,2,25,40000,1.0,0.0,0.0,0.0
2,3,40,75000,0.0,0.0,0.0,1.0
3,4,22,35000,0.0,1.0,0.0,0.0
4,5,35,60000,0.0,0.0,1.0,0.0
5,6,42,90000,0.0,0.0,0.0,1.0
6,7,51,65000,0.0,0.0,1.0,0.0


In [98]:
cities = df['city'].unique()
cities

array(['New York', 'London', 'Tokyo', 'Paris'], dtype=object)

In [99]:
# One Hot Encoding using sklearn
# create an object of one hot encoder class
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False).set_output(transform='pandas')

# the handle_unknown parameter specifies how to handle unknown categories during transform
# we've also set the output to a pandas dataframe

In [100]:
# apply the transform on the city column by calling the fit_transorm method on the OneHotEncoder object

ohetransform = ohe.fit_transform(df[['city']])

In [101]:
ohetransform

Unnamed: 0,city_London,city_New York,city_Paris,city_Tokyo
0,0.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0
3,0.0,1.0,0.0,0.0
4,0.0,0.0,1.0,0.0
5,0.0,0.0,0.0,1.0
6,0.0,0.0,1.0,0.0


In [102]:
# merge this back into original dataset using pandas concatenation
df = pd.concat([df, ohetransform], axis=1)
df

Unnamed: 0,customer_id,city,age,income,city_London,city_New York,city_Paris,city_Tokyo
0,1,New York,30,50000,0.0,1.0,0.0,0.0
1,2,London,25,40000,1.0,0.0,0.0,0.0
2,3,Tokyo,40,75000,0.0,0.0,0.0,1.0
3,4,New York,22,35000,0.0,1.0,0.0,0.0
4,5,Paris,35,60000,0.0,0.0,1.0,0.0
5,6,Tokyo,42,90000,0.0,0.0,0.0,1.0
6,7,Paris,51,65000,0.0,0.0,1.0,0.0


In [103]:
# the city column can be dropped now that we've one-hot encoded it
df.drop(columns=['city'])

Unnamed: 0,customer_id,age,income,city_London,city_New York,city_Paris,city_Tokyo
0,1,30,50000,0.0,1.0,0.0,0.0
1,2,25,40000,1.0,0.0,0.0,0.0
2,3,40,75000,0.0,0.0,0.0,1.0
3,4,22,35000,0.0,1.0,0.0,0.0
4,5,35,60000,0.0,0.0,1.0,0.0
5,6,42,90000,0.0,0.0,0.0,1.0
6,7,51,65000,0.0,0.0,1.0,0.0
