# Feature Encoding
Feature Encoding is a process of converting categorical data into numerical data. It is an important step in the process of data preprocessing. In this notebook, I will discuss some of the most popular encoding techniques and their implementation in Python.
1. Label Encoding
2. Ordinal Encoding
3. One-Hot Encoding
4. Binary Encoding

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
# Load the data
df=sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [5]:
df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

In [8]:
# Encoding the time column using label encoding
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
le=LabelEncoder()
df['time']=le.fit_transform(df['time'])
df['time'].value_counts()

time
0    176
1     68
Name: count, dtype: int64

In [9]:
df['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

In [10]:
# Ordinal encoding the day column using specific order
oe=OrdinalEncoder(categories=[['Thur','Fri','Sat','Sun']])
df['day']=oe.fit_transform(df[['day']])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,3.0,0,2
1,10.34,1.66,Male,No,3.0,0,3
2,21.01,3.5,Male,No,3.0,0,3
3,23.68,3.31,Male,No,3.0,0,2
4,24.59,3.61,Female,No,3.0,0,4


In [16]:
# One hot encoding the smoker column
ohe=OneHotEncoder()
smoker=ohe.fit_transform(df[['smoker']]).toarray()
print(smoker.shape[1])

2


In [19]:
smoker=pd.DataFrame(smoker,columns=ohe.get_feature_names_out(['smoker']))
df=pd.concat([df,smoker],axis=1)
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,smoker_0,smoker_1,smoker_No,smoker_Yes
0,16.99,1.01,Female,No,3.0,0,2,1.0,0.0,,
1,10.34,1.66,Male,No,3.0,0,3,1.0,0.0,,
2,21.01,3.5,Male,No,3.0,0,3,1.0,0.0,,
3,23.68,3.31,Male,No,3.0,0,2,1.0,0.0,,
4,24.59,3.61,Female,No,3.0,0,4,1.0,0.0,,


In [20]:
df.drop(['smoker_0','smoker_1'],axis=1,inplace=True)

In [21]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,smoker_No,smoker_Yes
0,16.99,1.01,Female,No,3.0,0,2,,
1,10.34,1.66,Male,No,3.0,0,3,,
2,21.01,3.5,Male,No,3.0,0,3,,
3,23.68,3.31,Male,No,3.0,0,2,,
4,24.59,3.61,Female,No,3.0,0,4,,


In [None]:
!pip install category_encoders

In [24]:
from category_encoders import BinaryEncoder
be=BinaryEncoder()
df=sns.load_dataset('tips')
day_tansformed=be.fit_transform(df['day'])
print(day_tansformed)

     day_0  day_1  day_2
0        0      0      1
1        0      0      1
2        0      0      1
3        0      0      1
4        0      0      1
..     ...    ...    ...
239      0      1      0
240      0      1      0
241      0      1      0
242      0      1      0
243      0      1      1

[244 rows x 3 columns]


As there are 4 unique values in day column but binary encoding will only create 3 columns whch mean where all three days are 0, it will be considered as 4th day.