# 7. **Feature Encoding**

Feature encoding is the process of transforming `categorical features` into `numeric features`. This is necessary because machine learning algorithms can only handle numeric features. There are many different ways to encode categorical features, and each method has its own advantages and disadvantages. In this notebook, we will explore some of the most popular methods for encoding categorical features, such as:

- Label encoding
- One-hot encoding
- Ordinal encoding
- Binary encoding
- frequency/count encoding

In [35]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [36]:
# load the dataset:
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


### as we can see that there are 4 categorical columns in the dataset:
1. sex
2. smoker
3. day
4. time

In [38]:
df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

In [39]:
# let's encode the categorical variables using different scikit learn methods:
# import the libraries:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

# let's encode the time variable using LabelEncoder:
le = LabelEncoder()
df['encoded_time'] = le.fit_transform(df['time'])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,0
2,21.01,3.5,Male,No,Sun,Dinner,3,0
3,23.68,3.31,Male,No,Sun,Dinner,2,0
4,24.59,3.61,Female,No,Sun,Dinner,4,0


In [40]:
df['encoded_time'].value_counts()

# so, --> 0 is dinner and 1 is lunch.

encoded_time
0    176
1     68
Name: count, dtype: int64

In [41]:
# let's see the day column and encode it using OrdinalEncoder:

df['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

In [42]:
# use OrdinalEncoder:
oe = OrdinalEncoder(categories=[['Thur', 'Fri', 'Sat', 'Sun']])
df['encoded_day'] = oe.fit_transform(df[['day']])
df.head()


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_day
0,16.99,1.01,Female,No,Sun,Dinner,2,0,3.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0,3.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0,3.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0,3.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0,3.0


In [43]:
df['encoded_day'].value_counts()

encoded_day
2.0    87
3.0    76
0.0    62
1.0    19
Name: count, dtype: int64

In [44]:

# Initialize the OneHotEncoder
ohe = OneHotEncoder()

# Fit the encoder and transform the data
encoded_sex = ohe.fit_transform(df[['sex']]).toarray()

# Create a DataFrame with the encoded columns
encoded_sex_df = pd.DataFrame(encoded_sex, columns=ohe.get_feature_names_out(['sex']))

# Concatenate the original DataFrame with the new encoded columns
df = pd.concat([df, encoded_sex_df], axis=1)

# Display the resulting DataFrame
print(df)

     total_bill   tip     sex smoker   day    time  size  encoded_time  \
0         16.99  1.01  Female     No   Sun  Dinner     2             0   
1         10.34  1.66    Male     No   Sun  Dinner     3             0   
2         21.01  3.50    Male     No   Sun  Dinner     3             0   
3         23.68  3.31    Male     No   Sun  Dinner     2             0   
4         24.59  3.61  Female     No   Sun  Dinner     4             0   
..          ...   ...     ...    ...   ...     ...   ...           ...   
239       29.03  5.92    Male     No   Sat  Dinner     3             0   
240       27.18  2.00  Female    Yes   Sat  Dinner     2             0   
241       22.67  2.00    Male    Yes   Sat  Dinner     2             0   
242       17.82  1.75    Male     No   Sat  Dinner     2             0   
243       18.78  3.00  Female     No  Thur  Dinner     2             0   

     encoded_day  sex_Female  sex_Male  
0            3.0         1.0       0.0  
1            3.0         0.0 

# Another example of one hot encoding:


In [45]:
# import another dataset:
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [47]:
# example of one hot encoding
titanic = sns.load_dataset('titanic')

onehot_encoder = OneHotEncoder()
embarked_onehot = onehot_encoder.fit_transform(titanic[['embarked']])
embarked_onehot_df = pd.DataFrame(embarked_onehot.toarray(), columns=onehot_encoder.get_feature_names_out(['embarked']))
titanic = pd.concat([titanic.reset_index(drop=True), embarked_onehot_df.reset_index(drop=True)], axis=1)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,embarked_C,embarked_Q,embarked_S,embarked_nan
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,0.0,0.0,1.0,0.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1.0,0.0,0.0,0.0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,0.0,0.0,1.0,0.0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,0.0,0.0,1.0,0.0
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,0.0,0.0,1.0,0.0


In [48]:
# !pip install category_encoders

In [49]:
# load the dataset:
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [50]:
from category_encoders import BinaryEncoder

# Initialize the encoder with the column to be encoded
binary_encoder = BinaryEncoder(cols=['day'])

# Fit and transform the DataFrame
df_binary = binary_encoder.fit_transform(df)

# Display the transformed DataFrame
print(df_binary)

     total_bill   tip     sex smoker  day_0  day_1  day_2    time  size
0         16.99  1.01  Female     No      0      0      1  Dinner     2
1         10.34  1.66    Male     No      0      0      1  Dinner     3
2         21.01  3.50    Male     No      0      0      1  Dinner     3
3         23.68  3.31    Male     No      0      0      1  Dinner     2
4         24.59  3.61  Female     No      0      0      1  Dinner     4
..          ...   ...     ...    ...    ...    ...    ...     ...   ...
239       29.03  5.92    Male     No      0      1      0  Dinner     3
240       27.18  2.00  Female    Yes      0      1      0  Dinner     2
241       22.67  2.00    Male    Yes      0      1      0  Dinner     2
242       17.82  1.75    Male     No      0      1      0  Dinner     2
243       18.78  3.00  Female     No      0      1      1  Dinner     2

[244 rows x 9 columns]


In [51]:
# use pandas for feature encoding

df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [52]:
import pandas as pd
# use pandas get dummies
get_dummies = pd.get_dummies(df, columns=['day'])
get_dummies.head()

Unnamed: 0,total_bill,tip,sex,smoker,time,size,day_Thur,day_Fri,day_Sat,day_Sun
0,16.99,1.01,Female,No,Dinner,2,False,False,False,True
1,10.34,1.66,Male,No,Dinner,3,False,False,False,True
2,21.01,3.5,Male,No,Dinner,3,False,False,False,True
3,23.68,3.31,Male,No,Dinner,2,False,False,False,True
4,24.59,3.61,Female,No,Dinner,4,False,False,False,True


---
## Now, we know the different techniques of feature_encoding including:
1. label encoding
2. one-hot encoding
3. ordinal encoding
4. frequency/count encoding
5. binary encoding

---

# About Me:

<img src="https://scontent.flhe6-1.fna.fbcdn.net/v/t39.30808-6/449152277_18043153459857839_8752993961510467418_n.jpg?_nc_cat=108&ccb=1-7&_nc_sid=127cfc&_nc_ohc=6slHzGIxf0EQ7kNvgEeodY9&_nc_ht=scontent.flhe6-1.fna&oh=00_AYCiVUtssn2d_rREDU_FoRbXvszHQImqOjfNEiVq94lfBA&oe=66861B78" width="30%">

**Muhammd Faizan**

3rd Year BS Computer Science student at University of Agriculture, Faisalabad.\
Contact me for queries/collabs/correction

[Kaggle](https://www.kaggle.com/faizanyousafonly/)\
[Linkedin](https://www.linkedin.com/in/mrfaizanyousaf/)\
[GitHub](https://github.com/faizan-yousaf/)\
[Email] faizan6t45@gmail.com or faizanyousaf815@gmail.com \
[Phone/WhatsApp]() +923065375389