Link to Medium blog post: https://annettechiu.medium.com/what-is-one-hot-encoding-when-doing-data-cleaning-2f28fbd029ac

# What is Integer Encoding and One-Hot Encoding when doing Feature Engineering ?

In [5]:
import pandas as pd

# Creating the DataFrame
df2 = pd.DataFrame([
    ['Green', 'M', 10.1, 1],
    ['Red', 'L', 13.5, 2],
    ['Blue', 'XL', 13.5, 2],
    ['Red', 'L', 10, 1]
])
df2.columns = ['Color', 'Size', 'Price', 'Classlable']

# One-hot Encoding
onehot_encoding = pd.get_dummies(df2['Color'], prefix='Color')

# Dropping the 'Color' column from the original DataFrame
df2 = df2.drop('Color', axis=1)

# Concatenating the one-hot encoded DataFrame with the original DataFrame (minus the 'Color' column)
df2 = pd.concat([onehot_encoding, df2], axis=1)

# Size Mapping
size_mapping = {'XL': 3, 'L': 2, 'M': 1}
df2['Size'] = df2['Size'].map(size_mapping)

df2


Unnamed: 0,Color_Blue,Color_Green,Color_Red,Size,Price,Classlable
0,False,True,False,1,10.1,1
1,False,False,True,2,13.5,2
2,True,False,False,3,13.5,2
3,False,False,True,2,10.0,1


In this case, the order of the a feature’s values is not important, using ordered features will confused the learning algorithm and potentially lead to overfitting. Therefore, we use the concat format to combine the original part and the one-hot part for the algorithm can read. We should not transform red into1, yellow into2, and green into 3 to avoid increasing the dimensionality because that would imply that there’s an order among the values in this category and this specific order will change the decision making.



## Some categorical data need integer encoding rather than one-hot encoding.


We must be careful that some features in the data frame cannot be transfer to one-hot encoding format. Furthermore, we will use Integer Encoding to convert categorical data to numerical data. For example, the size column imply that there’s an order among the values in this category. Assume the bigger size is more important for the model we set the Size XL to 3, L to 2 and M to 1. ( ‘XL’: 3, ‘L’: 2,’M’: 1} we will use the mapping function to transfer the size to the meaningful number. Finally, we will use concat format to combine the original part and the one-hot part and the data frame will be ready to go.



![image.png](attachment:image.png)

![image.png](attachment:image.png)