# Encoding Categorical Data

Categorical data must be encoded into numerical formats for machine learning models to understand it. In this section, we will explore different ways to encode categorical data, from basic techniques like label encoding to more advanced techniques like one-hot encoding and target encoding.

---

## Table of Contents

1. [Understanding Categorical Data](#1-understanding-categorical-data)
2. [Label Encoding](#2-label-encoding)
3. [One-Hot Encoding](#3-one-hot-encoding)
4. [Advanced Encoding Techniques](#4-advanced-encoding-techniques)

---

## 1. Understanding Categorical Data

First, let's identify the categorical variables in the dataset and understand their nature.

In [2]:
import pandas as pd

data = pd.read_csv('categorical_data.csv')

In [3]:
data.head()

Unnamed: 0,ID,Product_Name,Category,Region,Price,Sales,Discount_Percentage,Customer_Rating
0,1,Product_1,Furniture,South,870.28,6451,13.86,4.5
1,2,Product_2,Books,South,525.13,7745,39.55,3.6
2,3,Product_3,Furniture,West,809.6,8295,43.45,2.6
3,4,Product_4,Books,South,953.33,7416,49.26,1.9
4,5,Product_5,Food,North,363.77,9391,19.13,2.0


In [4]:
categorical_cols = data.select_dtypes(include=['object', 'category']).columns
print("Categorical columns in the dataset:", categorical_cols.tolist())

Categorical columns in the dataset: ['Product_Name', 'Category', 'Region']


## 2. Label Encoding

Label encoding converts categories into numerical labels, where each category is assigned a unique integer. This method is suitable for ordinal data.

In [5]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
data["Category_Label"] = label_encoder.fit_transform(data["Category"])

print("Encoded 'Category' column:\n", data[['Category', 'Category_Label']].head())

Encoded 'Category' column:
     Category  Category_Label
0  Furniture               4
1      Books               0
2  Furniture               4
3      Books               0
4       Food               3


## 3. One-Hot Encoding

One-hot encoding is often used for nominal (non-ordinal) categorical variables. It creates binary columns for each category.

In [7]:
data_onehot = pd.get_dummies(data,columns=["Category"],prefix="One_Hot")

data_onehot.head()

Unnamed: 0,ID,Product_Name,Region,Price,Sales,Discount_Percentage,Customer_Rating,Category_Label,One_Hot_Books,One_Hot_Clothing,One_Hot_Electronics,One_Hot_Food,One_Hot_Furniture
0,1,Product_1,South,870.28,6451,13.86,4.5,4,False,False,False,False,True
1,2,Product_2,South,525.13,7745,39.55,3.6,0,True,False,False,False,False
2,3,Product_3,West,809.6,8295,43.45,2.6,4,False,False,False,False,True
3,4,Product_4,South,953.33,7416,49.26,1.9,0,True,False,False,False,False
4,5,Product_5,North,363.77,9391,19.13,2.0,3,False,False,False,True,False


## 4. Advanced Encoding Techniques

Target encoding replaces categorical values with the mean of the target variable for each category. This method is useful for high-cardinality features.

In [11]:
# Example of target encoding
data['Category_Target'] = data.groupby('Category')['Price'].transform('mean')

# Display the target-encoded column
print("Target-encoded 'Category' column:\n", data[['Category', 'Category_Target']].head())

Target-encoded 'Category' column:
     Category  Category_Target
0  Furniture       520.646750
1      Books       490.265000
2  Furniture       520.646750
3      Books       490.265000
4       Food       463.772353


Frequency encoding replaces categories with their occurrence count. It's simple and retains information about the distribution of categories.

In [12]:
# Frequency encoding the 'Category' column
data['Category_Frequency'] = data.groupby('Category')['Category'].transform('count')

# Display the frequency-encoded column
print("Frequency-encoded 'Category' column:\n", data[['Category', 'Category_Frequency']].head())


Frequency-encoded 'Category' column:
     Category  Category_Frequency
0  Furniture                  40
1      Books                  40
2  Furniture                  40
3      Books                  40
4       Food                  34
