![alt text](https://developer-blogs.nvidia.com/wp-content/uploads/2022/02/ThreeApproachestoEncoding_Featured-Image.jpg)


## **1. Introduction to Encoding**

### **What is Encoding?**
Encoding is the process of converting categorical data (text or labels) into numerical format so that machine learning algorithms can process it. Most algorithms work with numerical data, so encoding is a crucial step in data preprocessing.

---

### **Why Encoding?**
- Machine learning algorithms cannot process categorical data directly.
- Encoding transforms categorical data into a format that algorithms can understand.
- It helps in preserving the relationship between categories (e.g., ordinal encoding) or creating binary columns (e.g., one-hot encoding).

---

### **Types of Encoding**
| **Encoding Type**       | **Description**                                                                 | **When to Use**                                                                 |
|--------------------------|---------------------------------------------------------------------------------|---------------------------------------------------------------------------------|
| **Label Encoding**       | Converts each category into a unique integer.                                   | When categories have an ordinal relationship (e.g., low, medium, high).         |
| **One-Hot Encoding**     | Creates binary columns for each category.                                       | When categories are nominal (no ordinal relationship).                          |
| **Ordinal Encoding**     | Converts categories into integers based on their order.                         | When categories have a clear order (e.g., small, medium, large).                |
| **Frequency Encoding**   | Replaces categories with their frequency in the dataset.                        | When the frequency of categories is meaningful.                                 |
| **Target Encoding**      | Replaces categories with the mean of the target variable for that category.     | When the target variable is correlated with the categories.                     |

---



## **2. Types of Encoding (Theoretical Explanation)**

### **1. Label Encoding**
- **What?** Converts each category into a unique integer.
- **Why?** Useful when categories have an ordinal relationship.
- **Example Situation**: Encoding education levels (e.g., "High School" → 0, "Bachelor’s" → 1, "Master’s" → 2).

---

### **2. One-Hot Encoding**
- **What?** Creates binary columns for each category.
- **Why?** Useful for nominal categories with no ordinal relationship.
- **Example Situation**: Encoding colors (e.g., "Red", "Green", "Blue") into separate binary columns.

---

### **3. Ordinal Encoding**
- **What?** Converts categories into integers based on their order.
- **Why?** Useful when categories have a clear order.
- **Example Situation**: Encoding sizes (e.g., "Small" → 0, "Medium" → 1, "Large" → 2).

---

### **4. Frequency Encoding**
- **What?** Replaces categories with their frequency in the dataset.
- **Why?** Useful when the frequency of categories is meaningful.
- **Example Situation**: Encoding cities based on how often they appear in the dataset.

---

### **5. Target Encoding**
- **What?** Replaces categories with the mean of the target variable for that category.
- **Why?** Useful in predictive models to capture relationships between categorical features and the target variable.
- **Example Situation**: Encoding product categories based on the average sales revenue associated with each category.

---







## **3. Practical Examples**





### **Dataset 1: Iris Dataset (Label Encoding)**

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import LabelEncoder

# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target_names[iris.target]

# Apply Label Encoding
encoder = LabelEncoder()
df['species_encoded'] = encoder.fit_transform(df['species'])

print(df[['species', 'species_encoded']].head())


  species  species_encoded
0  setosa                0
1  setosa                0
2  setosa                0
3  setosa                0
4  setosa                0


In [2]:
df.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'species', 'species_encoded'],
      dtype='object')

In [3]:
df['species'].value_counts()

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

In [4]:
df.sample(20)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,species_encoded
40,5.0,3.5,1.3,0.3,setosa,0
73,6.1,2.8,4.7,1.2,versicolor,1
97,6.2,2.9,4.3,1.3,versicolor,1
82,5.8,2.7,3.9,1.2,versicolor,1
9,4.9,3.1,1.5,0.1,setosa,0
2,4.7,3.2,1.3,0.2,setosa,0
98,5.1,2.5,3.0,1.1,versicolor,1
18,5.7,3.8,1.7,0.3,setosa,0
51,6.4,3.2,4.5,1.5,versicolor,1
48,5.3,3.7,1.5,0.2,setosa,0



**Explanation**:  
- The `species` column is encoded into integers (e.g., "setosa" → 0, "versicolor" → 1, "virginica" → 2).
- Used because the species have no ordinal relationship.

---

### **Dataset 2: Titanic Dataset (One-Hot Encoding)**

In [18]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Load the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Apply One-Hot Encoding
encoder = OneHotEncoder(sparse=False)
encoded_data = encoder.fit_transform(df[['Sex']])
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Sex']))

df = pd.concat([df, encoded_df], axis=1)
print(df[['Sex', 'Sex_female', 'Sex_male']].head())


      Sex  Sex_female  Sex_male
0    male         0.0       1.0
1  female         1.0       0.0
2  female         1.0       0.0
3  female         1.0       0.0
4    male         0.0       1.0




In [21]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_female,Sex_male
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0.0,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1.0,0.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1.0,0.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1.0,0.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0.0,1.0


In [20]:
df['Sex'].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

**Explanation**:  
- The `Sex` column is split into two binary columns: `Sex_female` and `Sex_male`.
- Used because gender is a nominal category.

---

### **Dataset 3: Wine Quality Dataset (Ordinal Encoding)**

In [22]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# Load the Wine Quality dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(url, delimiter=';')

# Apply Ordinal Encoding
encoder = OrdinalEncoder()
df['quality_encoded'] = encoder.fit_transform(df[['quality']])

print(df[['quality', 'quality_encoded']].head())


   quality  quality_encoded
0        5              2.0
1        5              2.0
2        5              2.0
3        6              3.0
4        5              2.0



**Explanation**:  
- The `quality` column is encoded into integers (e.g., 3 → 0, 4 → 1, 5 → 2).
- Used because wine quality has an ordinal relationship.

---

### **Dataset 4: Cars Dataset (Frequency Encoding)**

In [10]:

import pandas as pd

# Load the Cars dataset
url = "https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv"
df = pd.read_csv(url)

# Apply Frequency Encoding

frequency_map = df['Type'].value_counts(normalize=True).to_dict()
df['Type_frequency'] = df['Type'].map(frequency_map)

print(df[['Type', 'Type_frequency']].head())


      Type  Type_frequency
0    Small        0.222222
1  Midsize        0.244444
2  Compact        0.177778
3  Midsize        0.244444
4  Midsize        0.244444


In [11]:
df['Type'].value_counts()

Type
Midsize    22
Small      20
Compact    16
Sporty     12
Large      11
Van         9
Name: count, dtype: int64

In [12]:
df['Type_frequency'].value_counts()

Type_frequency
0.244444    22
0.222222    20
0.177778    16
0.133333    12
0.122222    11
0.100000     9
Name: count, dtype: int64


**Explanation**:  
- The `Type` column is replaced with the frequency of each category.
- Used because the frequency of car types is meaningful.

---

### **Dataset 5: Housing Dataset (Target Encoding)**

**Target Encoding** is a method of encoding categorical variables by replacing each category with a value derived from the target variable, usually the **mean** or **median** of the target variable for that category.

---

### **Why Use Target Encoding?**
1. Converts categorical data into numerical format while retaining information about the target variable.
2. Works well when there is a strong relationship between the categorical feature and the target variable.
3. Useful for high-cardinality categorical features (many unique values).

---

### **How It Works**
1. Group the data by the categorical feature.
2. Compute the target statistic (e.g., mean, median) for each category.
3. Replace each category with the corresponding statistic.

---

### **Formula**
For a categorical column `X` and target column `y`, the encoded value for a category is:

\[
X_{\text{encoded}} = \frac{\text{Sum of target values for the category}}{\text{Count of target values for the category}}
\]

---

### **Example**

| Category | Target Value | Encoded Value |  
|----------|--------------|---------------|  
| A        | 100          | 123.33        |  
| B        | 200          | 220.00        |  
| A        | 150          | 123.33        |  
| C        | 300          | 310.00        |  
| B        | 250          | 220.00        |  

For **Category A**:
- Mean = \((100 + 150 + 120) / 3 = 123.33\)

---

### **Advantages**
- Captures the relationship between the feature and the target variable.
- Produces compact numerical values, unlike one-hot encoding.

### **Disadvantages**
- Prone to **data leakage** if not handled carefully (e.g., using the entire dataset to compute means).
- May introduce noise for small categories or sparse data.

---

### **When to Use?**
- When dealing with categorical features in regression or classification tasks.
- When the feature has many unique categories (high cardinality).

---


In [13]:
import pandas as pd

# Load the Housing dataset
url = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv"
df = pd.read_csv(url)

# Apply Target Encoding
target_mean = df.groupby('ocean_proximity')['median_house_value'].mean().to_dict()
df['ocean_proximity_encoded'] = df['ocean_proximity'].map(target_mean)

print(df[['ocean_proximity', 'ocean_proximity_encoded']].head())


  ocean_proximity  ocean_proximity_encoded
0        NEAR BAY             259212.31179
1        NEAR BAY             259212.31179
2        NEAR BAY             259212.31179
3        NEAR BAY             259212.31179
4        NEAR BAY             259212.31179


**Explanation**:  
- The `ocean_proximity` column is replaced with the mean `median_house_value` for each category.
- Used because the target variable (`median_house_value`) is correlated with `ocean_proximity`.

---


## **4. Exercises for you**

1. **Label Encoding**: Use the `load_wine()` dataset and apply Label Encoding to the `target` column.  
2. **One-Hot Encoding**: Use the `Titanic` dataset and apply One-Hot Encoding to the `Embarked` column.  
3. **Ordinal Encoding**: Use the `Cars93` dataset and apply Ordinal Encoding to the `DriveTrain` column.  
4. **Frequency Encoding**: Use the `Iris` dataset and apply Frequency Encoding to the `species` column.  
5. **Target Encoding**: Use the `Housing` dataset and apply Target Encoding to the `ocean_proximity` column.

---

## **5. Key Takeaways**
- Encoding is essential for converting categorical data into numerical format.  
- Choose the encoding technique based on the nature of the data (ordinal, nominal, etc.).  
- Always visualize and understand the data before applying encoding.

----
Any question you can Dm me on my [Linkedin](https://www.linkedin.com/in/ibrahimqasmi313/)