# 🏷️ Handling Categorical Variables: One-Hot vs. Label Encoding

Categorical variables represent **discrete groups or categories** (e.g., color, size, country).  
Since most machine learning algorithms require **numerical inputs**, these variables must be converted into numbers.  
The two most common encoding techniques are **One-Hot Encoding** and **Label Encoding**.

---

## 🔹 1. One-Hot Encoding

**Core Idea:**  
Creates **new binary (0/1) columns** for each unique category in the feature.  
Each row has a `1` for the category it belongs to, and `0` for others.  

### 📌 Example: Color Feature

| Original: Color | Red | Blue | Green |
|------------------|-----|------|-------|
| Red              |  1  |  0   |  0    |
| Blue             |  0  |  1   |  0    |
| Green            |  0  |  0   |  1    |

### ✅ Advantages
- **No False Ordering** → Suitable for **nominal data** (categories with no order, like *Color*, *Country*).  
- **Widely Compatible** → Works well with models like **Linear Regression, Logistic Regression, Neural Networks**.  

### ⚠️ Disadvantages
- **Dummy Variable Trap** → With `k` categories, one column can be predicted from the others.  
  - Solution: Use **k-1** columns instead of k.  
- **Curse of Dimensionality** → For features with many categories (e.g., ZIP codes), it creates very large and sparse datasets.  
  - Alternatives: **Feature Hashing, Target Encoding, Embeddings**.  

---

## 🔹 2. Label Encoding

**Core Idea:**  
Assigns each category a unique integer.  

### 📌 Example: Size Feature (Ordinal)

| Original: Size | Label Encoded |
|----------------|----------------|
| Small          | 0              |
| Medium         | 1              |
| Large          | 2              |
| X-Large        | 3              |

### ✅ Advantages
- **Simple & Compact** → Only one column (no increase in dimensionality).  
- **Best for Ordinal Data** → When categories have a natural order (e.g., *Education Level, Size*).  

### ⚠️ Disadvantages
- **False Ordering Problem** → If used on **nominal data** (e.g., *Color*), the model might assume `Green (2) > Blue (1) > Red (0)`, which is meaningless.  
- **Numerical Bias** → Algorithms may wrongly interpret differences between integers as meaningful.  

---

## 🔹 3. Key Differences & When to Use

| Feature          | One-Hot Encoding                                   | Label Encoding                          |
|------------------|----------------------------------------------------|------------------------------------------|
| **Dimensionality** | Expands (k or k-1 new columns)                     | Stays the same (1 column)                |
| **Nature of Data** | Best for **Nominal Data** (no order: *Color, Country*) | Best for **Ordinal Data** (has order: *Size, Education*) |
| **Order Assumption** | No order implied. Categories are equidistant.    | Implies order via integers.              |
| **Model Suitability** | Linear/Logistic Regression, SVM, Neural Nets     | Tree-Based Models (Decision Trees, RF, XGBoost) often handle it well |

---

## 🔑 Summary

- Use **One-Hot Encoding** when:
  - Categories are **nominal** (no order).  
  - Dataset size is manageable (not too many unique categories).  

- Use **Label Encoding** when:
  - Categories are **ordinal** (ordered).  
  - Tree-based models are used, since they can handle encoded integers without assuming linear relationships.  


# Complete Guide: One-Hot Encoding vs. Label Encoding

Categorical variables represent discrete groups or categories (e.g., Color, Country, Size). Since most machine learning models require **numerical input**, these categories must be transformed into numbers. Two of the most common approaches are **Label Encoding** and **One-Hot Encoding**.

---

## 1. Label Encoding (Detailed)

**Definition:**  
Label Encoding assigns a **unique integer** to each category. It does not create new columns; instead, it directly replaces category names with integer values.

**Process:**
- Each category is mapped to a number (0, 1, 2, ...).
- Mapping may be alphabetical or based on order of appearance.

**Key Characteristic:**  
This method **implies an order** because integers have a natural ranking (2 > 1 > 0).  

### Example: Color Feature (⚠️ Cautionary Tale)

| Original: Color | Label Encoded |
|-----------------|---------------|
| Red             | 2             |
| Blue            | 1             |
| Green           | 0             |

🚨 Problem: The model may incorrectly assume **Red > Blue > Green**, which makes no sense for colors.

---

### ✅ When to Use Label Encoding:
- **Ordinal Features:**  
  Best for data with a **natural order**.  
  Example: Size → [Small, Medium, Large] → [0, 1, 2].
- **Tree-Based Models:**  
  Decision Trees, Random Forests, Gradient Boosting (XGBoost, LightGBM) handle label-encoded data well, since they split data using thresholds.

### ❌ Limitations:
- **False Ordinal Relationship:**  
  Misleads algorithms that assume numerical proximity matters (Linear Regression, Logistic Regression, KNN).  
  Example: Encoding "Country" as integers introduces meaningless order.

---

## 2. One-Hot Encoding (Recap & Integration)

**Definition:**  
One-Hot Encoding solves the "false order" problem by creating **binary indicator variables**.

**Process:**
- For a categorical variable with *k* categories, it creates *k* new binary (0/1) columns.
- Each row has a **1** in exactly one column.

**Key Characteristic:**  
Does **not assume order**. Treats all categories as separate and equidistant.

### Example: Color Feature (✅ The Solution)

| Original: Color | Red | Blue | Green |
|-----------------|-----|------|-------|
| Red             | 1   | 0    | 0     |
| Blue            | 0   | 1    | 0     |
| Green           | 0   | 0    | 1     |

---

### ✅ Applications:
- **Nominal Features:**  
  Categories with **no order** (e.g., Color, Country, Product Type).
- **Order-Sensitive Models:**  
  Essential for Linear Models, Logistic Regression, Neural Networks, SVMs.

### ❌ Limitations:
- **Curse of Dimensionality:**  
  If a feature has many unique values (e.g., ZIP codes), it creates too many columns, making the dataset sparse and computationally heavy.

---

## 3. Summary: Key Differences & Decision Guide

| Feature          | One-Hot Encoding                                     | Label Encoding                              |
|------------------|------------------------------------------------------|---------------------------------------------|
| **Concept**      | Creates a new binary column for each category        | Replaces categories with unique integers    |
| **Dimensionality** | Increases (adds *k* or *k-1* columns)               | Remains the same (1 column)                 |
| **Order Assumption** | No order assumed → Best for **Nominal** data      | Order implied → Best for **Ordinal** data   |
| **Model Suitability** | Safe for most models (Linear, Logistic, SVM, NN) | Works well with Tree-Based Models (if ordinal) |
| **Main Risk**    | High dimensionality with many categories             | False ordering for nominal data             |

---

### 🔑 Rule of Thumb:
- Use **One-Hot Encoding** for **Nominal** data.  
- Use **Label Encoding** for **Ordinal** data (especially with Tree-Based models).


# Encoding Techniques for Categorical Variables

Machine learning models require numerical input, but categorical variables come in many forms (nominal, ordinal, high-cardinality). Depending on the feature type and the algorithm, different encoding strategies are better suited.

---

## Summary Table

| Encoding Technique  | Use Case                                                                 |
|---------------------|--------------------------------------------------------------------------|
| **One-Hot Encoding** | Nominal features with a small number of unique categories                |
| **Label Encoding**   | Ordinal features or when used with algorithms like tree-based models     |
| **Frequency Encoding** | High-cardinality features in both regression and classification tasks |
| **Target Encoding**  | High-cardinality features in supervised learning tasks                   |

---

## 1. One-Hot Encoding
- **How it works:** Creates new binary columns (0/1) for each unique category.  
- **Best for:** Nominal data (no order), like Color = {Red, Blue, Green}.  
- **Advantages:** Avoids false ordering, works well with linear/logistic regression, SVM, and neural networks.  
- **Limitations:** High dimensionality when there are many unique categories (e.g., thousands of ZIP codes).  

---

## 2. Label Encoding
- **How it works:** Assigns each category a unique integer (e.g., Small=0, Medium=1, Large=2).  
- **Best for:** Ordinal data (categories with a natural order).  
- **Advantages:** Very efficient—does not increase dimensionality.  
- **Limitations:** Creates false ordering if applied to nominal variables. Not suitable for linear models but works fine with **tree-based models** (Decision Trees, Random Forests, XGBoost).  

---

## 3. Frequency Encoding
- **How it works:** Replaces each category with the frequency (or count) of its occurrence in the dataset.  
  Example:  


In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv"
data = pd.read_csv(url)

In [6]:
X = data.drop(columns=['Survived'])
y = data['Survived']

In [7]:
X

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [8]:
y

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

In [9]:
le = LabelEncoder()

In [11]:
X['Embarked'] = le.fit_transform(data['Embarked'])

In [12]:
X

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,2
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,0
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,2
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,2
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,2
...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,2
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,2
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,2
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,0


In [40]:
df_one_hot = pd.get_dummies(data, columns=['Sex','Embarked'], drop_first= True)


In [41]:
df_one_hot

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Ticket_frequency,Sex_male,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,,1,True,False,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,1,False,False,False
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,,1,False,False,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,2,False,False,True
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,,1,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,,1,True,False,True
887,888,1,1,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,B42,1,False,False,True
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",,1,2,W./C. 6607,23.4500,,2,False,False,True
889,890,1,1,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C148,1,True,False,False


In [31]:
# Aply Frequency Encoding
X['Ticket_frequency'] = X['Ticket'].map(data["Ticket"].value_counts())



In [32]:
# Display frequency Encoded feature
print(X[['Ticket', 'Ticket_frequency']])

               Ticket  Ticket_frequency
0           A/5 21171                 1
1            PC 17599                 1
2    STON/O2. 3101282                 1
3              113803                 2
4              373450                 1
..                ...               ...
886            211536                 1
887            112053                 1
888        W./C. 6607                 2
889            111369                 1
890            370376                 1

[891 rows x 2 columns]


In [34]:
X

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Ticket_frequency
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,2,1
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,0,1
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,2,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,2,2
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,2,1
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,2,1
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,2,2
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,0,1


In [48]:
X = df_one_hot.drop(columns=['Survived', 'Name', 'Ticket', 'Cabin','Age'])
y = data["Survived"]

In [49]:
X

Unnamed: 0,PassengerId,Pclass,SibSp,Parch,Fare,Ticket_frequency,Sex_male,Embarked_Q,Embarked_S
0,1,3,1,0,7.2500,1,True,False,True
1,2,1,1,0,71.2833,1,False,False,False
2,3,3,0,0,7.9250,1,False,False,True
3,4,1,1,0,53.1000,2,False,False,True
4,5,3,0,0,8.0500,1,True,False,True
...,...,...,...,...,...,...,...,...,...
886,887,2,0,0,13.0000,1,True,False,True
887,888,1,0,0,30.0000,1,False,False,True
888,889,3,1,2,23.4500,2,False,False,True
889,890,1,0,0,30.0000,1,True,False,False


In [50]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [51]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter = 200)
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [52]:
y_pred = model.predict(X_test)

In [53]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.7877094972067039