# 📘 NLP - One Hot Encoding (Complete Notes)

## 👋 Introduction

We're continuing our discussion on **Natural Language Processing (NLP)**.

Previously, we covered:

- Stemming  
- Lemmatization  
- Stopword removal  
- Data cleaning  

Now, we move to **converting text into vectors**, which is essential before applying ML algorithms.

---

## 🔤 One Hot Encoding

> One of the most basic techniques to convert words to numerical format.

### 💡 Concept:
- Each **unique word** in the vocabulary is assigned a **binary vector**.
- Length of vector = total vocabulary size.
- Only one position is `1`, the rest are `0`.

---

## 📄 Example Sentences

Let's take 3 sentences:

- D1: The food is good
- D2: The food is bad
- D3: Pizza is amazing


### Step 1: Build Vocabulary

Combined unique words:

['the', 'food', 'is', 'good', 'bad', 'pizza', 'amazing']


Vocabulary size = **7**

---

### Step 2: One Hot Encode Each Sentence

#### 🔹 Document D1: "The food is good"

| Word  | Vector                 |
|-------|-------------------------|
| the   | [1, 0, 0, 0, 0, 0, 0]   |
| food  | [0, 1, 0, 0, 0, 0, 0]   |
| is    | [0, 0, 1, 0, 0, 0, 0]   |
| good  | [0, 0, 0, 1, 0, 0, 0]   |

Shape: **4 × 7**

---

#### 🔹 Document D2: "The food is bad"

| Word  | Vector                 |
|-------|-------------------------|
| the   | [1, 0, 0, 0, 0, 0, 0]   |
| food  | [0, 1, 0, 0, 0, 0, 0]   |
| is    | [0, 0, 1, 0, 0, 0, 0]   |
| bad   | [0, 0, 0, 0, 1, 0, 0]   |

Shape: **4 × 7**

---

#### 🔹 Document D3: "Pizza is amazing"

| Word     | Vector                 |
|----------|-------------------------|
| pizza    | [0, 0, 0, 0, 0, 1, 0]   |
| is       | [0, 0, 1, 0, 0, 0, 0]   |
| amazing  | [0, 0, 0, 0, 0, 0, 1]   |

Shape: **3 × 7**

---

## ✅ Advantages

1. **Simple to implement**
   - Use `sklearn` or `pandas.get_dummies()` easily.

2. **Works for small datasets**
   - Effective for demos or early prototypes.

---

## ❌ Disadvantages

1. **Creates Sparse Matrix**
   - High dimensional and mostly 0s.
   - Inefficient in memory and computation.
   - Can lead to **overfitting** in ML models.

2. **Variable Input Lengths**
   - Each sentence can have different word count (e.g. D3 is 3×7 vs D1 which is 4×7).
   - ML models require **fixed input sizes**.

3. **No Semantic Meaning**
   - Vectors like:
     - food → [1, 0, 0]
     - pizza → [0, 1, 0]
     - burger → [0, 0, 1]
   - All at equal distance — doesn’t capture similarity.

4. **OOV Problem (Out of Vocabulary)**
   - Words not seen during training (e.g. “burger”) can't be encoded.

5. **Scalability**
   - A real-world dataset may have 50k+ unique words → huge sparse vectors.

---

## 🧠 Summary Table

| Feature               | One Hot Encoding |
|------------------------|------------------|
| Easy to Implement      | ✅ Yes           |
| Sparse Matrix          | ⚠️ Yes           |
| Fixed Input Length     | ❌ No            |
| Semantic Info          | ❌ None          |
| OOV Handling           | ❌ Poor          |
| Scalable               | ❌ Not suitable  |

---

## 💻 Example in Python

```python
import pandas as pd

# Sample words
words = ['the', 'food', 'is', 'good']
df = pd.DataFrame({'words': words})

# One-hot encoding using pandas
encoded = pd.get_dummies(df['words'])
print(encoded)
