# One-Hot Encoding From Scratch

## 1. Introduction
Machine Learning models (like Linear Regression, Neural Networks, etc.) require **numerical input**. They cannot understand text strings like "Apple", "Banana", or "Orange" directly.

**One-Hot Encoding** is a technique to convert these categorical text variables into a format that a computer can process.

### The Problem with Label Encoding
Why don't we just assign numbers like:
*   Apple = 1
*   Banana = 2
*   Orange = 3

**The Issue:** The model might misinterpret the data. It might think that "Orange" (3) is greater than "Apple" (1), or that the average of Apple and Orange is Banana. This implies an **order** (ordinality) that doesn't exist.

### The Solution: One-Hot Encoding
We create a new binary column (0 or 1) for every unique category.

| Fruit | Is_Apple | Is_Banana | Is_Orange |
|-------|----------|-----------|-----------|
| Apple | 1 | 0 | 0 |
| Banana| 0 | 1 | 0 |
| Orange| 0 | 0 | 1 |

In [1]:
# Sample Data
# Imagine we have a dataset of fruits
raw_data = ['Red', 'Blue', 'Green', 'Red', 'Blue', 'Red']

print(f"Original Data: {raw_data}")

Original Data: ['Red', 'Blue', 'Green', 'Red', 'Blue', 'Red']


## 2. Step-by-Step Implementation
To build this from scratch, we need to follow these logical steps:

1.  **Find Unique Categories:** Identify all the distinct labels in the dataset.
2.  **Create an Index:** Assign a specific index (0, 1, 2...) to each unique category.
3.  **Create Zero Vectors:** Create a list of zeros with the same length as the number of unique categories.
4.  **Flip the Bit:** Set the number at the specific index to **1**.

### Step 1: Find Unique Categories

In [2]:
# We use set() to remove duplicates and sorted() to keep the order consistent
unique_categories = sorted(list(set(raw_data)))

print(f"Unique Categories: {unique_categories}")
print(f"Number of Classes: {len(unique_categories)}")

Unique Categories: ['Blue', 'Green', 'Red']
Number of Classes: 3


### Step 2: Mapping Categories to Indices
We need a dictionary that tells us which position belongs to which color.
*   `'Blue'` $\rightarrow$ Index `0`
*   `'Green'` $\rightarrow$ Index `1`
*   `'Red'` $\rightarrow$ Index `2`

In [3]:
# Create a dictionary to map label -> index
category_to_index = {category: i for i, category in enumerate(unique_categories)}

print("Index Mapping:")
for cat, idx in category_to_index.items():
    print(f"  {cat} -> Index {idx}")

Index Mapping:
  Blue -> Index 0
  Green -> Index 1
  Red -> Index 2


### Step 3: Generating the Vectors
Now we iterate through our original data. For each item:
1.  Create a list of zeros: `[0, 0, 0]`
2.  Find the index of the current item (e.g., 'Red' is index 2).
3.  Change the 0 at index 2 to a 1: `[0, 0, 1]`

In [4]:
one_hot_encoded_data = []

for item in raw_data:
    # 1. Create a zero vector of length equal to unique classes
    # If we have 3 classes, we create [0, 0, 0]
    vector = [0] * len(unique_categories)
    
    # 2. Find the index of the current item
    index = category_to_index[item]
    
    # 3. Set that position to 1
    vector[index] = 1
    
    # Append to our list
    one_hot_encoded_data.append(vector)

# Print results
print(f"{'Original':<10} | {'One-Hot Vector'}")
print("-" * 30)
for i in range(len(raw_data)):
    print(f"{raw_data[i]:<10} | {one_hot_encoded_data[i]}")

Original   | One-Hot Vector
------------------------------
Red        | [0, 0, 1]
Blue       | [1, 0, 0]
Green      | [0, 1, 0]
Red        | [0, 0, 1]
Blue       | [1, 0, 0]
Red        | [0, 0, 1]


## 3. Creating a Reusable Function
Now that we understand the logic, let's wrap this into a clean Python function that we can use on any list of text data.

In [5]:
def one_hot_encode(data):
    """
    Performs One-Hot Encoding on a list of categorical data.
    Returns: 
        1. The list of one-hot vectors
        2. The list of unique category names (column headers)
    """
    # 1. Get unique categories
    unique_cats = sorted(list(set(data)))
    
    # 2. Create mapping
    cat_to_index = {cat: i for i, cat in enumerate(unique_cats)}
    
    # 3. Encode
    encoded_output = []
    for item in data:
        vector = [0] * len(unique_cats)
        vector[cat_to_index[item]] = 1
        encoded_output.append(vector)
        
    return encoded_output, unique_cats

# --- Testing the function ---
animals = ['Cat', 'Dog', 'Bird', 'Cat', 'Bird']
encoded_animals, headers = one_hot_encode(animals)

print(f"Categories: {headers}\n")
for raw, vec in zip(animals, encoded_animals):
    print(f"{raw}: {vec}")

Categories: ['Bird', 'Cat', 'Dog']

Cat: [0, 1, 0]
Dog: [0, 0, 1]
Bird: [1, 0, 0]
Cat: [0, 1, 0]
Bird: [1, 0, 0]
