#### Implementing One hot encoding from scratch

In [62]:
import numpy as np
import pandas as pd

#### Sample data to apply OHE

Here we have taken a 2 feature dataset, both categorical and we will apply one hot encoding on this

In [63]:
# Creating a dictionary with your specific features
data = {
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']
}

# Converting to a DataFrame
df = pd.DataFrame(data)
print("Original Dataset:")
print(df)

Original Dataset:
   Color    Size
0    Red   Small
1   Blue  Medium
2  Green   Large
3    Red  Medium
4   Blue   Small


Intiution of One hot encoding:

First we will find out how many unique categories are there in a feature/column.

and then we will create that many number of new arrays all with initial 0 (np.zero(df.shape[0])) - to get the number of values.

once done, we will loop each feature column and change the 0 to 1 if we see that category.

In [64]:
# get all the unique 
unique_categories = {}
for i in df.columns: # here generally in place of df.columns, we loop through only the categorical columns.
    unique = np.unique(df[i])
    unique_categories[i] = unique

unique_categories

{'Color': array(['Blue', 'Green', 'Red'], dtype=object),
 'Size': array(['Large', 'Medium', 'Small'], dtype=object)}

#### Now we will crate np.zeros for each of the categories in all the categorical features

The naming convention would be: Column_name + Category name

Example: Color_Red

In [65]:
# creating arrays with zeros:
categories = {}
for key, value in unique_categories.items():
    for i in value:
        name = f"{key}_{i}"
        categories[name] = np.zeros(df.shape[0], dtype=int)
categories

{'Color_Blue': array([0, 0, 0, 0, 0]),
 'Color_Green': array([0, 0, 0, 0, 0]),
 'Color_Red': array([0, 0, 0, 0, 0]),
 'Size_Large': array([0, 0, 0, 0, 0]),
 'Size_Medium': array([0, 0, 0, 0, 0]),
 'Size_Small': array([0, 0, 0, 0, 0])}

#### Now we have the playground setup, so we will start looping and filling the values.

for this, we will again loop through the categorical data, for each column, we will capture its name ex. Color, and then loop in it, and for each value in that column, we will update the corresponding array, so if i = 1 is Blue, we will change Color_Blue to [0,1,0,0]

In [66]:
for key, value in unique_categories.items():
    for i in range(df.shape[0]):
        # print(df[key][i])
        name = f"{key}_{df[key][i]}"
        categories[name][i] = 1

categories

{'Color_Blue': array([0, 1, 0, 0, 1]),
 'Color_Green': array([0, 0, 1, 0, 0]),
 'Color_Red': array([1, 0, 0, 1, 0]),
 'Size_Large': array([0, 0, 1, 0, 0]),
 'Size_Medium': array([0, 1, 0, 1, 0]),
 'Size_Small': array([1, 0, 0, 0, 1])}

In [67]:
for key,value in categories.items():
    df[key] = value
df

Unnamed: 0,Color,Size,Color_Blue,Color_Green,Color_Red,Size_Large,Size_Medium,Size_Small
0,Red,Small,0,0,1,0,0,1
1,Blue,Medium,1,0,0,0,1,0
2,Green,Large,0,1,0,1,0,0
3,Red,Medium,0,0,1,0,1,0
4,Blue,Small,1,0,0,0,0,1


#### Now we will drop the original columns of color and size

In [71]:
list(unique_categories.keys())
df = df.drop(columns=unique_categories.keys())

The Redundancy Step: "The Dummy Variable Trap"

we have one redundant column per feature. This is called Multicollinearity.

If you have three colors (Red, Blue, Green), and for a specific row you see:

Color_Blue = 0

Color_Green = 0

The computer can "perfectly predict" that Color_Red must be 1. Because the third column adds no new information, it can actually confuse certain models (like Linear Regression) by creating a mathematical dependency between features.

In [74]:
# create a array of columns to drop
for key,value in unique_categories.items():
    df = df.drop(columns=f"{key}_{value[0]}")
df

Unnamed: 0,Color_Green,Color_Red,Size_Medium,Size_Small
0,0,1,0,1
1,0,0,1,0
2,1,0,0,0
3,0,1,1,0
4,0,0,0,1


In [75]:
df

Unnamed: 0,Color_Green,Color_Red,Size_Medium,Size_Small
0,0,1,0,1
1,0,0,1,0
2,1,0,0,0
3,0,1,1,0
4,0,0,0,1


### Wrong encoding on size column

We should use Ordinal (Label) Encoding for the "Size" column because it preserves the mathematical relationship between categories. Unlike colors, which are just different, sizes have a natural hierarchy: Small is less than Medium, and Medium is less than Large. By assigning them numbers (1, 2, 3), you're teaching the model that "Large" is objectively "greater than" or "further along" than "Small." If you used one-hot encoding instead, the model would treat them as completely independent variables, losing the valuable "ranking" information that often helps a model make more accurate predictions with fewer columns.