# Technique: 04 Encoding

### What is this?
Encoding means changing words (categories) into numbers. Computer models cannot read words like "Toronto" or "Master Degree," so we must give them numerical values.

### Why use it?
1. Most Machine Learning models only understand numbers.
2. Good encoding makes the model more accurate.

### Two Main Methods:
1. **Label Encoding**: Giving each word a simple number (0, 1, 2, 3). Best for **Ordinal data** (data with an order, like High School < Bachelor < Master).
2. **One-Hot Encoding**: Creating new columns for each word. Best for **Nominal data** (data with no order, like North, South, East, West).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from data_generator import generate_dtt_dataset, GLOBAL_SEED

# Initialize Dataset
df = generate_dtt_dataset()
print(f'Dataset loaded with Global Seed: {GLOBAL_SEED}')
df.head()

Dataset loaded with Global Seed: 888


Unnamed: 0,Age,Annual_Salary,Household_Size,Education_Level,Region,Cluster_Feature_1,Cluster_Feature_2,Transaction_Amount
0,47,293814.560245,2.219505,Master,North,-0.666995,8.207288,3.257499
1,63,293814.560245,1.967882,High School,East,-1.843954,-8.553721,68.050678
2,55,293814.560245,1.82875,PhD,East,6.498745,-7.157678,8.111841
3,36,293814.560245,1.772328,Bachelor,East,-1.25746,-8.568788,27.766274
4,42,293814.560245,3.174114,Master,North,8.144192,-6.575686,13.867245


In [2]:
print("Education Levels (Ordinal):")
print(df['Education_Level'].unique())

print("\nRegions (Nominal):")
print(df['Region'].unique())

Education Levels (Ordinal):
<StringArray>
['Master', 'High School', 'PhD', 'Bachelor']
Length: 4, dtype: str

Regions (Nominal):
<StringArray>
['North', 'East', 'South', 'West']
Length: 4, dtype: str


## Method 1: Label Encoding
We use this for **Education_Level** because there is a rank (High School is lower than PhD). 
We assign a number to each rank to keep the order.

In [3]:
from sklearn.preprocessing import LabelEncoder

# 1. Create the Encoder
le = LabelEncoder()

# 2. Map the education to numbers
# Note: LabelEncoder assigns numbers alphabetically. 
# For true rank, sometimes we do it manually.
df['Education_Encoded'] = le.fit_transform(df['Education_Level'])

print("Education Level vs Encoded Number:")
print(df[['Education_Level', 'Education_Encoded']].head(10))

Education Level vs Encoded Number:
  Education_Level  Education_Encoded
0          Master                  2
1     High School                  1
2             PhD                  3
3        Bachelor                  0
4          Master                  2
5          Master                  2
6          Master                  2
7          Master                  2
8          Master                  2
9     High School                  1


## Method 2: One-Hot Encoding
We use this for **Region** (North, South, etc.). 
Since North is not "bigger" than South, we create 4 new columns. 
If the person is from the North, the 'North' column is 1, and others are 0.

In [4]:
# We can use pandas 'get_dummies' for One-Hot Encoding
region_encoded = pd.get_dummies(df['Region'], prefix='Region')

# Combine with our main data
df = pd.concat([df, region_encoded], axis=1)

print("New columns created for Region:")
print(region_encoded.head())

New columns created for Region:
   Region_East  Region_North  Region_South  Region_West
0        False          True         False        False
1         True         False         False        False
2         True         False         False        False
3         True         False         False        False
4        False          True         False        False


### Summary from Lecture Slides:
* **Label Encoding Problem**: The computer might think 3 is "better" than 0. Only use this if the order matters (Ordinal).
* **One-Hot Encoding Problem**: If you have 100 different cities, you will get 100 new columns. This makes the data too big.
* **Choice Matters**: The model performance changes a lot depending on which encoding you pick!