# **Encoding Categorical Variables:**
Machine learning models work with numbers, not text labels. Encoding converts categorical data into numeric form without destroying meaning.

The challenge is to encode categories while preserving information and avoiding unintended bias.

#### **1.One-hot Encoding:**

Each category becomes its own binary column.No category is treated as “larger” or “smaller” Perfect for nominal data.

In [1]:
# Red  Blue  Green
# 1    0     0
# 0    1     0
# 0    0     1

import pandas as pd

df = pd.DataFrame({"color": ["red", "blue", "green", "red"]})

encoded = pd.get_dummies(df["color"])
print(encoded)


    blue  green    red
0  False  False   True
1   True  False  False
2  False   True  False
3  False  False   True


#### **2.Label Encoding:**

Assigns integers to categories.

Red → 0
Blue → 1
Green → 2

In [2]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

labels = le.fit_transform(df["color"])
print(labels)


[2 0 1 2]


#### **3.Ordinal Encoding:**

Used when categories have real ranking, Like: Small < Medium < Large

In [4]:
from sklearn.preprocessing import OrdinalEncoder

sizes = pd.DataFrame({"size": ["small", "medium", "large","small"]})

encoder = OrdinalEncoder(categories=[["small", "medium", "large"]])
encoded = encoder.fit_transform(sizes)

print(encoded)

[[0.]
 [1.]
 [2.]
 [0.]]


#### **4.High-Cardinality:**

High cardinality = many unique categories.
Ex:
- User ID → thousands of values
- City names → hundreds

**`One-hot encoding causes:`**
- Massive feature explosion
- Memory issues
- Overfitting
- Sparse data

**`Strategies:`**
- Group rare categories
- Frequency encoding
- Target encoding
- Embeddings (advanced)

In [6]:
df["color"] = df["color"].replace({"green": "other"})
print(df)


   color
0    red
1   blue
2  other
3    red


**Summary:**
- One-hot → safest for nominal categories
- Label encoding → numeric labels (risk of false order)
- Ordinal encoding → real ranking
- High cardinality → manage complexity