# 📘 Notebook 04 — Encoding Categorical Variables
📁 File name: 04_encoding_categorical_variables.ipynb

This notebook walks beginners through why we need to encode categorical variables, and how to do it using OneHotEncoder and OrdinalEncoder from scikit-learn. It also introduces basic handling of unknown categories and how to work with both encoders in a Pandas-friendly way.

📒 Notebook Sections
1. Title & Intro
2. Why Encoding Is Needed
3. Explore Categorical Columns
4. OneHotEncoder
5. OrdinalEncoder
6. Dealing with Unknown Categories
7. Summary & What’s Next

## 1. Title & Introduction (Markdown)
### 🔠 04 — Encoding Categorical Variables

In this notebook, you'll learn how to convert categorical (text-based) columns into numbers so that machine learning models can work with them.

We’ll cover:

- What encoding is and why it matters
- One-hot encoding using `OneHotEncoder`
- Ordinal encoding using `OrdinalEncoder`
- How to handle unknown categories during inference

## 2. Why Encoding Is Needed (Markdown)
### 🤔 Why Do We Need to Encode Categories?

Machine learning models work with numbers, not text. So when your dataset contains columns like `Gender`, `Country`, or `ProductType`, we need to convert them into a numerical format.

Two common techniques:

- **OneHot Encoding**: Creates binary columns for each category
- **Ordinal Encoding**: Replaces categories with ordered integers (use only if order makes sense)



## 3. Explore Categorical Columns

In [None]:
import pandas as pd

# Load data
df = pd.read_csv("../data/sample_data.csv")

# Identify categorical columns
cat_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()
df[cat_cols].head()

## 4. OneHotEncoder

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Select one column to demonstrate
sample_df = df[["Country"]].copy()  # example column, adjust based on your dataset

# OneHotEncoder expects 2D arrays
encoder = OneHotEncoder(sparse=False, handle_unknown="ignore")

encoded = encoder.fit_transform(sample_df)

# Convert to DataFrame
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(["Country"]))
encoded_df.head()

## 5. OrdinalEncoder

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Use another column
sample_df2 = df[["Education_Level"]]  # example, change to suit your dataset

ordinal_encoder = OrdinalEncoder()
encoded_ord = ordinal_encoder.fit_transform(sample_df2)

df_encoded_ord = pd.DataFrame(encoded_ord, columns=["Education_Level_Ordinal"])
df_encoded_ord.head()

📝 Note: sparse=False makes output readable in notebook. handle_unknown="ignore" is helpful during inference when a new category appears.

## 6. Dealing with Unknown Categories (Advanced Option)

In [None]:
# During inference, OneHotEncoder may face unseen categories
# `handle_unknown='ignore'` prevents the model from crashing

# Simulate new/unseen data
new_data = pd.DataFrame({"Country": ["Atlantis"]})
encoder.transform(new_data)  # will return all 0s for "Atlantis"

## 7. Summary / What’s Next (Markdown)
### ✅ Summary

- We explored two main encoding methods:
  - **OneHotEncoder**: Great for unordered categories
  - **OrdinalEncoder**: Use when category order matters
- We also learned how to handle unknown categories safely

➡️ **Next Up**: `05_binning_numerical_features.ipynb`  
We'll learn how to group continuous variables into buckets using `KBinsDiscretizer`.
