# Encoding Categorical Features for ML Models

## ðŸ“š Learning Objectives

By completing this notebook, you will:
- Encode categorical features for ML models
- Use label encoding
- Use one-hot encoding
- Apply encoding techniques to datasets
- Understand when to use each method

## ðŸ”— Prerequisites

- âœ… Understanding of ML concepts
- âœ… Understanding of categorical data
- âœ… Scikit-learn knowledge

---

## Official Structure Reference

This notebook covers practical activities from **Course 01, Unit 2**:
- Encoding categorical features for ML models
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 2 Practical Content

---

## Introduction

**Categorical encoding** converts categorical data into numerical format that machine learning models can process, using techniques like label encoding and one-hot encoding.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

print("âœ… Libraries imported!")
print("\nEncoding Categorical Features")
print("=" * 60)

# Sample categorical data
data = {
    'color': ['red', 'blue', 'green', 'red', 'blue'],
    'size': ['small', 'large', 'medium', 'small', 'large'],
    'category': ['A', 'B', 'A', 'C', 'B']
}

df = pd.DataFrame(data)
print("\nOriginal data:")
print(df)
print(f"\nData types:\n{df.dtypes}")

print("\nâœ… Data prepared!")

In [None]:
# Label Encoding - Assigns integer labels
print("=" * 60)
print("LABEL ENCODING")
print("=" * 60)

label_encoder = LabelEncoder()

# Encode each categorical column
for col in df.select_dtypes(include=['object']).columns:
    df[f'{col}_encoded'] = label_encoder.fit_transform(df[col])
    print(f"\n{col} encoding:")
    print(f"  Original: {df[col].unique()}")
    print(f"  Encoded: {df[f'{col}_encoded'].unique()}")

print("\nâœ… Label encoding completed!")

In [None]:
# One-Hot Encoding - Creates binary columns
print("=" * 60)
print("ONE-HOT ENCODING")
print("=" * 60)

# Using pandas get_dummies
df_onehot = pd.get_dummies(df[['color', 'size', 'category']], prefix=['color', 'size', 'category'])
print("\nOne-hot encoded data:")
print(df_onehot)

print(f"\nOriginal columns: {df[['color', 'size', 'category']].shape[1]}")
print(f"One-hot columns: {df_onehot.shape[1]}")

print("\nâœ… One-hot encoding completed!")

## Summary

This notebook covered:
- âœ… **Label Encoding**: Assigns integer labels to categories
- âœ… **One-Hot Encoding**: Creates binary columns for each category
- âœ… **When to use**: Label encoding for ordinal data, one-hot for nominal data

Categorical encoding is essential for preparing data for machine learning models.