# Handling Missing Values in ML: A Beginner's Story

Imagine a classroom with empty chairs. Some students are sick, some forgot, some skipped. Each absence has a story. Missing values in your dataset are just like those empty chairs.

## Why Data Goes Missing

**MCAR (Random Accident):** Sarah's phone died mid-survey. Her data vanished by pure chance.

**MAR (Predictable Pattern):** Young patients skip "retirement status" questions — missingness tied to age, not the answer.

**MNAR (Hidden Truth):** Billionaires leave income blank *because* it's high. The absence itself reveals something.

## Deletion: The Art of Letting Go

**Listwise Deletion:** Like a librarian discarding books with torn pages. Raj deleted rows with missing values — his 10,000 records became 3,000. Simple but costly.

**Column Deletion:** Priya dropped an 80% empty "fax number" column. Sometimes whole features aren't worth keeping.

## Imputation: Filling the Gaps

**Mean/Median/Mode:** Tom filled blanks with averages. Quick, but everyone started looking suspiciously "average."

**KNN Imputation:** Maria estimated a missing house price by looking at similar nearby houses. Realistic but slow.

**Iterative (MICE):** Ahmed filled gaps column by column, using each to help estimate the next. Powerful but complex.

## Smart Algorithms & Indicators

**Self-Handling Models:** XGBoost, LightGBM, CatBoost handle missing values automatically — like chefs cooking with missing ingredients.

**Indicator Method:** David created a "was_income_missing" column before imputing. The absence itself became a useful feature.

## Best Practices

1. **Investigate first** — understand patterns before acting
2. **Avoid data leakage** — impute using only training data
3. **Document everything** — your future self will thank you
4. **Test multiple approaches** — let results guide your choice
5. **Use domain knowledge** — fresh graduate's experience = zero, not average
6. **Match strategy to scale** — <5% missing: simple methods; >25%: question the feature

## Quick Reference

| Situation | Approach |
|-----------|----------|
| Few missing, random | Deletion or mean/median |
| Relationships matter | KNN imputation |
| Multiple columns affected | Iterative (MICE) |
| Tree-based models | Let algorithm handle it |
| Missingness is meaningful | Create indicator columns |