<a href="https://colab.research.google.com/github/asifahsaan/data-preprocessing-beginners/blob/main/notebooks/02_handling_missing_values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

🧪 Notebook: 02_handling_missing_values.ipynb
This notebook will help beginners understand what missing data is, why it matters, and how to handle it using different imputation strategies—one of the most important steps in any preprocessing pipeline.

📌 Notebook Sections


1.   Title & Introduction
2.   Detect Missing Values
3. Visualize Missingness (optional but helpful)
4. Impute Missing Numerical Data
5. Impute Missing Categorical Data
6. Add Missing Indicator Columns (advanced, optional)
7. Summary & What’s Next

### 1. Title & Introduction (Markdown Cell)
### 🧼 02 — Handling Missing Values

In this notebook, we'll learn:

- How to detect missing values in a dataset
- Different strategies to fill (impute) missing values
- How to treat missing values in numeric and categorical features
- When to add "missing indicators"

👉 We'll use `SimpleImputer` from `scikit-learn` to demonstrate!


### 2. Detect Missing Values

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv("../data/sample_data.csv")

# Check how many missing values per column
df.isnull().sum()

### 3. Visualize Missingness (Optional but great for teaching)

In [None]:
import matplotlib.pyplot as plt

df.isnull().sum().sort_values(ascending=False).plot.bar()
plt.title("Missing Values per Column")
plt.ylabel("Count")
plt.show()

###  4. Impute Missing Values — Numerical Columns

In [None]:
from sklearn.impute import SimpleImputer

# Identify numeric columns with missing values
numeric_cols = df.select_dtypes(include=["float64", "int64"]).columns
numeric_with_na = [col for col in numeric_cols if df[col].isnull().sum() > 0]

# Apply median imputation
num_imputer = SimpleImputer(strategy="median")
df[numeric_with_na] = num_imputer.fit_transform(df[numeric_with_na])

# Confirm missing values are gone
df[numeric_with_na].isnull().sum()


### 5. Impute Missing Values — Categorical Columns

In [None]:
# Identify categorical columns with missing values
categorical_cols = df.select_dtypes(include=["object", "category"]).columns
cat_with_na = [col for col in categorical_cols if df[col].isnull().sum() > 0]

# Use a constant fill like "Missing"
cat_imputer = SimpleImputer(strategy="constant", fill_value="Missing")
df[cat_with_na] = cat_imputer.fit_transform(df[cat_with_na])

# Confirm missing values are gone
df[cat_with_na].isnull().sum()


### 6. Add Missing Indicators (Optional)
This is helpful if the fact that a value is missing carries predictive value (e.g., no income reported might mean lower loan approval chance).

In [None]:
# Add missing indicators before imputing
for col in numeric_with_na:
    df[col + "_was_missing"] = df[col].isnull().astype(int)


## ✅ Summary

In this notebook, we:

- Identified missing values in both numeric and categorical columns
- Used `SimpleImputer` to fill missing values with median (numeric) and constant (categorical)
- Learned that handling missing data is essential for building reliable ML models

➡️ **Next Up**: Scaling and Normalizing numeric features  
Check out `03_scaling_features.ipynb` to learn about `StandardScaler`, `MinMaxScaler`, and more!
