<a href="https://colab.research.google.com/github/awsdevguru/PearsonMLFoundations/blob/dev/2_3_01_Data_Cleaning_and_Missing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning: Handling Missing Data

Detect, understand, and handle missing values using pandas, a critical preprocessing skill before ML model building.

## 1. Setup and Imports

In [None]:
import pandas as pd
import numpy as np

## 2. Create or Load Sample Data

Create an inline dataset with intentional missing values.

In [None]:
data = {
    'Age': [25, 30, np.nan, 45, 50, np.nan],
    'Income': [50000, np.nan, 72000, 60000, np.nan, 80000],
    'Gender': ['Male', 'Female', np.nan, 'Female', 'Male', 'Male'],
    'Region': ['North', 'South', 'East', np.nan, 'West', 'East']
}
df = pd.DataFrame(data)
df

## 3. Identify Missing Data

`df.info()` shows missing counts quickly.

`.isnull().sum()` counts NaNs per column.

`df.isnull().mean() * 100`  % missing

Missingness >20% is usually a red flag.  

In [None]:
df.isnull().sum()

In [None]:
df.isnull().mean() * 100  # % missing

## 4. Visualize Missing Data

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(), cbar=False, cmap='Reds')
plt.title("Missing Data Heatmap")
plt.show()

## 5. Handling Missing Data: Deletion

When to use:
* <5% missing values
* Missing Completely at Random (MCAR)

In [None]:
df_drop_rows = df.dropna()
df_drop_cols = df.dropna(axis=1)

In [None]:
df_drop_rows

In [None]:
df_drop_cols # note: is empty df

## 6. Handling Missing Data: Imputation

When to use:
* **Mean/median:** numeric columns with normal distributions.
  * Mean: average
  * Median: middle value, use when outliers would skew
* **Mode:** categorical columns with a clear dominant category.
  * `pandas.DataFrame.mode` gets dominant category

In [None]:
# Before impute
df

In [None]:
# Mean / Median (Numerical)
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Income'] = df['Income'].fillna(df['Income'].median())
df

In [None]:
# Mode (Categorical)
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])  # male is dominant, impute with it for gender
df['Region'] = df['Region'].fillna(df['Region'].mode()[0])  # East is dominant, impute with it for region
df

## 7. Forward / Backward Fill (Time-Series Style)

When to use:
* Time-dependent data (e.g., sensor or stock prices).

In [None]:
# Reset data to missing values
data = {
    'Age': [25, 30, np.nan, 45, 50, np.nan],
    'Income': [50000, np.nan, 72000, 60000, np.nan, 80000],
    'Gender': ['Male', 'Female', np.nan, 'Female', 'Male', 'Male'],
    'Region': ['North', 'South', 'East', np.nan, 'West', 'East']
}
df = pd.DataFrame(data)

df_ffill = df.ffill()

df

In [None]:
df_ffill

In [None]:
# Reset data to missing values
data = {
    'Age': [25, 30, np.nan, 45, 50, np.nan],
    'Income': [50000, np.nan, 72000, 60000, np.nan, 80000],
    'Gender': ['Male', 'Female', np.nan, 'Female', 'Male', 'Male'],
    'Region': ['North', 'South', 'East', np.nan, 'West', 'East']
}
df = pd.DataFrame(data)

df_bfill = df.bfill()

df_bfill

## 8. Advanced Method (KNNImputer)

In [None]:
from sklearn.impute import KNNImputer
# Reset data to missing values
data = {
    'Age': [25, 30, np.nan, 45, 50, np.nan],
    'Income': [50000, np.nan, 72000, 60000, np.nan, 80000],
    'Gender': ['Male', 'Female', np.nan, 'Female', 'Male', 'Male'],
    'Region': ['North', 'South', 'East', np.nan, 'West', 'East']
}
df = pd.DataFrame(data)
df



In [None]:
knn = KNNImputer(n_neighbors=2)
df_knn = pd.DataFrame(knn.fit_transform(df.select_dtypes(include=[np.number])),
                      columns=['Age','Income'])
df_knn

## 9. Validate Cleaning

In [None]:
# Reset data to missing values
data = {
    'Age': [25, 30, np.nan, 45, 50, np.nan],
    'Income': [50000, np.nan, 72000, 60000, np.nan, 80000],
    'Gender': ['Male', 'Female', np.nan, 'Female', 'Male', 'Male'],
    'Region': ['North', 'South', 'East', np.nan, 'West', 'East']
}
df = pd.DataFrame(data)

# fill missing with mean/median/category
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Income'] = df['Income'].fillna(df['Income'].median())
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])  # male is dominant, impute with it for gender
df['Region'] = df['Region'].fillna(df['Region'].mode()[0])  # East is dominant, impute with it for region
df

In [None]:
df.isnull().sum()

In [None]:
df.describe(include='all')

## 10. Wrap-Up

Key takeaways:
* Missing data always exists, handle it.
* Choose method based on data type, pattern, and % missing.
* Always document cleaning decisions.