# **Missing data:**

Missing data can cause bias , distortion and hide real patterns so we must leanr how to handle it.

#### **1.Types of missing data:**

**`1.1 MCAR - Missing completely at random:`**

Missingness is unrelated to anything and is purely random.

**Example:**
Sensor glitch randomly drops readings

**Impact:**
- No systematic bias
- Safe to drop rows (usually)

**`1.2 MAR — Missing At Random:`**

Missingness depends on other variables, not itself.

**Example:**
Income missing more often for younger respondents

**Impact:**
- Can introduce bias
- Often fixable using informed imputation

**`1.3 MNAR — Missing Not At Random:`**

Missingness depends on the missing value itself.

**Example:**
High earners skip income question

**Impact:**
- Serious bias risk
- Requires domain understanding

#### **2.Bias from dropping rows:**

`df.dropna()` looks simple but here are a few things to consider:

**Safe when:**
- Mostly MCAR
- Small percentage missing

**Dangerous when:**
- Systematic missingness
- Large data loss

**Example bias:** If high-income rows are missing → model underestimates income.

In [2]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "age": [25, 30, np.nan, 40, np.nan],
    "income": [50, 60, 80, 90, 100]
})

print("Original:\n", df)

print("\nAfter dropna:\n", df.dropna())

Original:
     age  income
0  25.0      50
1  30.0      60
2   NaN      80
3  40.0      90
4   NaN     100

After dropna:
     age  income
0  25.0      50
1  30.0      60
3  40.0      90


#### **3.Mean vs Median imputation:**

Imputation means replacing missing values with `estimates`.

**`3.1 Mean imputation:`** Replace missing values with the average.

**Pros:**
- Easy 
- Preservees dataset size

**Cons:**
- Sensitive to outliers.
- Reduces Variance.
- Can distort distributions.

In [None]:
df.fillna(df["age"].mean(), inplace=True)      
print("\nAfter fillna:\n", df)                                                                          


After fillna:
          age  income
0  25.000000      50
1  30.000000      60
2  31.666667      80
3  40.000000      90
4  31.666667     100


**`3.2 Median imputation:`** Replace the missing value with the middle value.

Pros:
- Robust to skew/outliers
- Better for real world data

Cons:
- Still reduces variability

In [16]:
df.fillna(df["age"].median(), inplace=True)
print("\nAfter fillna with median:\n", df)



After fillna with median:
          age  income
0  25.000000      50
1  30.000000      60
2  31.666667      80
3  40.000000      90
4  31.666667     100
