# Handling Missing Values

Missing data is extremely common in real-world datasets. How you handle it can significantly affect model performance and bias.
Before choosing a method, it’s important to understand why data is missing:

- **MCAR** - Missing Completely At Random

- **MAR** - Missing At Random

- **MNAR** - Missing Not At Random

In [29]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification

# 1. Generate synthetic dataset
X, y = make_classification(
    n_samples=500,
    n_features=6,
    n_informative=4,
    n_redundant=2,
    random_state=42
)

df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(6)])
df["target"] = y

print(df.head())

# 2. Introduce MCAR missingness
rng = np.random.default_rng(42)

missing_rate = 0.15  # 15% missing entries
mask = rng.random(df.shape) < missing_rate

df_mcar = df.copy()
df_mcar[mask] = np.nan

print("percentage of missing values per feature\n", df_mcar.isna().mean())

print("\nFinal MCAR data\n", df_mcar.head())

   feature_0  feature_1  feature_2  feature_3  feature_4  feature_5  target
0  -0.562931  -1.122449  -1.157244  -2.470523   0.496637  -0.326114       0
1   0.645020   2.236070  -1.062233   1.627740   0.366363   0.852120       1
2   2.737576   1.427951   1.053366  -1.467892   2.862033  -1.535227       1
3   0.231788   2.540746  -1.967838   1.777564   0.296485   1.419086       1
4   0.636727  -1.257201   0.334326  -1.136824  -1.973597  -1.997091       0
percentage of missing values per feature
 feature_0    0.154
feature_1    0.156
feature_2    0.152
feature_3    0.140
feature_4    0.170
feature_5    0.152
target       0.162
dtype: float64

Final MCAR data
    feature_0  feature_1  feature_2  feature_3  feature_4  feature_5  target
0  -0.562931  -1.122449  -1.157244  -2.470523        NaN  -0.326114     0.0
1   0.645020        NaN  -1.062233   1.627740   0.366363   0.852120     1.0
2   2.737576   1.427951   1.053366        NaN   2.862033  -1.535227     1.0
3   0.231788   2.540746  -1.9678

### Case 1: Missing Completely At Random (MCAR)

When data is Missing Completely At Random, the probability of a value being missing is unrelated to the data itself.
Example: a sensor randomly malfunctions for no reason.

For MCAR we can:

#### A) Drop rows with missing values

Simple, but can remove a huge amount of data.

In [30]:
df_dropped = df_mcar.dropna()
print(df_dropped.shape[0])


156


Dropping rows removed over two-thirds of the data → usually not ideal because losing so many samples can reduce statistical power and remove important patterns.

#### B) Simple imputation (median/mean)

Keeps all data, but shrinks variance and may slightly reduce model performance.

In [16]:
df_median = df_mcar.fillna(df_mcar.median())
print(df_median.head())
print(df_median.shape[0])

   feature_0  feature_1  feature_2  feature_3  feature_4  feature_5  target
0  -0.562931  -1.122449  -1.157244  -2.470523  -0.435368  -0.326114     0.0
1   0.645020  -0.039903  -1.062233   1.627740   0.366363   0.852120     1.0
2   2.737576   1.427951   1.053366  -0.030986   2.862033  -1.535227     1.0
3   0.231788   2.540746  -1.967838   1.777564   0.296485   1.419086     0.0
4   0.636727  -1.257201   0.334326  -1.136824  -1.973597  -1.997091     0.0
500


### Case 2: Missing At Random (MAR)

MAR means missingness depends on other observed variables, not on the missing value itself.
Example:

- Younger people are less likely to report income

- Missingness depends on age, but not on income itself

Using a single global median would bias results (e.g., inflating incomes for young people).

For MAR, better methods are:

#### A) KNN Imputation

Matches each row with similar rows and uses neighbor values.

In [20]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
df_knn = pd.DataFrame(imputer.fit_transform(df_mcar), columns=df.columns)

print(df_knn.head())
print(df_knn.shape[0])

   feature_0  feature_1  feature_2  feature_3  feature_4  feature_5  target
0  -0.562931  -1.122449  -1.157244  -2.470523   0.024339  -0.326114     0.0
1   0.645020   1.723768  -1.062233   1.627740   0.366363   0.852120     1.0
2   2.737576   1.427951   1.053366  -0.136634   2.862033  -1.535227     1.0
3   0.231788   2.540746  -1.967838   1.777564   0.296485   1.419086     1.0
4   0.636727  -1.257201   0.334326  -1.136824  -1.973597  -1.997091     0.0
500


#### B) Multiple Imputation (MICE / IterativeImputer)

Predicts each feature using the others in a round-robin fashion.

In [22]:
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10, random_state=42)
df_mice = pd.DataFrame(imputer.fit_transform(df_mcar), columns=df.columns)

print(df_mice.head())
print(df_mice.shape[0])

   feature_0  feature_1  feature_2  feature_3  feature_4  feature_5    target
0  -0.562931  -1.122449  -1.157244  -2.470523   0.496544  -0.326114  0.000000
1   0.645020   2.238272  -1.062233   1.627740   0.366363   0.852120  1.000000
2   2.737576   1.427951   1.053366  -1.468055   2.862033  -1.535227  1.000000
3   0.231788   2.540746  -1.967838   1.777564   0.296485   1.419086  0.590913
4   0.636727  -1.257201   0.334326  -1.136824  -1.973597  -1.997091  0.000000
500


Both KNN and MICE work well for **MCAR** and **MAR**.

### Case 3: Missing Not At Random (MNAR)

MNAR means the missingness depends on the value that is missing.
Example:

- People with very high income avoid reporting it

In this case, imputation alone cannot fix the bias. The unobserved data is fundamentally different from the observed data.

Here's what we can do:

#### A) Add missing-indicator features

Allows the model to use the missingness itself as information.

#### B) Sensitivity analysis

Train multiple models under different assumptions:

assuming missing values are

10% lower

20% lower

40% lower, etc.

If conclusions stay stable → your result is robust.

(For this mini project, I left any unrealistic target values (like 0.590913) as-is for simplicity. In a real-world scenario, I would clean or correct these values, but here I ignored them to keep the example straightforward.)