<a href="https://colab.research.google.com/github/awsdevguru/PearsonMLFoundations/blob/dev/2_3_02_Data_Cleaning_Outliers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning: Outliers

## Mini-Glossary

* **Outlier:** A data point that's unusually far from most others; sometimes an error, sometimes the most interesting point.

* **Univariate (one variable):** Analyze one column at a time (e.g., IQR or Z-score on x).

* **Multivariate (many variables):** Analyze multiple columns together to spot odd combinations (e.g., Isolation Forest on x and y).

* **Z-score:** Number of standard deviations a value is from the mean; |z| > 3 is a common flag.

* **IQR (Interquartile Range):** Q3-Q1 (middle 50%). Flag points below Q1-1.5IQR or above Q3+1.5IQR.

* **Robust (statistical):** Method not easily swayed by a few extreme values (IQR is robust; mean/std often aren't).

* **Skew/Skewness:** Lopsided distribution (e.g., many small values and a long right tail of big ones).

* **Winsorize / Capping:** Clip extreme values to a set lower/upper bound (keep the row, reduce its impact).

* **Transformation (e.g., log1p):** Apply a function to stabilize variance or tame skew (log1p = log(1+x)).

* **Isolation Forest:** Anomaly detector that isolates unusual points via random splits; works well in multivariate settings.

* **Residual:** Difference between actual and predicted (y-ŷ); large residuals mean "unexpected given the model."

* **Decision boundary (intuition):** The line/surface a model uses to separate normal vs. unusual (or class A vs. B).

* **Contamination (IF param):** Expected proportion of outliers; guides the Isolation Forest on how many to flag.

* **Data leakage (general ML):** Using information from outside the training fold (or future data) that unfairly boosts performance.

* **Domain rules:** Real-world constraints (e.g., negative durations are impossible) that should be enforced first.

## Objective

* Detect outliers with Z-score (univariate), IQR (univariate, robust), and Isolation Forest (multivariate).

* Decide whether to keep, transform, cap (Winsorize), or remove outliers.

* Show impact on a simple model before/after treatment.

**Terms in play:**
* univariate = one column at a time
* multivariate = use multiple columns together
* Cap/Winsorize = clip extreme values to a limit instead of deleting them.

## 1) Setup

In [None]:
!pip install -q pandas numpy matplotlib seaborn scikit-learn

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from scipy.stats import zscore
from sklearn.ensemble import IsolationForest
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

sns.set_theme()
np.random.seed(73)

## 2) Create/Reproduce a Dataset with injected outliers

Note: We purposely add a few huge values so the methods have something obvious to flag.

In [None]:
n = 800
x = np.random.gamma(shape=2.0, scale=10.0, size=n)  # skewed, positive
y = 3.0 + 0.8*x + np.random.normal(0, 5, size=n)

n_out = 20
x_out = np.random.uniform(200, 400, size=n_out)
y_out = np.random.uniform(400, 800, size=n_out)

X = np.concatenate([x, x_out])
Y = np.concatenate([y, y_out])
df = pd.DataFrame({"x": X, "y": Y})
df.head()

In [None]:
# 20 outliers added to end of df
df.tail(25).head(10)

## 3) Visual EDA (Spot outliers quickly)

Boxplot: points far from the “box” are likely outliers.

Scatter: helps see odd pairs (multivariate behavior of x & y together).

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16,4))
sns.histplot(df["x"], bins=40, ax=axes[0]); axes[0].set_title("Histogram: x")
sns.boxplot(y=df["x"], ax=axes[1]); axes[1].set_title("Boxplot: x")
sns.scatterplot(data=df, x="x", y="y", ax=axes[2]); axes[2].set_title("Scatter: x vs y")
plt.show()

## Z-Score Method (univariate; assumes roughly normal)

Z-score: "how many standard deviations away" a value is from average. A common rule is |z| > 3 -> outlier.

In [None]:
df["x_z"] = zscore(df["x"])
z_thresh = 3.0
df["outlier_z"] = (df["x_z"].abs() > z_thresh)
df["outlier_z"].value_counts()

In [None]:
plt.figure(figsize=(6,4))
sns.scatterplot(data=df, x="x", y="y", hue="outlier_z", palette={False:"tab:blue", True:"tab:red"})
plt.title("Z-score flagged outliers")
plt.show()

## 5) IQR Method (univariate; robust to skew)

**IQR (Interquartile Range):** middle 50% spread (Q3-Q1). Anything below Q1-1.5IQR or above Q3+1.5IQR is flagged.
"Robust" = not easily thrown off by a few extreme points.

In [None]:
Q1, Q3 = df["x"].quantile([0.25, 0.75])
IQR = Q3 - Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR

df["outlier_iqr"] = (df["x"] < lower) | (df["x"] > upper)
(lower, upper), df["outlier_iqr"].value_counts()

In [None]:
plt.figure(figsize=(6,4))
sns.scatterplot(data=df, x="x", y="y", hue="outlier_iqr", palette={False:"tab:blue", True:"tab:orange"})
plt.title("IQR flagged outliers")
plt.show()

## 6) Isolation Forest (multivariate; no distribution assumptions)

**Isolation Forest:** an algorithm that "isolates" unusual points using random splits. Works on multiple features at once, so it can catch odd combinations.

In [None]:
iso = IsolationForest(n_estimators=200, contamination="auto", random_state=42)
df["iso_pred"] = iso.fit_predict(df[["x","y"]])  # -1 = outlier, 1 = inlier
df["outlier_iso"] = (df["iso_pred"] == -1)
df["outlier_iso"].value_counts()

In [None]:
plt.figure(figsize=(6,4))
sns.scatterplot(data=df, x="x", y="y", hue="outlier_iso", palette={False:"tab:blue", True:"tab:green"})
plt.title("Isolation Forest flagged outliers (multivariate)")
plt.show()

## 7) Consolidate Outlier Flags & Inspect

Tip: Combine signals, one strict strategy is "flag if any method flags it."

In [None]:
flags = ["outlier_z", "outlier_iqr", "outlier_iso"]
df["outlier_any"] = df[flags].any(axis=1)

print(df[flags].mean().sort_values(ascending=False))  # flag rates
df[df["outlier_any"]].head()

## 8) Decision Paths (treatments)
### A) Keep (baseline model)

**Baseline:** no treatment—use this for comparison.

In [None]:
X_baseline = df[["x"]].values
y_baseline = df["y"].values

m0 = LinearRegression().fit(X_baseline, y_baseline)
pred0 = m0.predict(X_baseline)
print("Baseline  R2:", r2_score(y_baseline, pred0))
print("Baseline MAE:", mean_absolute_error(y_baseline, pred0))

### B) Remove (drop flagged rows)

**Remove:** clean but may throw away rare-but-real events.

In [None]:
df_rm = df.loc[~df["outlier_any"]].copy()
m1 = LinearRegression().fit(df_rm[["x"]].values, df_rm["y"].values)
pred1 = m1.predict(df_rm[["x"]].values)

print("RemoveOutliers  R2:", r2_score(df_rm["y"].values, pred1))
print("RemoveOutliers MAE:", mean_absolute_error(df_rm["y"].values, pred1))
print("Rows kept:", len(df_rm), " / ", len(df))

### C) Cap (Winsorize) extremes

**Cap/Winsorize:** clip extreme values to the IQR limits -> reduces the impact of wild points without deleting data.

In [None]:
df_cap = df.copy()
df_cap["x_cap"] = df_cap["x"].clip(lower, upper)

m2 = LinearRegression().fit(df_cap[["x_cap"]].values, df_cap["y"].values)
pred2 = m2.predict(df_cap[["x_cap"]].values)

print("Cap(IQR)  R2:", r2_score(df_cap["y"].values, pred2))
print("Cap(IQR) MAE:", mean_absolute_error(df_cap["y"].values, pred2))

### D) Transform (e.g., log1p)

**log1p:** log(1+x) tames skew (long right tail) while keeping 0 safe (log(1+0)=0).

In [None]:
df_log = df.query("x >= 0").copy()
df_log["x_log1p"] = np.log1p(df_log["x"])

m3 = LinearRegression().fit(df_log[["x_log1p"]].values, df_log["y"].values)
pred3 = m3.predict(df_log[["x_log1p"]].values)

print("Log1p  R2:", r2_score(df_log["y"].values, pred3))
print("Log1p MAE:", mean_absolute_error(df_log["y"].values, pred3))

Compare

In [None]:
summary = pd.DataFrame({
    "Approach": ["Baseline","RemoveOutliers","Cap(IQR)","Log1p"],
    "R2": [
        r2_score(y_baseline, pred0),
        r2_score(df_rm["y"].values, pred1),
        r2_score(df_cap["y"].values, pred2),
        r2_score(df_log["y"].values, pred3)
    ],
    "MAE": [
        mean_absolute_error(y_baseline, pred0),
        mean_absolute_error(df_rm["y"].values, pred1),
        mean_absolute_error(df_cap["y"].values, pred2),
        mean_absolute_error(df_log["y"].values, pred3)
    ],
    "Rows Used": [len(df), len(df_rm), len(df_cap), len(df_log)]
})
summary

## 9) Domain Rules (sanity bounds)

**Domain rules:** business/physics limits (e.g., bytes < 0 impossible). Enforce before fancy stats.

In [None]:
df_dr = df.copy()
df_dr["x_dom"] = df_dr["x"].clip(lower=0)

m4 = LinearRegression().fit(df_dr[["x_dom"]].values, df_dr["y"].values)
pred4 = m4.predict(df_dr[["x_dom"]].values)
print("DomainClip R2:", r2_score(df_dr["y"].values, pred4))
print("DomainClip MAE:", mean_absolute_error(df_dr["y"].values, pred4))