# 06: Robust Statistics - Outlier-Resistant Analysis

**Objective**: Apply robust statistical methods resistant to outliers and violations of assumptions

**Key Methods**:
- Robust regression (Huber, RANSAC)
- Robust location/scale estimators (median, MAD, IQR)
- Winsorization and trimming
- Bootstrap confidence intervals
- Influence diagnostics (Cook's distance)

**Dataset**: NTSB Aviation Accidents (1962-2025)
**Last Updated**: 2025-11-09

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.linear_model import HuberRegressor, RANSACRegressor, LinearRegression
from sklearn.preprocessing import StandardScaler
import sqlalchemy as sa
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10
plt.rcParams['savefig.dpi'] = 150

figures_dir = Path('figures')
figures_dir.mkdir(exist_ok=True)

engine = sa.create_engine('postgresql://parobek@localhost/ntsb_aviation')
np.random.seed(42)
print("✅ Setup complete")


In [None]:
# Load data with potential outliers
query = """
SELECT 
    e.ev_id,
    e.ev_year,
    a.acft_year,
    e.inj_tot_f,
    e.inj_tot_s,
    CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 ELSE 0 END as is_fatal
FROM events e
LEFT JOIN aircraft a ON e.ev_id = a.ev_id AND a.aircraft_key = (
    SELECT MIN(a2.aircraft_key) FROM aircraft a2 WHERE a2.ev_id = e.ev_id
)
WHERE e.ev_year IS NOT NULL AND a.acft_year IS NOT NULL
"""

df = pd.read_sql(sa.text(query), engine)
df['aircraft_age'] = df['ev_year'] - df['acft_year']
df = df[(df['aircraft_age'] >= 0) & (df['aircraft_age'] <= 100)].copy()
df['inj_tot_f'] = df['inj_tot_f'].fillna(0)

print(f"Loaded {len(df):,} events for robust statistics")


## 1. Robust Location and Scale Estimators

In [None]:
# Compare classical vs robust estimators for aircraft age
age_data = df['aircraft_age'].dropna()

# Classical estimators
mean_classic = age_data.mean()
std_classic = age_data.std()

# Robust estimators
median_robust = age_data.median()
mad_robust = stats.median_abs_deviation(age_data, scale='normal')  # MAD scaled to match std
iqr_robust = stats.iqr(age_data)

# Trimmed mean (10% trim)
trimmed_mean = stats.trim_mean(age_data, proportiontocut=0.1)

print("\n📊 Classical vs Robust Estimators for Aircraft Age:")
print(f"\nLocation (Central Tendency):")
print(f"  Mean (classical): {mean_classic:.2f} years")
print(f"  Median (robust): {median_robust:.2f} years")
print(f"  Trimmed mean (10%): {trimmed_mean:.2f} years")
print(f"  Difference (mean - median): {mean_classic - median_robust:.2f} years")

print(f"\nScale (Variability):")
print(f"  Standard deviation (classical): {std_classic:.2f} years")
print(f"  MAD (robust, scaled): {mad_robust:.2f} years")
print(f"  IQR (robust): {iqr_robust:.2f} years")
print(f"  Ratio (std/MAD): {std_classic/mad_robust:.2f}")

if std_classic / mad_robust > 1.5:
    print(f"\n⚠️  High std/MAD ratio suggests outliers present (>1.5)")
else:
    print(f"\n✅ Low std/MAD ratio suggests minimal outlier influence (<1.5)")


## 2. Outlier Detection with IQR Method

In [None]:
# IQR-based outlier detection
Q1 = age_data.quantile(0.25)
Q3 = age_data.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = age_data[(age_data < lower_bound) | (age_data > upper_bound)]
n_outliers = len(outliers)

print(f"\n📊 IQR-Based Outlier Detection:")
print(f"  Q1 (25th percentile): {Q1:.2f}")
print(f"  Q3 (75th percentile): {Q3:.2f}")
print(f"  IQR: {IQR:.2f}")
print(f"  Lower bound: {lower_bound:.2f}")
print(f"  Upper bound: {upper_bound:.2f}")
print(f"\nOutliers detected: {n_outliers:,} ({n_outliers/len(age_data)*100:.2f}%)")

# Visualize outliers
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Left: Box plot
bp = ax1.boxplot(age_data, vert=True, patch_artist=True)
bp['boxes'][0].set_facecolor('lightblue')
ax1.set_ylabel('Aircraft Age (years)')
ax1.set_title(f'Box Plot with Outliers\n({n_outliers:,} outliers)', fontweight='bold')
ax1.grid(True, alpha=0.3, axis='y')

# Right: Histogram with outlier boundaries
ax2.hist(age_data, bins=50, alpha=0.7, color='blue', edgecolor='black')
ax2.axvline(lower_bound, color='red', linestyle='--', linewidth=2, label=f'Lower bound: {lower_bound:.1f}')
ax2.axvline(upper_bound, color='red', linestyle='--', linewidth=2, label=f'Upper bound: {upper_bound:.1f}')
ax2.axvline(median_robust, color='green', linestyle='-', linewidth=2, label=f'Median: {median_robust:.1f}')
ax2.set_xlabel('Aircraft Age (years)')
ax2.set_ylabel('Frequency')
ax2.set_title('Distribution with IQR Boundaries', fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.suptitle('Outlier Detection: Aircraft Age', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(figures_dir / '01_outlier_detection.png', dpi=150, bbox_inches='tight')
plt.show()


## 3. Robust Regression (OLS vs Huber vs RANSAC)

In [None]:
# Regression: Predict fatalities from aircraft age
reg_df = df[['aircraft_age', 'inj_tot_f']].dropna()
X = reg_df[['aircraft_age']].values
y = reg_df['inj_tot_f'].values

# Ordinary Least Squares (OLS)
ols = LinearRegression()
ols.fit(X, y)

# Huber regression (robust to outliers)
huber = HuberRegressor(epsilon=1.35, max_iter=200)
huber.fit(X, y)

# RANSAC regression (very robust, outlier removal)
# RANSAC with very tolerant parameters for noisy data
ransac = RANSACRegressor(
    random_state=42,
    min_samples=50,  # Minimum 50 samples for consensus
    residual_threshold=10.0,  # Very tolerant threshold
    max_trials=5000,  # Many attempts to find consensus
    stop_probability=0.99  # High confidence requirement
)
ransac.fit(X, y)

print("\n📊 Regression Comparison: Fatalities ~ Aircraft Age")
print(f"\nOLS (Ordinary Least Squares):")
print(f"  Slope: {ols.coef_[0]:.6f}")
print(f"  Intercept: {ols.intercept_:.6f}")

print(f"\nHuber Regression (Robust):")
print(f"  Slope: {huber.coef_[0]:.6f}")
print(f"  Intercept: {huber.intercept_:.6f}")

print(f"\nRANSAC Regression (Very Robust):")
print(f"  Slope: {ransac.estimator_.coef_[0]:.6f}")
print(f"  Intercept: {ransac.estimator_.intercept_:.6f}")
print(f"  Inliers: {ransac.inlier_mask_.sum():,} ({ransac.inlier_mask_.mean()*100:.1f}%)")

# Visualize regression lines
fig, ax = plt.subplots(figsize=(12, 6))

# Sample data points (too many to plot all)
sample_indices = np.random.choice(len(X), size=min(2000, len(X)), replace=False)
ax.scatter(X[sample_indices], y[sample_indices], alpha=0.3, s=10, color='gray', label='Data (sample)')

# Regression lines
x_range = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
ax.plot(x_range, ols.predict(x_range), 'b-', linewidth=2, label='OLS', alpha=0.8)
ax.plot(x_range, huber.predict(x_range), 'r--', linewidth=2, label='Huber', alpha=0.8)
ax.plot(x_range, ransac.predict(x_range), 'g:', linewidth=3, label='RANSAC', alpha=0.8)

ax.set_xlabel('Aircraft Age (years)', fontsize=12)
ax.set_ylabel('Fatal Injuries', fontsize=12)
ax.set_title('Robust Regression Comparison: OLS vs Huber vs RANSAC', fontsize=14, fontweight='bold')
ax.legend(loc='best', fontsize=10)
ax.grid(True, alpha=0.3)
ax.set_ylim(-1, 20)  # Focus on typical range

plt.tight_layout()
plt.savefig(figures_dir / '02_robust_regression.png', dpi=150, bbox_inches='tight')
plt.show()


## 4. Bootstrap Confidence Intervals

In [None]:
# Bootstrap 95% CI for median aircraft age
n_bootstrap = 10000
bootstrap_medians = []

for _ in range(n_bootstrap):
    sample = age_data.sample(n=len(age_data), replace=True)
    bootstrap_medians.append(sample.median())

bootstrap_medians = np.array(bootstrap_medians)

# Percentile method for CI
ci_lower = np.percentile(bootstrap_medians, 2.5)
ci_upper = np.percentile(bootstrap_medians, 97.5)

print(f"\n📊 Bootstrap Confidence Interval for Median Aircraft Age:")
print(f"\nBootstrap samples: {n_bootstrap:,}")
print(f"Sample median: {median_robust:.2f} years")
print(f"Bootstrap mean: {bootstrap_medians.mean():.2f} years")
print(f"Bootstrap std: {bootstrap_medians.std():.2f} years")
print(f"\n95% Bootstrap CI: [{ci_lower:.2f}, {ci_upper:.2f}]")

# Visualize bootstrap distribution
fig, ax = plt.subplots(figsize=(12, 6))

ax.hist(bootstrap_medians, bins=50, density=True, alpha=0.7, 
        color='purple', edgecolor='black', label='Bootstrap distribution')
ax.axvline(median_robust, color='blue', linestyle='-', linewidth=2, 
           label=f'Sample median: {median_robust:.2f}')
ax.axvline(ci_lower, color='red', linestyle='--', linewidth=2, alpha=0.7)
ax.axvline(ci_upper, color='red', linestyle='--', linewidth=2, alpha=0.7,
           label=f'95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]')

ax.set_xlabel('Median Aircraft Age (years)', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title(f'Bootstrap Distribution of Median\n({n_bootstrap:,} resamples)', 
             fontsize=14, fontweight='bold')
ax.legend(loc='best')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(figures_dir / '03_bootstrap_ci.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n✅ Bootstrap provides distribution-free confidence intervals")


## Key Findings

### 1. Robust Estimators
- **Median vs Mean**: Robust to outliers (unaffected by extreme values)
- **MAD vs Std**: Median Absolute Deviation resistant to outliers
- **When to use**: Data with heavy tails, outliers, non-normal distributions

### 2. Outlier Detection
- **IQR method**: 1.5×IQR rule identifies outliers
- **Box plots**: Visual identification of outliers
- **Prevalence**: Aircraft age has ~X% outliers

### 3. Robust Regression
- **OLS**: Sensitive to outliers (biased estimates)
- **Huber**: Moderate robustness (downweights outliers)
- **RANSAC**: High robustness (excludes outliers entirely)
- **Trade-off**: Robustness vs efficiency

### 4. Bootstrap Methods
- **Resampling**: Distribution-free inference
- **Confidence intervals**: No normality assumption required
- **Versatility**: Works for any statistic (median, ratio, correlation)

### Practical Recommendations

**Use robust methods when**:
- Outliers are present (check with box plots, IQR)
- Normality assumption violated (check with Q-Q plots)
- Sample sizes are small (<30)
- High-stakes decisions require conservative estimates

**Classical methods adequate when**:
- Data approximately normal
- Large sample sizes (n > 1000)
- Outliers are genuine data (not errors)
- Maximum efficiency desired