# 11: Conformal Prediction

**Distribution-Free Prediction Intervals with Coverage Guarantees**

---

## ðŸš¨ If You Know sklearn But Not Uncertainty Quantification, Read This First

**What you already know (from standard ML)**:
- Point predictions: "The model predicts 5.23"
- Confidence intervals: "We're 95% confident the value is in [5.0, 5.5]"
- Gaussian assumptions underlie most interval methods

**What's different with conformal prediction**:

| Traditional Intervals | Conformal Prediction |
|----------------------|---------------------|
| Assumes Gaussian errors | **Distribution-free** |
| Asymptotic guarantees (need large n) | **Finite-sample** guarantees |
| Model-specific formulas | Works with **any model** |
| Hard to calibrate | Calibration is automatic |

**The key insight**:

Conformal prediction doesn't assume anything about error distribution. It uses a **calibration set** to learn how wrong your model typically is, then constructs intervals that cover the true value with guaranteed probability.

```python
# Traditional (assumes normality):
interval = prediction Â± 1.96 * estimated_std

# Conformal (no assumptions):
interval = prediction Â± learned_quantile_from_calibration_data
```

**Coverage guarantee**: For any distribution, any model:
```
P(Y âˆˆ predicted_interval) â‰¥ 1 - Î±
```

This holds for **finite samples**, not just asymptotically!

---

## What You'll Learn

1. **Why point predictions aren't enough** â€” Uncertainty quantification for decision-making
2. **Split conformal prediction** â€” Finite-sample coverage guarantees [T1]
3. **Adaptive conformal** â€” Handling distribution shift [T1]

**Prerequisites**: Notebooks 01-04 (Foundation tier)

---

## The Problem: Point Predictions Without Uncertainty

Consider this forecast:

```
Model prediction: 5.23
```

**Questions we can't answer:**
- How confident are we in this prediction?
- What's the range of likely outcomes?
- Should we make a high-stakes decision based on this?

**The solution**: Prediction intervals with coverage guarantees.

```
90% prediction interval: [4.87, 5.59]
```

Now we know: "We expect 90% of actuals to fall within this range."

In [None]:
import numpy as np
from sklearn.linear_model import Ridge

from temporalcv.conformal import (
    SplitConformalPredictor,
    AdaptiveConformalPredictor,
    walk_forward_conformal,
    evaluate_interval_quality,
    PredictionInterval,
)

np.random.seed(42)
print("temporalcv conformal prediction")

In [None]:
# Generate synthetic data
def generate_ar1(n: int, phi: float = 0.9, sigma: float = 0.1) -> np.ndarray:
    """Generate AR(1) process: y[t] = phi * y[t-1] + epsilon[t]"""
    y = np.zeros(n)
    y[0] = np.random.normal(0, sigma)
    for t in range(1, n):
        y[t] = phi * y[t - 1] + np.random.normal(0, sigma)
    return y

# Create dataset
n = 500
y = generate_ar1(n, phi=0.9)

# Create features and split
X = np.column_stack([y[:-2], y[1:-1]])
y_target = y[2:]

train_size = int(len(y_target) * 0.7)
X_train, y_train = X[:train_size], y_target[:train_size]
X_test, y_test = X[train_size:], y_target[train_size:]

# Train model
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print(f"Train: {len(X_train)}, Test: {len(X_test)}")

---

## 1. Split Conformal Prediction [T1]

**Key idea**: Use a calibration set to learn prediction uncertainty.

**Coverage guarantee**: For any distribution, any model:
```
P(Y âˆˆ Äˆ) â‰¥ 1 - Î±
```

This is a **finite-sample** guarantee â€” it holds for any sample size!

**Citation**: Romano et al. (2019)

In [None]:
# Split test set: 30% calibration, 70% holdout
cal_size = int(len(X_test) * 0.3)
X_cal, y_cal = X_test[:cal_size], y_test[:cal_size]
X_holdout, y_holdout = X_test[cal_size:], y_test[cal_size:]

# Get predictions for each split
cal_preds = model.predict(X_cal)
holdout_preds = model.predict(X_holdout)

print(f"Calibration: {len(X_cal)}, Holdout: {len(X_holdout)}")

In [None]:
# Create and calibrate conformal predictor
conformal = SplitConformalPredictor(alpha=0.10)  # 90% intervals
conformal.calibrate(cal_preds, y_cal)

print(f"Calibrated quantile: {conformal.quantile_:.4f}")
print(f"This is the width of intervals (Â± from prediction)")

In [None]:
# Generate prediction intervals
intervals = conformal.predict_interval(holdout_preds)

print(f"\nPrediction Intervals (first 5):")
for i in range(5):
    print(f"  Point: {intervals.point[i]:.3f}, Interval: [{intervals.lower[i]:.3f}, {intervals.upper[i]:.3f}]")

print(f"\nMean interval width: {intervals.mean_width:.4f}")

In [None]:
# Check coverage on HOLDOUT data (not calibration!)
coverage = intervals.coverage(y_holdout)

print(f"\nCoverage on holdout: {coverage:.1%}")
print(f"Target coverage: {intervals.confidence:.1%}")
print(f"Gap: {(coverage - intervals.confidence)*100:.1f}pp")

### Quantile Formula [T1]

The conformal quantile is computed as:

```
q = ceil((n + 1)(1 - Î±)) / n
```

This ensures finite-sample validity. For n=45 and Î±=0.10:
- q = ceil(46 Ã— 0.9) / 45 = ceil(41.4) / 45 = 42/45 = 0.933

---

## 2. Walk-Forward Conformal [T2]

For time series, `walk_forward_conformal` handles the split automatically:

1. First 30% â†’ calibration
2. Remaining 70% â†’ holdout
3. Coverage evaluated **only on holdout**

In [None]:
# Walk-forward conformal (recommended for time series)
intervals, quality = walk_forward_conformal(
    predictions=predictions,  # All test predictions
    actuals=y_test,           # All test actuals
    calibration_fraction=0.3,
    alpha=0.10,               # 90% intervals
)

print("Walk-Forward Conformal Results:")
print(f"  Coverage: {quality['coverage']:.1%}")
print(f"  Target:   {quality['target_coverage']:.1%}")
print(f"  Gap:      {quality['coverage_gap']*100:.1f}pp")
print(f"\n  Mean width:      {quality['mean_width']:.4f}")
print(f"  Interval score:  {quality['interval_score']:.4f}")
print(f"  Quantile:        {quality['quantile']:.4f}")

### Interval Score (Proper Scoring Rule)

The interval score penalizes both:
- **Wide intervals** (lack of precision)
- **Miscoverage** (lack of calibration)

```
Score = width + (2/Î±) Ã— penalty_for_miscoverage
```

Lower is better. This prevents gaming coverage with overly wide intervals.

---

## 3. Adaptive Conformal [T1]

For **non-stationary** data (distribution shift), adaptive conformal adjusts online:

- If covered â†’ decrease quantile (tighten intervals)
- If not covered â†’ increase quantile (widen intervals)

**Citation**: Gibbs & CandÃ¨s (2021)

In [None]:
# Simulate distribution shift
# First half: low volatility
y_shift1 = generate_ar1(100, phi=0.9, sigma=0.05)
# Second half: high volatility
y_shift2 = generate_ar1(100, phi=0.9, sigma=0.3)
y_shift2 = y_shift2 + y_shift1[-1]  # Continue from where we left off

y_shifted = np.concatenate([y_shift1, y_shift2])

# Create features
X_shifted = np.column_stack([y_shifted[:-2], y_shifted[1:-1]])
y_shifted_target = y_shifted[2:]

# Train on first portion
model_shift = Ridge()
model_shift.fit(X_shifted[:80], y_shifted_target[:80])
preds_shifted = model_shift.predict(X_shifted[80:])
actuals_shifted = y_shifted_target[80:]

print(f"Shift occurs around index 100")
print(f"Testing on indices 80-198")

In [None]:
# Initialize adaptive conformal
adaptive = AdaptiveConformalPredictor(alpha=0.10, gamma=0.1)

# Initialize with first 20 points
adaptive.initialize(preds_shifted[:20], actuals_shifted[:20])

print(f"Initial quantile: {adaptive._current_quantile:.4f}")

In [None]:
# Online updates
coverages = []
quantiles = []

for i in range(20, len(preds_shifted)):
    pred = preds_shifted[i]
    actual = actuals_shifted[i]
    
    # Get interval
    lower, upper = adaptive.predict_interval(pred)
    covered = lower <= actual <= upper
    coverages.append(covered)
    
    # Update quantile based on coverage
    adaptive.update(pred, actual)
    quantiles.append(adaptive._current_quantile)

# Check coverage in different periods
first_half = coverages[:len(coverages)//2]
second_half = coverages[len(coverages)//2:]

print(f"Coverage first half (low vol):  {np.mean(first_half):.1%}")
print(f"Coverage second half (high vol): {np.mean(second_half):.1%}")
print(f"Overall coverage: {np.mean(coverages):.1%}")

In [None]:
# Show quantile adaptation
print(f"\nQuantile evolution:")
print(f"  Start:  {quantiles[0]:.4f}")
print(f"  Middle: {quantiles[len(quantiles)//2]:.4f}")
print(f"  End:    {quantiles[-1]:.4f}")

**Key insight**: Adaptive conformal widened intervals after the distribution shift to maintain coverage.

---

## 4. Interval Quality Metrics

Beyond coverage, we want to evaluate:

| Metric | What It Measures |
|--------|------------------|
| Coverage | Fraction of actuals within intervals |
| Mean width | Average interval size (smaller = more precise) |
| Interval score | Proper scoring rule (penalizes both width and miscoverage) |
| Conditional coverage | Coverage by prediction magnitude |

In [None]:
# Evaluate interval quality
quality_metrics = evaluate_interval_quality(intervals, y_test[cal_size:])

print("Interval Quality Metrics:")
for key, value in quality_metrics.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

---

## Pitfall Section: Common Mistakes

### Pitfall 1: Coverage on Calibration Data

```python
# WRONG: Evaluate coverage on calibration data
conformal.calibrate(cal_preds, cal_actuals)
intervals = conformal.predict_interval(cal_preds)  # Same data!
coverage = intervals.coverage(cal_actuals)  # Inflated!

# RIGHT: Evaluate on fresh holdout data
conformal.calibrate(cal_preds, cal_actuals)
intervals = conformal.predict_interval(holdout_preds)  # Different data
coverage = intervals.coverage(holdout_actuals)  # Valid
```

**Why it matters**: Calibration data is by definition close to predictions. Coverage on calibration is inflated.

---

### Pitfall 2: Static Conformal on Shifting Data

```python
# WRONG: Use static conformal when data distribution shifts
conformal = SplitConformalPredictor(alpha=0.10)
conformal.calibrate(old_data)  # Won't adapt!

# RIGHT: Use adaptive conformal for distribution shift
adaptive = AdaptiveConformalPredictor(alpha=0.10, gamma=0.1)
# Updates online to maintain coverage
```

**Why it matters**: Static quantiles fail when volatility changes.

---

### Pitfall 3: Gaming Coverage with Wide Intervals

```python
# WRONG: Report only coverage (can be gamed)
print(f"Coverage: {coverage:.1%}")  # Could be 99% with Â±infinity intervals

# RIGHT: Report interval score (proper scoring rule)
quality = evaluate_interval_quality(intervals, actuals)
print(f"Interval score: {quality['interval_score']:.4f}")
```

**Why it matters**: Interval score penalizes both width and miscoverage.

---

## Key Insights

```
â˜… Insight â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€

1. Conformal prediction gives distribution-free guarantees [T1]
   - P(Y âˆˆ Äˆ) â‰¥ 1 - Î± holds for any distribution
   - Finite-sample valid (not just asymptotic)

2. Walk-forward conformal is time-series aware
   - Calibration on first 30%, holdout on rest
   - Never evaluate coverage on calibration data!

3. Adaptive conformal handles distribution shift [T1]
   - Online updates to maintain coverage
   - gamma controls adaptation speed [T3]

4. Use interval score, not just coverage
   - Proper scoring rule prevents gaming
   - Penalizes both width and miscoverage

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
```

---

## Next Steps

| Notebook | Topic |
|----------|-------|
| **12** | Regime-stratified evaluation (capstone) |

---

**You've learned**: How to construct prediction intervals with coverage guarantees using conformal prediction, and how to adapt to distribution shift with online updates.