# 10: High-Persistence Metrics

**MC-SS, Move-Conditional Evaluation, and Theil's U**

---

## ðŸš¨ If You Know sklearn But Not High-Persistence Data, Read This First

**What you already know (from standard ML)**:
- Lower MAE = better model
- 10% improvement over baseline = meaningful progress
- If your model beats the baseline, you've learned something useful

**What's different with high-persistence time series** (like Treasury rates, unemployment, GDP):

| Metric | Typical Value | Interpretation |
|--------|---------------|----------------|
| Model MAE | 0.0012 | Looks amazing! |
| Persistence MAE | 0.0013 | Also amazing... |
| Improvement | 8% | Sounds good! |

**The problem**: Both models are essentially predicting "no change."

```python
# What's really happening:
y[t]   = 5.001
y[t+1] = 5.002  # Barely moved!

# Persistence predicts: y_hat[t+1] = 5.001 (yesterday's value)
# Your model predicts:  y_hat[t+1] = 5.0015 (split the difference)
# Both are "right" because the series barely moved!
```

**The question we should ask**: When the series *actually moves*, does your model predict the move?

**The fix**: Move-Conditional Skill Score (MC-SS) â€” evaluate only on significant moves.

---

### Why 20% Improvement = "Too Good To Be True" [T3]

From 7+ post-mortems in production forecasting:
- **5-10% improvement**: Plausible with good feature engineering
- **10-20%**: Unusual but possible; warrants investigation  
- **>20%**: Almost always indicates leakage or bug

This threshold caught:
- BUG-001: Lag feature leakage (reported 35% improvement)
- BUG-005: Regime computation lookahead (reported 28% improvement)
- BUG-009: Internal-only validation (reported 42% improvement)

---

## What You'll Learn

1. **Why standard metrics fail on persistent data** â€” The persistence paradox
2. **Move-conditional metrics** â€” MC-SS isolates skill on significant events [T2]
3. **Theil's U statistic** â€” Scale-free comparison to naive baseline [T1]

**Prerequisites**: Notebooks 01-04 (Foundation tier)

---

## The Problem: The Persistence Paradox

Consider a highly persistent time series (like Treasury rates):

```
Model MAE:       0.0012
Persistence MAE: 0.0013
Improvement:     8%
```

**Looks good, right?** But wait...

Both are essentially "predicting no change" because:
- Series barely moves most of the time
- Model learned to copy y[t-1] (just like persistence)
- Small differences are just noise

**The paradox**: Standard MAE can't distinguish genuine skill from lucky noise.

In [None]:
import numpy as np
from sklearn.linear_model import Ridge

from temporalcv.persistence import (
    compute_move_threshold,
    compute_move_conditional_metrics,
    compute_direction_accuracy,
)
from temporalcv.metrics import compute_theils_u

np.random.seed(42)
print("temporalcv high-persistence metrics")

In [None]:
# Generate HIGH-PERSISTENCE data (phi = 0.98)
def generate_ar1(n: int, phi: float, sigma: float = 0.01) -> np.ndarray:
    """Generate AR(1) process: y[t] = phi * y[t-1] + epsilon[t]"""
    y = np.zeros(n)
    y[0] = 5.0  # Start at a level (like an interest rate)
    for t in range(1, n):
        y[t] = phi * y[t - 1] + np.random.normal(0, sigma)
    return y

# High persistence like Treasury rates
n = 500
y = generate_ar1(n, phi=0.98, sigma=0.02)

# Create features and split
X = np.column_stack([y[:-2], y[1:-1]])
y_target = y[2:]

train_size = int(len(y_target) * 0.7)
X_train, y_train = X[:train_size], y_target[:train_size]
X_test, y_test = X[train_size:], y_target[train_size:]

print(f"Data range: {y.min():.3f} to {y.max():.3f}")
print(f"Daily changes std: {np.std(np.diff(y)):.4f}")

In [None]:
# Train model and compute predictions
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Persistence baseline: y[t-1]
persistence = X_test[:, -1]

# Standard MAE comparison
model_mae = np.mean(np.abs(y_test - predictions))
persistence_mae = np.mean(np.abs(y_test - persistence))
improvement = 1 - model_mae / persistence_mae

print("Standard MAE Comparison:")
print(f"  Model MAE:       {model_mae:.6f}")
print(f"  Persistence MAE: {persistence_mae:.6f}")
print(f"  Improvement:     {improvement:.1%}")

**The problem**: Both MAE values are tiny because the series barely moves.

**We need metrics that focus on *when the series actually moves*.**

---

## 1. Move Threshold Definition [T2]

A "move" is defined as a change exceeding the **70th percentile** of historical |changes|.

This means:
- ~30% of periods are "moves" (UP or DOWN)
- ~70% are "flat" (noise around no change)

**Critical**: The threshold MUST be computed from **training data only**!

In [None]:
# Compute changes (we evaluate on CHANGES, not levels)
train_changes = np.diff(y_train)
test_changes = np.diff(y_test)
pred_changes = np.diff(predictions)

# Compute threshold from TRAINING data only!
threshold = compute_move_threshold(train_changes, percentile=70.0)

print(f"Move threshold (70th percentile): {threshold:.6f}")
print(f"Computed from training data: n={len(train_changes)}")

In [None]:
# Classify test changes
n_up = np.sum(test_changes > threshold)
n_down = np.sum(test_changes < -threshold)
n_flat = np.sum(np.abs(test_changes) <= threshold)

print(f"\nTest data classification:")
print(f"  UP:   {n_up} ({n_up/len(test_changes):.1%})")
print(f"  DOWN: {n_down} ({n_down/len(test_changes):.1%})")
print(f"  FLAT: {n_flat} ({n_flat/len(test_changes):.1%})")

---

## 2. MC-SS (Move-Conditional Skill Score) [T2]

MC-SS measures skill **only on periods when the series actually moved**.

**Formula**:
```
MC-SS = 1 - (model_mae_on_moves / persistence_mae_on_moves)
```

**Interpretation**:
- MC-SS = 0: Model equals persistence on moves
- MC-SS > 0: Model beats persistence on moves
- MC-SS < 0: Model worse than persistence on moves

In [None]:
# Compute move-conditional metrics
mc = compute_move_conditional_metrics(
    predictions=pred_changes,
    actuals=test_changes,
    threshold=threshold,  # From training data!
)

print("Move-Conditional Metrics:")
print(f"  MAE on UP moves:   {mc.mae_up:.6f}")
print(f"  MAE on DOWN moves: {mc.mae_down:.6f}")
print(f"  MAE on FLAT:       {mc.mae_flat:.6f}")
print(f"\n  MC-SS: {mc.skill_score:.3f}")
print(f"  Reliable? {mc.is_reliable} (n_up={mc.n_up}, n_down={mc.n_down})")

### Reliability Check

MC-SS requires sufficient samples in each direction:
- `is_reliable = True` if n_up >= 10 AND n_down >= 10
- If unreliable, treat MC-SS with caution [T3]

In [None]:
# MoveConditionalResult provides summary
print(f"\nTotal samples: {mc.n_total}")
print(f"Move fraction: {mc.move_fraction:.1%} (should be ~30%)")

---

## 3. Direction Accuracy

Did the model predict the **correct direction** of change?

**Two modes:**

| Mode | Classes | Use Case |
|------|---------|----------|
| 2-class | UP/DOWN | General direction accuracy |
| 3-class | UP/DOWN/FLAT | Fair comparison to persistence |

**Why 3-class matters**: Persistence always predicts FLAT (zero change).
In 2-class, persistence gets 0% accuracy (trivially). 
In 3-class, persistence gets credit when actual is FLAT.

In [None]:
# 2-class direction accuracy (sign-based)
acc_2class = compute_direction_accuracy(pred_changes, test_changes)

# 3-class direction accuracy (with threshold)
acc_3class = compute_direction_accuracy(
    pred_changes, test_changes, move_threshold=threshold
)

print("Direction Accuracy:")
print(f"  2-class (sign): {acc_2class:.1%}")
print(f"  3-class (UP/DOWN/FLAT): {acc_3class:.1%}")

In [None]:
# Compare to persistence (predicts 0 = FLAT)
persistence_changes = np.zeros_like(test_changes)  # Persistence predicts no change

pers_acc_2class = compute_direction_accuracy(persistence_changes, test_changes)
pers_acc_3class = compute_direction_accuracy(
    persistence_changes, test_changes, move_threshold=threshold
)

print("\nPersistence Direction Accuracy:")
print(f"  2-class: {pers_acc_2class:.1%} (trivially 0%)")
print(f"  3-class: {pers_acc_3class:.1%} (credit for FLAT periods)")

---

## 4. Theil's U Statistic [T1]

Theil's U compares model RMSE to naive (persistence) RMSE:

**Formula**:
```
U = RMSE(model) / RMSE(naive)
```

**Interpretation**:
- U < 1: Model beats naive
- U = 1: Model equals naive
- U > 1: Model worse than naive

**Citation**: Theil (1966)

In [None]:
# Compute Theil's U
theils_u = compute_theils_u(predictions, y_test)

print(f"Theil's U: {theils_u:.4f}")
if theils_u < 1:
    print(f"  Model beats persistence by {(1 - theils_u)*100:.1f}%")
else:
    print(f"  Model is {(theils_u - 1)*100:.1f}% worse than persistence")

---

## Complete Evaluation Template

Here's a production-ready function for high-persistence data:

In [None]:
def evaluate_high_persistence(
    predictions: np.ndarray,
    actuals: np.ndarray,
    train_actuals: np.ndarray,
    threshold_percentile: float = 70.0,
) -> dict:
    """
    Complete evaluation for high-persistence time series.
    
    Returns dict with standard metrics, move-conditional metrics,
    direction accuracy, and Theil's U.
    """
    # Standard metrics
    mae = float(np.mean(np.abs(actuals - predictions)))
    
    # Persistence baseline
    persistence_mae = float(np.mean(np.abs(np.diff(actuals))))
    
    # Move threshold from TRAINING data
    train_changes = np.diff(train_actuals)
    threshold = compute_move_threshold(train_changes, percentile=threshold_percentile)
    
    # Move-conditional metrics
    pred_changes = np.diff(predictions)
    test_changes = np.diff(actuals)
    
    mc = compute_move_conditional_metrics(
        pred_changes, test_changes, threshold=threshold
    )
    
    # Direction accuracy (3-class)
    dir_acc = compute_direction_accuracy(
        pred_changes, test_changes, move_threshold=threshold
    )
    
    # Theil's U
    theils_u = compute_theils_u(predictions, actuals)
    
    return {
        "mae": mae,
        "persistence_mae": persistence_mae,
        "improvement": 1 - mae / persistence_mae,
        "mc_ss": mc.skill_score,
        "mc_reliable": mc.is_reliable,
        "n_up": mc.n_up,
        "n_down": mc.n_down,
        "direction_accuracy": dir_acc,
        "theils_u": theils_u,
        "move_threshold": threshold,
    }

# Test the template
results = evaluate_high_persistence(predictions, y_test, y_train)

print("Complete Evaluation:")
print("="*50)
for key, value in results.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

---

## Pitfall Section: Common Mistakes

### Pitfall 1: Threshold from Full Series

```python
# WRONG: Compute threshold from all data (leaks future!)
all_changes = np.diff(y_all)
threshold = compute_move_threshold(all_changes)  # BUG!

# RIGHT: Compute from training data only
train_changes = np.diff(y_train)
threshold = compute_move_threshold(train_changes)  # Safe
```

**Why it matters**: Using test data to define "moves" is lookahead bias (BUG-003).

---

### Pitfall 2: Using Levels Instead of Changes

```python
# WRONG: Pass raw levels to persistence metrics
mc = compute_move_conditional_metrics(predictions, actuals)  # ERROR!

# RIGHT: Pass changes (differences)
pred_changes = np.diff(predictions)
actual_changes = np.diff(actuals)
mc = compute_move_conditional_metrics(pred_changes, actual_changes)
```

**Why it matters**: Persistence metrics assume "predicting zero change" as baseline.

---

### Pitfall 3: Ignoring Reliability Flag

```python
# WRONG: Trust MC-SS without checking reliability
print(f"MC-SS: {mc.skill_score:.3f}")  # May be unreliable!

# RIGHT: Check reliability first
if mc.is_reliable:
    print(f"MC-SS: {mc.skill_score:.3f}")
else:
    print(f"Warning: Only {mc.n_up} UP and {mc.n_down} DOWN samples")
```

**Why it matters**: n < 10 per direction â†’ high variance â†’ meaningless score.

---

## Key Insights

```
â˜… Insight â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€

1. Standard MAE fails on high-persistence data
   - Series barely moves â†’ all models look similar
   - Need move-conditional evaluation

2. Move threshold = 70th percentile [T2]
   - ~30% moves, ~70% flat
   - MUST be computed from training data only

3. MC-SS isolates skill on significant moves
   - Check is_reliable flag (n >= 10 per direction)
   - Positive MC-SS = genuine forecasting skill

4. Theil's U provides scale-free comparison [T1]
   - U < 1 means model beats persistence
   - Works on levels (not changes)

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
```

---

## Next Steps

| Notebook | Topic |
|----------|-------|
| **11** | Conformal prediction for uncertainty quantification |
| 12 | Regime-stratified evaluation (capstone) |

---

**You've learned**: How to properly evaluate forecasting models on high-persistence data using move-conditional metrics that isolate genuine skill from noise.