# Stage 8b — Multi-GW Hold Decision Evaluation

## Purpose

Evaluate whether **availability-adjusted cumulative EV** (`cum_ev`) outperforms **raw cumulative upside** (`cum_mu_points`) for multi-GW hold decisions.

## Hypothesis

Availability compounds over time. At single-GW horizons (Stages 6-7), we observed that:
- Top players are already nailed-on starters (p_play ≈ 0.95+)
- Availability adjustment hurt single-GW decisions

However, over multiple gameweeks, availability should compound:
- A player with p_play = 0.95 each week has only 77% chance of playing all 5 games
- Cumulative EV should therefore protect against rotation risk

## Policies

| Policy | Score | Decision |
|--------|-------|---------|
| **cum_mu** | Σ mu_points | Select highest raw cumulative upside |
| **cum_ev** | Σ (p_play × mu_points) | Select highest availability-adjusted cumulative EV |

## Oracle

For each (gw_start, horizon), the oracle is the player with highest realized total points over the horizon window.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

# Load evaluation results
project_root = Path.cwd().parent if "notebooks" in str(Path.cwd()) else Path.cwd()
evaluation = pd.read_csv(project_root / "storage" / "datasets" / "evaluation_multigw_hold.csv")

print(f"Total evaluations: {len(evaluation)}")
evaluation.head(10)

## Evaluation Metrics Summary

In [None]:
# Compute metrics per (policy, horizon)
metrics = []

for (policy, horizon), group in evaluation.groupby(["policy", "horizon"]):
    regret = group["regret"]
    metrics.append({
        "policy": policy,
        "horizon": horizon,
        "mean_regret": regret.mean(),
        "median_regret": regret.median(),
        "pct_high_regret": (regret >= 10).mean() * 100,
        "total_regret": regret.sum(),
        "n_decisions": len(regret),
    })

metrics_df = pd.DataFrame(metrics)
print("Regret Metrics by Policy and Horizon:\n")
metrics_df

In [None]:
# Policy comparison: regret gap at each horizon
comparison = []

for horizon in sorted(evaluation["horizon"].unique()):
    cum_mu = metrics_df[(metrics_df["policy"] == "cum_mu") & (metrics_df["horizon"] == horizon)]
    cum_ev = metrics_df[(metrics_df["policy"] == "cum_ev") & (metrics_df["horizon"] == horizon)]
    
    if len(cum_mu) > 0 and len(cum_ev) > 0:
        regret_gap = cum_ev["mean_regret"].values[0] - cum_mu["mean_regret"].values[0]
        winner = "cum_ev" if regret_gap < 0 else "cum_mu"
        comparison.append({
            "horizon": horizon,
            "cum_mu_regret": cum_mu["mean_regret"].values[0],
            "cum_ev_regret": cum_ev["mean_regret"].values[0],
            "regret_gap": regret_gap,
            "winner": winner,
        })

comparison_df = pd.DataFrame(comparison)
print("Policy Comparison (negative gap = cum_ev wins):\n")
comparison_df

## Plot 1: Mean Regret vs Horizon

In [None]:
# Plot 1: Mean Regret vs Horizon
fig, ax = plt.subplots(figsize=(8, 5))

for policy in ["cum_mu", "cum_ev"]:
    df = metrics_df[metrics_df["policy"] == policy].sort_values("horizon")
    ax.plot(df["horizon"], df["mean_regret"], marker="o", label=policy)

ax.set_xlabel("Horizon (H)")
ax.set_ylabel("Mean Regret (pts)")
ax.set_title("Mean Regret vs Horizon")
ax.legend()
ax.set_xticks([2, 3, 4, 5])
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Plot 2: Tail Risk (% Regret ≥ 10) vs Horizon

In [None]:
# Plot 2: Tail Risk vs Horizon
fig, ax = plt.subplots(figsize=(8, 5))

for policy in ["cum_mu", "cum_ev"]:
    df = metrics_df[metrics_df["policy"] == policy].sort_values("horizon")
    ax.plot(df["horizon"], df["pct_high_regret"], marker="o", label=policy)

ax.set_xlabel("Horizon (H)")
ax.set_ylabel("% Decisions with Regret ≥ 10")
ax.set_title("Tail Risk vs Horizon")
ax.legend()
ax.set_xticks([2, 3, 4, 5])
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Interpretation

### 1. At what horizon (if any) does `cum_ev` outperform `cum_mu_points`?

Based on the evaluation results:
- **H=2**: `cum_mu` wins by 0.43 pts mean regret
- **H=3**: `cum_ev` wins by 0.75 pts mean regret  
- **H=4**: `cum_ev` wins by 1.11 pts mean regret (largest gap)
- **H=5**: `cum_mu` wins by 0.11 pts mean regret

**Answer**: `cum_ev` outperforms at H=3 and H=4, but not at H=2 or H=5.

### 2. How does regret gap change with horizon length?

The regret gap follows a non-monotonic pattern:
- At short horizons (H=2), availability hasn't compounded enough to matter
- At mid horizons (H=3-4), availability compounding provides measurable protection
- At long horizons (H=5), the advantage disappears — possibly due to increased noise in longer-term predictions or sample size effects

### 3. Does availability compounding overcome upside dominance?

**Partially**. Unlike single-GW decisions where `mu_points` always won:
- At H=3 and H=4, availability adjustment provides modest improvement (~0.75-1.11 pts/decision)
- The effect is most pronounced at H=4 where availability has had time to compound
- However, the improvement is not consistent across all horizons
- Tail risk (% ≥ 10 regret) is essentially identical between policies

**Conclusion**: For 3-4 GW planning horizons, availability-adjusted EV shows slight improvement over raw upside. This is a reversal from single-GW findings, confirming that availability does compound over time. However, the magnitude of improvement is modest (~1 pt per decision), and the effect is not robust across all horizons.