# Conditional Training A/B Comparison

**Objective**: Measure whether training only on rows where `minutes > 0` improves production decision quality.

**Background**: Research (Stage 6) showed that separating "didn't play" from "played badly" reduces captain regret. This notebook tests whether applying that insight to production training yields measurable benefit.

**Models Compared**:
| Model | Description |
|-------|-------------|
| Baseline | Production model trained on all rows |
| Conditional | Production model trained only on `minutes > 0` |

**Metrics**:
- MAE / RMSE (sanity check)
- Captain regret vs oracle
- % GW with regret ≥ 10 pts

In [1]:
import sys
from pathlib import Path

# Add src to path
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root / "src"))

import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error

from dugout.production.pipeline.runner import Pipeline
from dugout.production.features.definitions import FEATURE_COLUMNS

In [2]:
# Force reload of modules to pick up changes
import importlib
import dugout.production.features.builder
import dugout.production.pipeline.runner
importlib.reload(dugout.production.features.builder)
importlib.reload(dugout.production.pipeline.runner)

from dugout.production.pipeline.runner import Pipeline
from dugout.production.features.definitions import FEATURE_COLUMNS
print("Modules reloaded")

Modules reloaded


## 1. Train Both Models

In [3]:
# Train baseline model (all rows)
print("=" * 60)
print("BASELINE MODEL (all rows)")
print("=" * 60)

baseline = Pipeline(conditional_on_play=False)
baseline.gather_data()
baseline.build_features()
baseline.split()
baseline.train()

baseline_model = baseline.model

BASELINE MODEL (all rows)
Database: /Users/safarifgisa/Documents/Springboard/Google5DayAI/the-dugout/storage/fpl_2025_26.sqlite
Gathered 16,559 rows, 799 players
Built 12,620 feature rows
Training LightGBM...
Training residual model...
Model saved to /Users/safarifgisa/Documents/Springboard/Google5DayAI/the-dugout/storage/production/models/lightgbm_v1/model.joblib


In [4]:
# Train conditional model (minutes > 0 only)
print("\n" + "=" * 60)
print("CONDITIONAL MODEL (minutes > 0 only)")
print("=" * 60)

conditional = Pipeline(conditional_on_play=True)
conditional.gather_data()
conditional.build_features()
conditional.split()
conditional.train()

conditional_model = conditional.model


CONDITIONAL MODEL (minutes > 0 only)
Database: /Users/safarifgisa/Documents/Springboard/Google5DayAI/the-dugout/storage/fpl_2025_26.sqlite
Gathered 16,559 rows, 799 players
Built 12,620 feature rows
Conditional training: 2,693 / 6,564 rows (minutes > 0)
Training LightGBM...
Training residual model...
Model saved to /Users/safarifgisa/Documents/Springboard/Google5DayAI/the-dugout/storage/production/models/lightgbm_v1/model.joblib


## 2. Evaluate MAE / RMSE (Sanity Check)

Both models should be evaluated on the **same** test set (which includes all rows).

In [5]:
# Use same test set for both
test_df = baseline.test_df.copy()

# Predictions
X_test = test_df[FEATURE_COLUMNS].values
test_df["pred_baseline"] = baseline_model.predict(X_test)
test_df["pred_conditional"] = conditional_model.predict(X_test)

y_true = test_df["total_points"].values

results = {
    "Baseline": {
        "MAE": mean_absolute_error(y_true, test_df["pred_baseline"]),
        "RMSE": np.sqrt(mean_squared_error(y_true, test_df["pred_baseline"])),
    },
    "Conditional": {
        "MAE": mean_absolute_error(y_true, test_df["pred_conditional"]),
        "RMSE": np.sqrt(mean_squared_error(y_true, test_df["pred_conditional"])),
    },
}

print("\nTest Set Performance (all rows):")
print("-" * 40)
for model_name, metrics in results.items():
    print(f"{model_name:12s}: MAE={metrics['MAE']:.3f}, RMSE={metrics['RMSE']:.3f}")


Test Set Performance (all rows):
----------------------------------------
Baseline    : MAE=1.041, RMSE=1.948
Conditional : MAE=1.912, RMSE=2.280


## 3. Captain Regret Evaluation

For each gameweek in the test set:
1. Select captain as `argmax(predicted_points)` per model
2. Find oracle captain (actual max points that GW)
3. Regret = oracle_points - captain_points

In [6]:
def compute_captain_regret(df: pd.DataFrame, pred_col: str) -> pd.DataFrame:
    """Compute captain regret per gameweek.
    
    Args:
        df: DataFrame with gw, player_id, total_points, and prediction column
        pred_col: Name of prediction column to use for captain selection
    
    Returns:
        DataFrame with per-GW regret stats
    """
    results = []
    
    for gw in df["gw"].unique():
        gw_df = df[df["gw"] == gw]
        
        # Oracle: player with max actual points
        oracle_idx = gw_df["total_points"].idxmax()
        oracle_points = gw_df.loc[oracle_idx, "total_points"]
        
        # Model pick: player with max predicted points
        captain_idx = gw_df[pred_col].idxmax()
        captain_points = gw_df.loc[captain_idx, "total_points"]
        captain_pred = gw_df.loc[captain_idx, pred_col]
        
        regret = oracle_points - captain_points
        
        results.append({
            "gw": gw,
            "oracle_points": oracle_points,
            "captain_points": captain_points,
            "captain_pred": captain_pred,
            "regret": regret,
        })
    
    return pd.DataFrame(results)

In [7]:
# Compute regret for both models
baseline_regret = compute_captain_regret(test_df, "pred_baseline")
conditional_regret = compute_captain_regret(test_df, "pred_conditional")

baseline_regret["model"] = "Baseline"
conditional_regret["model"] = "Conditional"

regret_df = pd.concat([baseline_regret, conditional_regret], ignore_index=True)

print("Per-GW Regret Summary:")
print(regret_df.groupby("model")["regret"].agg(["mean", "std", "min", "max"]).round(2))

Per-GW Regret Summary:
              mean   std  min  max
model                             
Baseline     13.25  1.26   12   15
Conditional  12.25  2.50    9   15


In [8]:
# Final comparison table
summary = pd.DataFrame({
    "Model": ["Baseline", "Conditional"],
    "MAE": [results["Baseline"]["MAE"], results["Conditional"]["MAE"]],
    "RMSE": [results["Baseline"]["RMSE"], results["Conditional"]["RMSE"]],
    "Mean Regret": [
        baseline_regret["regret"].mean(),
        conditional_regret["regret"].mean(),
    ],
    "% GW Regret >= 10": [
        (baseline_regret["regret"] >= 10).mean() * 100,
        (conditional_regret["regret"] >= 10).mean() * 100,
    ],
})

print("\n" + "=" * 70)
print("FINAL COMPARISON")
print("=" * 70)
print(summary.to_string(index=False))


FINAL COMPARISON
      Model      MAE     RMSE  Mean Regret  % GW Regret >= 10
   Baseline 1.041177 1.947962        13.25              100.0
Conditional 1.912234 2.280445        12.25               75.0


In [9]:
# Compute delta
delta_regret = baseline_regret["regret"].mean() - conditional_regret["regret"].mean()
delta_pct = delta_regret / baseline_regret["regret"].mean() * 100 if baseline_regret["regret"].mean() > 0 else 0

print(f"\nRegret Delta: {delta_regret:.2f} pts/GW")
print(f"Improvement: {delta_pct:.1f}%")

if delta_regret > 0.5:
    print("\n✅ CONCLUSION: Conditional training IMPROVES decision quality.")
    print("   Consider making conditional_on_play=True the default.")
elif delta_regret < -0.5:
    print("\n❌ CONCLUSION: Conditional training HURTS decision quality.")
    print("   Keep baseline (all rows) as default.")
else:
    print("\n⚠️  CONCLUSION: Effect is NEGLIGIBLE (< 0.5 pts/GW).")
    print("   Conditional training doesn't meaningfully impact captain decisions.")


Regret Delta: 1.00 pts/GW
Improvement: 7.5%

✅ CONCLUSION: Conditional training IMPROVES decision quality.
   Consider making conditional_on_play=True the default.


## 4. Captain Agreement Rate

How often do both models select the same captain?

In [10]:
# Check agreement per GW
agreement_count = 0
total_gws = test_df["gw"].nunique()

for gw in test_df["gw"].unique():
    gw_df = test_df[test_df["gw"] == gw]
    baseline_pick = gw_df.loc[gw_df["pred_baseline"].idxmax(), "player_id"]
    conditional_pick = gw_df.loc[gw_df["pred_conditional"].idxmax(), "player_id"]
    if baseline_pick == conditional_pick:
        agreement_count += 1

agreement_rate = agreement_count / total_gws * 100
print(f"Captain Agreement Rate: {agreement_rate:.1f}% ({agreement_count}/{total_gws} GWs)")

Captain Agreement Rate: 25.0% (1/4 GWs)
