# Data Leakage Fix Verification

This notebook demonstrates the train/test split evaluation methodology that prevents data leakage in time series forecasting.

## Methodology

1. **Split data**: 80% train, 20% test
2. **Evaluate**: Train ONLY on train data, evaluate on test (unseen) data
3. **Production**: Retrain on ALL data for final forecast

## Why This Matters

- **Old approach**: Model could see future data during walk-forward validation
- **New approach**: True out-of-sample evaluation on completely unseen data
- **Result**: Honest performance metrics + best production model

In [1]:
# Imports
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np
from pathlib import Path

from gw2ml.data.loaders import load_gw2_series
from gw2ml.evaluation.backtest import walk_forward_backtest
from gw2ml.modeling.arima import ARIMAModel
from gw2ml.modeling.xgboost import XGBoostModel
from gw2ml.metrics.registry import get_metric

In [None]:
# Configuration
#¬†19702 - Platinum Ore, 19976 - Mystic Coin
ITEM_ID = 19702  # GW2 item ID to analyze
DAYS_BACK = 3  # Days of historical data
VALUE_COLUMN = "buy_unit_price"
FORECAST_HORIZON = 24  # Steps ahead for both evaluation and production forecast
TRAIN_SPLIT = 0.8  # 80% train, 20% test

print(f"Configuration:")
print(f"  Item ID: {ITEM_ID}")
print(f"  Days back: {DAYS_BACK}")
print(f"  Train/Test split: {TRAIN_SPLIT:.0%} / {(1-TRAIN_SPLIT):.0%}")
print(f"  Forecast horizon: {FORECAST_HORIZON} steps (used for both backtest and production)")

Configuration:
  Item ID: 19702
  Days back: 3
  Train/Test split: 80% / 20%
  Forecast horizon: 60 steps (used for both backtest and production)


## Step 1: Load Data

Load historical price data for the item. This will be cached locally.

In [3]:
# Load data (uses caching)
print("Loading data...")
series_meta = load_gw2_series(
    item_id=ITEM_ID,
    days_back=DAYS_BACK,
    value_column=VALUE_COLUMN,
    fill_missing_dates=True,
)

series = series_meta.series
print(f"‚úì Loaded {len(series)} data points")
print(f"  Time range: {series.start_time()} to {series.end_time()}")
print(f"  Value range: {series.values().min():.2f} to {series.values().max():.2f}")

Loading data...
‚úì Loaded 864 data points
  Time range: 2026-01-02 14:40:00 to 2026-01-05 14:35:00
  Value range: 155.00 to 199.00


## Step 2: Split Data into Train/Test

**Critical**: The model will NEVER see test data during evaluation!

In [4]:
# Split into train/test
split_idx = int(len(series) * TRAIN_SPLIT)
train_series = series[:split_idx]
test_series = series[split_idx:]

print(f"Data split:")
print(f"  Train: {len(train_series)} points ({len(train_series)/len(series):.1%})")
print(f"  Test:  {len(test_series)} points ({len(test_series)/len(series):.1%})")
print(f"\nTrain period: {train_series.start_time()} to {train_series.end_time()}")
print(f"Test period:  {test_series.start_time()} to {test_series.end_time()}")

Data split:
  Train: 691 points (80.0%)
  Test:  173 points (20.0%)

Train period: 2026-01-02 14:40:00 to 2026-01-05 00:10:00
Test period:  2026-01-05 00:15:00 to 2026-01-05 14:35:00


## Step 3: Evaluate ARIMA on Test Set

Train ONLY on train data, evaluate on held-out test data.

In [5]:
# ARIMA evaluation
print("Training ARIMA model...")
arima_params = {"p": 1, "d": 1, "q": 1, "seasonal_order": None}

arima_result = walk_forward_backtest(
    model_class=ARIMAModel,
    model_params=arima_params,
    series=series,
    train_series=train_series,  # Train on this ONLY
    test_series=test_series,    # Evaluate on this
    forecast_horizon=FORECAST_HORIZON,
    stride=1,
    verbose=True,
)

# Calculate metrics
mape_fn = get_metric("mape")
arima_mape = mape_fn(arima_result.actuals, arima_result.forecasts)

print(f"\n‚úì ARIMA Results:")
print(f"  Test set MAPE: {arima_mape:.4f}%")
print(f"  Forecast points: {len(arima_result.forecasts)}")
print(f"  This is TRUE out-of-sample performance!")

Training ARIMA model...


historical forecasts:   0%|          | 0/114 [00:00<?, ?it/s]


‚úì ARIMA Results:
  Test set MAPE: 1.9251%
  Forecast points: 114
  This is TRUE out-of-sample performance!


## Step 4: Visualize Walk-Forward Rolling Window

This shows how the model is evaluated using a rolling window approach:
- **Train on past data** (expanding window)
- **Predict N steps ahead** (forecast horizon)
- **Move forward** by stride steps
- **Repeat** until test set is covered

This ensures the model NEVER sees future data during evaluation!

In [None]:
# Helper function to convert TimeSeries to DataFrame
def to_df(ts):
    df = ts.to_dataframe()
    df.index = pd.to_datetime(df.index).tz_localize(None)
    return df

# Visualize the rolling window walk-forward process
print(f"Walk-Forward Backtest Details:")
print(f"  Total data points: {len(series)}")
print(f"  Train set: {len(train_series)} points")
print(f"  Test set: {len(test_series)} points")
print(f"  Forecast horizon: {FORECAST_HORIZON} steps")
print(f"  Stride: {arima_result.stride} step")

# Convert to dataframes first
train_df = to_df(train_series)
test_df = to_df(test_series)
actual_df = test_df

# Build full forecast windows for visualization
viz_model = ARIMAModel(**arima_params)
viz_model.fit(train_series)
combined_series = train_series.append(test_series)

window_forecasts = viz_model.historical_forecasts(
    series=combined_series,
    start=len(train_series),
    forecast_horizon=FORECAST_HORIZON,
    stride=arima_result.stride,
    retrain=True,
    last_points_only=False,
)

if not isinstance(window_forecasts, list):
    window_forecasts = [window_forecasts]

window_dfs = [to_df(wf) for wf in window_forecasts]
num_windows = len(window_dfs)

print(f"  Number of forecast windows: {num_windows}")

if num_windows:
    print("")
    print("Forecast coverage:")
    print(f"  First window: {window_dfs[0].index[0]} -> {window_dfs[0].index[-1]}")
    print(f"  Last window (with actuals): {window_dfs[-1].index[0]} -> {window_dfs[-1].index[-1]}")
    print(f"  Train ends: {train_df.index[-1]}")
    print(f"  Test starts: {test_df.index[0]}")

# Determine time step for the extra window
if len(test_df.index) > 1:
    step = test_df.index[1] - test_df.index[0]
else:
    step = pd.Timedelta(hours=1)

extra_window_start = test_df.index[-1] + step
extra_window_end = extra_window_start + step * (FORECAST_HORIZON - 1)

# Optional: production-style forecast after test data ends
production_forecast_df = None
production_model = ARIMAModel(**arima_params)
production_model.fit(series)
production_forecast_df = to_df(production_model.predict(FORECAST_HORIZON))

# Create visualization
fig = go.Figure()

# Show EVERY Nth window to demonstrate the rolling pattern
if num_windows:
    step_size = max(1, num_windows // 20)

    # Add subtle gradient showing all windows
    for idx in range(0, num_windows, step_size):
        wf_df = window_dfs[idx]
        window_start = wf_df.index[0]
        window_end = wf_df.index[-1]

        # Gradient from red (start) to blue (end)
        progress = idx / max(1, num_windows - 1)
        red = int(255 * (1 - progress))
        blue = int(255 * progress)
        color = f"rgba({red}, 100, {blue}, 0.05)"
        border_color = f"rgba({red}, 100, {blue}, 0.25)"

        # Add very subtle shaded region
        fig.add_vrect(
            x0=window_start,
            x1=window_end,
            fillcolor=color,
            layer="below",
            line_width=0.5,
            line_dash="dot",
            line_color=border_color,
        )

    # Highlight 3 specific windows for reference
    highlight_indices = [0, num_windows // 2, max(0, num_windows - 1)]
    highlight_colors = ["rgba(255, 0, 0, 0.15)", "rgba(0, 255, 0, 0.15)", "rgba(0, 100, 255, 0.15)"]
    border_colors = ["red", "green", "blue"]
    labels = ["W1", "W2", "W3"]

    for idx, color, border_color, label in zip(highlight_indices, highlight_colors, border_colors, labels):
        if idx < len(window_dfs):
            wf_df = window_dfs[idx]
            window_start = wf_df.index[0]
            window_end = wf_df.index[-1]

            # Highlighted window
            fig.add_vrect(
                x0=window_start,
                x1=window_end,
                fillcolor=color,
                layer="below",
                line_width=2,
                line_dash="dash",
                line_color=border_color,
            )

            # Label
            mid_idx = len(wf_df) // 2
            fig.add_annotation(
                x=wf_df.index[mid_idx], y=0.98, yref="paper",
                text=label,
                showarrow=False,
                font=dict(size=10, color=border_color, family="Arial Black"),
                bgcolor="rgba(255,255,255,0.9)",
                bordercolor=border_color,
                borderwidth=2,
            )

# Add one extra window after test data ends
fig.add_vrect(
    x0=extra_window_start,
    x1=extra_window_end,
    fillcolor="rgba(120, 120, 120, 0.08)",
    layer="below",
    line_width=1,
    line_dash="dash",
    line_color="rgba(120, 120, 120, 0.6)",
)

extra_mid = extra_window_start + (extra_window_end - extra_window_start) / 2
fig.add_annotation(
    x=extra_mid, y=0.98, yref="paper",
    text="Next (no actuals)",
    showarrow=False,
    font=dict(size=10, color="rgba(80, 80, 80, 1)", family="Arial Black"),
    bgcolor="rgba(255,255,255,0.9)",
    bordercolor="rgba(120, 120, 120, 0.6)",
    borderwidth=2,
)

# Add train/test split line
fig.add_vline(
    x=train_df.index[-1],
    line_dash="dash",
    line_color="black",
    line_width=2,
)

fig.add_annotation(
    x=train_df.index[-1], y=0.5, yref="paper",
    text="Train/Test<br>Split", showarrow=True, arrowhead=2,
    ax=40, ay=0,
    font=dict(size=11, color="black"),
    bgcolor="rgba(255,255,255,0.9)",
    bordercolor="black",
    borderwidth=2,
)

# Add the actual data lines ON TOP
# Train data (gray)
fig.add_trace(go.Scatter(
    x=train_df.index,
    y=train_df.iloc[:, 0],
    mode="lines",
    name="Train Data",
    line=dict(color="gray", width=2),
    opacity=0.6,
))

# Forecast windows (orange)
forecast_legend_shown = False
highlight_map = {
    0: "W1",
    num_windows // 2: "W2",
    max(0, num_windows - 1): "W3",
} if num_windows else {}

for idx, wf_df in enumerate(window_dfs):
    if idx in highlight_map:
        fig.add_trace(go.Scatter(
            x=wf_df.index,
            y=wf_df.iloc[:, 0],
            mode="lines",
            name=f"Forecast {highlight_map[idx]}",
            line=dict(color="#ff7f0e", width=2.5),
        ))
    else:
        showlegend = not forecast_legend_shown
        if showlegend:
            forecast_legend_shown = True
        fig.add_trace(go.Scatter(
            x=wf_df.index,
            y=wf_df.iloc[:, 0],
            mode="lines",
            name="Forecast Window",
            showlegend=showlegend,
            line=dict(color="rgba(255, 127, 14, 0.2)", width=1),
        ))

# Test data actual (blue, thick)
fig.add_trace(go.Scatter(
    x=actual_df.index,
    y=actual_df.iloc[:, 0],
    mode="lines",
    name="Test Actual",
    line=dict(color="#1f77b4", width=3),
))

# Production forecast (after data ends)
if production_forecast_df is not None:
    fig.add_trace(go.Scatter(
        x=production_forecast_df.index,
        y=production_forecast_df.iloc[:, 0],
        mode="lines",
        name="Production Forecast",
        line=dict(color="#2f2f2f", width=2, dash="dot"),
    ))

fig.update_layout(
    title=f"Walk-Forward Validation: {FORECAST_HORIZON}-Step Rolling Windows",
    xaxis_title="Time",
    yaxis_title=f"{VALUE_COLUMN}",
    hovermode="x unified",
    height=500,
    showlegend=True,
)

fig.show()

print("")
print("üìä Interpretation:")
print("  ‚Ä¢ GRAY: Training data (model never trained on test data)")
print("  ‚Ä¢ BLUE: Actual test prices (ground truth)")
print("  ‚Ä¢ ORANGE: Forecast windows (each forecast only shown within its test window)")
print(f"  ‚Ä¢ Shaded boxes: Rolling {FORECAST_HORIZON}-step windows starting at the split")
print("  ‚Ä¢ Dashed box: One extra window after test data ends")


## Step 5: Production Forecast (Retrain on ALL Data)

After evaluation, retrain on ALL data for the best production forecast.

In [None]:
# Test different amounts of historical data to find the sweet spot
import time

# Test different days_back values
days_to_test = [1, 2, 3, 5, 7, 10, 14, 21, 30]
results = []

print("Finding ARIMA's Sweet Spot - Testing different training window sizes...")
print(f"Testing with forecast horizon: {FORECAST_HORIZON} steps\n")

for days in days_to_test:
    print(f"  Testing {days} days of data...", end=" ")
    start_time = time.time()
    
    try:
        # Load data with this window size
        test_series_meta = load_gw2_series(
            item_id=ITEM_ID,
            days_back=days,
            value_column=VALUE_COLUMN,
            fill_missing_dates=True,
        )
        
        # Split into train/test (80/20)
        test_series_obj = test_series_meta.series
        split_idx = int(len(test_series_obj) * TRAIN_SPLIT)
        train_subset = test_series_obj[:split_idx]
        test_subset = test_series_obj[split_idx:]
        
        # Skip if not enough test data
        if len(test_subset) < FORECAST_HORIZON:
            print("‚ö†Ô∏è  Not enough test data, skipping")
            continue
        
        # Run backtest
        backtest_result = walk_forward_backtest(
            model_class=ARIMAModel,
            model_params=arima_params,
            series=test_series_obj,
            train_series=train_subset,
            test_series=test_subset,
            forecast_horizon=FORECAST_HORIZON,
            stride=1,
            verbose=False,
        )
        
        # Calculate metrics
        mape = mape_fn(backtest_result.actuals, backtest_result.forecasts)
        elapsed = time.time() - start_time
        
        results.append({
            'days': days,
            'data_points': len(test_series_obj),
            'train_points': len(train_subset),
            'test_points': len(test_subset),
            'mape': mape,
            'time_seconds': elapsed,
            'forecasts': len(backtest_result.forecasts)
        })
        
        print(f"‚úì MAPE: {mape:.4f}% ({len(test_series_obj)} points, {elapsed:.1f}s)")
        
    except Exception as e:
        print(f"‚úó Failed: {str(e)[:50]}")
        continue

# Create visualization
if results:
    fig = make_subplots(
        rows=2, cols=1,
        subplot_titles=(
            "ARIMA Accuracy vs Training Data Size",
            "Training Time vs Data Size"
        ),
        vertical_spacing=0.15,
    )
    
    # Top plot: MAPE vs days
    fig.add_trace(go.Scatter(
        x=[r['days'] for r in results],
        y=[r['mape'] for r in results],
        mode='lines+markers',
        name='MAPE',
        line=dict(width=3, color='#1f77b4'),
        marker=dict(size=10),
    ), row=1, col=1)
    
    # Find the elbow point (where improvement slows down)
    if len(results) >= 3:
        # Simple heuristic: point where MAPE improvement is < 5% of previous improvement
        improvements = []
        for i in range(1, len(results)):
            prev_mape = results[i-1]['mape']
            curr_mape = results[i]['mape']
            improvement = prev_mape - curr_mape
            improvements.append(improvement)
        
        # Find sweet spot (diminishing returns)
        sweet_spot_idx = 0
        for i in range(1, len(improvements)):
            if improvements[i] < improvements[i-1] * 0.3:  # Less than 30% of previous improvement
                sweet_spot_idx = i
                break
        
        if sweet_spot_idx > 0:
            sweet_spot = results[sweet_spot_idx]
            fig.add_vline(
                x=sweet_spot['days'],
                line_dash="dash",
                line_color="red",
                annotation_text=f"Sweet Spot: {sweet_spot['days']} days",
                annotation_position="top",
                row=1, col=1
            )
    
    # Bottom plot: Time vs days
    fig.add_trace(go.Scatter(
        x=[r['days'] for r in results],
        y=[r['time_seconds'] for r in results],
        mode='lines+markers',
        name='Training Time',
        line=dict(width=3, color='#ff7f0e'),
        marker=dict(size=10),
    ), row=2, col=1)
    
    fig.update_xaxes(title_text="Days of Historical Data", row=1, col=1)
    fig.update_xaxes(title_text="Days of Historical Data", row=2, col=1)
    fig.update_yaxes(title_text="MAPE (%)", row=1, col=1)
    fig.update_yaxes(title_text="Time (seconds)", row=2, col=1)
    
    fig.update_layout(
        height=700,
        showlegend=False,
    )
    
    fig.show()
    
    # Print summary
    print(f"\nüìä Summary:")
    print(f"{'Days':<6} {'Points':<8} {'MAPE':<10} {'Time':<8} {'Notes'}")
    print("-" * 60)
    
    best_mape = min(results, key=lambda r: r['mape'])
    fastest = min(results, key=lambda r: r['time_seconds'])
    
    for r in results:
        notes = []
        if r == best_mape:
            notes.append("‚≠ê Best MAPE")
        if r == fastest:
            notes.append("‚ö° Fastest")
        if sweet_spot_idx > 0 and r == results[sweet_spot_idx]:
            notes.append("üéØ Sweet Spot")
        
        print(f"{r['days']:<6} {r['data_points']:<8} {r['mape']:<10.4f} {r['time_seconds']:<8.1f} {' | '.join(notes)}")
    
    print(f"\nüéØ Recommendation:")
    if sweet_spot_idx > 0:
        rec = results[sweet_spot_idx]
        print(f"   Use {rec['days']} days of data ({rec['data_points']} points)")
        print(f"   - MAPE: {rec['mape']:.4f}%")
        print(f"   - Training time: {rec['time_seconds']:.1f}s")
        print(f"   - Balance between accuracy and speed")
    else:
        print(f"   Use {best_mape['days']} days for best accuracy")
        print(f"   - MAPE: {best_mape['mape']:.4f}%")
        
else:
    print("‚ö†Ô∏è  No successful tests - unable to determine sweet spot")

‚úì MAPE: 2.4613% (4032 points, 80.9s)
  Testing 21 days of data... 

## BONUS: Find ARIMA's Sweet Spot

How much historical data does ARIMA need for optimal performance? Let's test different training window sizes to find the point of diminishing returns.

## Step 6: Complete Picture

Shows the full workflow:
1. Train data
2. Test evaluation
3. Production forecast (trained on all data)

## Step 7: Compare with XGBoost (Optional)

Run the same evaluation for XGBoost to compare models.

In [21]:
# XGBoost evaluation
print("Training XGBoost model...")
xgb_params = {"lags": 12, "n_estimators": 100}

try:
    xgb_result = walk_forward_backtest(
        model_class=XGBoostModel,
        model_params=xgb_params,
        series=series,
        train_series=train_series,
        test_series=test_series,
        forecast_horizon=FORECAST_HORIZON,
        stride=1,
        verbose=True,
    )

    xgb_mape = mape_fn(xgb_result.actuals, xgb_result.forecasts)
    print(f"\n‚úì XGBoost Results:")
    print(f"  Test set MAPE: {xgb_mape:.4f}%")
    
    # Compare models
    print(f"\nüìä Model Comparison:")
    print(f"  ARIMA MAPE:   {arima_mape:.4f}%")
    print(f"  XGBoost MAPE: {xgb_mape:.4f}%")
    winner = "ARIMA" if arima_mape < xgb_mape else "XGBoost"
    print(f"  Best model: {winner}")
    
except Exception as e:
    print(f"‚ö†Ô∏è XGBoost failed: {e}")
    print("  (This might be due to MPS/GPU incompatibility)")

Training XGBoost model...


historical forecasts:   0%|          | 0/114 [00:00<?, ?it/s]


‚úì XGBoost Results:
  Test set MAPE: 2.2176%

üìä Model Comparison:
  ARIMA MAPE:   1.8693%
  XGBoost MAPE: 2.2176%
  Best model: ARIMA


## Summary

### ‚úÖ What We Did

1. **Split data** into 80% train / 20% test
2. **Trained** models ONLY on train data
3. **Evaluated** on held-out test data (never seen during training)
4. **Retrained** on ALL data for production forecasts

### üõ°Ô∏è Why This Prevents Data Leakage

- Model **never sees** test data during evaluation phase
- Test metrics represent **true out-of-sample** performance
- Production model uses **all available data** for best forecasts

### üìà Results

- **Test set MAPE**: Honest performance metric
- **Plots show**: Clear separation between train/test/forecast
- **No flat sections**: Real predictions at every time point

This methodology ensures your models are evaluated honestly and your production forecasts use all available information!