### Cel benchmarków

Celem tego etapu jest zbudowanie uczciwego punktu odniesienia
dla modeli prognostycznych poprzez proste baseline’y
oraz spójny backtesting typu rolling-origin.


### Wczytanie danych do backtestingu

Korzystamy z danych przygotowanych po feature engineering,
ograniczonych do serii oznaczonych jako eligible.


In [2]:
import pandas as pd
import numpy as np

from pathlib import Path

BASE_DIR = Path(".")
DATA_DIR = BASE_DIR / "data"
FEAT_DIR = DATA_DIR / "features"

DATA_PATH = FEAT_DIR / "features_level_a.parquet"
df = pd.read_parquet(DATA_PATH)

df["week_start"] = pd.to_datetime(df["week_start"])


### Klucze i zmienna docelowa

Ustalamy spójne nazwy kluczy,
które będą używane we wszystkich modelach.


In [3]:
SERIES_KEY = ["country", "sku"]
TIME_COL = "week_start"
TARGET = "demand"


### Spójność chronologii

Zapewniamy poprawną kolejność czasową
wewnątrz każdej serii.


In [4]:
df = df.sort_values(SERIES_KEY + [TIME_COL]).reset_index(drop=True)


### Metryki jakości prognoz

Definiujemy podstawowe metryki,
które są odporne na różnice skali popytu.


In [5]:
def mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def wape(y_true, y_pred):
    denom = np.sum(np.abs(y_true))
    return np.nan if denom == 0 else np.sum(np.abs(y_true - y_pred)) / denom


### Baseline’y prognostyczne

Implementujemy najprostsze modele referencyjne,
które wyznaczają minimalny akceptowalny poziom jakości.


In [6]:
def naive_last(y_hist, horizon):
    return np.repeat(y_hist.iloc[-1], horizon)

def seasonal_naive(y_hist, horizon, season=52):
    if len(y_hist) < season:
        return naive_last(y_hist, horizon)
    return y_hist.iloc[-season:-(season-horizon) if season>horizon else None].values[:horizon]

def drift(y_hist, horizon):
    if len(y_hist) < 2:
        return naive_last(y_hist, horizon)
    slope = (y_hist.iloc[-1] - y_hist.iloc[0]) / (len(y_hist) - 1)
    return y_hist.iloc[-1] + slope * np.arange(1, horizon + 1)


### Rolling-origin backtesting

Backtesting opiera się na przesuwającym się punkcie odcięcia,
co pozwala ocenić stabilność modeli w czasie.


In [7]:
HORIZON = 13
MIN_TRAIN = 104

def rolling_splits(n_obs, min_train, horizon):
    splits = []
    end = min_train
    while end + horizon <= n_obs:
        splits.append((0, end, end, end + horizon))
        end += horizon
    return splits


### Backtesting pojedynczej serii

Dla każdej serii wykonujemy backtesting
z identycznymi punktami odcięcia.


In [8]:
def backtest_series(s):
    y = s[TARGET].reset_index(drop=True)
    splits = rolling_splits(len(y), MIN_TRAIN, HORIZON)
    rows = []

    for tr_start, tr_end, te_start, te_end in splits:
        y_train = y.iloc[tr_start:tr_end]
        y_test = y.iloc[te_start:te_end]

        preds = {
            "naive": naive_last(y_train, len(y_test)),
            "seasonal_naive": seasonal_naive(y_train, len(y_test)),
            "drift": drift(y_train, len(y_test)),
        }

        for model, y_pred in preds.items():
            rows.append({
                "model": model,
                "mae": mae(y_test.values, y_pred),
                "wape": wape(y_test.values, y_pred),
            })

    return pd.DataFrame(rows)


### Uruchomienie benchmarków

Backtesting wykonujemy dla każdej serii
oznaczonej jako eligible.


In [12]:
results = []

for key, g in df.groupby(SERIES_KEY, observed=True):
    r = backtest_series(g)
    r["country"] = key[0]
    r["sku"] = key[1]
    results.append(r)

bt = pd.concat(results, ignore_index=True)
bt.head(20)


Unnamed: 0,model,mae,wape,country,sku
0,naive,1954.153846,0.234143,Germany,00019-003-003
1,seasonal_naive,660.846154,0.079181,Germany,00019-003-003
2,drift,2045.613144,0.245101,Germany,00019-003-003
3,naive,1800.230769,0.197736,Germany,00019-003-003
4,seasonal_naive,1499.076923,0.164657,Germany,00019-003-003
5,drift,1896.179045,0.208274,Germany,00019-003-003
6,naive,611.923077,0.069457,Germany,00019-003-003
7,seasonal_naive,1052.307692,0.119443,Germany,00019-003-003
8,drift,615.675611,0.069883,Germany,00019-003-003
9,naive,2447.0,0.333655,Poland,02589-489-000


### Agregacja wyników benchmarków

Agregujemy metryki w obrębie serii i modeli,
aby umożliwić porównanie jakości.


In [13]:
summary = (
    bt.groupby(["country", "sku", "model"])
      .agg(
          mae=("mae", "mean"),
          wape=("wape", "mean"),
          n_folds=("mae", "count")
      )
      .reset_index()
)

summary.head(20)


Unnamed: 0,country,sku,model,mae,wape,n_folds
0,Germany,00019-003-003,drift,1519.155933,0.174419,3
1,Germany,00019-003-003,naive,1455.435897,0.167112,3
2,Germany,00019-003-003,seasonal_naive,1070.74359,0.121094,3
3,Poland,02589-489-000,drift,688.665848,0.084518,15
4,Poland,02589-489-000,naive,721.189744,0.088712,15
5,Poland,02589-489-000,seasonal_naive,964.474359,0.118481,15
6,Poland,62170-027-000,drift,28.73637,0.547761,1
7,Poland,62170-027-000,naive,27.0,0.514663,1
8,Poland,62170-027-000,seasonal_naive,30.269231,0.576979,1
9,Portugal,00012-619-000,drift,1958.296653,0.255375,6


### Wybór najlepszego benchmarku

Dla każdej serii wybieramy baseline o najniższym WAPE.
To wyznacza minimalny poziom jakości dla kolejnych modeli.


In [16]:
best_per_series = (
    summary.sort_values(["country", "sku", "wape", "mae"], ascending=[True, True, True, True])
           .groupby(["country", "sku"], as_index=False)
           .first()
           .rename(columns={"model": "best_baseline"})
)

best_per_series.head(20)


Unnamed: 0,country,sku,best_baseline,mae,wape,n_folds
0,Germany,00019-003-003,seasonal_naive,1070.74359,0.121094,3
1,Poland,02589-489-000,drift,688.665848,0.084518,15
2,Poland,62170-027-000,naive,27.0,0.514663,1
3,Portugal,00012-619-000,seasonal_naive,1481.750205,0.196622,6
4,Portugal,00041-097-000,seasonal_naive,9.413682,0.33758,15
5,Romania,00012-432-000,seasonal_naive,234.177949,0.262325,15
6,Romania,00077-010-000,naive,219.486318,0.421451,15
7,Romania,07808-016-000,naive,3.136095,0.994129,13
8,Spain,00023-189-000,seasonal_naive,47.679092,0.228847,15
9,Spain,00119-066-001,seasonal_naive,391.637808,0.172094,2


### Trudność serii

Porządkujemy serie według jakości najlepszego baseline.
Pozwala to szybko zidentyfikować serie trudne i niestabilne.


In [17]:
hardest = best_per_series.sort_values(["wape", "mae"], ascending=[False, False]).head(10)
easiest = best_per_series.sort_values(["wape", "mae"], ascending=[True, True]).head(10)

display(hardest)
display(easiest)


Unnamed: 0,country,sku,best_baseline,mae,wape,n_folds
7,Romania,07808-016-000,naive,3.136095,0.994129,13
2,Poland,62170-027-000,naive,27.0,0.514663,1
6,Romania,00077-010-000,naive,219.486318,0.421451,15
12,Sweden,00295-629-000,naive,76.602564,0.340206,6
4,Portugal,00041-097-000,seasonal_naive,9.413682,0.33758,15
5,Romania,00012-432-000,seasonal_naive,234.177949,0.262325,15
8,Spain,00023-189-000,seasonal_naive,47.679092,0.228847,15
3,Portugal,00012-619-000,seasonal_naive,1481.750205,0.196622,6
11,Sweden,00011-294-000,seasonal_naive,556.502564,0.172493,15
9,Spain,00119-066-001,seasonal_naive,391.637808,0.172094,2


Unnamed: 0,country,sku,best_baseline,mae,wape,n_folds
1,Poland,02589-489-000,drift,688.665848,0.084518,15
13,Sweden,75-072-000,seasonal_naive,241.666667,0.09892,15
0,Germany,00019-003-003,seasonal_naive,1070.74359,0.121094,3
10,Spain,04592-030-000,seasonal_naive,240.294872,0.143226,15
9,Spain,00119-066-001,seasonal_naive,391.637808,0.172094,2
11,Sweden,00011-294-000,seasonal_naive,556.502564,0.172493,15
3,Portugal,00012-619-000,seasonal_naive,1481.750205,0.196622,6
8,Spain,00023-189-000,seasonal_naive,47.679092,0.228847,15
5,Romania,00012-432-000,seasonal_naive,234.177949,0.262325,15
4,Portugal,00041-097-000,seasonal_naive,9.413682,0.33758,15


### Wyniki zagregowane per kraj

Agregujemy wyniki najlepszych baseline’ów w obrębie kraju.
Pozwala to porównać jakość prognoz między rynkami.


In [18]:
country_summary = (
    best_per_series.groupby("country")
    .agg(
        n_series=("sku", "count"),
        median_wape=("wape", "median"),
        mean_wape=("wape", "mean"),
        p90_wape=("wape", lambda x: x.quantile(0.9)),
        best_model_mode=("best_baseline", lambda x: x.value_counts().index[0]),
    )
    .reset_index()
    .sort_values("median_wape")
)

country_summary


Unnamed: 0,country,n_series,median_wape,mean_wape,p90_wape,best_model_mode
0,Germany,1,0.121094,0.121094,0.121094,seasonal_naive
4,Spain,3,0.172094,0.181389,0.217496,seasonal_naive
5,Sweden,3,0.172493,0.203873,0.306663,seasonal_naive
2,Portugal,2,0.267101,0.267101,0.323484,seasonal_naive
1,Poland,2,0.29959,0.29959,0.471648,drift
3,Romania,3,0.421451,0.559302,0.879594,naive


### Porównanie modeli baseline

Porównujemy baseline’y w ujęciu globalnym,
aby ocenić który benchmark jest najsilniejszy średnio.


In [19]:
global_summary = (
    summary.groupby("model")
    .agg(
        mean_wape=("wape", "mean"),
        median_wape=("wape", "median"),
        mean_mae=("mae", "mean"),
        n_series=("sku", "nunique"),
    )
    .reset_index()
    .sort_values("median_wape")
)

global_summary


Unnamed: 0,model,mean_wape,median_wape,mean_mae,n_series
2,seasonal_naive,0.314366,0.212734,398.261577,14
1,naive,0.322122,0.250359,475.514302,14
0,drift,0.332497,0.256538,482.168244,14


### Rejestr benchmarków

Budujemy tabelę, która będzie używana w kolejnych notebookach
jako punkt odniesienia i baseline do porównań.


In [22]:
baseline_registry = best_per_series.merge(
    country_summary[["country", "median_wape"]],
    on="country",
    how="left"
).rename(columns={"median_wape": "country_median_wape_best_baseline"})

baseline_registry.head(20)


Unnamed: 0,country,sku,best_baseline,mae,wape,n_folds,country_median_wape_best_baseline
0,Germany,00019-003-003,seasonal_naive,1070.74359,0.121094,3,0.121094
1,Poland,02589-489-000,drift,688.665848,0.084518,15,0.29959
2,Poland,62170-027-000,naive,27.0,0.514663,1,0.29959
3,Portugal,00012-619-000,seasonal_naive,1481.750205,0.196622,6,0.267101
4,Portugal,00041-097-000,seasonal_naive,9.413682,0.33758,15,0.267101
5,Romania,00012-432-000,seasonal_naive,234.177949,0.262325,15,0.421451
6,Romania,00077-010-000,naive,219.486318,0.421451,15,0.421451
7,Romania,07808-016-000,naive,3.136095,0.994129,13,0.421451
8,Spain,00023-189-000,seasonal_naive,47.679092,0.228847,15,0.172094
9,Spain,00119-066-001,seasonal_naive,391.637808,0.172094,2,0.172094


### Zapis wyników benchmarków

Zapisujemy wyniki w formie plików wejściowych
dla kolejnych etapów modelowania i raportowania.


In [21]:
OUT_DIR = Path("data") / "backtesting"
OUT_DIR.mkdir(parents=True, exist_ok=True)

bt_path = OUT_DIR / "baseline_backtest_folds.parquet"
summary_path = OUT_DIR / "baseline_summary.parquet"
best_path = OUT_DIR / "baseline_best_per_series.parquet"
country_path = OUT_DIR / "baseline_country_summary.csv"
global_path = OUT_DIR / "baseline_global_summary.csv"

bt.to_parquet(bt_path, index=False)
summary.to_parquet(summary_path, index=False)
best_per_series.to_parquet(best_path, index=False)
country_summary.to_csv(country_path, index=False)
global_summary.to_csv(global_path, index=False)

bt_path, summary_path, best_path, country_path, global_path


(WindowsPath('data/backtesting/baseline_backtest_folds.parquet'),
 WindowsPath('data/backtesting/baseline_summary.parquet'),
 WindowsPath('data/backtesting/baseline_best_per_series.parquet'),
 WindowsPath('data/backtesting/baseline_country_summary.csv'),
 WindowsPath('data/backtesting/baseline_global_summary.csv'))

### Podsumowanie benchmarków

Zbudowano baseline’y oraz rolling-origin backtesting.
Wybrano najlepszy baseline dla każdej serii.
Wyniki zapisano w formie gotowej do porównań z modelami ML i TS.
