# Welcome to Forecast Acadeny - Forecasting @ Scale
## 00 - Initial Setup
### What we’ll do

**Goal:** get your machine ready, download the M5 dataset via Nixtla, create a small teaching subset, and verify everything with a quick diagnostic.
You’ll do this once, then reuse the outputs in later lessons.
In this setup notebook we will:

1. **Create the course folder structure**  
   Organize inputs, outputs, and interim data so everything is easy to find.

2. **(Optional) Install/verify dependencies**  
   Make sure you have `datasetsforecast`, `pyarrow`, `tsforge`, and other packages ready.

3. **Download the M5 dataset**  
   Use Nixtla’s `datasetsforecast` loader to fetch *sales*, *calendar*, and *prices* data.

4. **Save raw data to `data/input/raw/`**  
   Store the full files as Parquet for faster loading and smaller size.

5. **Build a teaching subset**  
   Use `tsforge` to create a smaller sample (few departments/stores/items) and save it to `data/input/processed/`.

6. **Run a quick diagnostic**  
   Check the subset for completeness (no missing periods) so we know it’s healthy for forecasting.


In [4]:
## Create Project Paths and Folders

import os
from pathlib import Path

# notebook root
ROOT = Path.cwd()

# one folder up (..)
BASE = ROOT.parent

DATA_DIR = BASE / "data"
INPUT_RAW = DATA_DIR / "input" / "raw"
INPUT_PROCESSED = DATA_DIR / "input" / "processed"
INTERIM_DIR = DATA_DIR / "interim"
OUTPUT_DIR = DATA_DIR / "output"
OUTPUT_MODELS = OUTPUT_DIR / "models"
OUTPUT_FORECASTS = OUTPUT_DIR / "forecasts"
OUTPUT_DIAG = OUTPUT_DIR / "diagnostics"
OUTPUT_PLOTS = OUTPUT_DIR / "plots"
DOCS_FIGS = BASE / "docs" / "figures"

for p in [
    INPUT_RAW, INPUT_PROCESSED, INTERIM_DIR,
    OUTPUT_DIR, OUTPUT_MODELS, OUTPUT_FORECASTS, OUTPUT_DIAG, OUTPUT_PLOTS,
    DOCS_FIGS
]:
    p.mkdir(parents=True, exist_ok=True)

print("Created/verified folders:")
for p in [    INPUT_RAW, INPUT_PROCESSED, INTERIM_DIR,
    OUTPUT_DIR, OUTPUT_MODELS, OUTPUT_FORECASTS, OUTPUT_DIAG, OUTPUT_PLOTS,
    DOCS_FIGS]:
    print(" -", p)

Created/verified folders:
 - c:\Users\tacke\Documents\GitHub\forecast_academy\data\input\raw
 - c:\Users\tacke\Documents\GitHub\forecast_academy\data\input\processed
 - c:\Users\tacke\Documents\GitHub\forecast_academy\data\interim
 - c:\Users\tacke\Documents\GitHub\forecast_academy\data\output
 - c:\Users\tacke\Documents\GitHub\forecast_academy\data\output\models
 - c:\Users\tacke\Documents\GitHub\forecast_academy\data\output\forecasts
 - c:\Users\tacke\Documents\GitHub\forecast_academy\data\output\diagnostics
 - c:\Users\tacke\Documents\GitHub\forecast_academy\data\output\plots
 - c:\Users\tacke\Documents\GitHub\forecast_academy\docs\figures


### 2) (Optional) installs

If you didn’t install these in your VS Code environment already, uncomment and run:

- **datasetsforecast** → for the official M5 loader (Nixtla)  
- **pyarrow** → for fast Parquet I/O  
- **tsforge** → your package with teaching utilities  
- **pytimetk** → time-series EDA helpers (used later in the course)


In [None]:
# !pip install -U datasetsforecast pyarrow
# !pip install -U tsforge          # if published, or `pip install -e .` from your tsforge repo
# !pip install -U pytimetk         # optional: used in later EDA/feature lessons


### 3) Imports & Version Checks

In [5]:
import sys, platform
import pandas as pd
import numpy as np

print("Python:", sys.version.split()[0], "| OS:", platform.system())
print("pandas:", pd.__version__)


Python: 3.12.7 | OS: Windows
pandas: 2.3.2


### 4) Download M5 via Nixtla and save raw to `data/input/raw/`

We’ll use Nixtla’s official loader (`datasetsforecast.m5.M5`) which downloads & caches M5 locally.  
We then save what we loaded as **Parquet files** in `data/input/raw/` so later notebooks don’t need to refetch.

In [10]:
from datasetsforecast.m5 import M5

RAW_CACHE = str(INPUT_RAW / "nixtla_cache")  # keep nixtla internals in a subfolder

# Download using Nixtla
Y_df, X_df, prices_df = M5.load(directory=RAW_CACHE, cache=True)

# Rename columns for teaching
sales_df = Y_df.rename(columns={"unique_id": "id", "ds": "date", "y": "sales"})
calendar_df = X_df.rename(columns={"ds": "date"})
sales_df["date"] = pd.to_datetime(sales_df["date"])
calendar_df["date"] = pd.to_datetime(calendar_df["date"])

# Save flat Parquet copies in data/input/raw/
sales_path = INPUT_RAW / "m5_sales.parquet"
calendar_path = INPUT_RAW / "m5_calendar.parquet"
prices_path = INPUT_RAW / "m5_prices.parquet"

sales_df.to_parquet(sales_path, index=False)
calendar_df.to_parquet(calendar_path, index=False)
prices_df.to_parquet(prices_path, index=False)

print("Saved cleaned raw files:")
for p in [sales_path, calendar_path, prices_path]:
    print(" -", p.relative_to(INPUT_RAW), f"({round(p.stat().st_size/1e6, 1)} MB)")


100%|██████████| 50.2M/50.2M [00:01<00:00, 46.6MiB/s]
INFO:datasetsforecast.utils:Successfully downloaded m5.zip, 50219189, bytes.
INFO:datasetsforecast.utils:Decompressing zip file...
INFO:datasetsforecast.utils:Successfully decompressed c:\Users\tacke\Documents\GitHub\forecast_academy\data\input\raw\nixtla_cache\m5\datasets\m5.zip
  without_leading_zeros = long['y'].gt(0).groupby(long['id']).transform('cummax')


Saved cleaned raw files:
 - m5_sales.parquet (85.0 MB)
 - m5_calendar.parquet (67.2 MB)
 - m5_prices.parquet (0.2 MB)


The sales data we just downloaded is the official Kaggle M5 training set (sales_train_validation.csv in wide format). It runs ~1913 days, ending 2016-06-19. Kaggle’s private leaderboard was scored on a hidden 28-day test horizon, but for our learning we’ll create our own validation splits.

### 5) Create a small teaching subset and save to `data/input/processed/`

We’ll use a helper that returns a clean slice (e.g., **FOODS_3 + HOUSEHOLD_1**, stores **CA_1 + TX_1**, ~5 items per store/department).  
This keeps lessons fast while preserving hierarchy.

If your `tsforge` function is available (`tsforge.datasets.load_m5_subset`), we’ll use it.  
Otherwise, a small fallback function does the same.

In [None]:
Path("..") / "data" / "input" / "raw"

WindowsPath('data/input/raw')

In [16]:
def make_train_test_split(df: pd.DataFrame, date_col: str = "date", horizon: int = 28):
    """
    Split a long-format time series panel into train/test sets.

    Parameters
    ----------
    df : DataFrame
        Must have a datetime column named `date_col`.
    date_col : str, default "date"
        Name of datetime column.
    horizon : int, default 28
        Forecast horizon length (days).

    Returns
    -------
    train_df, test_df : tuple of DataFrames
    """
    df = df.copy()
    max_date = df[date_col].max()
    cutoff = max_date - pd.Timedelta(days=horizon)
    train = df[df[date_col] <= cutoff]
    test = df[df[date_col] > cutoff]
    return train, test


INPUT_RAW = Path("..") / "data" / "input" / "raw"
INPUT_PROCESSED = Path("..") / "data" / "input" / "processed"

# Load raw parquet (already standardized in setup)
sales = pd.read_parquet(INPUT_RAW / "m5_sales.parquet")
calendar = pd.read_parquet(INPUT_RAW / "m5_calendar.parquet")
prices = pd.read_parquet(INPUT_RAW / "m5_prices.parquet")

# --------------------------------------------------------------------
# Build subset (teaching slice)
depts = ["FOODS_3", "HOUSEHOLD_1"]
stores = ["CA_1", "TX_1"]

sales_sub = (
    sales[sales["id"].str.contains("|".join(depts)) & sales["id"].str.contains("|".join(stores))]
    .groupby("id")
    .head(400)   # grab ~400 days per series
)

# --------------------------------------------------------------------
# Drop zero-sales rows to simulate missingness (for teaching padding)
sales = sales[sales["sales"] > 0]
sales_sub = sales_sub[sales_sub["sales"] > 0]

# --------------------------------------------------------------------
# Train/test splits
train_full, test_full = make_train_test_split(sales, date_col="date", horizon=28)
train_sub, test_sub = make_train_test_split(sales_sub, date_col="date", horizon=28)

# --------------------------------------------------------------------
# Save outputs

# Sales splits
train_full.to_parquet(INPUT_PROCESSED / "m5_sales_train_full.parquet", index=False)
test_full.to_parquet(INPUT_PROCESSED / "m5_sales_test_full.parquet", index=False)

train_sub.to_parquet(INPUT_PROCESSED / "m5_sales_train_subset.parquet", index=False)
test_sub.to_parquet(INPUT_PROCESSED / "m5_sales_test_subset.parquet", index=False)

# Calendar & prices (save only full since they are known)
calendar.to_parquet(INPUT_PROCESSED / "m5_calendar_full.parquet", index=False)
prices.to_parquet(INPUT_PROCESSED / "m5_prices_full.parquet", index=False)

print("✅ Training/test sets saved:")
print(" - Sales (full + subset)")
print(" - Calendar (full only)")
print(" - Prices (full only)")


✅ Training/test sets saved:
 - Sales (full + subset)
 - Calendar (full only)
 - Prices (full only)


### 6) Quick sanity checks

- Confirm columns exist and data spans multiple years  
- Standardize an **ID** column for later notebooks (e.g., `id = item_id + '_' + store_id`)  
- Identify the **value column** (`sales`, `value`, etc.) robustly

### 6) Quick sanity checks

We’ll run a few checks to make sure the train/test splits are valid:

- **Confirm columns exist** and cover the expected date ranges  
- **Standardize an ID column** (`id = item_id + '_' + store_id`) if not already present  
- **Identify the value column** (`sales`, `value`, etc.) robustly  
- **Check consistency between train and test**:  
  - Train end date and test start date are exactly 1 day apart  
  - All series IDs in test are also present in train  
  - The number of unique series is consistent

In [19]:
from pathlib import Path
import pandas as pd

INPUT_PROCESSED = Path("..") / "data" / "input" / "processed"

# Load subset train/test
train_sub = pd.read_parquet(INPUT_PROCESSED / "m5_sales_train_subset.parquet")
test_sub = pd.read_parquet(INPUT_PROCESSED / "m5_sales_test_subset.parquet")

# Load calendar & prices (full only)
calendar_full = pd.read_parquet(INPUT_PROCESSED / "m5_calendar_full.parquet")
prices_full = pd.read_parquet(INPUT_PROCESSED / "m5_prices_full.parquet")

# Ensure stable ID
if "id" not in train_sub.columns:
    train_sub["id"] = train_sub["item_id"].astype(str) + "_" + train_sub["store_id"].astype(str)
    test_sub["id"] = test_sub["item_id"].astype(str) + "_" + test_sub["store_id"].astype(str)

# Detect value column
possible_vals = ["sales","value","demand","qty","units"]
value_col = next((c for c in possible_vals if c in train_sub.columns), None)
if value_col is None:
    raise ValueError(f"Couldn't find a value column in {possible_vals}")

# -------------------------------
# Quick info
print("Subset TRAIN rows:", len(train_sub), 
      "| Date range:", train_sub["date"].min(), "→", train_sub["date"].max())
print("Subset TEST rows:", len(test_sub), 
      "| Date range:", test_sub["date"].min(), "→", test_sub["date"].max())
print("Calendar full cols:", list(calendar_full.columns)[:8], "...")
print("Prices full cols:", list(prices_full.columns)[:8], "...")
print("Value column detected:", value_col)

# -------------------------------
# Consistency checks
train_end = train_sub["date"].max()
test_start = test_sub["date"].min()

ids_train = set(train_sub["id"].unique())
ids_test = set(test_sub["id"].unique())

print("\nConsistency checks:")
print(" - Train last date:", train_end)
print(" - Test first date:", test_start)
print(" - Gap between train end and test start:", (test_start - train_end).days, "days")
print(" - All test IDs in train?", ids_test.issubset(ids_train))
print(" - # unique IDs in train:", len(ids_train), "| in test:", len(ids_test))


Subset TRAIN rows: 565353 | Date range: 2011-01-29 00:00:00 → 2016-05-22 00:00:00
Subset TEST rows: 170 | Date range: 2016-05-23 00:00:00 → 2016-06-19 00:00:00
Calendar full cols: ['unique_id', 'date', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 'snap_CA', 'snap_TX'] ...
Prices full cols: ['unique_id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'] ...
Value column detected: sales

Consistency checks:
 - Train last date: 2016-05-22 00:00:00
 - Test first date: 2016-05-23 00:00:00
 - Gap between train end and test start: 1 days
 - All test IDs in train? True
 - # unique IDs in train: 2710 | in test: 13


### 7) (Optional) .gitignore helper

If you’re versioning this repo, it’s smart to ignore **raw data** and **outputs**.  
We’ll generate a starter `.gitignore` so large files don’t accidentally end up in git history. Run this cell once.

In [None]:
gitignore_path = ROOT / ".gitignore"
lines = [
    "# data (raw, interim, output)",
    "data/input/raw/",
    "data/interim/",
    "data/output/",
    "",
    "# OS/editor files",
    ".DS_Store",
    ".ipynb_checkpoints/",
    ""
]

if not gitignore_path.exists():
    gitignore_path.write_text("\n".join(lines))
    print("Created .gitignore with data folders ignored.")
else:
    print(".gitignore already exists — review to ensure data folders are ignored.")


### You’re set 🎉

- **Raw M5** lives in `data/input/raw/` (Parquet).  
- **Teaching subset** lives in `data/input/processed/`.  
- Optional diagnostics show the data is usable at **weekly & monthly grains**.  

**Next:** move to `01_framing.ipynb` to define the forecast charter (target, grain, horizon, metrics) and start exploring the data.