# Welcome to Forecast Acadeny - Forecasting @ Scale
## 00 - Initial Setup
### What we’ll do

**Goal:** get your machine ready, download the M5 dataset via Nixtla, create a small teaching subset, and verify everything with a quick diagnostic.
You’ll do this once, then reuse the outputs in later lessons.
In this setup notebook we will:

1. **Create the course folder structure**  
   Organize inputs, outputs, and interim data so everything is easy to find.

2. **(Optional) Install/verify dependencies**  
   Make sure you have `datasetsforecast`, `pyarrow`, `tsforge`, and other packages ready.

3. **Download the M5 dataset**  
   Use Nixtla’s `datasetsforecast` loader to fetch *sales*, *calendar*, and *prices* data.

4. **Save raw data to `data/input/raw/`**  
   Store the full files as Parquet for faster loading and smaller size.

5. **Build a teaching subset**  
   Use `tsforge` to create a smaller sample (few departments/stores/items) and save it to `data/input/processed/`.

6. **Run a quick diagnostic**  
   Check the subset for completeness (no missing periods) so we know it’s healthy for forecasting.


In [1]:
## Create Project Paths and Folders

import os
from pathlib import Path

# notebook root
ROOT = Path.cwd()

# one folder up (..)
BASE = ROOT.parent

DATA_DIR = BASE / "data"
INPUT_RAW = DATA_DIR / "input" / "raw"
INPUT_PROCESSED = DATA_DIR / "input" / "processed"
INTERIM_DIR = DATA_DIR / "interim"
OUTPUT_DIR = DATA_DIR / "output"
OUTPUT_MODELS = OUTPUT_DIR / "models"
OUTPUT_FORECASTS = OUTPUT_DIR / "forecasts"
OUTPUT_DIAG = OUTPUT_DIR / "diagnostics"
OUTPUT_PLOTS = OUTPUT_DIR / "plots"
DOCS_FIGS = BASE / "docs" / "figures"

for p in [
    INPUT_RAW, INPUT_PROCESSED, INTERIM_DIR,
    OUTPUT_DIR, OUTPUT_MODELS, OUTPUT_FORECASTS, OUTPUT_DIAG, OUTPUT_PLOTS,
    DOCS_FIGS
]:
    p.mkdir(parents=True, exist_ok=True)

print("Created/verified folders:")
for p in [    INPUT_RAW, INPUT_PROCESSED, INTERIM_DIR,
    OUTPUT_DIR, OUTPUT_MODELS, OUTPUT_FORECASTS, OUTPUT_DIAG, OUTPUT_PLOTS,
    DOCS_FIGS]:
    print(" -", p)

Created/verified folders:
 - c:\Users\tacke\Documents\GitHub\tsforge\data\input\raw
 - c:\Users\tacke\Documents\GitHub\tsforge\data\input\processed
 - c:\Users\tacke\Documents\GitHub\tsforge\data\interim
 - c:\Users\tacke\Documents\GitHub\tsforge\data\output
 - c:\Users\tacke\Documents\GitHub\tsforge\data\output\models
 - c:\Users\tacke\Documents\GitHub\tsforge\data\output\forecasts
 - c:\Users\tacke\Documents\GitHub\tsforge\data\output\diagnostics
 - c:\Users\tacke\Documents\GitHub\tsforge\data\output\plots
 - c:\Users\tacke\Documents\GitHub\tsforge\docs\figures


### 2) (Optional) installs

If you didn’t install these in your VS Code environment already, uncomment and run:

- **datasetsforecast** → for the official M5 loader (Nixtla)  
- **pyarrow** → for fast Parquet I/O  
- **tsforge** → your package with teaching utilities  



In [2]:
# !pip install -U datasetsforecast pyarrow
# !pip install -U tsforge          # if published, or `pip install -e .` from your tsforge repo



### 3) Imports & Version Checks

In [3]:
import sys, platform
import pandas as pd
import numpy as np

print("Python:", sys.version.split()[0], "| OS:", platform.system())
print("pandas:", pd.__version__)


Python: 3.12.10 | OS: Windows
pandas: 2.3.2


In [4]:
#%pip list

### 4) Download M5 via Nixtla and save raw to `data/input/raw/`

We’ll use Nixtla’s official loader (`datasetsforecast.m5.M5`) which downloads & caches M5 locally.  
We then save what we loaded as **Parquet files** in `data/input/raw/` so later notebooks don’t need to refetch.

In [5]:
from pathlib import Path
import pandas as pd
from datasetsforecast.m5 import M5

# -------------------------------------------------------------------
# Paths
# -------------------------------------------------------------------
# Go directly to data/raw
INPUT_RAW = Path("..") / "data" / "input" / "raw"
INPUT_RAW.mkdir(parents=True, exist_ok=True)

# Tell Nixtla to build cache here
Y_df, X_df, meta_df = M5.load(directory=str(INPUT_RAW), cache=True)

# -------------------------------------------------------------------
# Load with Nixtla
# This gives us 3 dfs: Y_df (sales), X_df (calendar+snap+prices), meta_df (meta)
# -------------------------------------------------------------------
Y_df, X_df, meta_df = M5.load(directory=str(INPUT_RAW / "nixtla_cache"), cache=True)

# Messy up the data a bit to simulate real data
meta_df[['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']] = meta_df[['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']].astype(str)
sales_df = Y_df.merge(meta_df, on=['unique_id'], how="left")
sales_df = sales_df.rename(columns={"ds": "date", "y": "sales"})
sales_df = sales_df[['item_id','dept_id','cat_id','store_id','state_id','date','sales']]
sales_df = sales_df[sales_df.sales>0]  # remove zero sales rows

prices_df = X_df.merge(meta_df, on=['unique_id'], how="left")
prices_df = prices_df.rename(columns={"ds": "date", "sell_price": "price"})
calendar_df = prices_df[['date','event_name_1','event_type_1','event_name_2','event_type_2','snap_CA','snap_TX','snap_WI']].drop_duplicates()

prices_df = prices_df[['item_id','store_id','date','price']]

# -------------------------------------------------------------------
# Save parquet outputs
# -------------------------------------------------------------------
sales_path = INPUT_RAW / "00_m5_sales.parquet"
calendar_path = INPUT_RAW / "00_m5_calendar.parquet"
prices_path = INPUT_RAW / "00_m5_prices.parquet"
#meta_path = INPUT_RAW / "00_m5_meta.parquet"

sales_df.to_parquet(sales_path, index=False)
calendar_df.to_parquet(calendar_path, index=False)
prices_df.to_parquet(prices_path, index=False)

print("✅ Saved clean M5 files without Nixtla unique_id:")
print(" -", sales_path.name, f"({round(sales_path.stat().st_size/1e6,1)} MB)")
print(" -", calendar_path.name, f"({round(calendar_path.stat().st_size/1e6,1)} MB)")
print(" -", prices_path.name, f"({round(prices_path.stat().st_size/1e6,1)} MB)")


100%|██████████| 50.2M/50.2M [00:00<00:00, 67.6MiB/s]
INFO:datasetsforecast.utils:Successfully downloaded m5.zip, 50219189, bytes.
INFO:datasetsforecast.utils:Decompressing zip file...
INFO:datasetsforecast.utils:Successfully decompressed ..\data\input\raw\m5\datasets\m5.zip
  keep_mask = long.groupby('id')['y'].transform(first_nz_mask, engine='numba')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  long.rename(columns={'id': 'unique_id', 'date': 'ds'}, inplace=True)
100%|██████████| 50.2M/50.2M [00:01<00:00, 48.3MiB/s]
INFO:datasetsforecast.utils:Successfully downloaded m5.zip, 50219189, bytes.
INFO:datasetsforecast.utils:Decompressing zip file...
INFO:datasetsforecast.utils:Successfully decompressed ..\data\input\raw\nixtla_cache\m5\datasets\m5.zip
  keep_mask = long.groupby('id')['y'].transform(first_nz_mask, engine='numba')


✅ Saved clean M5 files without Nixtla unique_id:
 - 00_m5_sales.parquet (39.1 MB)
 - 00_m5_calendar.parquet (0.0 MB)
 - 00_m5_prices.parquet (57.0 MB)


The sales data we just downloaded is the official Kaggle M5 training set (sales_train_validation.csv in wide format). It runs ~1913 days, ending 2016-06-19. Kaggle’s private leaderboard was scored on a hidden 28-day test horizon, but for our learning we’ll create our own validation splits.

#### Examine Sales Data
Here we will examine the sales data.  
The unique identifier in this data is item_id and store_id.  There are actually 2 hierarchies here, product and location. 

**Product Hierarchy:** 
An item belongs to a department which belongs to a  category, so our product hierachy is item -> department -> category.

**Location Hierarchy**
A store belongs to a state, so our hierarchy is store -> state.

When modeling, it is helpful to have a `unique_id` that we can reference easily when merging data or modeling, so lets create 2 things.
1. A `unique_id` field in our dataset.
2. A dataframe called  `meta_df` which stores the hierarchy information for our `unique_id`


In [6]:
sales_df = pd.read_parquet(r'C:\Users\tacke\Documents\GitHub\forecast_academy\data\input\raw\00_m5_sales.parquet')
sales_df['unique_id'] = sales_df['item_id'] + '_' + sales_df['store_id'] 
sales_df.to_parquet('../data/input/processed/00_m5_sales_full.parquet')
sales_df.head()

Unnamed: 0,item_id,dept_id,cat_id,store_id,state_id,date,sales,unique_id
0,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-01-29,3.0,FOODS_1_001_CA_1
1,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-02-01,1.0,FOODS_1_001_CA_1
2,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-02-02,4.0,FOODS_1_001_CA_1
3,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-02-03,2.0,FOODS_1_001_CA_1
4,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-02-05,2.0,FOODS_1_001_CA_1


In [7]:
meta_df = sales_df[['unique_id','item_id','dept_id','cat_id','store_id','state_id']].drop_duplicates()
meta_df.to_parquet('../data/input/processed/00_m5_meta_full.parquet')

meta_df.head()

Unnamed: 0,unique_id,item_id,dept_id,cat_id,store_id,state_id
0,FOODS_1_001_CA_1,FOODS_1_001,FOODS_1,FOODS,CA_1,CA
864,FOODS_1_001_CA_2,FOODS_1_001,FOODS_1,FOODS,CA_2,CA
1881,FOODS_1_001_CA_3,FOODS_1_001,FOODS_1,FOODS,CA_3,CA
2771,FOODS_1_001_CA_4,FOODS_1_001,FOODS_1,FOODS,CA_4,CA
3281,FOODS_1_001_TX_1,FOODS_1_001,FOODS_1,FOODS,TX_1,TX


In [8]:
prices_df = pd.read_parquet(r'C:\Users\tacke\Documents\GitHub\forecast_academy\data\input\raw\00_m5_prices.parquet')
prices_df['unique_id'] = prices_df['item_id'] + '_' + prices_df['store_id'] 
prices_df.to_parquet('../data/input/processed/00_m5_prices_full.parquet')
prices_df.head()

Unnamed: 0,item_id,store_id,date,price,unique_id
0,FOODS_1_001,CA_1,2011-01-29,2.0,FOODS_1_001_CA_1
1,FOODS_1_001,CA_1,2011-01-30,2.0,FOODS_1_001_CA_1
2,FOODS_1_001,CA_1,2011-01-31,2.0,FOODS_1_001_CA_1
3,FOODS_1_001,CA_1,2011-02-01,2.0,FOODS_1_001_CA_1
4,FOODS_1_001,CA_1,2011-02-02,2.0,FOODS_1_001_CA_1


In [9]:
calendar_df = pd.read_parquet(r'C:\Users\tacke\Documents\GitHub\forecast_academy\data\input\raw\00_m5_calendar.parquet')
calendar_df.replace('nan', pd.NA, inplace=True)
calendar_df.to_parquet('../data/input/processed/00_m5_calendar.parquet')
calendar_df.head()

  calendar_df.replace('nan', pd.NA, inplace=True)


Unnamed: 0,date,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,snap_TX,snap_WI
0,2011-01-29,,,,,0,0,0
1,2011-01-30,,,,,0,0,0
2,2011-01-31,,,,,0,0,0
3,2011-02-01,,,,,1,1,0
4,2011-02-02,,,,,1,0,1


In [10]:
sales_df.shape

(19321177, 8)

### 5) Create a small teaching subset and save to `data/input/processed/`

Here is a challenge with m5, its over 19M rows of data.  While its not a deal breaker, for the purposes of this training, it is not necessary to forecast the entire population as it will just take unnecessary processing time. 
For this training, let's subset the data so it is easier to work with.  However, if you want to explore how things work on the full dataset, simply choose `subset=False` in the next block.

We will limit our subset to **FOODS** and **HOUSEHOLD** categor items, in the stores **CA_1 + TX_2 + TX_1**.
This keeps lessons fast while preserving hierarchy.

In [11]:
sales_df['cat_id'].unique()

array(['FOODS', 'HOBBIES', 'HOUSEHOLD'], dtype=object)

In [12]:
sales_df[sales_df['cat_id']=='HOUSEHOLD'].item_id.nunique()

1047

In [13]:
sales_df['store_id'].unique()

array(['CA_1', 'CA_2', 'CA_3', 'CA_4', 'TX_1', 'TX_2', 'TX_3', 'WI_1',
       'WI_2', 'WI_3'], dtype=object)

In [14]:
sales_df.date.min()

Timestamp('2011-01-29 00:00:00')

In [15]:
sales_df.date.max()

Timestamp('2016-06-19 00:00:00')

In [16]:
subset = True  # set to False to use full data (19M rows)
if subset:
    sales_df_sub = sales_df[sales_df['store_id'].isin(['CA_1','CA_2','TX_1']) & sales_df['cat_id'].isin(['HOBBIES','HOUSEHOLD'])]
    sales_df_sub = sales_df_sub[sales_df_sub['date']>='2012-06-19']
    unique_ids = sales_df_sub['unique_id'].unique()
    prices_df_sub = prices_df[prices_df.unique_id.isin(unique_ids)]
    meta_df_sub = prices_df[prices_df.unique_id.isin(unique_ids)]
    print(f"Subsetting to {sales_df_sub.shape[0]} rows of data")

sales_df_sub.to_parquet('../data/input/processed/00_m5_sales_subset.parquet')
meta_df_sub.to_parquet('../data/input/processed/00_m5_meta_subset.parquet')
prices_df_sub.to_parquet('../data/input/processed/00_m5_prices_subset.parquet') 

Subsetting to 2170376 rows of data


### 6) Make Training and Test Set

Because we do not want any cheating and make this as realistic as possible, we want to immediately split off a test set so even during the EDA process, we do not get any peaks in to the future data.
To be consistent with the M5, we will pull out 28 days at the end of the dataset.

In [17]:
sales_df_sub.date.max()

Timestamp('2016-06-19 00:00:00')

In [18]:
def make_train_test_split(df: pd.DataFrame, date_col: str = "date", horizon: int = 28):
    """
    Split a long-format time series panel into train/test sets.

    Parameters
    ----------
    df : DataFrame
        Must have a datetime column named `date_col`.
    date_col : str, default "date"
        Name of datetime column.
    horizon : int, default 28
        Forecast horizon length (days).

    Returns
    -------
    train_df, test_df : tuple of DataFrames
    """
    df = df.copy()
    max_date = df[date_col].max()
    cutoff = max_date - pd.Timedelta(days=horizon)
    train = df[df[date_col] <= cutoff]
    test = df[df[date_col] > cutoff]
    return train, test

INPUT_PROCESSED = Path("..") / "data" / "input" / "processed"

# Train/test splits
train_df, test_df = make_train_test_split(sales_df_sub, date_col="date", horizon=28)

# Save outputs
train_df.to_parquet(INPUT_PROCESSED / "00_m5_sales_train.parquet", index=False)
test_df.to_parquet(INPUT_PROCESSED / "00_m5_sales_test.parquet", index=False)

print("✅ Training/test sets saved:")



✅ Training/test sets saved:


### 7) Quick sanity checks

We’ll run a few checks to make sure the train/test splits are valid:

- **Check consistency between train and test**:  
  - Train end date and test start date are exactly 1 day apart  
  - All series IDs in test are also present in train  
  - The number of unique series is consistent

In [19]:
from pathlib import Path
import pandas as pd

# -------------------------------
# Quick info
print("Subset TRAIN rows:", len(train_df), 
      "| Date range:", train_df["date"].min(), "→", train_df["date"].max())
print("Subset TEST rows:", len(test_df), 
      "| Date range:", test_df["date"].min(), "→", test_df["date"].max())


# -------------------------------
# Consistency checks
train_end = train_df["date"].max()
test_start = test_df["date"].min()

ids_train = set(train_df["unique_id"].unique())
ids_test = set(test_df["unique_id"].unique())

print("\nConsistency checks:")
print(" - Train last date:", train_end)
print(" - Test first date:", test_start)
print(" - Gap between train end and test start:", (test_start - train_end).days, "days")
print(" - All test IDs in train?", ids_test.issubset(ids_train))
print(" - # unique IDs in train:", len(ids_train), "| in test:", len(ids_test))


Subset TRAIN rows: 2116761 | Date range: 2012-06-19 00:00:00 → 2016-05-22 00:00:00
Subset TEST rows: 53615 | Date range: 2016-05-23 00:00:00 → 2016-06-19 00:00:00

Consistency checks:
 - Train last date: 2016-05-22 00:00:00
 - Test first date: 2016-05-23 00:00:00
 - Gap between train end and test start: 1 days
 - All test IDs in train? True
 - # unique IDs in train: 4836 | in test: 4836


### 7) (Optional) .gitignore helper

If you’re versioning this repo, it’s smart to ignore **raw data** and **outputs**.  
We’ll generate a starter `.gitignore` so large files don’t accidentally end up in git history. Run this cell once.

In [20]:
gitignore_path = ROOT / ".gitignore"
lines = [
    "# data (raw, interim, output)",
    "data/input/raw/",
    "data/interim/",
    "data/output/",
    "",
    "# OS/editor files",
    ".DS_Store",
    ".ipynb_checkpoints/",
    ""
]

if not gitignore_path.exists():
    gitignore_path.write_text("\n".join(lines))
    print("Created .gitignore with data folders ignored.")
else:
    print(".gitignore already exists — review to ensure data folders are ignored.")


Created .gitignore with data folders ignored.


### You’re set 🎉

- **Raw M5** lives in `data/input/raw/` (Parquet).  
- **Teaching subset** lives in `data/input/processed/`.  

**Next:** move to `01_initial_eda.ipynb` to define the forecast charter (target, grain, horizon, metrics) and start exploring the data.