# Data Preparation & Quality Notebook

Lean, reproducible preprocessing for the M5 subset used in this capstone. 
**Scope:** M5 only, CPU-only workflow, no external augmentation. 
**Objectives:**
1. Load panel-level demand data (or build a small synthetic fallback).
2. Profile: shape, date range, item counts, missing & zero-demand rates.
3. Validate date continuity per item.
4. Flag simple outliers via z-score (>3).
5. Apply minimal cleaning (non-negative demand, fill NA with 0).
6. Persist a compact JSON + Markdown quality report to `artifacts/data/`.

This notebook directly supports GHGSat-aligned responsibilities: Data Exploration, Curation, Quality, and Rapid Prototyping.

In [None]:
# Imports & paths
import pandas as pd, numpy as np, json, math, os
from pathlib import Path
import matplotlib.pyplot as plt
DATA_PROCESSED = Path('data/processed')
PANEL_PATH = DATA_PROCESSED / 'm5_panel_subset.parquet'
ARTIFACT_DIR = Path('artifacts/data')
ARTIFACT_DIR.mkdir(parents=True, exist_ok=True)
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

In [None]:
# Load or create synthetic fallback panel (used if processed subset not present).
if PANEL_PATH.exists():
    panel_df = pd.read_parquet(PANEL_PATH)
    source_note = f'Loaded existing panel: {PANEL_PATH}'
else:
    print('Panel not found -> generating lightweight synthetic fallback (20 items x 200 days).')
    items = [f'ITEM_{i:03d}' for i in range(20)]
    dates = pd.date_range('2024-01-01', periods=200, freq='D')
    rows = []
    for item in items:
        base = np.random.randint(5, 25)
        seasonal = np.sin(np.linspace(0, 12 * math.pi, len(dates))) * np.random.uniform(3, 8)
        noise = np.random.randn(len(dates)) * np.random.uniform(0.5, 2.0)
        demand = (base + seasonal + noise).clip(min=0).round(2)
        for d, val in zip(dates, demand):
            rows.append({'item_id': item, 'date': d, 'demand': float(val)})
    panel_df = pd.DataFrame(rows)
    DATA_PROCESSED.mkdir(parents=True, exist_ok=True)
    panel_df.to_parquet(PANEL_PATH, index=False)
    source_note = 'Synthetic fallback (saved to processed path)'
panel_df['date'] = pd.to_datetime(panel_df['date'])
print(source_note)
panel_df.head()

In [None]:
# Basic profiling summary
n_rows = len(panel_df)
n_items = panel_df['item_id'].nunique()