# Module 1.8: Data Preparation

> **Goal:** Transform first-contact data into a forecast-ready dataset.

**Module Type:** Transformation

| Input | Output |
|-------|--------|
| `1_06.parquet` (raw weekly data) | `1_08.parquet` (forecast-ready) |
| `1_06.json` (raw state report) | `1_08.json` (prepared state + decisions) |

By the end of this module, you'll have a dataset that is:
- **Continuous in time** â€” no missing weeks
- **Properly imputed** â€” domain-appropriate fill policy
- **Enriched with features** â€” known-at-time calendar attributes
- **Documented** â€” all decisions logged and traceable

| Step | What | Why |
|------|------|-----|
| 1 | Load + Review Prior State | Understand what we're starting with |
| 2 | Fill Gaps | Complete weekly timeline for every series |
| 3 | Impute Target | Apply domain-appropriate fill policy |
| 4 | Merge Calendar | Add known-at-time features |
| 5 | Document & Save | Log all decisions, compare to prior |

---

## 1. Setup

In [1]:
# =============================================================================
# SETUP
# =============================================================================

# --- Imports ---
import sys
import warnings
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from utilsforecast.preprocessing import fill_gaps

# --- Path Configuration ---
MODULE_DIR = Path().resolve()
PROJECT_ROOT = MODULE_DIR.parent.parent
sys.path.insert(0, str(PROJECT_ROOT))

# --- Local Imports ---
from src import (
    CacheManager,
    first_contact_check,
    load_m5_calendar,
    aggregate_calendar_to_weekly,
)

# --- Settings ---
warnings.filterwarnings("ignore")
plt.style.use("seaborn-v0_8-whitegrid")

# --- Managers ---
cache = CacheManager(MODULE_DIR / ".cache")
outputs = CacheManager(PROJECT_ROOT / "outputs")

print(f"âœ“ Setup complete | Root: {PROJECT_ROOT.name}")

âœ“ Setup complete | Root: real-world-forecasting-foundations


---

## 2. Load & Review Prior State

Load from Module 1.06 and understand what we're starting with.

In [2]:
# Load prior module's output + report
df, prior_report = outputs.load('1_06', with_report=True)

âš  Cache '1_06' not found


In [3]:
# Review prior state
print("ðŸ“Š Starting State (from 1.06):")
prior_report.summary_table()

ðŸ“Š Starting State (from 1.06):


AttributeError: 'NoneType' object has no attribute 'summary_table'

In [None]:
# Check for issues to fix
prior_report.table()

### 2.1 Diagnose Gaps

How many series have missing weeks?

In [None]:
# Expected number of weeks per series
date_range = df['ds'].agg(['min', 'max'])
expected_weeks = ((date_range['max'] - date_range['min']).days // 7) + 1

# Actual weeks per series
actual_weeks = df.groupby('unique_id')['ds'].nunique()

# Series with gaps
series_with_gaps = (actual_weeks < expected_weeks).sum()
total_series = df['unique_id'].nunique()

print(f"Expected weeks per series: {expected_weeks}")
print(f"Series with gaps: {series_with_gaps:,} / {total_series:,} ({series_with_gaps/total_series:.1%})")

---

## 3. Fill Gaps

Ensure every series has a complete weekly timeline.

In [None]:
# Fill gaps using Nixtla's fill_gaps
# This creates rows for missing dates with NA values
df_filled = fill_gaps(
    df,
    freq='W-SAT',  # Walmart fiscal week
    start=None,    # Use min date per series
    end=None       # Use max date per series
)

print(f"Before: {len(df):,} rows")
print(f"After:  {len(df_filled):,} rows")
print(f"Added:  {len(df_filled) - len(df):,} gap rows")

In [None]:
# Add gap flag for traceability
df_filled['is_gap'] = df_filled['y'].isna()

---

## 4. Impute Target

Fill NA values with domain-appropriate strategy.

In [None]:
# Count NAs before imputation
na_before = df_filled['y'].isna().sum()
print(f"NAs before imputation: {na_before:,}")

In [None]:
# Imputation strategy: Zero fill
# Rationale: In retail, missing data typically means no sales occurred
df_filled['y'] = df_filled['y'].fillna(0)

na_after = df_filled['y'].isna().sum()
print(f"NAs after imputation: {na_after:,}")

---

## 5. Merge Calendar Features

Add known-at-time calendar attributes.

### 5.1 Load & Aggregate Calendar

In [None]:
# Load raw daily calendar
calendar = load_m5_calendar(PROJECT_ROOT / 'data')
calendar.head()

In [None]:
# Aggregate to weekly
weekly_calendar = aggregate_calendar_to_weekly(calendar)
weekly_calendar.head()

### 5.2 Merge with Sales Data

In [None]:
# Merge on date
df_merged = df_filled.merge(weekly_calendar, on='ds', how='left')

print(f"Columns before: {df_filled.shape[1]}")
print(f"Columns after:  {df_merged.shape[1]}")
print(f"Added: {df_merged.shape[1] - df_filled.shape[1]} calendar features")

In [None]:
df_merged.head()

### 5.3 What We're NOT Adding (Yet)

| Feature | Why Excluded | When to Add |
|---------|--------------|-------------|
| Price features | Requires lagging to avoid leakage | Feature Engineering module |
| Lag features | Created during model training | Modeling module |
| Outlier flags | Need baseline forecast first | Post-baseline module |

---

## 6. Document & Save

Create final report with all decisions logged.

### 6.1 Create Report with Decisions

In [None]:
# Create report for prepared data
report = first_contact_check(
    df_merged, 
    dataset_name='1.08 Prepared',
    prior_module='1_06'
)

In [None]:
# Log all decisions
report.log_decision(
    step='Gap Detection',
    decision='datetime_diagnostics()',
    assumption='Weekly frequency is correct',
    reversible=True,
    note='Re-run with different freq if needed'
)

report.log_decision(
    step='Gap Filling',
    decision="fill_gaps(freq='W-SAT')",
    assumption='Series should span full date range',
    reversible=True,
    note='is_gap flag preserved for traceability'
)

report.log_decision(
    step='Imputation',
    decision='NA â†’ 0',
    assumption='Missing = no sales (retail domain)',
    reversible=True,
    note='Can re-impute using is_gap flag'
)

report.log_decision(
    step='Calendar Merge',
    decision='Weekly aggregation (events: any, SNAP: max)',
    assumption='One event day = event week',
    reversible=True,
    note='Raw calendar available for different aggregation'
)

### 6.2 View Data Evolution

In [None]:
# Show what changed
report.evolution(prior_report)

In [None]:
# Show decisions made
report.decisions_table()

In [None]:
# Or use combined display
report.evolution_display(prior_report)

### 6.3 Final Report Card

In [None]:
report.table()

### 6.4 Save Output

In [None]:
# Save to outputs/01_foundations/
outputs.save(
    df=df_merged,
    key='1_08',
    unit='01_foundations',
    report=report,
    config={
        'freq': 'W-SAT',
        'imputation': 'zero',
        'calendar_merged': True,
        'gaps_filled': True
    }
)

---

## 7. Summary

### What We Did

| Step | Before | After |
|------|--------|-------|
| Gap Filling | Series with gaps | Continuous timeline |
| Imputation | NAs in target | All values filled |
| Calendar Merge | 8 columns | 20+ columns |

### Key Assumptions

1. **Missing data = no sales** â€” Zero fill is domain-appropriate for retail
2. **Events: any-in-week** â€” If any day in week had event, week is flagged
3. **SNAP: max-in-week** â€” If any day had SNAP, week is flagged
4. **Fiscal week: Sun-Sat** â€” Walmart's calendar, not ISO week

---

## Next Steps

| Module | Focus |
|--------|-------|
| **1.11** | Plotting & visual diagnostics |
| **2.01** | Baseline models â€” naive, seasonal naive |
| **2.02** | Statistical models â€” ETS, ARIMA |