# Module 1.6: First Contact with the Data

> **Goal:** Run first-contact checks to confirm the data can support the 5Q Framework.

**Module Type:** Observation (no transformations)

| Artifact | Description |
|----------|-------------|
| `1_06.parquet` | Weekly M5 sales (cleaned dtypes, duplicates removed) |
| `1_06.json` | First Contact Report (raw data state) |

Each section maps to a 5Q question:

| Q | Name | What It Defines | First Contact Check |
|---|------|-----------------|---------------------|
| **Q1** | Decision | The Target | Is `y` clear, numeric, clean? |
| **Q2** | Metric | What "Good" Means | Issues that bias evaluation? (NAs, duplicates) |
| **Q3** | Horizon & Level | The Structure | Enough history? Right granularity? |
| **Q4** | Data & Drivers | What Model Learns | Behavioral signals (zeros, volatility) |

## 1. Setup

In [None]:
# =============================================================================
# SETUP
# =============================================================================

# --- Imports ---
import sys
import warnings
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# --- Path Configuration ---
MODULE_DIR = Path().resolve()
PROJECT_ROOT = MODULE_DIR.parent.parent
sys.path.insert(0, str(PROJECT_ROOT))

# --- Local Imports ---
from src import load_m5, CacheManager, first_contact_check

# --- Settings ---
warnings.filterwarnings("ignore")
plt.style.use("seaborn-v0_8-whitegrid")

# --- Managers ---
cache = CacheManager(MODULE_DIR / ".cache")       # Intermediate (gitignored)
outputs = CacheManager(PROJECT_ROOT / "outputs")  # Curriculum artifacts

print(f"✓ Setup complete | Root: {PROJECT_ROOT.name}")

## 2. Load Data

`messify=True` simulates real-world data issues (string dtypes, NaN injection, duplicates, etc).

In [None]:
daily_sales = load_m5(
    data_dir=MODULE_DIR / "data",
    cache=cache,
    cache_key='m5_messified',
    messify=True,
    messify_config={
        'random_state': 42,
        'zeros_to_na_frac': 0.30,
        'zeros_drop_frac': 0.02,
        'zeros_drop_gaps_frac': 0.10,
        'duplicates_add_n': 150,
        'na_drop_frac': None,
        'dtypes_corrupt': True,
    },
    include_hierarchy=True,
)

---

## 3. Q1: Decision — The Target

<div style="background: linear-gradient(135deg, #11998e 0%, #38ef7d 100%); color: white; padding: 12px 20px; border-radius: 8px; margin: 10px 0; max-width: 600px;">
<strong>Is the target clear, numeric, and clean?</strong>
</div>

### 3.1 Identify the Target

In [None]:
daily_sales.columns

### 3.2 Check Data Types

In [None]:
daily_sales[['ds', 'y']].dtypes

### 3.3 Fix Data Type Issues

Messification corrupts dtypes. `errors='coerce'` converts unparseable values to proper NaN.

In [None]:
daily_sales['ds'] = pd.to_datetime(daily_sales['ds'])
daily_sales['y'] = pd.to_numeric(daily_sales['y'], errors='coerce')

In [None]:
daily_sales[['ds', 'y']].dtypes

---

## 4. Q2: Metric — What "Good" Means

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 12px 20px; border-radius: 8px; margin: 10px 0; max-width: 600px;">
<strong>Are there issues that would bias our evaluation?</strong><br>
<em>Missingness, duplicates, and orphan data corrupt metrics.</em>
</div>

### 4.1 Check for Missing Values

In [None]:
daily_sales.isna().sum()

### 4.2 Handle Critical NAs

Drop rows with null dates (can't aggregate without a date). Target NAs handled in Module 1.08.

In [None]:
daily_sales = daily_sales.dropna(subset=['ds'])

### 4.3 Check for Duplicates

Duplicates inflate aggregates and bias metrics.

In [None]:
non_target_cols = [c for c in daily_sales.columns if c != 'y']
n_dups = daily_sales.duplicated(subset=non_target_cols).sum()
print(f"Duplicates found: {n_dups:,}")

In [None]:
daily_sales = daily_sales.drop_duplicates(subset=non_target_cols)

---

## 5. Q3: Horizon & Level — The Structure

<div style="background: linear-gradient(135deg, #4facfe 0%, #00f2fe 100%); color: white; padding: 12px 20px; border-radius: 8px; margin: 10px 0; max-width: 600px;">
<strong>Do we have enough history at the right granularity?</strong>
</div>

### 5.1 Daily Data Summary

In [None]:
daily_sales.head(10)

### 5.2 Aggregate to Weekly

Daily data is noisy. Weekly aggregation:
- Smooths noise
- Makes patterns visible
- Reduces data size

In [None]:
# Walmart fiscal week: Sunday-Saturday
daily_sales['week'] = daily_sales['ds'].dt.to_period('W-SAT').dt.start_time

In [None]:
# Aggregation columns
id_cols = ['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'unique_id']

# Aggregate: sum sales, keep hierarchy
weekly_sales = (
    daily_sales
    .groupby(id_cols + ['week'], as_index=False)
    .agg({'y': 'sum'})
    .rename(columns={'week': 'ds'})
)

In [None]:
print(f"Daily:  {len(daily_sales):,} rows")
print(f"Weekly: {len(weekly_sales):,} rows")
print(f"Reduction: {1 - len(weekly_sales)/len(daily_sales):.1%}")

### 5.3 Visualize Daily vs Weekly

In [None]:
# Sample one series
first_series = daily_sales[daily_sales['unique_id'] == daily_sales['unique_id'].iloc[0]]
sample_daily = first_series.head(365)
sample_weekly = weekly_sales[weekly_sales['unique_id'] == first_series['unique_id'].iloc[0]].head(52)

label = f"{first_series['item_id'].iloc[0]} @ {first_series['store_id'].iloc[0]}"

fig, axes = plt.subplots(2, 1, figsize=(12, 6))

axes[0].plot(sample_daily['ds'], sample_daily['y'], alpha=0.7, linewidth=0.6)
axes[0].set_title(f'Daily Sales (noisy)\n{label}')
axes[0].set_ylabel('Units')

axes[1].plot(sample_weekly['ds'], sample_weekly['y'], alpha=0.9, linewidth=1.5, color='tab:orange')
axes[1].set_title(f'Weekly Sales (patterns visible)\n{label}')
axes[1].set_ylabel('Units')

plt.tight_layout()
plt.show()

---

## 6. Q4: Data — What the Model Learns

<div style="background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%); color: white; padding: 12px 20px; border-radius: 8px; margin: 10px 0; max-width: 600px;">
<strong>What behavioral signals shape model selection?</strong><br>
<em>Intermittency, volatility, and data quality affect what the model can learn.</em>
</div>

### 6.1 Intermittency (Zeros + NAs)

In [None]:
n_zeros = (weekly_sales['y'] == 0).sum()
n_na = weekly_sales['y'].isna().sum()
n_total = len(weekly_sales)

print(f"Zeros: {n_zeros:,} ({n_zeros/n_total:.1%})")
print(f"NAs:   {n_na:,} ({n_na/n_total:.1%})")
print(f"Total intermittent: {(n_zeros+n_na)/n_total:.1%}")

### 6.2 Data Shape Summary

In [None]:
weekly_sales.info(memory_usage='deep')

---

## 7. First Contact Report

Automated check across all dimensions:

In [None]:
report = first_contact_check(weekly_sales, dataset_name='1.06 Weekly M5')

In [None]:
report.table()

In [None]:
report.summary_table()

---

## 8. Save Output

Save weekly data for downstream modules. NAs preserved for gap-filling in Module 1.08.

In [None]:
# Save to outputs/01_foundations/
outputs.save(
    df=weekly_sales,
    key='1_06',
    unit='01_foundations',
    report=report
)

---

## 9. Next Steps

| Module | Focus |
|--------|-------|
| **1.07** | Understand M5 structure (hierarchy, calendar, prices) |
| **1.08** | Data preparation (fill gaps, impute, merge calendar) |
| **1.11** | Plotting & visual diagnostics |