# Module 1.6: First Contact with the Data

> **Goal:** Run first-contact checks to confirm the data can support the 5Q Framework.

Each section maps to a 5Q question:

| Q | Name | What It Defines | First Contact Check |
|---|------|-----------------|---------------------|
| **Q1** | Decision | The Target | Is `y` clear, numeric, clean? |
| **Q2** | Metric | What "Good" Means | Issues that bias evaluation? (NAs, duplicates) |
| **Q3** | Horizon & Level | The Structure | Enough history? Right granularity? |
| **Q4** | Data & Drivers | What Model Learns | Behavioral signals (zeros, volatility) |

---

## 1. Setup

In [1]:
# --- Imports ---
import sys
import os
import warnings
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from dtype_diet import optimize_dtypes, report_on_dataframe
import forecast_foundations as ff
from tsforge.plots import plot_timeseries, plot_distribution

# --- Settings ---

# Project Root Setup
markers = ('.git', 'pyproject.toml', '.project-root')
p = Path.cwd().resolve()
PROJECT_ROOT = next((d for d in [p] + list(p.parents) if any((d / m).exists() for m in markers)), p)
os.chdir(PROJECT_ROOT)
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# Data Directory
DATA_DIR = PROJECT_ROOT / 'data'
OUTPUT_DIR = DATA_DIR / 'output'

## 2. Load Data

`messify=True` simulates real-world data issues (string dtypes, NaN injection, duplicates, etc).

In [2]:
# M5 data downloads to ROOT_DIR/data, messified cache goes to DATA_DIR
daily_sales= ff.load_m5(
    DATA_DIR,
    messify=True,
    messify_config={
       'random_state': 42,      
        'zeros_drop_frac': 1,           # Drop 100% of zero rows
        'duplicates_add_n': 150,           # Add 150 duplicates
        'dtypes_corrupt': True,            # Corrupt dtypes
    },
    include_hierarchy=True
)

LOADING M5 DATA
‚úì M5 cache detected. Loading from local files...
‚úì Loaded in 0.4s
  Shape: 47,649,940 rows √ó 3 columns
  Memory: 1,001.7 MB

üîß Applying messification...
Step 1/6: Converting zeros to NAs...
  ‚úì Converted 4,249,314 zeros to NAs
Step 2/6: Adding duplicate rows...
  ‚úì Added 150 duplicate rows
Step 3/6: Dropping zero-demand rows (sparse reporting)...
  ‚úì Dropped 24,079,523 zero-demand rows
Step 4/6: Dropping NA rows... [SKIPPED]
Step 5/6: Creating internal gaps... [SKIPPED]
Step 6/6: Corrupting data types...
  ‚úì Converted ds to string
  ‚úì Converted y to string

DATA MESSIFICATION SUMMARY

Original shape: 47,649,940 rows √ó 3 columns
Messified shape: 23,570,567 rows √ó 3 columns

Changes applied (5):
  1. Converted 4,249,314 zeros to NAs (15% of zeros)
  2. Added 150 duplicate rows
  3. Dropped 24,079,523 zero-demand rows (100% of zeros)
  4. Converted ds to string dtype
  5. Converted y to string dtype

‚úì Data successfully messified!

üèóÔ∏è Expanding h

---

<div style="text-align: center;">

## 3. `Q1: Decision` ‚Äî Defines the Target

<div style="background: linear-gradient(135deg, #2d42a7 0%, #3a2f7e 100%); color: white; padding: 12px 20px; border-radius: 8px; margin: 10px auto; max-width: 600px;">
<strong>Is the target clear, numeric, and clean?</strong><br>
</div>

</div>


### 3.1 Identify target columns

In [3]:
# What columns do we have? What are we forecasting?
daily_sales.columns

Index(['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'unique_id',
       'ds', 'y'],
      dtype='object')

### 3.2 Check dtypes

In [4]:
# Are ds and y the right types?
daily_sales[['ds', 'y']].dtypes

ds    object
y     object
dtype: object

### 3.3 Fix dtypes

Messification corrupts dtypes. `errors='coerce'` converts unparseable values to proper NaN.

In [5]:
daily_sales['ds'] = pd.to_datetime(daily_sales['ds'])
daily_sales['y'] = pd.to_numeric(daily_sales['y'], errors='coerce')

In [6]:
# Verify fix
daily_sales[['ds', 'y']].dtypes

ds    datetime64[ns]
y            float64
dtype: object

### 3.4 Optimize memory

Rule of thumb: keep DataFrames under 1GB to avoid memory issues.

In [7]:
# Is the memory manageable?
daily_sales.memory_usage(deep=True).sum() / 1e6

568.975091

In [8]:
# use the dtype-diet package to optimize dtypes
daily_sales = optimize_dtypes(daily_sales, report_on_dataframe(daily_sales))


In [9]:
# memory after optimization
daily_sales.memory_usage(deep=True).sum() / 1e6


427.551689

---

<div style="text-align: center;">

## 4. `Q2: Metric` ‚Äî Defines What "Good" Means

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 12px 20px; border-radius: 8px; margin: 10px auto; max-width: 600px;">
<strong>Are there issues that would bias our evaluation?</strong><br>
<em>Missingness, duplicates, and orphan data corrupt metrics.</em>
</div>

</div>


### 4.1 Check for NAs

In [10]:
# Where are the NAs?
daily_sales.isna().sum()

item_id            0
dept_id            0
cat_id             0
store_id           0
state_id           0
unique_id          0
ds                 0
y            4249326
dtype: int64

### 4.2 Drop invalid dates

Drop rows with null dates (can't aggregate without a date).

In [11]:
daily_sales = daily_sales.dropna(subset=['ds'])

### 4.3 Check for orphans

Rows with null ID columns are orphan data ‚Äî they can't be aggregated properly.

In [12]:
id_cols = [c for c in daily_sales.columns if c not in ['ds', 'y']]
daily_sales[id_cols].isna().sum()

item_id      0
dept_id      0
cat_id       0
store_id     0
state_id     0
unique_id    0
dtype: int64

### 4.4 Remove duplicates

Duplicates inflate aggregates and bias metrics. Remove before aggregating.

In [13]:
non_target_cols = [c for c in daily_sales.columns if c != 'y']
n_dups = daily_sales.duplicated(subset=non_target_cols)

In [14]:
n_dups.value_counts()

False    23570491
True           76
Name: count, dtype: int64

In [15]:
# Remove them
daily_sales = daily_sales.drop_duplicates(subset=non_target_cols)

---

<div style="text-align: center;">

## 5. Q3: Horizon & Level ‚Äî Defines the Structure

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 12px 20px; border-radius: 8px; margin: 10px auto; max-width: 600px;">
<strong>Do we have enough history at the right granularity?</strong><br>
</div>

</div>


### 5.1 Preview data

In [16]:
# What does the data look like?
daily_sales.head(10)

Unnamed: 0,item_id,dept_id,cat_id,store_id,state_id,unique_id,ds,y
0,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-01-29,3.0
1,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-01,1.0
2,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-02,4.0
3,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-03,2.0
4,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-04,
5,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-05,2.0
6,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-10,3.0
7,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-11,1.0
8,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-12,3.0
9,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-14,2.0


### 5.2 Check date range

Rule of thumb: need 2-3x forecast horizon. For 12-week forecast, want ~36+ weeks.

In [17]:
# min date
daily_sales['ds'].min()

Timestamp('2011-01-29 00:00:00')

In [18]:
# max date
daily_sales['ds'].max()

Timestamp('2016-06-19 00:00:00')

In [19]:
# number of weeks
((daily_sales['ds'].max() - daily_sales['ds'].min()).days // 7) + 1

282

### 5.3 Drop invalid dates

Drop rows with null dates (can't aggregate without a date).

In [20]:
daily_sales = daily_sales.dropna(subset=['ds'])

### 5.4  Check for outlier dates

Look for dates before 1900, future dates, or outlier dates far from the main range.

In [21]:
unique_dates = (
    daily_sales['ds']
    .dropna()
    .drop_duplicates()
    .sort_values()
)

unique_dates.head(5), unique_dates.tail(5)

(0      2011-01-29
 2199   2011-01-30
 2200   2011-01-31
 1      2011-02-01
 2      2011-02-02
 Name: ds, dtype: datetime64[ns],
 3257   2016-06-15
 1037   2016-06-16
 1038   2016-06-17
 1039   2016-06-18
 1040   2016-06-19
 Name: ds, dtype: datetime64[ns])

### 5.5 Aggregate to weekly

Weekly aligns with business planning and reduces daily noise.

In [22]:
# Subset hierarchy columns for group by

hierarchy_cols = ['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'unique_id']
hierarchy_cols

['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'unique_id']

In [23]:
# Create week column (W-SAT = weeks ending Saturday, i.e., Sunday through Saturday)
daily_sales['week'] = daily_sales['ds'].dt.to_period('W-SAT').dt.start_time

In [24]:
# Validate frequency
pd.infer_freq(daily_sales['week'].drop_duplicates().sort_values())

'W-SUN'

In [25]:
# Aggregate: group by all id columns + week, sum the target
weekly_sales = (
    daily_sales.groupby(hierarchy_cols + ['week'], as_index=False, observed=True)
    ['y']
    .sum()
    .rename(columns={'week': 'ds'})
)

In [26]:
plot_distribution(
    weekly_sales,
    id_col='unique_id',
    value_col='ds',
    agg='nunique',
    bins=50,
    show_mean=True,
    style={"title": "Series Length Distribution (weeks)"},
)

We see an increasing trend ‚Äî new series were added over time. This is common in retail as stores expand product assortments. We'll handle ragged start dates in data prep.

### 5.6  Compare daily vs weekly

Compare the same series at daily vs weekly granularity. How does aggregation affect the signal?

In [27]:
sample_id = daily_sales['unique_id'].iloc[0]
sample_id

'FOODS_1_001_CA_1'

In [28]:
plot_timeseries(
    daily_sales,
    id_col='unique_id', date_col='ds', value_col='y',
    ids=sample_id,
    style={'title': f'Daily ‚Äî {sample_id}'},
)


In [29]:
plot_timeseries(
    weekly_sales,
    id_col='unique_id', date_col='ds', value_col='y',
    ids=sample_id,
    style={'title': f'Weekly ‚Äî {sample_id}'},
)

### 5.7 Compare memory usage before and after aggregation

In [30]:
# before aggregating
daily_sales.memory_usage(deep=True).sum() / 1e6

804.744949

In [31]:
# after aggregating
weekly_sales.memory_usage(deep=True).sum() / 1e6

129.596487

### 5.8 Plot total volume by category/dept

In [32]:
plot_timeseries(
    weekly_sales,
    id_col='unique_id', date_col='ds', value_col='y',
    group_col='cat_id',
    agg='sum',
    mode='overlay',
    style={'title': 'Total Weekly Volume by Category'},
)

In [33]:
plot_timeseries(
    weekly_sales,
    id_col='unique_id', date_col='ds', value_col='y',
    group_col='dept_id',
    agg='sum',
    mode='overlay',
    wrap=3,
    style={'title': 'Total Weekly Volume by Department'},
)

## 6. Save Output

In [None]:
# Cache for downstream modules
weekly_sales.to_parquet(OUTPUT_DIR / '1.06_first_contact_output.parquet', index=False)

: 