# Module 1.6: First Contact with the Data

> **Goal:** Run first-contact checks to confirm the data supports the 5Q Framework, then clean and aggregate to weekly.

## 1. Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import tsforge as tsf

import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')

DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

print("‚úì Setup complete")

‚úì Setup complete


## 2. Load Data

`messify=True` simulates real-world data issues (string dtypes, NaN injection, duplicates).

In [22]:
df = tsf.load_m5(
    DATA_DIR,
    messify=True,
    messify_kwargs={
        'random_state': 42,
        'zero_to_na_pct': 0.30,
        'add_duplicates': True,
        'n_duplicates': 150,
        'corrupt_dtypes': True,
        'drop_na_frac': 0.10,  # Drop 10% of NA rows to simulate incomplete data
        'cache_dir': DATA_DIR
    },
    include_hierarchy=True,
    verbose=True
)

LOADING M5 DATA
‚úì M5 cache detected. Loading from local files...
‚úì Loaded in 1.1s
  Shape: 47,649,940 rows √ó 3 columns
  Memory: 638.4 MB
  Columns: unique_id, ds, y
  Returning: Y_df, X_df, S_df (all 3 dataframes)

üîß Applying messification...
LOADING CACHED MESSIFIED DATA

üìÅ Cache file: m5_messy_n30490_rs42_zna30_dup150_dtype1_rmv2_dropna10.parquet
   Using cached version (skip messification)

üí° To regenerate: set force_refresh=True

‚úì Loaded 46,250,639 rows √ó 3 columns
  Expanding hierarchy via S_df merge...
  ‚úì Added hierarchy columns: ['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']

LOAD COMPLETE
  Shape: 46,250,639 rows √ó 7 columns
  Columns: ['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'ds', 'y']
  Applied: messified, hierarchy columns


## 3. Pre-Aggregation Checks (Daily Data)

Fix these issues BEFORE aggregating ‚Äî otherwise they corrupt the weekly rollup.

### 3.1 Fix Data Types & Handle NAs

Messification converts `ds` and `y` to strings with literal `"nan"` values. Convert back to proper types.

In [23]:
df[['ds', 'y']].dtypes

ds    object
y     object
dtype: object

`errors='coerce'` converts unparseable values (including string `"nan"`) to proper NaN.

In [24]:
df['ds'] = pd.to_datetime(df['ds'])
df['y'] = pd.to_numeric(df['y'], errors='coerce')

Check remaining NAs. We'll handle imputation in Module 1.10 after filling gaps.

In [25]:
df.isna().sum()

item_id           0
dept_id           0
cat_id            0
store_id          0
state_id          0
ds                0
y           7495875
dtype: int64

### 3.2 Check for Null IDs

Rows with null ID columns can't be properly aggregated.

In [26]:
# Check for null values in ID columns
col_ids = [c for c in df.columns if c not in ['ds', 'y']]
df[col_ids].isna().sum()

item_id     0
dept_id     0
cat_id      0
store_id    0
state_id    0
dtype: int64

### 3.3 Check for Weird Dates

Look for dates before 1900, future dates, or outlier dates far from the main range.

In [27]:
unique_dates = (
    df['ds']
    .dropna()
    .drop_duplicates()
    .sort_values()
)

unique_dates.head(5), unique_dates.tail(5)


(0   2011-01-29
 1   2011-01-30
 2   2011-01-31
 3   2011-02-01
 4   2011-02-02
 Name: ds, dtype: datetime64[ns],
 1915   2016-06-15
 1916   2016-06-16
 1917   2016-06-17
 1918   2016-06-18
 1919   2016-06-19
 Name: ds, dtype: datetime64[ns])

### 3.4 Data Info & Memory

In [28]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46250639 entries, 0 to 46250638
Data columns (total 7 columns):
 #   Column    Dtype         
---  ------    -----         
 0   item_id   category      
 1   dept_id   category      
 2   cat_id    category      
 3   store_id  category      
 4   state_id  category      
 5   ds        datetime64[ns]
 6   y         float64       
dtypes: category(5), datetime64[ns](1), float64(1)
memory usage: 970.6 MB


### 3.5 Check for Duplicates

A duplicate = same row (except target) appears multiple times. Remove before aggregating.

In [29]:
df.columns

Index(['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'ds', 'y'], dtype='object')

In [30]:
# Check for duplicates on all columns except target
non_target_cols = [c for c in df.columns if c != 'y']
dup_mask = df.duplicated(subset=non_target_cols, keep=False)

In [31]:
dup_mask.value_counts()

False    46250357
True          282
Name: count, dtype: int64

In [32]:
df = df.drop_duplicates(subset=non_target_cols)

### 3.6 Daily Data Summary

Confirm daily data looks correct before aggregating.

In [33]:
df.head(10)

Unnamed: 0,item_id,dept_id,cat_id,store_id,state_id,ds,y
0,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-01-29,3.0
1,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-01-30,0.0
2,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-01-31,0.0
3,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-02-01,1.0
4,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-02-02,4.0
5,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-02-03,2.0
6,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-02-04,
7,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-02-05,2.0
8,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-02-06,0.0
9,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,2011-02-07,0.0


## 4. Aggregate to Weekly

Weekly granularity aligns with business planning and reduces daily noise.

Group by all non-target, non-date columns and sum the target.

In [34]:
# Get all columns except ds and y
group_cols = [c for c in df.columns if c not in ['ds', 'y']]

# Create week column
# W-SAT = weeks ending Saturday = weeks starting Sunday (M5 convention)
df['week'] = df['ds'].dt.to_period('W-SAT').dt.start_time

In [35]:
# Aggregate: group by all id columns + week, sum the target
df = (
    df.groupby(group_cols + ['week'], as_index=False, observed=True)
    ['y']
    .sum()
    .rename(columns={'week': 'ds'})
)

df.head(10)

Unnamed: 0,item_id,dept_id,cat_id,store_id,state_id,ds,y
0,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,2013-07-14,1.0
1,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,2013-07-21,0.0
2,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,2013-07-28,2.0
3,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,2013-08-04,2.0
4,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,2013-08-11,6.0
5,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,2013-08-18,1.0
6,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,2013-08-25,2.0
7,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,2013-09-01,5.0
8,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,2013-09-08,1.0
9,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,2013-09-15,5.0


### 4.1 Memory after aggregation

In [36]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6848061 entries, 0 to 6848060
Data columns (total 7 columns):
 #   Column    Dtype         
---  ------    -----         
 0   item_id   category      
 1   dept_id   category      
 2   cat_id    category      
 3   store_id  category      
 4   state_id  category      
 5   ds        datetime64[ns]
 6   y         float64       
dtypes: category(5), datetime64[ns](1), float64(1)
memory usage: 143.9 MB


### 4.2 Date Range

Need 2-3x forecast horizon for meaningful patterns. For 12-week forecast, want ~36 weeks minimum.

In [37]:
# min date
df['ds'].min()

Timestamp('2011-01-23 00:00:00')

In [38]:
# max date
df['ds'].max()

Timestamp('2016-06-19 00:00:00')

In [39]:
# number of weeks
((df['ds'].max() - df['ds'].min()).days // 7) + 1

283

## 5. First Contact Summary

Run all checks with a single function call `first_contact_check()` from `tsforge`

In [None]:
tsf.first_contact_check_simple(df)

FIRST CONTACT CHECK
‚úì Required columns present (ds, y)
‚úì ds is datetime
‚úì y is numeric
‚úì No NAs in ds
‚úì No NAs in ID columns
‚Ñπ 0 NAs in y (will impute in Module 1.10)
‚úì No impossible dates
‚úì No duplicates

Summary:
  Shape: 6,848,061 rows √ó 7 columns
  Series: 30,490
  Date range: 2011-01-23 to 2016-06-19
  Unique dates: 283
  Memory: 143.9 MB

‚úì ALL CHECKS PASSED


True

## 6. Save Output

Save cleaned weekly data for Module 1.7. NAs preserved for gap-filling in Module 1.10.

In [44]:
# Save cleaned weekly data
output_path = DATA_DIR / '1_6_output.parquet'
df.to_parquet(output_path, index=False)

print(f"‚úì Saved to {output_path}")
print(f"  Shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
print(f"  Columns: {list(df.columns)}")

‚úì Saved to data/1_6_output.parquet
  Shape: 6,848,061 rows √ó 7 columns
  Columns: ['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'ds', 'y']


## 7. Next Steps

| Module | Focus |
|--------|-------|
| **1.7** | Understand M5 structure (hierarchy, calendar, prices) |
| **1.8** | Diagnostics (seasonality, volatility, trend) |
| **1.9** | Portfolio analysis with GenAI |
| **1.10** | Data preparation (fill gaps, calendar merge, imputation) |
| **1.11** | Plotting & visual diagnostics |