
# Walmart Department-Level Data Preparation Pipeline

This notebook documents the full transformation flow from the raw Walmart-provided files to the enriched modeling dataset used by Team 8. It recreates each stage so we can regenerate the final table (`data/processed/merged_data_model_ready_interactions.csv`) from scratch and understand every decision that shaped it.



## Objectives

1. Build the base store–department daily dataset from Walmart's raw deliveries files.
2. Layer on external context (macro, labor, weather, search interest, calendars, tax holidays).
3. Clean redundant or low-utility features while keeping documentation about why they were removed.
4. Engineer department-aware interaction features to capture domain nuances (e.g., weather for sporting goods).
5. Persist intermediate artifacts and write the final modeling dataset for downstream notebooks.

> **Note:** Several auxiliary files under `data/external/` were downloaded or manually curated earlier. They are treated as inputs here so the notebook stays deterministic. Synthetic sources are clearly labeled.


## Imports & Configuration

In [10]:

import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option('display.max_columns', 100)
pd.set_option('display.float_format', lambda x: f"{x:,.4f}")

BASE_DIR = Path.cwd()
DATA_DIR = BASE_DIR / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
EXTERNAL_DIR = DATA_DIR / 'external'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
EXTERNAL_DIR.mkdir(parents=True, exist_ok=True)



## 1. Load Core Walmart Data

We start from the original files provided by Walmart:

- `inbound_cases_team8.csv` — department-level inbound cases (with future horizon marked by missing `cases`).
- `trucks.csv` — planned truck arrivals per store/day.
- `stores_data.xlsx` — store attributes, including `state_name`, `market_area_nbr`, and `region_nbr`.
- `student mapping share.xlsx` — maps departments to student teams; we filter to Team 8 (departments 6, 9, 41, 67, 90).


In [11]:

cases_path = BASE_DIR / 'inbound_cases_team8.csv'
trucks_path = BASE_DIR / 'trucks.csv'
stores_path = BASE_DIR / 'stores_data.xlsx'
dept_path = BASE_DIR / 'student mapping share.xlsx'

cases = pd.read_csv(cases_path, parse_dates=['dt'])
trucks = pd.read_csv(trucks_path, parse_dates=['dt'])
stores = pd.read_excel(stores_path).rename(columns={'store_id5': 'store_id'})
dept_mapping = pd.read_excel(dept_path)

print('Cases shape:', cases.shape)
print('Trucks shape:', trucks.shape)
print('Stores shape:', stores.shape)
print('Dept mapping shape:', dept_mapping.shape)

cases.head()


Cases shape: (275000, 5)
Trucks shape: (52200, 3)
Stores shape: (100, 4)
Dept mapping shape: (57, 9)


Unnamed: 0,dept_id,store_id,dt,cases,student_group
0,6,10002,2025-02-20,63.0,8
1,6,10002,2025-02-28,56.0,8
2,6,10002,2025-02-19,62.0,8
3,90,10001,2025-02-25,67.0,8
4,9,10004,2025-02-07,53.0,8



## 2. Build Base Merged Dataset

Steps:

1. Filter the department mapping to the Team 8 assignments and keep descriptive columns.
2. Confirm the cases file already aligns with those departments.
3. Merge cases with trucks (`store_id`, `dt`), stores (`store_id`), and department metadata (`dept_id`).
4. Persist the baseline curated tables.


In [12]:

team8_depts = (
    dept_mapping[['dept_id', 'dept_desc', 'gmm_name', 'dmm_name', 'student_group']]
    .dropna(subset=['dept_id', 'student_group'])
    .loc[lambda df: df['student_group'] == 8]
)

print('Team 8 departments:', team8_depts['dept_id'].tolist())

merged_base = (
    cases
    .merge(trucks, on=['store_id', 'dt'], how='left', suffixes=('', '_truck'))
    .merge(stores, on='store_id', how='left')
    .merge(team8_depts[['dept_id', 'dept_desc', 'gmm_name', 'dmm_name']], on='dept_id', how='left')
)

print('Merged base shape:', merged_base.shape)
print('Date range:', merged_base['dt'].min().date(), '→', merged_base['dt'].max().date())
print('Missing cases (future horizon):', merged_base['cases'].isna().sum())

# Persist baseline outputs for reference
merged_base.to_csv(PROCESSED_DIR / 'merged_data.csv', index=False)
team8_depts.to_csv(PROCESSED_DIR / 'dept_reference.csv', index=False)
stores[['store_id', 'state_name', 'market_area_nbr', 'region_nbr']].to_csv(
    PROCESSED_DIR / 'stores_reference.csv', index=False
)

merged_base.head()


Team 8 departments: [41, 90, 67, 6, 9]
Merged base shape: (275000, 12)
Date range: 2024-03-14 → 2025-09-14
Missing cases (future horizon): 14000


Unnamed: 0,dept_id,store_id,dt,cases,student_group,trucks,state_name,market_area_nbr,region_nbr,dept_desc,gmm_name,dmm_name
0,6,10002,2025-02-20,63.0,8,2.0,NC,296,26,CAMERAS AND SUPPLIES,GENERAL MERCHANDISE,ENTERTAINMENT TOYS AND SEASONAL
1,6,10002,2025-02-28,56.0,8,2.0,NC,296,26,CAMERAS AND SUPPLIES,GENERAL MERCHANDISE,ENTERTAINMENT TOYS AND SEASONAL
2,6,10002,2025-02-19,62.0,8,2.0,NC,296,26,CAMERAS AND SUPPLIES,GENERAL MERCHANDISE,ENTERTAINMENT TOYS AND SEASONAL
3,90,10001,2025-02-25,67.0,8,3.0,MD,285,22,DAIRY,FOOD,CHILLED ALCOHOL AND CONVENIENCE
4,9,10004,2025-02-07,53.0,8,2.0,LA,66,13,SPORTING GOODS,GENERAL MERCHANDISE,HARDLINES



## 3. External Data Sources Overview

We enrich the base table with several external signals housed under `data/external/`:

| File | Source Type | Notes |
|------|-------------|-------|
| `fred_macro_series.csv` | **Real** (FRED download) | CPI (headline & food-at-home), consumer sentiment, initial jobless claims, fed funds rate, WTI crude |
| `state_unemployment_rates.csv` | **Real** (BLS LAUS API) | Monthly unemployment rates per state |
| `state_daily_weather.csv` | **Real** (Open-Meteo ERA5) | Daily max/min temp (°C) & precipitation (mm) per state capital |
| `google_trends_weekly.csv` | **Real** (Google Trends via `pytrends`) | Weekly US search interest for key department categories |
| `state_school_calendar_estimates.csv` | **Synthetic** | Approximate 2024–25 school start/end dates & back-to-school windows |
| `sports_events_major.csv` | **Synthetic** | Manually curated national sports milestones |
| `sales_tax_holidays.csv` | **Hybrid** | 2024 confirmed state holidays + projected 2025 repeats (flagged `status`) |

The notebook treats these as read-only inputs; we reprocess them to align with the daily store–dept grain.


## 4. Macro & Labor Indicators (Daily Alignment)

In [13]:

start_date = merged_base['dt'].min().normalize()
end_date = merged_base['dt'].max().normalize()

fred = pd.read_csv(EXTERNAL_DIR / 'fred_macro_series.csv', parse_dates=['observation_date'])
fred = fred.set_index('observation_date').sort_index().ffill()
fred_daily = fred.reindex(pd.date_range(fred.index.min(), end_date), method='ffill')
fred_daily = fred_daily.loc[start_date:]
fred_daily.index.name = 'dt'
fred_daily = fred_daily.reset_index()

fred_daily.to_csv(EXTERNAL_DIR / 'fred_macro_daily.csv', index=False)
fred_daily.head()


Unnamed: 0,dt,CPI_All_Items,CPI_FoodAtHome,Consumer_Sentiment,Initial_Jobless_Claims,Fed_Funds_Rate,Crude_Oil_Price
0,2024-03-14,312.107,325.455,79.4,213000.0,5.33,82.16
1,2024-03-15,312.107,325.455,79.4,213000.0,5.33,81.94
2,2024-03-16,312.107,325.455,79.4,213000.0,5.33,81.94
3,2024-03-17,312.107,325.455,79.4,213000.0,5.33,81.94
4,2024-03-18,312.107,325.455,79.4,213000.0,5.33,83.68


## 5. State Unemployment (Monthly → Daily)

In [14]:

unemp = pd.read_csv(EXTERNAL_DIR / 'state_unemployment_rates.csv', parse_dates=['date'])
date_index = pd.date_range(start_date, end_date)
state_frames = []
for state, grp in unemp.groupby('state'):
    filled = (
        grp.sort_values('date')
        .set_index('date')
        .reindex(date_index, method='ffill')
        .rename_axis('dt')
        .reset_index()
    )
    filled['state_name'] = state
    state_frames.append(filled[['dt', 'state_name', 'unemployment_rate']])
state_unemp_daily = pd.concat(state_frames, ignore_index=True)
state_unemp_daily.rename(columns={'unemployment_rate': 'state_unemployment_rate'}, inplace=True)
state_unemp_daily.to_csv(EXTERNAL_DIR / 'state_unemployment_daily.csv', index=False)
state_unemp_daily.head()


Unnamed: 0,dt,state_name,state_unemployment_rate
0,2024-03-14,AL,2.9
1,2024-03-15,AL,2.9
2,2024-03-16,AL,2.9
3,2024-03-17,AL,2.9
4,2024-03-18,AL,2.9


## 6. Weather Features

In [15]:

weather = pd.read_csv(EXTERNAL_DIR / 'state_daily_weather.csv', parse_dates=['dt'])
weather.head()


Unnamed: 0,dt,state_name,temp_max_c,temp_min_c,precip_mm
0,2024-03-14,AL,26.5,10.1,0.0
1,2024-03-15,AL,20.5,16.1,23.9
2,2024-03-16,AL,22.7,13.8,0.0
3,2024-03-17,AL,17.2,11.5,9.1
4,2024-03-18,AL,16.2,6.3,0.0


## 7. Google Trends (Weekly → Daily)

In [16]:

trends_weekly = pd.read_csv(EXTERNAL_DIR / 'google_trends_weekly.csv', parse_dates=['dt'])
trends_daily = (
    trends_weekly.set_index('dt').sort_index()
    .reindex(pd.date_range(trends_weekly['dt'].min(), end_date), method='ffill')
    .loc[start_date:]
    .rename_axis('dt')
    .reset_index()
)
trends_daily.to_csv(EXTERNAL_DIR / 'google_trends_daily.csv', index=False)
trends_daily.head()


Unnamed: 0,dt,trends_cameras,trends_sporting_goods,trends_team_sports,trends_party_supplies,trends_dairy_products
0,2024-03-14,48,67,6,3,1
1,2024-03-15,48,67,6,3,1
2,2024-03-16,48,67,6,3,1
3,2024-03-17,45,70,6,3,1
4,2024-03-18,45,70,6,3,1


## 8. Calendars & Events

In [17]:

school_cal = pd.read_csv(EXTERNAL_DIR / 'state_school_calendar_estimates.csv', parse_dates=['start_date','end_date','back_to_school_window_start','back_to_school_window_end'])
sports_events = pd.read_csv(EXTERNAL_DIR / 'sports_events_major.csv', parse_dates=['start_date','end_date'])
sales_tax = pd.read_csv(EXTERNAL_DIR / 'sales_tax_holidays.csv', parse_dates=['start_date','end_date'])

rows = []
for _, row in sales_tax.iterrows():
    start = max(row['start_date'].date(), start_date.date())
    end = min(row['end_date'].date(), end_date.date())
    for dt in pd.date_range(start, end):
        rows.append({
            'dt': dt,
            'state_name': row['state_name'],
            'sales_tax_event': row['event_name'],
            'sales_tax_category': row['category'],
            'sales_tax_status': row['status']
        })

sales_tax_daily = pd.DataFrame(rows)
sales_tax_daily['dt'] = pd.to_datetime(sales_tax_daily['dt'])
sales_tax_daily.to_csv(EXTERNAL_DIR / 'sales_tax_holidays_daily.csv', index=False)
sales_tax_daily.head()


Unnamed: 0,dt,state_name,sales_tax_event,sales_tax_category,sales_tax_status
0,2024-07-19,AL,Back to School Sales Tax Holiday,back_to_school,confirmed
1,2024-07-20,AL,Back to School Sales Tax Holiday,back_to_school,confirmed
2,2024-07-21,AL,Back to School Sales Tax Holiday,back_to_school,confirmed
3,2024-08-03,AR,Back to School Sales Tax Holiday,back_to_school,confirmed
4,2024-08-04,AR,Back to School Sales Tax Holiday,back_to_school,confirmed


## 9. Assemble Enriched Dataset

In [18]:

merged = merged_base.copy()

# Merge macro & unemployment
merged = merged.merge(fred_daily, on='dt', how='left')
merged = merged.merge(state_unemp_daily, on=['dt','state_name'], how='left')

# Weather transformations
merged = merged.merge(weather, on=['dt','state_name'], how='left')
merged['temp_max_f'] = merged['temp_max_c'] * 9/5 + 32
merged['temp_min_f'] = merged['temp_min_c'] * 9/5 + 32
merged['temp_avg_f'] = (merged['temp_max_f'] + merged['temp_min_f']) / 2
merged['precip_in'] = merged['precip_mm'] / 25.4
merged['cooling_degree_days'] = (merged['temp_avg_f'] - 65).clip(lower=0)
merged['heating_degree_days'] = (65 - merged['temp_avg_f']).clip(lower=0)
for col in ['temp_max_c','temp_min_c','precip_mm']:
    if col in merged.columns:
        merged.drop(columns=col, inplace=True)

merged['is_weather_sensitive_sporting'] = merged['dept_id'].isin([9, 41]).astype(int)
merged['is_weather_sensitive_dairy'] = merged['dept_id'].eq(90).astype(int)
merged['is_weather_sensitive_celebration'] = merged['dept_id'].eq(67).astype(int)
merged['is_weather_sensitive_cameras'] = merged['dept_id'].eq(6).astype(int)
merged['hot_day_flag'] = ((merged['temp_max_f'] >= 90) & merged['is_weather_sensitive_sporting'].eq(1)).astype(int)
merged['cold_day_flag'] = ((merged['temp_min_f'] <= 32) & merged['is_weather_sensitive_dairy'].eq(1)).astype(int)
merged['storm_day_flag'] = ((merged['precip_in'] >= 0.5) & merged['is_weather_sensitive_sporting'].eq(1)).astype(int)

# Google Trends
merged = merged.merge(trends_daily, on='dt', how='left')
trend_cols = [col for col in merged.columns if col.startswith('trends_') and not col.endswith('_scaled')]
for col in trend_cols:
    merged[f'{col}_scaled'] = merged[col] / 100.0

# School calendar flags
merged['is_school_in_session'] = 0
merged['is_back_to_school_window'] = 0
for _, row in school_cal.iterrows():
    mask = merged['state_name'].eq(row['state_name'])
    merged.loc[mask, 'is_school_in_session'] = merged.loc[mask, 'dt'].between(row['start_date'], row['end_date']).astype(int)
    merged.loc[mask, 'is_back_to_school_window'] = merged.loc[mask, 'dt'].between(row['back_to_school_window_start'], row['back_to_school_window_end']).astype(int)

# Sports events overlay
merged['sports_event_flag'] = 0
merged['sports_event_name'] = ''
merged['sports_event_category'] = ''
for _, event in sports_events.iterrows():
    relevant_depts = []
    if 'Sporting Goods' in event['relevant_departments']:
        relevant_depts.append(9)
    if 'Team Sports' in event['relevant_departments']:
        relevant_depts.append(41)
    mask = merged['dept_id'].isin(relevant_depts) & merged['dt'].between(event['start_date'], event['end_date'])
    merged.loc[mask, 'sports_event_flag'] = 1
    merged.loc[mask, 'sports_event_name'] = event['event_name']
    merged.loc[mask, 'sports_event_category'] = event['sport_category']

# Sales tax holidays
merged = merged.merge(sales_tax_daily, on=['dt','state_name'], how='left')
merged['sales_tax_holiday_flag'] = merged['sales_tax_event'].notna().astype(int)
merged['sales_tax_holiday_status'] = merged['sales_tax_status'].fillna('none')
merged['sales_tax_holiday_category'] = merged['sales_tax_category'].fillna('none')
merged.drop(columns=['sales_tax_status','sales_tax_category'], inplace=True)

print('Enriched dataset shape:', merged.shape)
merged.head()


Enriched dataset shape: (275000, 51)


Unnamed: 0,dept_id,store_id,dt,cases,student_group,trucks,state_name,market_area_nbr,region_nbr,dept_desc,gmm_name,dmm_name,CPI_All_Items,CPI_FoodAtHome,Consumer_Sentiment,Initial_Jobless_Claims,Fed_Funds_Rate,Crude_Oil_Price,state_unemployment_rate,temp_max_f,temp_min_f,temp_avg_f,precip_in,cooling_degree_days,heating_degree_days,is_weather_sensitive_sporting,is_weather_sensitive_dairy,is_weather_sensitive_celebration,is_weather_sensitive_cameras,hot_day_flag,cold_day_flag,storm_day_flag,trends_cameras,trends_sporting_goods,trends_team_sports,trends_party_supplies,trends_dairy_products,trends_cameras_scaled,trends_sporting_goods_scaled,trends_team_sports_scaled,trends_party_supplies_scaled,trends_dairy_products_scaled,is_school_in_session,is_back_to_school_window,sports_event_flag,sports_event_name,sports_event_category,sales_tax_event,sales_tax_holiday_flag,sales_tax_holiday_status,sales_tax_holiday_category
0,6,10002,2025-02-20,63.0,8,2.0,NC,296,26,CAMERAS AND SUPPLIES,GENERAL MERCHANDISE,ENTERTAINMENT TOYS AND SEASONAL,319.775,333.45,64.7,224000.0,4.33,72.88,3.7,30.02,22.28,26.15,0.0197,0.0,38.85,0,0,0,1,0,0,0,64,26,6,3,1,0.64,0.26,0.06,0.03,0.01,1,0,0,,,,0,none,none
1,6,10002,2025-02-28,56.0,8,2.0,NC,296,26,CAMERAS AND SUPPLIES,GENERAL MERCHANDISE,ENTERTAINMENT TOYS AND SEASONAL,319.775,333.45,64.7,243000.0,4.33,69.97,3.7,61.16,42.8,51.98,0.0,0.0,13.02,0,0,0,1,0,0,0,44,26,7,4,2,0.44,0.26,0.07,0.04,0.02,1,0,0,,,,0,none,none
2,6,10002,2025-02-19,62.0,8,2.0,NC,296,26,CAMERAS AND SUPPLIES,GENERAL MERCHANDISE,ENTERTAINMENT TOYS AND SEASONAL,319.775,333.45,64.7,224000.0,4.33,72.58,3.7,35.42,23.9,29.66,0.2874,0.0,35.34,0,0,0,1,0,0,0,64,26,6,3,1,0.64,0.26,0.06,0.03,0.01,1,0,0,,,,0,none,none
3,90,10001,2025-02-25,67.0,8,3.0,MD,285,22,DAIRY,FOOD,CHILLED ALCOHOL AND CONVENIENCE,319.775,333.45,64.7,243000.0,4.33,69.15,3.0,58.1,37.76,47.93,0.0,0.0,17.07,0,1,0,0,0,0,0,44,26,7,4,2,0.44,0.26,0.07,0.04,0.02,1,0,0,,,,0,none,none
4,9,10004,2025-02-07,53.0,8,2.0,LA,66,13,SPORTING GOODS,GENERAL MERCHANDISE,HARDLINES,319.775,333.45,64.7,222000.0,4.33,71.32,4.4,79.7,64.4,72.05,0.0,7.05,0.0,1,0,0,0,0,0,0,60,25,6,4,2,0.6,0.25,0.06,0.04,0.02,1,0,0,,,,0,none,none


## 10. Remove Redundant / Low-Value Columns

In [19]:

columns_to_drop = [
    'student_group',
    'trends_cameras',
    'trends_sporting_goods',
    'trends_team_sports',
    'trends_party_supplies',
    'trends_dairy_products'
]
existing = [col for col in columns_to_drop if col in merged.columns]
merged.drop(columns=existing, inplace=True)
print('Columns now:', len(merged.columns))


Columns now: 45


## 11. Engineer Interaction Features

In [20]:

groups = {
    'sporting': merged['is_weather_sensitive_sporting'],
    'dairy': merged['is_weather_sensitive_dairy'],
    'celebration': merged['is_weather_sensitive_celebration'],
    'cameras': merged['is_weather_sensitive_cameras']
}

merged['cdd_sporting'] = merged['cooling_degree_days'] * groups['sporting']
merged['cdd_dairy'] = merged['cooling_degree_days'] * groups['dairy']
merged['hdd_dairy'] = merged['heating_degree_days'] * groups['dairy']
merged['precip_sporting'] = merged['precip_in'] * groups['sporting']
merged['precip_celebration'] = merged['precip_in'] * groups['celebration']

trend_map = {
    'trend_sporting_interaction': ('trends_sporting_goods_scaled', groups['sporting']),
    'trend_team_sports_interaction': ('trends_team_sports_scaled', groups['sporting']),
    'trend_party_interaction': ('trends_party_supplies_scaled', groups['celebration']),
    'trend_dairy_interaction': ('trends_dairy_products_scaled', groups['dairy']),
    'trend_cameras_interaction': ('trends_cameras_scaled', groups['cameras'])
}
for new_col, (trend_col, mask_series) in trend_map.items():
    merged[new_col] = merged[trend_col] * mask_series

merged['bts_sporting_flag'] = merged['is_back_to_school_window'] * groups['sporting']
merged['bts_celebration_flag'] = merged['is_back_to_school_window'] * groups['celebration']
merged['sports_event_sporting_flag'] = merged['sports_event_flag'] * groups['sporting']
merged['hot_back_to_school_flag'] = merged['hot_day_flag'] * merged['is_back_to_school_window'] * groups['sporting']

sales_tax_bts = ((merged['sales_tax_holiday_category'] == 'back_to_school') & (merged['sales_tax_holiday_flag'] == 1)).astype(int)
merged['sales_tax_back_to_school_flag'] = sales_tax_bts
merged['sales_tax_bts_sporting'] = sales_tax_bts * groups['sporting']

merged['cpi_food_gap'] = merged['CPI_FoodAtHome'] - merged['CPI_All_Items']

print('Final column count:', len(merged.columns))
merged.head()


Final column count: 62


Unnamed: 0,dept_id,store_id,dt,cases,trucks,state_name,market_area_nbr,region_nbr,dept_desc,gmm_name,dmm_name,CPI_All_Items,CPI_FoodAtHome,Consumer_Sentiment,Initial_Jobless_Claims,Fed_Funds_Rate,Crude_Oil_Price,state_unemployment_rate,temp_max_f,temp_min_f,temp_avg_f,precip_in,cooling_degree_days,heating_degree_days,is_weather_sensitive_sporting,is_weather_sensitive_dairy,is_weather_sensitive_celebration,is_weather_sensitive_cameras,hot_day_flag,cold_day_flag,storm_day_flag,trends_cameras_scaled,trends_sporting_goods_scaled,trends_team_sports_scaled,trends_party_supplies_scaled,trends_dairy_products_scaled,is_school_in_session,is_back_to_school_window,sports_event_flag,sports_event_name,sports_event_category,sales_tax_event,sales_tax_holiday_flag,sales_tax_holiday_status,sales_tax_holiday_category,cdd_sporting,cdd_dairy,hdd_dairy,precip_sporting,precip_celebration,trend_sporting_interaction,trend_team_sports_interaction,trend_party_interaction,trend_dairy_interaction,trend_cameras_interaction,bts_sporting_flag,bts_celebration_flag,sports_event_sporting_flag,hot_back_to_school_flag,sales_tax_back_to_school_flag,sales_tax_bts_sporting,cpi_food_gap
0,6,10002,2025-02-20,63.0,2.0,NC,296,26,CAMERAS AND SUPPLIES,GENERAL MERCHANDISE,ENTERTAINMENT TOYS AND SEASONAL,319.775,333.45,64.7,224000.0,4.33,72.88,3.7,30.02,22.28,26.15,0.0197,0.0,38.85,0,0,0,1,0,0,0,0.64,0.26,0.06,0.03,0.01,1,0,0,,,,0,none,none,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.64,0,0,0,0,0,0,13.675
1,6,10002,2025-02-28,56.0,2.0,NC,296,26,CAMERAS AND SUPPLIES,GENERAL MERCHANDISE,ENTERTAINMENT TOYS AND SEASONAL,319.775,333.45,64.7,243000.0,4.33,69.97,3.7,61.16,42.8,51.98,0.0,0.0,13.02,0,0,0,1,0,0,0,0.44,0.26,0.07,0.04,0.02,1,0,0,,,,0,none,none,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.44,0,0,0,0,0,0,13.675
2,6,10002,2025-02-19,62.0,2.0,NC,296,26,CAMERAS AND SUPPLIES,GENERAL MERCHANDISE,ENTERTAINMENT TOYS AND SEASONAL,319.775,333.45,64.7,224000.0,4.33,72.58,3.7,35.42,23.9,29.66,0.2874,0.0,35.34,0,0,0,1,0,0,0,0.64,0.26,0.06,0.03,0.01,1,0,0,,,,0,none,none,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.64,0,0,0,0,0,0,13.675
3,90,10001,2025-02-25,67.0,3.0,MD,285,22,DAIRY,FOOD,CHILLED ALCOHOL AND CONVENIENCE,319.775,333.45,64.7,243000.0,4.33,69.15,3.0,58.1,37.76,47.93,0.0,0.0,17.07,0,1,0,0,0,0,0,0.44,0.26,0.07,0.04,0.02,1,0,0,,,,0,none,none,0.0,0.0,17.07,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0,0,0,0,0,0,13.675
4,9,10004,2025-02-07,53.0,2.0,LA,66,13,SPORTING GOODS,GENERAL MERCHANDISE,HARDLINES,319.775,333.45,64.7,222000.0,4.33,71.32,4.4,79.7,64.4,72.05,0.0,7.05,0.0,1,0,0,0,0,0,0,0.6,0.25,0.06,0.04,0.02,1,0,0,,,,0,none,none,7.05,0.0,0.0,0.0,0.0,0.25,0.06,0.0,0.0,0.0,0,0,0,0,0,0,13.675


## 12. Persist Final Modeling Dataset

In [21]:

final_path = PROCESSED_DIR / 'merged_data_model_ready.csv'
final_interactions_path = PROCESSED_DIR / 'merged_data_model_ready_interactions.csv'

interaction_cols = [
    'cdd_sporting', 'cdd_dairy', 'hdd_dairy', 'precip_sporting', 'precip_celebration',
    'trend_sporting_interaction', 'trend_team_sports_interaction', 'trend_party_interaction',
    'trend_dairy_interaction', 'trend_cameras_interaction', 'bts_sporting_flag',
    'bts_celebration_flag', 'sports_event_sporting_flag', 'hot_back_to_school_flag',
    'sales_tax_back_to_school_flag', 'sales_tax_bts_sporting', 'cpi_food_gap'
]

base_columns = [col for col in merged.columns if col not in interaction_cols]
merged[base_columns].to_csv(final_path, index=False)
merged.to_csv(final_interactions_path, index=False)

print('Saved base modeling dataset to:', final_path)
print('Saved interaction dataset to:', final_interactions_path)
print('Final dataset shape:', merged.shape)
print('Date range:', merged['dt'].min().date(), '→', merged['dt'].max().date())


Saved base modeling dataset to: /Users/chanamalluvinay/Documents/wmt_proj/data/processed/merged_data_model_ready.csv
Saved interaction dataset to: /Users/chanamalluvinay/Documents/wmt_proj/data/processed/merged_data_model_ready_interactions.csv
Final dataset shape: (275000, 62)
Date range: 2024-03-14 → 2025-09-14


## 13. Sanity Checks

In [22]:

key_dupes = merged.duplicated(subset=['store_id','dept_id','dt']).sum()
print('Duplicate store-dept-day rows:', key_dupes)
print('Null overview in new columns:')
new_columns = [col for col in merged.columns if col not in ['dept_id','store_id','dt','cases','trucks','state_name','market_area_nbr','region_nbr','dept_desc','gmm_name','dmm_name']]
print(merged[new_columns].isnull().sum().sort_values(ascending=False).head(10))


Duplicate store-dept-day rows: 0
Null overview in new columns:
sales_tax_event               272380
CPI_All_Items                      0
precip_celebration                 0
sports_event_name                  0
sports_event_category              0
sales_tax_holiday_flag             0
sales_tax_holiday_status           0
sales_tax_holiday_category         0
cdd_sporting                       0
cdd_dairy                          0
dtype: int64



## Summary

- Recreated the original merged dataset (275k rows × 12 columns) and saved the supporting reference tables.
- Added macroeconomic, labor, weather, Google Trends, school calendar, sports schedule, and sales tax holiday signals.
- Removed redundant (`student_group`) and duplicate trend features, keeping scaled versions for modeling.
- Engineered targeted interaction features to capture department-specific sensitivities.
- Persisted both the base modeling dataset and the interaction-enhanced version, ensuring they can be regenerated at any time.

This notebook is the single source of truth for preparing the Team 8 modeling data pipeline.
