
# Train / Validation / Test Splits with Time-Series Features

We partition the historical portion of `data/processed/merged_data_model_ready_interactions.csv` (2024-03-14 → 2025-08-17) into train/validation/test folds. Rows from 2025-08-18 → 2025-09-14 remain untouched as the final forecast horizon.

**Default cutoffs (adjust as needed):**
- Train: 2024-03-14 → 2025-05-31
- Validation: 2025-06-01 → 2025-06-30
- Test: 2025-07-01 → 2025-08-17
- Holdout: 2025-08-18 → 2025-09-14 (not used for fitting/eval; reserved for final scoring)


## Imports & Parameters

In [1]:

import pandas as pd
from pathlib import Path

BASE_DIR = Path.cwd()
DATA_PATH = BASE_DIR / 'data' / 'processed' / 'merged_data_model_ready_interactions.csv'
OUTPUT_DIR = BASE_DIR / 'data' / 'processed' / 'timeseries_splits'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

TRAIN_END = pd.Timestamp('2025-05-31')
VAL_END = pd.Timestamp('2025-06-30')
TEST_END = pd.Timestamp('2025-08-17')

print('Train end:', TRAIN_END.date())
print('Validation end:', VAL_END.date())
print('Test end:', TEST_END.date())


Train end: 2025-05-31
Validation end: 2025-06-30
Test end: 2025-08-17


## Load Final Dataset

In [2]:

df = pd.read_csv(DATA_PATH, parse_dates=['dt'], low_memory=False)
print('Shape:', df.shape)
print('Date range:', df['dt'].min().date(), '→', df['dt'].max().date())
print('Departments:', df['dept_id'].unique().tolist())
df.head()


Shape: (275000, 62)
Date range: 2024-03-14 → 2025-09-14
Departments: [6, 90, 9, 41, 67]


Unnamed: 0,dept_id,store_id,dt,cases,trucks,state_name,market_area_nbr,region_nbr,dept_desc,gmm_name,...,trend_party_interaction,trend_dairy_interaction,trend_cameras_interaction,bts_sporting_flag,bts_celebration_flag,sports_event_sporting_flag,hot_back_to_school_flag,sales_tax_back_to_school_flag,sales_tax_bts_sporting,cpi_food_gap
0,6,10002,2025-02-20,63.0,2.0,NC,296,26,CAMERAS AND SUPPLIES,GENERAL MERCHANDISE,...,0.0,0.0,0.64,0,0,0,0,0,0,13.675
1,6,10002,2025-02-28,56.0,2.0,NC,296,26,CAMERAS AND SUPPLIES,GENERAL MERCHANDISE,...,0.0,0.0,0.44,0,0,0,0,0,0,13.675
2,6,10002,2025-02-19,62.0,2.0,NC,296,26,CAMERAS AND SUPPLIES,GENERAL MERCHANDISE,...,0.0,0.0,0.64,0,0,0,0,0,0,13.675
3,90,10001,2025-02-25,67.0,3.0,MD,285,22,DAIRY,FOOD,...,0.0,0.02,0.0,0,0,0,0,0,0,13.675
4,9,10004,2025-02-07,53.0,2.0,LA,66,13,SPORTING GOODS,GENERAL MERCHANDISE,...,0.0,0.0,0.0,0,0,0,0,0,0,13.675


## Split by Date

In [3]:

history_mask = df['dt'] <= TEST_END
holdout_mask = df['dt'] > TEST_END

history_df = df.loc[history_mask].copy()
holdout_df = df.loc[holdout_mask].copy()

train_mask = history_df['dt'] <= TRAIN_END
val_mask = (history_df['dt'] > TRAIN_END) & (history_df['dt'] <= VAL_END)
test_mask = (history_df['dt'] > VAL_END)

train_df = history_df.loc[train_mask].copy()
val_df = history_df.loc[val_mask].copy()
test_df = history_df.loc[test_mask].copy()

print('Train rows:', len(train_df))
print('Validation rows:', len(val_df))
print('Test rows:', len(test_df))
print('Holdout rows:', len(holdout_df))


Train rows: 222000
Validation rows: 15000
Test rows: 24000
Holdout rows: 14000



## Helper: Build Lag/Rolling Features Safely

We define a function that, given a DataFrame (train/val/test), computes lagged and rolling statistics per `(store_id, dept_id)` using only past data. For validation/test, we prepend the previous splits' history so the rolling window has context, then drop the extra rows.


In [4]:

def add_lag_features(target_df, history_df=None, group_cols=('store_id','dept_id'), date_col='dt'):
    """Return target_df with lag/rolling features. history_df provides prior rows (train or train+val)."""
    if history_df is not None:
        concat_df = pd.concat([history_df, target_df], axis=0)
    else:
        concat_df = target_df.copy()

    group_list = list(group_cols)
    concat_df = concat_df.sort_values(group_list + [date_col])

    for lag in [1, 7, 14]:
        concat_df[f'cases_lag_{lag}'] = concat_df.groupby(group_list)['cases'].shift(lag)
    concat_df['cases_ma_7'] = concat_df.groupby(group_list)['cases'].rolling(window=7, min_periods=1).mean().reset_index(level=group_list, drop=True)
    concat_df['cases_ma_14'] = concat_df.groupby(group_list)['cases'].rolling(window=14, min_periods=1).mean().reset_index(level=group_list, drop=True)
    concat_df['cases_std_7'] = concat_df.groupby(group_list)['cases'].rolling(window=7, min_periods=1).std().reset_index(level=group_list, drop=True)

    concat_df['trucks_lag_1'] = concat_df.groupby(group_list)['trucks'].shift(1)
    concat_df['trucks_lag_7'] = concat_df.groupby(group_list)['trucks'].shift(7)
    concat_df['trucks_ma_7'] = concat_df.groupby(group_list)['trucks'].rolling(window=7, min_periods=1).mean().reset_index(level=group_list, drop=True)

    result = concat_df.loc[concat_df.index.isin(target_df.index)].copy()
    return result



def add_lag_features(target_df, history_df=None, group_cols=('store_id','dept_id'), date_col='dt'):
    """Return target_df with lag/rolling features. history_df provides prior rows (train or train+val)."""
    if history_df is not None:
        concat_df = pd.concat([history_df, target_df], axis=0)
    else:
        concat_df = target_df.copy()

    concat_df = concat_df.sort_values(list(group_cols) + [date_col])
    for lag in [1, 7, 14]:
        concat_df[f'cases_lag_{lag}'] = concat_df.groupby(group_cols)['cases'].shift(lag)
    concat_df['cases_ma_7'] = concat_df.groupby(group_cols)['cases'].rolling(window=7, min_periods=1).mean().reset_index(level=group_cols, drop=True)
    concat_df['cases_ma_14'] = concat_df.groupby(group_cols)['cases'].rolling(window=14, min_periods=1).mean().reset_index(level=group_cols, drop=True)
    concat_df['cases_std_7'] = concat_df.groupby(group_cols)['cases'].rolling(window=7, min_periods=1).std().reset_index(level=group_cols, drop=True)

    concat_df['trucks_lag_1'] = concat_df.groupby(group_cols)['trucks'].shift(1)
    concat_df['trucks_lag_7'] = concat_df.groupby(group_cols)['trucks'].shift(7)
    concat_df['trucks_ma_7'] = concat_df.groupby(group_cols)['trucks'].rolling(window=7, min_periods=1).mean().reset_index(level=group_cols, drop=True)

    result = concat_df.loc[concat_df.index.isin(target_df.index)].copy()
    return result


In [5]:

train_with_lags = add_lag_features(train_df)
val_with_lags = add_lag_features(val_df, history_df=train_df)
test_with_lags = add_lag_features(test_df, history_df=pd.concat([train_df, val_df]))

print('Train lags nulls:', train_with_lags[['cases_lag_1','cases_lag_7','cases_lag_14']].isnull().sum().to_dict())
print('Val lags nulls:', val_with_lags[['cases_lag_1','cases_lag_7','cases_lag_14']].isnull().sum().to_dict())
print('Test lags nulls:', test_with_lags[['cases_lag_1','cases_lag_7','cases_lag_14']].isnull().sum().to_dict())


Train lags nulls: {'cases_lag_1': 500, 'cases_lag_7': 3500, 'cases_lag_14': 7000}
Val lags nulls: {'cases_lag_1': 0, 'cases_lag_7': 0, 'cases_lag_14': 0}
Test lags nulls: {'cases_lag_1': 0, 'cases_lag_7': 0, 'cases_lag_14': 0}


## Save Splits

In [6]:

train_with_lags = add_lag_features(train_df)
val_with_lags = add_lag_features(val_df, history_df=train_df)
test_with_lags = add_lag_features(test_df, history_df=pd.concat([train_df, val_df]))

train_with_lags.to_csv(OUTPUT_DIR / 'train_timeseries.csv', index=False)
val_with_lags.to_csv(OUTPUT_DIR / 'val_timeseries.csv', index=False)
test_with_lags.to_csv(OUTPUT_DIR / 'test_timeseries.csv', index=False)
holdout_df.to_csv(OUTPUT_DIR / 'holdout_forecast_window.csv', index=False)

print('Saved train/val/test/holdout files to', OUTPUT_DIR)


Saved train/val/test/holdout files to /Users/chanamalluvinay/Documents/wmt_proj/data/processed/timeseries_splits



## Notes
- Adjust `TRAIN_END` and `VAL_END` at the top if you need different cutoffs.
- The lag features currently cover cases/trucks; extend the helper function for additional metrics as needed.
- Forecast rows (with `cases` null) live beyond the test split and can be scored after a model is trained.
