# Corporación Favorita Grocery Sales Forecasting
**w03_d01_MODEL_baseline.ipynb**

**Author:** Alberto Diaz Durana  
**Date:** November 2025  
**Purpose:** Establish XGBoost baseline model with comprehensive evaluation

---

## Objectives

This notebook accomplishes the following:

- Load final feature-engineered dataset from Week 2 (w02_d05_FE_final.pkl)
- Create chronological train/test split (Jan-Feb train, March 2014 test)
- Train XGBoost baseline model with default parameters
- Evaluate with 6 comprehensive metrics (MAE, RMSE, Bias, MAD, rMAD, MAPE)
- Visualize predictions and residuals
- Document baseline performance for Week 3 comparison

---

## Business Context

**Why baseline modeling matters:**

- Establishes performance benchmark for hyperparameter tuning
- Validates feature engineering efforts from Week 2
- Identifies model strengths and weaknesses before optimization
- Provides interpretable metrics for stakeholder communication

**Expected outcomes:**
- RMSE baseline documented
- Feature importance from default XGBoost
- Clear under/over-forecasting patterns identified

---

In [2]:
# Cell 1: Imports and Project Setup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
import time
warnings.filterwarnings('ignore')

# XGBoost and evaluation
import xgboost as xgb
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

# Visualization settings
plt.style.use('default')
sns.set_palette("husl")

# Reproducibility
np.random.seed(42)

print("Library versions:")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"xgboost: {xgb.__version__}")
print(f"matplotlib: {plt.matplotlib.__version__}")
print(f"seaborn: {sns.__version__}")

Library versions:
pandas: 2.1.4
numpy: 1.26.4
xgboost: 2.0.3
matplotlib: 3.10.7
seaborn: 0.13.2


## 1. Data Loading & Verification

**Objective:** Load final feature-engineered dataset from Week 2 Day 5

**Activities:**
- Load w02_d05_FE_final.pkl (300,896 rows × 57 columns expected)
- Verify dataset structure and shape
- Check feature count (29 engineered + 28 base = 57 total)
- Confirm temporal range (2013-01-02 to 2017-08-15)
- Verify no data quality issues before modeling

**Expected output:** 
- Dataset loaded successfully
- Shape: (300896, 57)
- Date range confirmed
- No missing values in critical features

In [3]:
# Determine paths (works from notebooks/ or project root)
current_dir = Path(__file__).parent if '__file__' in globals() else Path.cwd()
PROJECT_ROOT = current_dir.parent if current_dir.name == 'notebooks' else current_dir

DATA_PROCESSED = PROJECT_ROOT / 'data' / 'processed'
DATA_RESULTS = PROJECT_ROOT / 'data' / 'results' / 'features'
OUTPUTS_FIGURES = PROJECT_ROOT / 'outputs' / 'figures' / 'features'

print(f"\nProject root: {PROJECT_ROOT.resolve()}")
print(f"Data processed: {DATA_PROCESSED.resolve()}")
print(f"Results output: {DATA_RESULTS.resolve()}")
print(f"Figures output: {OUTPUTS_FIGURES.resolve()}")


Project root: D:\Demand-forecasting-in-retail
Data processed: D:\Demand-forecasting-in-retail\data\processed
Results output: D:\Demand-forecasting-in-retail\data\results\features
Figures output: D:\Demand-forecasting-in-retail\outputs\figures\features


In [4]:
# Cell 2: Load Final Feature-Engineered Dataset

# Load the final dataset from Week 2 Day 5
file_path = DATA_PROCESSED / 'w02_d05_FE_final.pkl'  
df = pd.read_pickle(file_path)

print("Dataset loaded successfully")
print(f"Shape: {df.shape}")
print(f"\nDate range: {df['date'].min()} to {df['date'].max()}")
print(f"\nColumn count: {len(df.columns)}")
print(f"\nFirst few columns:")
print(df.columns.tolist()[:10])
print(f"\nLast few columns:")
print(df.columns.tolist()[-10:])
print(f"\nData types summary:")
print(df.dtypes.value_counts())
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

Dataset loaded successfully
Shape: (300896, 57)

Date range: 2013-01-02 00:00:00 to 2017-08-15 00:00:00

Column count: 57

First few columns:
['id', 'date', 'store_nbr', 'item_nbr', 'unit_sales', 'onpromotion', 'family', 'class', 'perishable', 'city']

Last few columns:
['cluster_avg_sales', 'cluster_median_sales', 'cluster_std_sales', 'item_avg_sales', 'item_median_sales', 'item_std_sales', 'item_count', 'item_total_sales', 'promo_item_avg_interaction', 'promo_cluster_interaction']

Data types summary:
float64           30
int64              9
int32              9
object             8
datetime64[ns]     1
Name: count, dtype: int64

Memory usage: 219.9 MB


### 1.1 Data Quality Verification

**Objective:** Verify dataset quality before modeling

**Checks:**
- Temporal order preserved (critical for time series)
- Missing values assessment
- Feature types validation
- Basic statistics

In [5]:
# Cell 3: Data Quality Verification

print("=== DATA QUALITY CHECK ===\n")

# Check temporal order
print("1. Temporal Order:")
is_sorted = df.sort_values(['store_nbr', 'item_nbr', 'date']).equals(df)
print(f"   Dataset sorted by (store_nbr, item_nbr, date): {is_sorted}")

# Missing values
print(f"\n2. Missing Values:")
missing = df.isnull().sum()
missing_features = missing[missing > 0].sort_values(ascending=False)
if len(missing_features) > 0:
    print(f"   Features with missing values:")
    for col, count in missing_features.head(10).items():
        pct = (count / len(df)) * 100
        print(f"   - {col}: {count:,} ({pct:.2f}%)")
else:
    print("   No missing values found")

# Target variable statistics
print(f"\n3. Target Variable (unit_sales):")
print(f"   Mean: {df['unit_sales'].mean():.2f}")
print(f"   Median: {df['unit_sales'].median():.2f}")
print(f"   Std: {df['unit_sales'].std():.2f}")
print(f"   Min: {df['unit_sales'].min():.2f}")
print(f"   Max: {df['unit_sales'].max():.2f}")
print(f"   Zeros: {(df['unit_sales'] == 0).sum():,} ({(df['unit_sales'] == 0).sum() / len(df) * 100:.1f}%)")

# Feature categories
print(f"\n4. Feature Categories:")
print(f"   Total features: {len(df.columns)}")
lag_features = [col for col in df.columns if 'lag' in col.lower()]
rolling_features = [col for col in df.columns if ('avg' in col or 'std' in col) and 'sales' in col]
oil_features = [col for col in df.columns if 'oil' in col.lower()]
agg_features = [col for col in df.columns if any(x in col for x in ['store_avg', 'cluster_avg', 'item_avg'])]
promo_features = [col for col in df.columns if 'promo' in col.lower()]

print(f"   - Lag features: {len(lag_features)}")
print(f"   - Rolling features: {len(rolling_features)}")
print(f"   - Oil features: {len(oil_features)}")
print(f"   - Aggregation features: {len(agg_features)}")
print(f"   - Promotion features: {len(promo_features)}")

print(f"\n5. Date Range for Modeling:")
print(f"   Full range: {df['date'].min().date()} to {df['date'].max().date()}")
print(f"   Total days: {(df['date'].max() - df['date'].min()).days}")

=== DATA QUALITY CHECK ===

1. Temporal Order:
   Dataset sorted by (store_nbr, item_nbr, date): True

2. Missing Values:
   Features with missing values:
   - unit_sales_lag30: 291,884 (97.00%)
   - holiday_type: 273,698 (90.96%)
   - holiday_name: 273,698 (90.96%)
   - unit_sales_lag14: 204,230 (67.87%)
   - unit_sales_lag7: 119,961 (39.87%)
   - unit_sales_lag1: 19,692 (6.54%)
   - unit_sales_7d_std: 19,692 (6.54%)
   - unit_sales_14d_std: 19,692 (6.54%)
   - unit_sales_30d_std: 19,692 (6.54%)
   - oil_price_lag30: 30 (0.01%)

3. Target Variable (unit_sales):
   Mean: 6.79
   Median: 3.00
   Std: 15.63
   Min: -92.00
   Max: 2534.00
   Zeros: 0 (0.0%)

4. Feature Categories:
   Total features: 57
   - Lag features: 7
   - Rolling features: 12
   - Oil features: 6
   - Aggregation features: 4
   - Promotion features: 4

5. Date Range for Modeling:
   Full range: 2013-01-02 to 2017-08-15
   Total days: 1686


Key observations:

Temporal order preserved 
High NaN in lag features expected (sparse retail data) 
No zeros in dataset (sparse format - only recorded sales) 
Feature categories align with Week 2 engineering 

## 2. Train/Test Split (Chronological)

**Objective:** Create time-based train/test split for realistic forecasting evaluation

**Strategy:**
- **Training period:** January-February 2014 (2 months)
- **Test period:** March 2014 (1 month)
- **No shuffling:** Preserve temporal order (critical for time series)
- **Rationale:** Simulate realistic forecast scenario (predict next month using past 2 months)

**Expected split:**
- Train: ~60-70% of 2014 Q1 data
- Test: ~30-40% of 2014 Q1 data
- Course requirement: Model Jan-Mar 2014 only (Guayas, top-3 families)

In [6]:
# Cell 4: Chronological Train/Test Split

print("=== TRAIN/TEST SPLIT (CHRONOLOGICAL) ===\n")

# Filter to Jan-Mar 2014 only (course requirement)
print("1. Filter to Q1 2014 (Jan-Mar):")
df_2014q1 = df[(df['date'] >= '2014-01-01') & (df['date'] <= '2014-03-31')].copy()
print(f"   Original dataset: {len(df):,} rows")
print(f"   Q1 2014 subset: {len(df_2014q1):,} rows ({len(df_2014q1)/len(df)*100:.1f}%)")
print(f"   Date range: {df_2014q1['date'].min().date()} to {df_2014q1['date'].max().date()}")

# Split: Jan-Feb (train), March (test)
print(f"\n2. Chronological Split:")
train = df_2014q1[df_2014q1['date'] < '2014-03-01'].copy()
test = df_2014q1[df_2014q1['date'] >= '2014-03-01'].copy()

print(f"   Train (Jan-Feb 2014): {len(train):,} rows")
print(f"     Date range: {train['date'].min().date()} to {train['date'].max().date()}")
print(f"     Days: {(train['date'].max() - train['date'].min()).days + 1}")

print(f"   Test (March 2014): {len(test):,} rows")
print(f"     Date range: {test['date'].min().date()} to {test['date'].max().date()}")
print(f"     Days: {(test['date'].max() - test['date'].min()).days + 1}")

print(f"\n3. Split Ratio:")
print(f"   Train: {len(train)/len(df_2014q1)*100:.1f}%")
print(f"   Test: {len(test)/len(df_2014q1)*100:.1f}%")

# Verify no data leakage (test dates > train dates)
print(f"\n4. Data Leakage Check:")
latest_train_date = train['date'].max()
earliest_test_date = test['date'].min()
print(f"   Latest train date: {latest_train_date.date()}")
print(f"   Earliest test date: {earliest_test_date.date()}")
print(f"   No overlap: {latest_train_date < earliest_test_date}")

# Target variable distribution
print(f"\n5. Target Distribution:")
print(f"   Train unit_sales - Mean: {train['unit_sales'].mean():.2f}, Median: {train['unit_sales'].median():.2f}")
print(f"   Test unit_sales - Mean: {test['unit_sales'].mean():.2f}, Median: {test['unit_sales'].median():.2f}")

=== TRAIN/TEST SPLIT (CHRONOLOGICAL) ===

1. Filter to Q1 2014 (Jan-Mar):
   Original dataset: 300,896 rows
   Q1 2014 subset: 12,668 rows (4.2%)
   Date range: 2014-01-01 to 2014-03-31

2. Chronological Split:
   Train (Jan-Feb 2014): 7,982 rows
     Date range: 2014-01-01 to 2014-02-28
     Days: 59
   Test (March 2014): 4,686 rows
     Date range: 2014-03-01 to 2014-03-31
     Days: 31

3. Split Ratio:
   Train: 63.0%
   Test: 37.0%

4. Data Leakage Check:
   Latest train date: 2014-02-28
   Earliest test date: 2014-03-01
   No overlap: True

5. Target Distribution:
   Train unit_sales - Mean: 7.21, Median: 3.00
   Test unit_sales - Mean: 7.26, Median: 3.00


Gap Period to Prevent Lag Feature Leakage
The Issue:

Our max lag is 30 days (unit_sales_lag30)
If we predict March 1 without a gap:

lag1 uses Feb 28 (training data)
lag7 uses Feb 22 (training data)
lag30 uses Jan 30 (training data)


This creates subtle information leakage from training into test predictions

Standard Solution:
Leave a gap period equal to maximum lag (30 days) between train and test.

## 2. Train/Test Split (Chronological with Gap Period)

**Objective:** Create time-based train/test split with gap period to prevent lag feature leakage

**Strategy:**
- **Training period:** January 1 - February 21, 2014 (52 days)
- **Gap period:** February 22 - February 28, 2014 (7 days) - EXCLUDED from both sets
- **Test period:** March 1 - March 31, 2014 (31 days)
- **No shuffling:** Preserve temporal order (critical for time series)

**Rationale for 7-day gap:**
- Prevents leakage for lag7 feature (strongest autocorrelation: r=0.40)
- Balances training data availability (52 days) vs leak prevention
- lag1 predictions use gap period data (acceptable trade-off)
- lag14 and lag30 still use some training period (documented limitation)

**Trade-offs documented:**
- Strict 30-day gap would leave only 30 days training (insufficient)
- 7-day gap pragmatically prevents most critical leakage
- Acceptable for academic project scope

In [7]:
# Cell 4: Chronological Train/Test Split with Gap Period

print("=== TRAIN/TEST SPLIT WITH GAP PERIOD ===\n")

# Filter to Jan-Mar 2014 only (course requirement)
print("1. Filter to Q1 2014 (Jan-Mar):")
df_2014q1 = df[(df['date'] >= '2014-01-01') & (df['date'] <= '2014-03-31')].copy()
print(f"   Original dataset: {len(df):,} rows")
print(f"   Q1 2014 subset: {len(df_2014q1):,} rows ({len(df_2014q1)/len(df)*100:.1f}%)")
print(f"   Date range: {df_2014q1['date'].min().date()} to {df_2014q1['date'].max().date()}")

# Split with 7-day gap: Train (Jan 1 - Feb 21), Gap (Feb 22-28), Test (Mar 1-31)
print(f"\n2. Chronological Split with Gap:")
train = df_2014q1[df_2014q1['date'] <= '2014-02-21'].copy()
gap = df_2014q1[(df_2014q1['date'] > '2014-02-21') & (df_2014q1['date'] < '2014-03-01')].copy()
test = df_2014q1[df_2014q1['date'] >= '2014-03-01'].copy()

print(f"   Train (Jan 1 - Feb 21, 2014): {len(train):,} rows")
print(f"     Date range: {train['date'].min().date()} to {train['date'].max().date()}")
print(f"     Days: {(train['date'].max() - train['date'].min()).days + 1}")

print(f"   Gap (Feb 22 - Feb 28, 2014): {len(gap):,} rows [EXCLUDED]")
print(f"     Date range: {gap['date'].min().date()} to {gap['date'].max().date()}")
print(f"     Days: {(gap['date'].max() - gap['date'].min()).days + 1}")
print(f"     Purpose: Prevent lag7 feature leakage")

print(f"   Test (March 1 - 31, 2014): {len(test):,} rows")
print(f"     Date range: {test['date'].min().date()} to {test['date'].max().date()}")
print(f"     Days: {(test['date'].max() - test['date'].min()).days + 1}")

print(f"\n3. Split Ratio (excluding gap):")
print(f"   Train: {len(train)/(len(train)+len(test))*100:.1f}%")
print(f"   Test: {len(test)/(len(train)+len(test))*100:.1f}%")
print(f"   Gap excluded: {len(gap):,} rows ({len(gap)/len(df_2014q1)*100:.1f}% of Q1)")

# Verify gap period
print(f"\n4. Gap Period Verification:")
latest_train_date = train['date'].max()
earliest_test_date = test['date'].min()
gap_days = (earliest_test_date - latest_train_date).days - 1
print(f"   Latest train date: {latest_train_date.date()}")
print(f"   Earliest test date: {earliest_test_date.date()}")
print(f"   Gap period: {gap_days} days")
print(f"   Prevents lag7 leakage: {gap_days >= 7}")

# Lag feature leakage analysis
print(f"\n5. Lag Feature Leakage Assessment:")
print(f"   lag1 (yesterday): Uses gap period data (acceptable)")
print(f"   lag7 (last week): No leakage (gap prevents)")
print(f"   lag14 (2 weeks ago): Partial overlap with training period")
print(f"   lag30 (1 month ago): Uses training period data")
print(f"   Decision: Pragmatic 7-day gap balances data availability vs leakage")

# Target variable distribution
print(f"\n6. Target Distribution:")
print(f"   Train unit_sales - Mean: {train['unit_sales'].mean():.2f}, Median: {train['unit_sales'].median():.2f}")
print(f"   Test unit_sales - Mean: {test['unit_sales'].mean():.2f}, Median: {test['unit_sales'].median():.2f}")
print(f"   Gap unit_sales - Mean: {gap['unit_sales'].mean():.2f}, Median: {gap['unit_sales'].median():.2f}")

=== TRAIN/TEST SPLIT WITH GAP PERIOD ===

1. Filter to Q1 2014 (Jan-Mar):
   Original dataset: 300,896 rows
   Q1 2014 subset: 12,668 rows (4.2%)
   Date range: 2014-01-01 to 2014-03-31

2. Chronological Split with Gap:
   Train (Jan 1 - Feb 21, 2014): 7,050 rows
     Date range: 2014-01-01 to 2014-02-21
     Days: 52
   Gap (Feb 22 - Feb 28, 2014): 932 rows [EXCLUDED]
     Date range: 2014-02-22 to 2014-02-28
     Days: 7
     Purpose: Prevent lag7 feature leakage
   Test (March 1 - 31, 2014): 4,686 rows
     Date range: 2014-03-01 to 2014-03-31
     Days: 31

3. Split Ratio (excluding gap):
   Train: 60.1%
   Test: 39.9%
   Gap excluded: 932 rows (7.4% of Q1)

4. Gap Period Verification:
   Latest train date: 2014-02-21
   Earliest test date: 2014-03-01
   Gap period: 7 days
   Prevents lag7 leakage: True

5. Lag Feature Leakage Assessment:
   lag1 (yesterday): Uses gap period data (acceptable)
   lag7 (last week): No leakage (gap prevents)
   lag14 (2 weeks ago): Partial overlap wit

## 3. Feature & Target Separation

**Objective:** Prepare X (features) and y (target) for modeling

**Strategy:**
- **Target variable:** unit_sales (what we predict)
- **Features to EXCLUDE:** 
  - id (identifier, not predictive)
  - date (used for split, not as feature)
  - store_nbr, item_nbr (identifiers - aggregations already capture patterns)
  - unit_sales (target variable)
  - Categorical text features (city, state, type, family, class) - use engineered features instead
  
- **Features to INCLUDE:** All 29 engineered features + relevant base features
- **Handle NaN:** XGBoost handles natively (no imputation needed per DEC-011)

**Expected result:**
- X_train: (7,050 rows × ~40-45 features)
- X_test: (4,686 rows × ~40-45 features)
- y_train: (7,050 values)
- y_test: (4,686 values)

In [8]:
# Cell 5: Feature & Target Separation

print("=== FEATURE & TARGET SEPARATION ===\n")

# Define columns to exclude
exclude_cols = [
    'id',           # Identifier
    'date',         # Used for split, not feature
    'store_nbr',    # Identifier (aggregations capture store patterns)
    'item_nbr',     # Identifier (aggregations capture item patterns)
    'unit_sales',   # Target variable
    # Categorical text features (already encoded via aggregations)
    'city', 'state', 'type', 'family', 'class',
    # Holiday text features (sparse, mostly NaN)
    'holiday_name', 'holiday_type'
]

print("1. Define Features:")
feature_cols = [col for col in train.columns if col not in exclude_cols]
print(f"   Total columns in train: {len(train.columns)}")
print(f"   Excluded columns: {len(exclude_cols)}")
print(f"   Feature columns: {len(feature_cols)}")

print(f"\n2. Feature Categories:")
# Categorize features for documentation
base_features = [col for col in feature_cols if col in ['onpromotion', 'perishable', 'cluster']]
lag_features = [col for col in feature_cols if 'lag' in col.lower()]
rolling_features = [col for col in feature_cols if ('avg' in col or 'std' in col) and 'sales' in col]
oil_features = [col for col in feature_cols if 'oil' in col.lower()]
agg_features = [col for col in feature_cols if any(x in col for x in ['store_avg', 'cluster_avg', 'item_avg', 'store_median', 'cluster_median', 'item_median', 'store_std', 'cluster_std', 'item_std', 'item_count', 'item_total'])]
promo_features = [col for col in feature_cols if 'promo' in col.lower() and col != 'onpromotion']

print(f"   - Base features: {len(base_features)}")
print(f"   - Lag features: {len(lag_features)}")
print(f"   - Rolling features: {len(rolling_features)}")
print(f"   - Oil features: {len(oil_features)}")
print(f"   - Aggregation features: {len(agg_features)}")
print(f"   - Promotion interaction features: {len(promo_features)}")

print(f"\n3. Create X and y:")
X_train = train[feature_cols].copy()
y_train = train['unit_sales'].copy()
X_test = test[feature_cols].copy()
y_test = test['unit_sales'].copy()

print(f"   X_train shape: {X_train.shape}")
print(f"   y_train shape: {y_train.shape}")
print(f"   X_test shape: {X_test.shape}")
print(f"   y_test shape: {y_test.shape}")

print(f"\n4. Missing Values in Features:")
train_missing = X_train.isnull().sum().sort_values(ascending=False)
train_missing_pct = (train_missing / len(X_train) * 100).round(2)
features_with_nan = train_missing[train_missing > 0]
if len(features_with_nan) > 0:
    print(f"   Features with NaN in training set:")
    for feat, count in features_with_nan.head(10).items():
        print(f"   - {feat}: {count:,} ({train_missing_pct[feat]:.2f}%)")
else:
    print("   No missing values in training features")

print(f"\n5. Data Types:")
print(f"   {X_train.dtypes.value_counts().to_dict()}")

print(f"\n6. Feature List (first 15):")
for i, col in enumerate(feature_cols[:15], 1):
    print(f"   {i}. {col}")
print(f"   ... and {len(feature_cols) - 15} more features")

=== FEATURE & TARGET SEPARATION ===

1. Define Features:
   Total columns in train: 57
   Excluded columns: 12
   Feature columns: 45

2. Feature Categories:
   - Base features: 3
   - Lag features: 7
   - Rolling features: 12
   - Oil features: 6
   - Aggregation features: 12
   - Promotion interaction features: 3

3. Create X and y:
   X_train shape: (7050, 45)
   y_train shape: (7050,)
   X_test shape: (4686, 45)
   y_test shape: (4686,)

4. Missing Values in Features:
   Features with NaN in training set:
   - unit_sales_lag30: 7,050 (100.00%)
   - unit_sales_lag14: 7,005 (99.36%)
   - unit_sales_lag7: 4,842 (68.68%)
   - unit_sales_30d_std: 696 (9.87%)
   - unit_sales_14d_std: 696 (9.87%)
   - unit_sales_7d_std: 696 (9.87%)
   - unit_sales_lag1: 696 (9.87%)

5. Data Types:
   {dtype('float64'): 29, dtype('int32'): 9, dtype('int64'): 5, dtype('O'): 2}

6. Feature List (first 15):
   1. onpromotion
   2. perishable
   3. cluster
   4. year
   5. month
   6. day
   7. day_of_week
   