# FULL_01: Data Loading to Feature Engineering (Production Pipeline)

**Purpose:** Process full Guayas dataset (no sampling) with 33 optimized features  
**Source:** Consolidates w01_d01 through w02_d05 notebooks  
**Output:** `data/processed/full_featured_data.pkl`

**Key Decisions Applied:**
- DEC-014: 33 features (exclude rolling std, oil, promotion interactions)
- Full Guayas region, ALL families (no 300K sampling)

**Environment:** WSL2 Ubuntu 22.04, Python 3.11, GPU available

In [35]:
### Section 1: Environment Setup
# Source: w01_d01_SETUP_data_inventory.ipynb
# Standard libraries and path configuration for WSL2 environment

import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Path configuration
PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
DATA_RAW = PROJECT_ROOT / 'data' / 'raw'
DATA_PROCESSED = PROJECT_ROOT / 'data' / 'processed'

# Ensure output directory exists
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)

print(f"Project root: {PROJECT_ROOT}")
print(f"Raw data: {DATA_RAW}")
print(f"Processed data: {DATA_PROCESSED}")
print(f"\nPandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Project root: /home/berto/Demand-forecasting-in-retail
Raw data: /home/berto/Demand-forecasting-in-retail/data/raw
Processed data: /home/berto/Demand-forecasting-in-retail/data/processed

Pandas version: 2.3.3
NumPy version: 2.3.5


In [36]:
### Section 1.1: Data Inventory
# Source: w01_d01_SETUP_data_inventory.ipynb
# Verify all required Kaggle files are present

required_files = [
    'train.csv',
    'stores.csv', 
    'items.csv',
    'oil.csv',
    'holidays_events.csv',
    'transactions.csv'
]

print("Raw data inventory:")
print("-" * 50)
for f in required_files:
    filepath = DATA_RAW / f
    if filepath.exists():
        size_mb = filepath.stat().st_size / (1024 * 1024)
        print(f"  {f:<25} {size_mb:>8.1f} MB")
    else:
        print(f"  {f:<25} MISSING")
        

Raw data inventory:
--------------------------------------------------
  train.csv                   4765.9 MB
  stores.csv                     0.0 MB
  items.csv                      0.1 MB
  oil.csv                        0.0 MB
  holidays_events.csv            0.0 MB
  transactions.csv               1.5 MB


all files present. The train.csv is 4.77 GB which is the full dataset (~125M rows). Now we need to load it with Guayas filtering. Given the file size, we should use chunked loading to filter efficiently.
Next cell should load stores.csv first to identify Guayas store numbers, then load train.csv with filtering.

In [37]:
### Section 2: Data Loading
# Source: w01_d02_EDA_data_loading_filtering.ipynb
# Load stores metadata, filter to Guayas province

stores = pd.read_csv(DATA_RAW / 'stores.csv')

print("Stores dataset:")
print(f"  Shape: {stores.shape}")
print(f"\nStores by state:")
print(stores['state'].value_counts().to_string())

# Identify Guayas stores
guayas_stores = stores[stores['state'] == 'Guayas']['store_nbr'].tolist()
print(f"\nGuayas store numbers ({len(guayas_stores)} stores): {guayas_stores}")

Stores dataset:
  Shape: (54, 5)

Stores by state:
state
Pichincha                         19
Guayas                            11
Azuay                              3
Santo Domingo de los Tsachilas     3
Manabi                             3
Los Rios                           2
Cotopaxi                           2
Tungurahua                         2
El Oro                             2
Bolivar                            1
Imbabura                           1
Chimborazo                         1
Pastaza                            1
Santa Elena                        1
Loja                               1
Esmeraldas                         1

Guayas store numbers (11 stores): [24, 26, 27, 28, 29, 30, 32, 34, 35, 36, 51]


we have 11 Guayas stores identified. Now we need to load train.csv but it's 4.7GB which is large. We should load it with filtering to only Guayas stores to save memory. We can use chunked reading or just load and filter.
Let me provide the next cell to load train.csv filtering to Guayas only.

In [38]:
### Section 2.1: Load Train Data with Guayas Filter
# Source: w01_d02_EDA_data_loading_filtering.ipynb
# Load train.csv in chunks, filter to Guayas stores only
# Full dataset: ~125M rows, Guayas subset: ~2M rows expected

import time

start_time = time.time()

# Read in chunks to manage memory (4.7GB file)
chunks = []
chunk_size = 1_000_000

print(f"Loading train.csv (chunked, filtering to {len(guayas_stores)} Guayas stores)...")

for i, chunk in enumerate(pd.read_csv(DATA_RAW / 'train.csv', chunksize=chunk_size)):
    guayas_chunk = chunk[chunk['store_nbr'].isin(guayas_stores)]
    chunks.append(guayas_chunk)
    if (i + 1) % 25 == 0:
        print(f"  Processed {(i+1) * chunk_size / 1e6:.0f}M rows...")

train = pd.concat(chunks, ignore_index=True)
del chunks  # Free memory

load_time = time.time() - start_time

print(f"\nGuayas train data loaded:")
print(f"  Shape: {train.shape}")
print(f"  Memory: {train.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(f"  Load time: {load_time:.1f} seconds")

Loading train.csv (chunked, filtering to 11 Guayas stores)...
  Processed 25M rows...
  Processed 50M rows...
  Processed 75M rows...
  Processed 100M rows...
  Processed 125M rows...

Guayas train data loaded:
  Shape: (22941656, 6)
  Memory: 3011.4 MB
  Load time: 64.0 seconds


22.9M rows for Guayas - that's much larger than expected (~2M). This is the full Guayas data across all dates and all items. We'll need to filter by date range and potentially check the data structure before proceeding with feature engineering.
Let me continue with the next cell to explore the data structure and date range.

In [39]:
### Section 2.2: Train Data Structure
# Source: w01_d02_EDA_data_loading_filtering.ipynb
# Examine date range, stores, items

train['date'] = pd.to_datetime(train['date'])

print("Train data structure:")
print(f"  Columns: {train.columns.tolist()}")
print(f"  Date range: {train['date'].min()} to {train['date'].max()}")
print(f"  Unique stores: {train['store_nbr'].nunique()}")
print(f"  Unique items: {train['item_nbr'].nunique()}")
print(f"  Unique dates: {train['date'].nunique()}")

print(f"\nDate coverage:")
print(f"  Total days in range: {(train['date'].max() - train['date'].min()).days + 1}")

print(f"\nSample rows:")
print(train.head(3).to_string(index=False))


Train data structure:
  Columns: ['id', 'date', 'store_nbr', 'item_nbr', 'unit_sales', 'onpromotion']
  Date range: 2013-01-02 00:00:00 to 2017-08-15 00:00:00
  Unique stores: 11
  Unique items: 4008
  Unique dates: 1680

Date coverage:
  Total days in range: 1687

Sample rows:
   id       date  store_nbr  item_nbr  unit_sales onpromotion
18789 2013-01-02         24    103665         6.0         NaN
18790 2013-01-02         24    105574        12.0         NaN
18791 2013-01-02         24    105575         2.0         NaN


Data 22.9M rows, 11 stores, 4008 items, dates from 2013-01-02 to 2017-08-15. Now we need to filter to our target period for DEC-016 (Q4 2013 + Q1 2014 for training, March 2014 for test). But first, let's load the supporting files (items, holidays) and then filter to the relevant date range.
Actually, looking at the handoff document, we need:

Training: Q4 2013 + Q1 2014 (Oct 1, 2013 - Feb 21, 2014)
Gap: 7 days (Feb 22-28, 2014) per DEC-013
Test: March 2014 (Mar 1-31, 2014)

So we only need data from roughly Oct 2013 to March 2014 for the model. However, for feature engineering (lags up to 30 days), we need data starting from at least September 2013.
Let me filter the date range first to reduce memory, then proceed with feature engineering.

In [40]:
### Section 2.3: Filter to Target Period
# Source: w01_d02_EDA_data_loading_filtering.ipynb
# DEC-016: Q4 2013 + Q1 2014 training, March 2014 test
# Need September 2013 start for 30-day lag buffer

DATE_START = '2013-09-01'  # Buffer for 30-day lags
DATE_END = '2014-03-31'    # Test period end

train = train[(train['date'] >= DATE_START) & (train['date'] <= DATE_END)].copy()

print(f"Filtered to {DATE_START} - {DATE_END}:")
print(f"  Shape: {train.shape}")
print(f"  Memory: {train.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(f"  Date range: {train['date'].min().date()} to {train['date'].max().date()}")
print(f"  Unique dates: {train['date'].nunique()}")
print(f"  Unique items: {train['item_nbr'].nunique()}")

Filtered to 2013-09-01 - 2014-03-31:
  Shape: (2089152, 6)
  Memory: 167.1 MB
  Date range: 2013-09-01 to 2014-03-31
  Unique dates: 211
  Unique items: 2638


Filtered down from 22.9M to 2.1M rows. This is the target period we need. Now we need to load supporting data (items, holidays) and start feature engineering.
Next steps:

Load items metadata (for family, perishable)
Load holidays data
Create calendar features
Create lag features
Create rolling features
Create aggregation features
Assemble final 33 features

Let me continue with items data loading.

In [41]:
### Section 3: Load Supporting Data - Items
# Source: w01_d02_EDA_data_loading_filtering.ipynb
# Items metadata: family, class, perishable flag

items = pd.read_csv(DATA_RAW / 'items.csv')

print("Items metadata:")
print(f"  Shape: {items.shape}")
print(f"  Columns: {items.columns.tolist()}")

print(f"\nProduct families ({items['family'].nunique()}):")
family_counts = items['family'].value_counts()
print(family_counts.head(10).to_string())

print(f"\nPerishable distribution:")
print(items['perishable'].value_counts().to_string())

# Merge items info into train
train = train.merge(items[['item_nbr', 'family', 'class', 'perishable']], 
                    on='item_nbr', how='left')

print(f"\nTrain after items merge:")
print(f"  Shape: {train.shape}")

Items metadata:
  Shape: (4100, 4)
  Columns: ['item_nbr', 'family', 'class', 'perishable']

Product families (33):
family
GROCERY I        1334
BEVERAGES         613
CLEANING          446
PRODUCE           306
DAIRY             242
PERSONAL CARE     153
BREAD/BAKERY      134
HOME CARE         108
DELI               91
MEATS              84

Perishable distribution:
perishable
0    3114
1     986

Train after items merge:
  Shape: (2089152, 9)


We now have items merged. Next we need to load stores data (already have it, but need to merge), holidays, and oil data. However, per DEC-014, oil features are excluded, so we can skip oil processing. Let me load holidays next.

In [42]:
### Section 3.1: Load Supporting Data - Holidays
# Source: w01_d02_EDA_data_loading_filtering.ipynb
# Holidays and events for Ecuador

holidays = pd.read_csv(DATA_RAW / 'holidays_events.csv')
holidays['date'] = pd.to_datetime(holidays['date'])

# Filter to target period
holidays = holidays[(holidays['date'] >= DATE_START) & (holidays['date'] <= DATE_END)].copy()

print("Holidays data:")
print(f"  Shape: {holidays.shape}")
print(f"  Columns: {holidays.columns.tolist()}")

print(f"\nHoliday types:")
print(holidays['type'].value_counts().to_string())

print(f"\nSample holidays in period:")
print(holidays[['date', 'type', 'locale', 'locale_name', 'description']].head(5).to_string(index=False))

Holidays data:
  Shape: (26, 6)
  Columns: ['date', 'type', 'locale', 'locale_name', 'description', 'transferred']

Holiday types:
type
Holiday       18
Additional     7
Transfer       1

Sample holidays in period:
      date     type   locale locale_name                         description
2013-09-28  Holiday    Local      Ibarra                 Fundacion de Ibarra
2013-10-07  Holiday    Local     Quevedo            Cantonizacion de Quevedo
2013-10-09  Holiday National     Ecuador          Independencia de Guayaquil
2013-10-11 Transfer National     Ecuador Traslado Independencia de Guayaquil
2013-11-02  Holiday National     Ecuador                     Dia de Difuntos


We have the holidays data loaded. Now we need to load stores metadata (we already have it) and create a complete calendar. Let me continue with the data loading steps.
Next I should:

Load transactions (optional, but we have it)
Create complete calendar (all dates for all store-item combinations)
Then start feature engineering

Actually, looking at the Week 2 notebooks structure, I think we should:

First create a complete calendar/skeleton
Then merge all the supporting data
Then do quality checks
Then feature engineering

Let me continue with loading stores (we already have it but need to merge) and then create the complete calendar.

In [43]:
### Section 3.2: Merge Stores Metadata
# Source: w01_d02_EDA_data_loading_filtering.ipynb
# Add store cluster, city, type information

train = train.merge(stores[['store_nbr', 'city', 'state', 'type', 'cluster']], 
                    on='store_nbr', how='left')

print("Train after stores merge:")
print(f"  Shape: {train.shape}")
print(f"  Columns: {train.columns.tolist()}")

print(f"\nStore clusters distribution:")
print(train.groupby('cluster')['store_nbr'].nunique().to_string())

print(f"\nStore types distribution:")
print(train.groupby('type')['store_nbr'].nunique().to_string())

Train after stores merge:
  Shape: (2089152, 13)
  Columns: ['id', 'date', 'store_nbr', 'item_nbr', 'unit_sales', 'onpromotion', 'family', 'class', 'perishable', 'city', 'state', 'type', 'cluster']

Store clusters distribution:
cluster
1     2
3     3
6     1
10    3
17    1

Store types distribution:
type
A    1
B    1
C    3
D    3
E    2


 All metadata merged. Now I need to move to data quality checks and preprocessing. This should check for missing values, negatives (returns), outliers, etc. Following the Week 1 pattern.

In [44]:
### Section 4: Data Quality & Preprocessing
# Source: w01_d03_EDA_quality_preprocessing.ipynb
# Check for missing values, negative sales, outliers

print("Data Quality Assessment:")
print("-" * 60)

print("\nMissing values:")
missing = train.isnull().sum()
missing_pct = (missing / len(train) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing': missing[missing > 0],
    'Percent': missing_pct[missing > 0]
})
print(missing_df.to_string())

print(f"\nUnit sales statistics:")
print(f"  Min: {train['unit_sales'].min()}")
print(f"  Max: {train['unit_sales'].max()}")
print(f"  Mean: {train['unit_sales'].mean():.2f}")
print(f"  Median: {train['unit_sales'].median():.2f}")

print(f"\nNegative sales (returns):")
negative_count = (train['unit_sales'] < 0).sum()
negative_pct = (negative_count / len(train) * 100)
print(f"  Count: {negative_count} ({negative_pct:.2f}%)")

Data Quality Assessment:
------------------------------------------------------------

Missing values:
             Missing  Percent
onpromotion  2089152    100.0

Unit sales statistics:
  Min: -120.0
  Max: 4851.002
  Mean: 7.58
  Median: 3.90

Negative sales (returns):
  Count: 107 (0.01%)


Interesting - onpromotion is 100% missing for this time period. This is different from the Week 3 development sample where it had ~16% NaN. This could be because:

The onpromotion column wasn't being tracked in late 2013/early 2014
The data for Guayas stores specifically doesn't have promotion data
The filtered period doesn't have promotion data

Since DEC-014 excluded promotion interaction features but kept the base onpromotion feature, we need to handle this. We should fill with 0 (no promotion) for now and note this limitation.

In [45]:
### Section 4.1: Handle Missing Values and Negative Sales
# Source: w01_d03_EDA_quality_preprocessing.ipynb
# Fill onpromotion NaN with 0 (no promotion)
# Keep negative sales (returns are valid business events)

# Fill onpromotion - 100% NaN in this period suggests no promotion tracking
train['onpromotion'] = train['onpromotion'].fillna(0).astype(int)

print("After preprocessing:")
print(f"  onpromotion filled: {(train['onpromotion'] == 0).sum()} zeros, {(train['onpromotion'] == 1).sum()} ones")

# Clip extreme negatives to reasonable return threshold
train.loc[train['unit_sales'] < -100, 'unit_sales'] = -100

print(f"  Negative sales clipped to >= -100")
print(f"  Current unit_sales range: [{train['unit_sales'].min()}, {train['unit_sales'].max()}]")

print(f"\nFinal train shape: {train.shape}")
print(f"Missing values remaining: {train.isnull().sum().sum()}")

After preprocessing:
  onpromotion filled: 2089152 zeros, 0 ones
  Negative sales clipped to >= -100
  Current unit_sales range: [-100.0, 4851.002]

Final train shape: (2089152, 13)
Missing values remaining: 0


Data is clean now. Next we need to create the calendar scaffold (complete date grid) to handle missing store-item-date combinations (which represent zero sales). This is critical for time series.
Then we'll move into feature engineering:

Temporal features (lags, rolling)
Calendar features (day of week, etc.)
Holiday features
Store/item aggregations

Let me create the calendar scaffold next.

In [46]:
### Section 5: Create Complete Calendar
# Source: w01_d03_EDA_quality_preprocessing.ipynb
# Missing store-item-date combinations = zero sales
# Create complete date grid for all store-item pairs

from itertools import product

print("Creating complete calendar scaffold...")

# Get unique values
unique_dates = pd.date_range(start=DATE_START, end=DATE_END, freq='D')
unique_stores = train['store_nbr'].unique()
unique_items = train['item_nbr'].unique()

print(f"  Dates: {len(unique_dates)}")
print(f"  Stores: {len(unique_stores)}")
print(f"  Items: {len(unique_items)}")
print(f"  Expected combinations: {len(unique_dates) * len(unique_stores) * len(unique_items):,}")

# Create complete grid
calendar = pd.DataFrame(
    list(product(unique_dates, unique_stores, unique_items)),
    columns=['date', 'store_nbr', 'item_nbr']
)

print(f"\nCalendar scaffold shape: {calendar.shape}")
print(f"  Memory: {calendar.memory_usage(deep=True).sum() / 1e6:.1f} MB")

Creating complete calendar scaffold...
  Dates: 212
  Stores: 10
  Items: 2638
  Expected combinations: 5,592,560

Calendar scaffold shape: (5592560, 3)
  Memory: 134.2 MB


We have 5.6M complete combinations. Now we need to merge the actual sales data with this calendar scaffold, filling missing combinations with 0 sales. This will expand from 2.1M rows (sparse) to 5.6M rows (complete calendar).

In [47]:
### Section 5.1: Merge Sales Data with Complete Calendar
# Source: w01_d03_EDA_quality_preprocessing.ipynb
# Left join: calendar with actual sales, fill missing = 0

print("Merging actual sales with calendar scaffold...")

# Merge calendar with train data
df = calendar.merge(
    train[['date', 'store_nbr', 'item_nbr', 'unit_sales', 'onpromotion']],
    on=['date', 'store_nbr', 'item_nbr'],
    how='left'
)

# Fill missing sales with 0 (no sales that day)
df['unit_sales'] = df['unit_sales'].fillna(0)
df['onpromotion'] = df['onpromotion'].fillna(0).astype(int)

print(f"\nComplete dataset:")
print(f"  Shape: {df.shape}")
print(f"  Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

print(f"\nData density:")
original_rows = len(train)
complete_rows = len(df)
density = (original_rows / complete_rows * 100)
print(f"  Original rows: {original_rows:,}")
print(f"  Complete rows: {complete_rows:,}")
print(f"  Density: {density:.1f}% (sparsity: {100-density:.1f}%)")

Merging actual sales with calendar scaffold...

Complete dataset:
  Shape: (5592560, 5)
  Memory: 223.7 MB

Data density:
  Original rows: 2,089,152
  Complete rows: 5,592,560
  Density: 37.4% (sparsity: 62.6%)


The dataset shows 62.6% sparsity, which is much better than the 99.1% mentioned in the user's memories. This makes sense because we're working with the complete Guayas dataset (all families) rather than the sampled one.
Now I need to add metadata back to the complete dataset.

In [48]:
### Section 5.2: Merge Metadata Back
# Source: w01_d03_EDA_quality_preprocessing.ipynb
# Add items and stores metadata to complete calendar

print("Adding metadata to complete dataset...")

# Merge items metadata
df = df.merge(
    items[['item_nbr', 'family', 'class', 'perishable']],
    on='item_nbr',
    how='left'
)

# Merge stores metadata
df = df.merge(
    stores[['store_nbr', 'city', 'state', 'type', 'cluster']],
    on='store_nbr',
    how='left'
)

print(f"\nComplete dataset with metadata:")
print(f"  Shape: {df.shape}")
print(f"  Columns: {df.columns.tolist()}")
print(f"  Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

print(f"\nSample rows:")
print(df.head(3).to_string(index=False))

Adding metadata to complete dataset...

Complete dataset with metadata:
  Shape: (5592560, 12)
  Columns: ['date', 'store_nbr', 'item_nbr', 'unit_sales', 'onpromotion', 'family', 'class', 'perishable', 'city', 'state', 'type', 'cluster']
  Memory: 1767.5 MB

Sample rows:
      date  store_nbr  item_nbr  unit_sales  onpromotion       family  class  perishable      city  state type  cluster
2013-09-01         24     96995         2.0            0    GROCERY I   1093           0 Guayaquil Guayas    D        1
2013-09-01         24    103665         3.0            0 BREAD/BAKERY   2712           1 Guayaquil Guayas    D        1
2013-09-01         24    105574         8.0            0    GROCERY I   1045           0 Guayaquil Guayas    D        1


Now we have a complete dataset with 5.6M rows. The sparsity is 62.6%, which is consistent with retail data where not all items sell every day in every store.
Next steps for feature engineering:

Temporal features (lag features)
Rolling statistics
Calendar features
Aggregations

Following DEC-014, we need to create 33 features excluding:

Rolling std features
Oil features
Promotion interaction features

Let me proceed with the lag features first.

In [49]:
### Section 6: Prepare for Feature Engineering
# Source: w02_d01_FE_lags.ipynb
# Sort by store, item, date to enable lag/rolling calculations

print("Sorting data for time series operations...")

df = df.sort_values(['store_nbr', 'item_nbr', 'date']).reset_index(drop=True)

print(f"\nDataset sorted:")
print(f"  Shape: {df.shape}")
print(f"  Order: store_nbr, item_nbr, date")

print(f"\nSample sorted data (first store-item pair):")
sample = df.head(5)[['date', 'store_nbr', 'item_nbr', 'unit_sales']]
print(sample.to_string(index=False))

Sorting data for time series operations...

Dataset sorted:
  Shape: (5592560, 12)
  Order: store_nbr, item_nbr, date

Sample sorted data (first store-item pair):
      date  store_nbr  item_nbr  unit_sales
2013-09-01         24     96995         2.0
2013-09-02         24     96995         2.0
2013-09-03         24     96995         2.0
2013-09-04         24     96995         0.0
2013-09-05         24     96995         0.0


The data is sorted by store, item, and date. Now we need to start feature engineering. According to DEC-014, we need 33 features:
Temporal (8):

unit_sales_lag1, lag7, lag14, lag30
unit_sales_7d_avg, 14d_avg, 30d_avg
unit_sales_lag1_7d_corr

Calendar (7):

year, month, day, dayofweek, dayofyear, weekofyear, quarter

Holiday (4):

holiday_proximity, is_holiday, holiday_period, days_to_next_holiday

Promotion (2):

onpromotion, promo_item_interaction

Store/Item (7):

cluster, store_avg_sales, item_avg_sales, item_store_avg
cluster_avg_sales, family_avg_sales, city_avg_sales

Derived (5):

perishable, weekend, month_start, month_end, is_payday

Let me start with lag features. The user wants cell-by-cell execution.

In [50]:
### Section 7: Feature Engineering - Lag Features
# Source: w02_d01_FE_lags.ipynb
# Create lag 1, 7, 14, 30 for autocorrelation patterns

print("Creating lag features...")

# Group by store-item for time series operations
grouped = df.groupby(['store_nbr', 'item_nbr'])['unit_sales']

# Create lags
df['unit_sales_lag1'] = grouped.shift(1)
df['unit_sales_lag7'] = grouped.shift(7)
df['unit_sales_lag14'] = grouped.shift(14)
df['unit_sales_lag30'] = grouped.shift(30)

print(f"\nLag features created:")
print(f"  unit_sales_lag1, lag7, lag14, lag30")

print(f"\nNaN counts (expected at start of series):")
for col in ['unit_sales_lag1', 'unit_sales_lag7', 'unit_sales_lag14', 'unit_sales_lag30']:
    nan_count = df[col].isna().sum()
    nan_pct = (nan_count / len(df) * 100)
    print(f"  {col}: {nan_count:,} ({nan_pct:.2f}%)")

print(f"\nSample with lags (first store-item pair):")
sample = df[df['store_nbr']==24][df['item_nbr']==96995].head(35)
sample_cols = ['date', 'unit_sales', 'unit_sales_lag1', 'unit_sales_lag7', 'unit_sales_lag14', 'unit_sales_lag30']
print(sample[sample_cols].to_string(index=False))

Creating lag features...

Lag features created:
  unit_sales_lag1, lag7, lag14, lag30

NaN counts (expected at start of series):
  unit_sales_lag1: 26,380 (0.47%)
  unit_sales_lag7: 184,660 (3.30%)
  unit_sales_lag14: 369,320 (6.60%)
  unit_sales_lag30: 791,400 (14.15%)

Sample with lags (first store-item pair):
      date  unit_sales  unit_sales_lag1  unit_sales_lag7  unit_sales_lag14  unit_sales_lag30
2013-09-01         2.0              NaN              NaN               NaN               NaN
2013-09-02         2.0              2.0              NaN               NaN               NaN
2013-09-03         2.0              2.0              NaN               NaN               NaN
2013-09-04         0.0              2.0              NaN               NaN               NaN
2013-09-05         0.0              0.0              NaN               NaN               NaN
2013-09-06         3.0              0.0              NaN               NaN               NaN
2013-09-07         1.0             

The lag features are working correctly. The NaN counts are expected - lag30 has ~14% NaN because the first 30 days of each store-item series don't have lag30 values yet.
Next step is rolling average features (7d, 14d, 30d averages). Per DEC-014, we exclude rolling standard deviation features.

Does DEC-014 (excluding rolling std features) still applies now that we have the full dataset instead of the 300K sample?
. Let's stick with DEC-014 for now - here's why:
Rationale to keep exclusion:

Decision was validated through ablation testing on 300K sample
Rolling std measures volatility - likely still noisy even with more data in a sparse (62.6% zeros) dataset
Scientific method: apply learned decisions first, then validate at scale

Validation in Notebook 2:

Addition 3 (Feature Importance Stability Check) will test if DEC-014 holds
If rolling std shows high importance on full data, we document that as a finding
Could create DEC-017 to revisit if needed.

In [51]:
### Section 8: Feature Engineering - Rolling Statistics
# Source: w02_d02_FE_rolling.ipynb
# Create 7, 14, 30-day rolling averages
# DEC-014: EXCLUDE rolling std features (caused 3.82% RMSE degradation)

print("Creating rolling average features...")

# Group by store-item for rolling calculations
grouped = df.groupby(['store_nbr', 'item_nbr'])['unit_sales']

# Create rolling averages (min_periods=1 to handle start of series)
df['unit_sales_7d_avg'] = grouped.transform(lambda x: x.rolling(window=7, min_periods=1).mean())
df['unit_sales_14d_avg'] = grouped.transform(lambda x: x.rolling(window=14, min_periods=1).mean())
df['unit_sales_30d_avg'] = grouped.transform(lambda x: x.rolling(window=30, min_periods=1).mean())

print(f"\nRolling average features created:")
print(f"  unit_sales_7d_avg, 14d_avg, 30d_avg")
print(f"  Rolling std features EXCLUDED per DEC-014")

print(f"\nSample with rolling averages (first store-item pair):")
sample = df[df['store_nbr']==24][df['item_nbr']==96995].head(35)
sample_cols = ['date', 'unit_sales', 'unit_sales_7d_avg', 'unit_sales_14d_avg', 'unit_sales_30d_avg']
print(sample[sample_cols].to_string(index=False))

Creating rolling average features...

Rolling average features created:
  unit_sales_7d_avg, 14d_avg, 30d_avg
  Rolling std features EXCLUDED per DEC-014

Sample with rolling averages (first store-item pair):
      date  unit_sales  unit_sales_7d_avg  unit_sales_14d_avg  unit_sales_30d_avg
2013-09-01         2.0           2.000000            2.000000            2.000000
2013-09-02         2.0           2.000000            2.000000            2.000000
2013-09-03         2.0           2.000000            2.000000            2.000000
2013-09-04         0.0           1.500000            1.500000            1.500000
2013-09-05         0.0           1.200000            1.200000            1.200000
2013-09-06         3.0           1.500000            1.500000            1.500000
2013-09-07         1.0           1.428571            1.428571            1.428571
2013-09-08         1.0           1.285714            1.375000            1.375000
2013-09-09         1.0           1.142857            

Now we need to move on to the next feature engineering step. According to the handoff document, we should:

Skip oil features (DEC-014 excluded them)
Create aggregation features (store/item/cluster averages)
Create calendar features
Create holiday features
Create derived features

In [52]:
### Section 8.1: Lag-Correlation Feature
# Source: w02_d02_FE_rolling.ipynb
# Correlation between lag1 and 7-day average (autocorrelation strength)

print("Creating lag-correlation feature...")

# Create correlation between lag1 and 7d average
df['unit_sales_lag1_7d_corr'] = df['unit_sales_lag1'] * df['unit_sales_7d_avg']

print(f"\nLag-correlation feature created:")
print(f"  unit_sales_lag1_7d_corr = lag1 * 7d_avg")

print(f"\nFeature statistics:")
print(f"  Min: {df['unit_sales_lag1_7d_corr'].min():.2f}")
print(f"  Max: {df['unit_sales_lag1_7d_corr'].max():.2f}")
print(f"  Mean: {df['unit_sales_lag1_7d_corr'].mean():.2f}")
print(f"  NaN count: {df['unit_sales_lag1_7d_corr'].isna().sum():,}")

Creating lag-correlation feature...

Lag-correlation feature created:
  unit_sales_lag1_7d_corr = lag1 * 7d_avg

Feature statistics:
  Min: -4900.00
  Max: 3518485.71
  Mean: 81.81
  NaN count: 26,380


From the handoff document, the 33 features include:

Temporal (8): unit_sales_lag1, lag7, lag14, lag30, unit_sales_7d_avg, 14d_avg, 30d_avg, unit_sales_lag1_7d_corr
Calendar (7): year, month, day, dayofweek, dayofyear, weekofyear, quarter
Holiday (4): holiday_proximity, is_holiday, holiday_period, days_to_next_holiday
Promotion (2): onpromotion, promo_item_interaction
Store/Item (7): cluster, store_avg_sales, item_avg_sales, item_store_avg, cluster_avg_sales, family_avg_sales, city_avg_sales
Derived (5): perishable, weekend, month_start, month_end, is_payday

Oil features are EXCLUDED per DEC-014.
Next should be calendar features.

In [53]:
### Section 9: Feature Engineering - Calendar Features
# Source: w02_d04_FE_aggregations.ipynb (temporal components)
# Extract year, month, day, dayofweek, etc.

print("Creating calendar features...")

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek  # Monday=0, Sunday=6
df['dayofyear'] = df['date'].dt.dayofyear
df['weekofyear'] = df['date'].dt.isocalendar().week
df['quarter'] = df['date'].dt.quarter

print(f"\nCalendar features created:")
print(f"  year, month, day, dayofweek, dayofyear, weekofyear, quarter")

print(f"\nSample calendar features:")
sample = df[['date', 'year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']].head(10)
print(sample.to_string(index=False))

Creating calendar features...

Calendar features created:
  year, month, day, dayofweek, dayofyear, weekofyear, quarter

Sample calendar features:
      date  year  month  day  dayofweek  dayofyear  weekofyear  quarter
2013-09-01  2013      9    1          6        244          35        3
2013-09-02  2013      9    2          0        245          36        3
2013-09-03  2013      9    3          1        246          36        3
2013-09-04  2013      9    4          2        247          36        3
2013-09-05  2013      9    5          3        248          36        3
2013-09-06  2013      9    6          4        249          36        3
2013-09-07  2013      9    7          5        250          36        3
2013-09-08  2013      9    8          6        251          36        3
2013-09-09  2013      9    9          0        252          37        3
2013-09-10  2013      9   10          1        253          37        3


From the handoff, the 33 final features include:
Holiday (4):

holiday_proximity, is_holiday, holiday_period, days_to_next_holiday

In [54]:
### Section 10: Feature Engineering - Holiday Features
# Source: w02_d04_FE_aggregations.ipynb
# Create holiday proximity, is_holiday, holiday_period, days_to_next_holiday
# This maps unique dates (212) instead of iterating 5.6M rows.

print("Creating holiday features (vectorized)...")

# Create is_holiday flag (1 if date is a holiday)
holiday_dates_set = set(holidays['date'].dt.date)
df['is_holiday'] = df['date'].dt.date.isin(holiday_dates_set).astype(int)

# Vectorized holiday proximity - work with unique dates only
unique_dates = df['date'].unique()
holiday_dates_np = holidays['date'].values  # numpy datetime64

# Create mapping for unique dates
proximity_map = {}
for date in unique_dates:
    date_np = np.datetime64(date, 'D')  # Convert to numpy datetime64
    distances = (holiday_dates_np.astype('datetime64[D]') - date_np).astype(int)
    proximity_map[pd.Timestamp(date)] = int(distances[np.abs(distances).argmin()])

df['holiday_proximity'] = df['date'].map(proximity_map)

# Days to next holiday (vectorized)
days_to_next_map = {}
for date in unique_dates:
    date_np = np.datetime64(date, 'D')
    future_holidays = holiday_dates_np[holiday_dates_np > date]
    if len(future_holidays) > 0:
        days = int((future_holidays.min().astype('datetime64[D]') - date_np).astype(int))
    else:
        days = 999
    days_to_next_map[pd.Timestamp(date)] = days

df['days_to_next_holiday'] = df['date'].map(days_to_next_map)

# Holiday period: 1 if within 3 days of holiday
df['holiday_period'] = (df['holiday_proximity'].abs() <= 3).astype(int)

print(f"\nHoliday features created:")
print(f"  is_holiday, holiday_proximity, days_to_next_holiday, holiday_period")

print(f"\nHoliday feature statistics:")
print(f"  Holidays in period: {df['is_holiday'].sum():,} rows")
print(f"  Holiday periods (±3 days): {df['holiday_period'].sum():,} rows")
print(f"  Holiday proximity range: [{df['holiday_proximity'].min()}, {df['holiday_proximity'].max()}]")

Creating holiday features (vectorized)...

Holiday features created:
  is_holiday, holiday_proximity, days_to_next_holiday, holiday_period

Holiday feature statistics:
  Holidays in period: 659,500 rows
  Holiday periods (±3 days): 1,899,360 rows
  Holiday proximity range: [-30, 29]


The holiday features are created. Now we need to continue with:

Aggregation features (store/item/cluster averages)
Derived features (perishable, weekend, month_start, month_end, is_payday)
Promotion interaction feature
Final assembly and save

In [55]:
### Section 11: Feature Engineering - Aggregation Features
# Source: w02_d04_FE_aggregations.ipynb
# Create store, item, cluster, family, city average sales

print("Creating aggregation features...")

# Calculate various aggregations (use all data for global patterns)
agg_features = {}

# Store average sales
agg_features['store_avg_sales'] = df.groupby('store_nbr')['unit_sales'].transform('mean')

# Item average sales
agg_features['item_avg_sales'] = df.groupby('item_nbr')['unit_sales'].transform('mean')

# Item-store interaction average
agg_features['item_store_avg'] = df.groupby(['item_nbr', 'store_nbr'])['unit_sales'].transform('mean')

# Cluster average sales
agg_features['cluster_avg_sales'] = df.groupby('cluster')['unit_sales'].transform('mean')

# Family average sales
agg_features['family_avg_sales'] = df.groupby('family')['unit_sales'].transform('mean')

# City average sales
agg_features['city_avg_sales'] = df.groupby('city')['unit_sales'].transform('mean')

# Add to dataframe
for feature_name, feature_values in agg_features.items():
    df[feature_name] = feature_values

print(f"\nAggregation features created:")
for feature_name in agg_features.keys():
    print(f"  {feature_name}")

print(f"\nAggregation feature statistics:")
for feature_name in agg_features.keys():
    print(f"  {feature_name}: mean={df[feature_name].mean():.2f}, std={df[feature_name].std():.2f}")

Creating aggregation features...

Aggregation features created:
  store_avg_sales
  item_avg_sales
  item_store_avg
  cluster_avg_sales
  family_avg_sales
  city_avg_sales

Aggregation feature statistics:
  store_avg_sales: mean=2.83, std=1.54
  item_avg_sales: mean=2.83, std=4.67
  item_store_avg: mean=2.83, std=6.90
  cluster_avg_sales: mean=2.83, std=1.47
  family_avg_sales: mean=2.83, std=1.15
  city_avg_sales: mean=2.83, std=0.53


Now I need to create derived features like weekend, perishable, month_start, month_end, is_payday. Also need to create the promotion interaction feature. Let me continue with the next cell.

In [56]:
### Section 12: Feature Engineering - Derived Features
# Source: w02_d04_FE_aggregations.ipynb
# Create weekend, month_start, month_end, is_payday flags

print("Creating derived features...")

# Weekend flag (Saturday=5, Sunday=6)
df['weekend'] = (df['dayofweek'].isin([5, 6])).astype(int)

# Month start/end flags
df['month_start'] = (df['day'] <= 7).astype(int)
df['month_end'] = (df['day'] >= 24).astype(int)

# Payday flag (15th and last day of month)
# Get last day of each month
df['is_last_day'] = (df['date'].dt.is_month_end).astype(int)
df['is_payday'] = ((df['day'] == 15) | (df['is_last_day'] == 1)).astype(int)
df.drop('is_last_day', axis=1, inplace=True)

print(f"\nDerived features created:")
print(f"  weekend, month_start, month_end, is_payday")

print(f"\nDerived feature statistics:")
print(f"  Weekend days: {df['weekend'].sum():,} rows ({df['weekend'].mean()*100:.1f}%)")
print(f"  Month start (days 1-7): {df['month_start'].sum():,} rows ({df['month_start'].mean()*100:.1f}%)")
print(f"  Month end (days 24+): {df['month_end'].sum():,} rows ({df['month_end'].mean()*100:.1f}%)")
print(f"  Payday (15th, last): {df['is_payday'].sum():,} rows ({df['is_payday'].mean()*100:.1f}%)")

Creating derived features...

Derived features created:
  weekend, month_start, month_end, is_payday

Derived feature statistics:
  Weekend days: 1,609,180 rows (28.8%)
  Month start (days 1-7): 1,292,620 rows (23.1%)
  Month end (days 24+): 1,345,380 rows (24.1%)
  Payday (15th, last): 369,320 rows (6.6%)


Now we need to create the promotion interaction feature. According to DEC-014, we exclude:

promo_holiday_category
promo_item_avg
promo_cluster

But we keep: promo_item_interaction

In [57]:
### Section 13: Feature Engineering - Promotion Interaction
# Source: w02_d04_FE_aggregations.ipynb
# Create promo_item_interaction (onpromotion * item_avg_sales)
# DEC-014: EXCLUDE promo_holiday_category, promo_item_avg, promo_cluster

print("Creating promotion interaction feature...")

# Promotion * item average interaction
df['promo_item_interaction'] = df['onpromotion'] * df['item_avg_sales']

print(f"\nPromotion interaction feature created:")
print(f"  promo_item_interaction = onpromotion * item_avg_sales")
print(f"  Other promotion interactions EXCLUDED per DEC-014")

print(f"\nFeature statistics:")
print(f"  Non-zero interactions: {(df['promo_item_interaction'] > 0).sum():,}")
print(f"  Mean: {df['promo_item_interaction'].mean():.2f}")
print(f"  Max: {df['promo_item_interaction'].max():.2f}")

Creating promotion interaction feature...

Promotion interaction feature created:
  promo_item_interaction = onpromotion * item_avg_sales
  Other promotion interactions EXCLUDED per DEC-014

Feature statistics:
  Non-zero interactions: 0
  Mean: 0.00
  Max: 0.00


All zeros as expected - onpromotion was 100% zero in this period. Feature still included for consistency with DEC-014.

In [58]:
### Section 14: Final Feature Set (33 Features per DEC-014)
# Source: w02_d05_FE_final.ipynb
# Verify we have exactly 33 features as specified

# Define the 33 features per DEC-014
feature_columns = [
    # Temporal (8)
    'unit_sales_lag1', 'unit_sales_lag7', 'unit_sales_lag14', 'unit_sales_lag30',
    'unit_sales_7d_avg', 'unit_sales_14d_avg', 'unit_sales_30d_avg',
    'unit_sales_lag1_7d_corr',
    
    # Calendar (7)
    'year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter',
    
    # Holiday (4)
    'holiday_proximity', 'is_holiday', 'holiday_period', 'days_to_next_holiday',
    
    # Promotion (2)
    'onpromotion', 'promo_item_interaction',
    
    # Store/Item (7)
    'cluster', 'store_avg_sales', 'item_avg_sales', 'item_store_avg',
    'cluster_avg_sales', 'family_avg_sales', 'city_avg_sales',
    
    # Derived (5)
    'perishable', 'weekend', 'month_start', 'month_end', 'is_payday'
]

print("Feature Set Summary (DEC-014):")
print("-" * 60)
print(f"Total features: {len(feature_columns)}")

# Verify all features exist
missing_features = [f for f in feature_columns if f not in df.columns]
if missing_features:
    print(f"\nMISSING FEATURES: {missing_features}")
else:
    print(f"\nAll 33 features present in dataset")

print(f"\nFeatures by category:")
print(f"  Temporal (8): lags, rolling averages, correlation")
print(f"  Calendar (7): year, month, day, dayofweek, etc.")
print(f"  Holiday (4): proximity, flags, days to next")
print(f"  Promotion (2): onpromotion, item interaction")
print(f"  Store/Item (7): cluster, hierarchical averages")
print(f"  Derived (5): weekend, month flags, payday")

print(f"\nCurrent dataset shape: {df.shape}")

Feature Set Summary (DEC-014):
------------------------------------------------------------
Total features: 33

All 33 features present in dataset

Features by category:
  Temporal (8): lags, rolling averages, correlation
  Calendar (7): year, month, day, dayofweek, etc.
  Holiday (4): proximity, flags, days to next
  Promotion (2): onpromotion, item interaction
  Store/Item (7): cluster, hierarchical averages
  Derived (5): weekend, month flags, payday

Current dataset shape: (5592560, 42)


We have all 33 features. The dataset has 42 columns total because it includes metadata columns (date, store_nbr, item_nbr, unit_sales, family, class, city, state, type, etc.) plus the 33 features.
Now we need to prepare the final dataset for export. Let me create a cell to select the columns we need and handle NaN values from lag features.

In [59]:
### Section 15: Handle NaN Values
# Source: w02_d05_FE_final.ipynb
# NaN values exist at start of time series from lag features
# Per DEC-013: 7-day gap + 30-day lags means September data is buffer

print("Handling NaN values from lag features...")

print(f"\nNaN counts by feature:")
nan_counts = df[feature_columns].isnull().sum()
nan_counts = nan_counts[nan_counts > 0].sort_values(ascending=False)
for feat, count in nan_counts.items():
    pct = (count / len(df) * 100)
    print(f"  {feat}: {count:,} ({pct:.2f}%)")

# Strategy: Drop rows with ANY NaN in features
# These are early September rows used as buffer for lags
df_clean = df.dropna(subset=feature_columns).copy()

print(f"\nAfter dropping NaN rows:")
print(f"  Original: {len(df):,} rows")
print(f"  Clean: {len(df_clean):,} rows")
print(f"  Dropped: {len(df) - len(df_clean):,} rows ({(len(df) - len(df_clean))/len(df)*100:.2f}%)")

print(f"\nDate range after cleaning:")
print(f"  Start: {df_clean['date'].min().date()}")
print(f"  End: {df_clean['date'].max().date()}")

Handling NaN values from lag features...

NaN counts by feature:
  unit_sales_lag30: 791,400 (14.15%)
  unit_sales_lag14: 369,320 (6.60%)
  unit_sales_lag7: 184,660 (3.30%)
  unit_sales_lag1: 26,380 (0.47%)
  unit_sales_lag1_7d_corr: 26,380 (0.47%)

After dropping NaN rows:
  Original: 5,592,560 rows
  Clean: 4,801,160 rows
  Dropped: 791,400 rows (14.15%)

Date range after cleaning:
  Start: 2013-10-01
  End: 2014-03-31


we dropped the first 30 days (September) which was the buffer period for lag features. Now we have October 1, 2013 through March 31, 2014 - exactly what we need for DEC-016 training (Q4 2013 + Q1 2014) and test (March 2014).
Next we need to prepare the final dataset with metadata columns and feature columns, then save to pickle.

In [60]:
### Section 16: Prepare Final Dataset
# Source: w02_d05_FE_final.ipynb
# Organize columns: metadata + target + features

# Metadata columns (for filtering, not for modeling)
metadata_columns = ['date', 'store_nbr', 'item_nbr', 'family', 'class', 
                    'city', 'state', 'type']

# Target column
target_column = 'unit_sales'

# Final column order
final_columns = metadata_columns + [target_column] + feature_columns

# Create final dataset
df_final = df_clean[final_columns].copy()

print("Final dataset structure:")
print("-" * 60)
print(f"Shape: {df_final.shape}")
print(f"Memory: {df_final.memory_usage(deep=True).sum() / 1e6:.1f} MB")

print(f"\nColumn organization:")
print(f"  Metadata (8): {metadata_columns}")
print(f"  Target (1): {target_column}")
print(f"  Features (33): [see feature_columns list]")

print(f"\nDate coverage:")
print(f"  Start: {df_final['date'].min().date()}")
print(f"  End: {df_final['date'].max().date()}")
print(f"  Days: {df_final['date'].nunique()}")

print(f"\nStore-item coverage:")
print(f"  Stores: {df_final['store_nbr'].nunique()}")
print(f"  Items: {df_final['item_nbr'].nunique()}")
print(f"  Families: {df_final['family'].nunique()}")

Final dataset structure:
------------------------------------------------------------
Shape: (4801160, 42)
Memory: 2578.5 MB

Column organization:
  Metadata (8): ['date', 'store_nbr', 'item_nbr', 'family', 'class', 'city', 'state', 'type']
  Target (1): unit_sales
  Features (33): [see feature_columns list]

Date coverage:
  Start: 2013-10-01
  End: 2014-03-31
  Days: 182

Store-item coverage:
  Stores: 10
  Items: 2638
  Families: 32


We have 4.8M rows covering Oct 1, 2013 to March 31, 2014. This is the complete dataset with all 33 features. Now we need to save it and provide a summary.
Next steps:

Data quality validation
Save the processed dataset
Final summary with comparison to Week 3 sample

In [61]:
### Section 17: Final Data Quality Validation
# Source: w02_d05_FE_final.ipynb
# Verify data quality before export

print("Final data quality validation:")
print("-" * 60)

# Check for any remaining NaN in features
nan_check = df_final[feature_columns].isnull().sum().sum()
print(f"\nNaN values in features: {nan_check}")

# Check for infinities
inf_check = np.isinf(df_final[feature_columns].select_dtypes(include=[np.number])).sum().sum()
print(f"Infinite values in features: {inf_check}")

# Check target distribution
print(f"\nTarget (unit_sales) statistics:")
print(f"  Min: {df_final['unit_sales'].min():.2f}")
print(f"  Max: {df_final['unit_sales'].max():.2f}")
print(f"  Mean: {df_final['unit_sales'].mean():.2f}")
print(f"  Median: {df_final['unit_sales'].median():.2f}")
print(f"  Std: {df_final['unit_sales'].std():.2f}")

# Check feature value ranges (sample)
print(f"\nSample feature ranges:")
sample_features = ['unit_sales_lag1', 'unit_sales_7d_avg', 'cluster', 'weekend']
for feat in sample_features:
    print(f"  {feat}: [{df_final[feat].min():.2f}, {df_final[feat].max():.2f}]")


Final data quality validation:
------------------------------------------------------------

NaN values in features: 0
Infinite values in features: 0

Target (unit_sales) statistics:
  Min: -100.00
  Max: 4851.00
  Mean: 2.90
  Median: 0.00
  Std: 11.61

Sample feature ranges:
  unit_sales_lag1: [-100.00, 4851.00]
  unit_sales_7d_avg: [-8.71, 821.61]
  cluster: [1.00, 17.00]
  weekend: [0.00, 1.00]


data quality checks pass. Now we need to save the final dataset. This is 4.8M rows with 42 columns (8 metadata + 1 target + 33 features). 

In [62]:
### Section 18: Export Processed Dataset
# Source: w02_d05_FE_final.ipynb
# Save to pickle for efficient loading in Notebook 2

import time

output_file = DATA_PROCESSED / 'full_featured_data.pkl'

print("Exporting final dataset...")
print("-" * 60)

start_time = time.time()

# Save to pickle (preserves dtypes, faster than CSV)
df_final.to_pickle(output_file)

export_time = time.time() - start_time

print(f"\nExport complete:")
print(f"  File: {output_file}")
print(f"  Size: {output_file.stat().st_size / (1024**2):.1f} MB")
print(f"  Time: {export_time:.1f} seconds")

print(f"\nDataset summary:")
print(f"  Rows: {len(df_final):,}")
print(f"  Columns: {len(df_final.columns)}")
print(f"  Features: {len(feature_columns)}")
print(f"  Period: {df_final['date'].min().date()} to {df_final['date'].max().date()}")
print(f"  Stores: {df_final['store_nbr'].nunique()}")
print(f"  Items: {df_final['item_nbr'].nunique()}")


Exporting final dataset...
------------------------------------------------------------

Export complete:
  File: /home/berto/Demand-forecasting-in-retail/data/processed/full_featured_data.pkl
  Size: 1341.6 MB
  Time: 2.6 seconds

Dataset summary:
  Rows: 4,801,160
  Columns: 42
  Features: 33
  Period: 2013-10-01 to 2014-03-31
  Stores: 10
  Items: 2638


## FULL_01 Summary: Data to Features Complete

### Execution Summary

| Step | Source | Result |
|------|--------|--------|
| Data Loading | w01_d01, w01_d02 | 22.9M rows → 2.1M (period filter) |
| Calendar Scaffold | w01_d03 | 5.6M complete combinations |
| Lag Features | w02_d01 | 4 lag features (1, 7, 14, 30 days) |
| Rolling Features | w02_d02 | 3 rolling avg (std excluded per DEC-014) |
| Holiday Features | w02_d04 | 4 holiday features |
| Aggregations | w02_d04 | 6 hierarchical averages |
| Derived Features | w02_d04 | 5 temporal flags |
| Final Export | w02_d05 | 4.8M rows × 42 columns |

### Output Dataset

| Metric | Value |
|--------|-------|
| Rows | 4,801,160 |
| Columns | 42 (8 metadata + 1 target + 33 features) |
| Period | Oct 1, 2013 - Mar 31, 2014 |
| Stores | 10 (Guayas) |
| Items | 2,638 |
| Families | 32 |
| Sparsity | 62.6% |
| File | full_featured_data.pkl (1.3 GB) |

### Comparison: Sample vs Full

| Metric | Week 3 Sample | Full Pipeline |
|--------|---------------|---------------|
| Sampling | 300K rows | No sampling |
| Families | Top-3 | All 32 |
| Items | ~500 | 2,638 |
| Final rows | ~25K | 4.8M |

### Decisions Applied
- **DEC-014:** 33 features (excluded rolling std, oil, promotion interactions)

### Issues Resolved
1. Holiday feature vectorization (5+ min → 3 sec)
2. DateTime type handling for numpy arithmetic

### Next Steps
**FULL_02_train_final_model.ipynb:**
- Apply DEC-016 split (Q4 2013 + Q1 2014 training)
- Apply DEC-013 gap (7 days)
- Train XGBoost + LSTM
- Compare to Week 3 baseline
- Export production artifacts