# üîß Feature Engineering Guide

H∆∞·ªõng d·∫´n chi ti·∫øt v·ªÅ t·ª´ng workstream (WS0-WS6) trong pipeline.

## Overview
Pipeline bao g·ªìm 7 workstreams:
- **WS0**: Aggregation & Master Grid
- **WS1**: Relational Features (Product, Household)
- **WS2**: Time-Series Features (Lag, Rolling, Calendar)
- **WS3**: Behavior Features (Clickstream)
- **WS4**: Price & Promotion Features
- **WS5**: Stockout Recovery Features
- **WS6**: Weather Features


In [3]:
# Setup
import sys
from pathlib import Path
import pandas as pd
import numpy as np

# Add project root to path
project_root = Path().resolve().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from src.config import setup_project_path, get_dataset_config, setup_logging
from src.pipelines._01_load_data import load_data

setup_project_path()
setup_logging()

# Import display for Jupyter notebooks
try:
    from IPython.display import display
except ImportError:
    # Fallback if not in Jupyter
    display = print

# Load data
dataframes, config = load_data()
print(f"Dataset: {config['name']}")


2025-11-13 01:31:27,255 - src.pipelines._01_load_data - INFO - [PIPELINE STEP 1: LOAD DATA]


2025-11-13 01:31:27,256 - src.pipelines._01_load_data - INFO - Active Dataset: FreshRetailNet-50K
2025-11-13 01:31:27,256 - src.pipelines._01_load_data - INFO - Data Directory: D:\datastorm\E-Grocery_Forecaster\data\2_raw
2025-11-13 01:31:27,257 - src.pipelines._01_load_data - INFO - Loading FreshRetailNet-50K dataset...
2025-11-13 01:31:27,258 - src.pipelines._01_load_data - INFO -   Loading freshretail_train.parquet...
2025-11-13 01:31:30,605 - src.pipelines._01_load_data - INFO - ‚úì Converted 'dt' to 'hour_timestamp'
2025-11-13 01:31:30,611 - src.pipelines._01_load_data - INFO - ‚úì Converted 'sale_amount' to 'sales_quantity'
2025-11-13 01:31:30,624 - src.pipelines._01_load_data - INFO - Cleaning raw data (handling errors and outliers)...
2025-11-13 01:31:31,457 - src.pipelines._01_load_data - INFO - ‚úì Raw data cleaning complete.
2025-11-13 01:31:31,458 - src.pipelines._01_load_data - INFO - ‚úÖ Data loading complete. Loaded 3 dataframes: ['sales', 'weather', 'products']
Dataset:

## WS0: Aggregation & Master Grid

**M·ª•c ƒë√≠ch**: Chuy·ªÉn ƒë·ªïi transaction-level data th√†nh aggregated weekly/hourly level v√† t·∫°o complete grid.


In [4]:
from src.features import ws0_aggregation as ws0

# Get sales data
sales_df = dataframes.get('sales') or dataframes.get('freshretail_train')
print(f"Original shape: {sales_df.shape}")

# Create master grid
master_df = ws0.create_master_grid(sales_df, config)
print(f"Master grid shape: {master_df.shape}")
print(f"\nKey columns: {list(master_df.columns[:10])}")

# Show sample
display(master_df.head())


ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

## WS1: Relational Features

**M·ª•c ƒë√≠ch**: Th√™m th√¥ng tin v·ªÅ product v√† household demographics.


In [None]:
from src.features import ws1_relational_features as ws1

if config.get('has_relational', False):
    master_df = ws1.enrich_relational_features(master_df, dataframes)
    print(f"After WS1 shape: {master_df.shape}")
    
    # Show new columns
    relational_cols = [col for col in master_df.columns 
                      if col in ['DEPARTMENT', 'COMMODITY_DESC', 'BRAND', 'MANUFACTURER']]
    if relational_cols:
        print(f"\nRelational columns added: {relational_cols}")
        display(master_df[relational_cols].head())
else:
    print("WS1 skipped: has_relational=False in config")


## WS2: Time-Series Features

**M·ª•c ƒë√≠ch**: T·∫°o lag features, rolling statistics, v√† calendar features (LEAK-SAFE).


In [None]:
from src.features import ws2_timeseries_features as ws2

# Add time-series features
master_df = ws2.add_lag_rolling_features(master_df, use_config=False)
print(f"After WS2 shape: {master_df.shape}")

# Show time-series features
ts_cols = [col for col in master_df.columns 
          if any(x in col.lower() for x in ['lag', 'rolling', 'week_', 'month_', 'quarter'])]
print(f"\nTime-series features ({len(ts_cols)}):")
print(ts_cols[:15])

# Show sample
display(master_df[['PRODUCT_ID', 'STORE_ID', config['time_column'], 
                   'SALES_VALUE'] + ts_cols[:5]].head(10))


## WS4: Price & Promotion Features

**M·ª•c ƒë√≠ch**: T√≠nh to√°n price features v√† promotion indicators.


In [None]:
from src.features import ws4_price_features as ws4

# Add price/promotion features
master_df = ws4.add_price_promotion_features(master_df, dataframes)
print(f"After WS4 shape: {master_df.shape}")

# Show price/promo features
price_cols = [col for col in master_df.columns 
             if any(x in col.lower() for x in ['price', 'discount', 'promo', 'display', 'mailer'])]
print(f"\nPrice/Promotion features ({len(price_cols)}):")
print(price_cols)

# Show sample
if price_cols:
    display(master_df[['PRODUCT_ID', 'STORE_ID', 'SALES_VALUE'] + price_cols[:5]].head())
