# 04: Feature Engineering Pipeline üõ†Ô∏è

This notebook implements the **Feature Engineering** workflow required for the Anomaly Detection Agent (Day 4). We transform raw sales data into time-series metrics that allow us to detect statistical outliers.

### üéØ Goals
1.  **Ingest Data:** Load the cleaned dataset using the `DataIngestorAgent`.
2.  **Engineer Features:** Apply time-series transformations using the `FeatureEngineer` class:
    * **Time Features:** Year, Month, Quarter, Day of Week.
    * **Rolling Metrics:** 3-month moving averages to smooth out volatility.
    * **Lag Features:** Previous month comparison for growth calculations.
3.  **Save Snapshot:** Export the enriched dataset (`sales_features.parquet`) for the Anomaly Agent.

### üèóÔ∏è Components Used
* `agents.feature_transforms.FeatureEngineer`: Handles the mathematical transformations.
* `agents.data_ingestor.DataIngestorAgent`: Loads and standardizes raw CSV data.

## Imports

In [2]:
import sys
import os
import pandas as pd

# Add project root
project_root = os.path.abspath(os.path.join(os.path.dirname('__file__'), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

from agents.data_ingestor import DataIngestorAgent
from agents.feature_transforms import FeatureEngineer

print("‚úÖ Feature Engineering Pipeline Ready")

‚úÖ Feature Engineering Pipeline Ready


In [3]:
# Load Clean Data
ingestor = DataIngestorAgent("../data/raw/superstore.csv")
df_clean = ingestor.clean_data()
print(f"Loaded {len(df_clean)} rows.")
df_clean.head(2)

2025-11-19 22:13:56,628 - agents.data_ingestor - INFO - Attempting read with encoding='utf-8'...
2025-11-19 22:13:56,649 - agents.data_ingestor - INFO - Attempting read with encoding='latin1'...
2025-11-19 22:13:56,680 - agents.data_ingestor - INFO - Success! Read 9994 rows.
2025-11-19 22:13:56,699 - agents.data_ingestor - INFO - Schema validation passed.


Loaded 9994 rows.


Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Order Year,Order Month
0,1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136,2016,11
1,2,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582,2016,11


## 2) Apply Engineering

In [4]:
# Initialize Engineer
engineer = FeatureEngineer(df_clean)

# 1. Add Time Features
engineer.add_time_features()

# 2. Add Rolling Sales (3-month moving average)
engineer.add_rolling_metrics(target_col='Sales', window=3)

# 3. Add Lag Features (Previous month sales)
engineer.add_lag_features(target_col='Sales', lag=1)

# Get final DF
df_features = engineer.get_engineered_data()
df_features[['Order Date', 'Sales', 'Sales_Rolling_3', 'Sales_Lag_1']].head()

Unnamed: 0,Order Date,Sales,Sales_Rolling_3,Sales_Lag_1
7980,2014-01-03,16.448,16.448,0.0
739,2014-01-04,11.784,14.116,16.448
740,2014-01-04,272.736,100.322667,11.784
741,2014-01-04,3.54,96.02,272.736
1759,2014-01-05,19.536,98.604,3.54


## 3) Save Snapshot

In [5]:
# Save this for Day 4 (Anomaly Detection)
output_path = "../data/processed/sales_features.parquet"
df_features.to_parquet(output_path)
print(f"‚úÖ Saved engineered features to {output_path}")

‚úÖ Saved engineered features to ../data/processed/sales_features.parquet


## 4) ‚è≠Ô∏è Next Step: Detecting Anomalies

Success! We have enriched our raw sales data with:
* **Rolling Averages** (Trends)
* **Lag Features** (Growth Rates)
* **Time Features** (Seasonality)

And saved the snapshot to: `data/processed/sales_features.parquet`.

We are now ready to open **`notebooks/05_anomaly_detection.ipynb`** to build the **Statistical Anomaly Agent**, which will consume this file to find outliers.