# üîß Notebook 2: Feature Engineering

**Author:** Amey Talkatkar | **Course:** MLOps with Agentic AI

## üéØ Learning Objectives
- Transform raw data into ML-ready features
- Create lag and rolling window features
- Encode categorical variables
- Scale numerical features
- Handle train/test split properly
- Save processed data for DVC tracking

## üî• The Problem
A DS trained a model on raw data:
- Forgot to encode categories ‚Üí Model crashed
- Scaled on all data ‚Üí Data leakage!
- No lag features ‚Üí Model couldn't capture trends
- Result: Poor accuracy in production

**Solution: Proper feature engineering!**

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported")

## Step 1: Load Data

In [None]:
df = pd.read_csv('../data/raw/sales_data.csv', parse_dates=['date'])
print(f"Loaded: {len(df):,} rows")
df.head()

## Step 2: Create Lag Features
Capture historical patterns (yesterday's sales predict today's)

In [None]:
# Sort by date for time series operations
df = df.sort_values('date').reset_index(drop=True)

# Create lag features (previous days' sales)
for lag in [1, 7, 30]:
    df[f'sales_lag_{lag}'] = df.groupby(['region', 'product'])['sales'].shift(lag)

# Rolling window features (average of last N days)
for window in [7, 30]:
    df[f'sales_rolling_mean_{window}'] = df.groupby(['region', 'product'])['sales'].transform(
        lambda x: x.rolling(window=window, min_periods=1).mean()
    )
    df[f'sales_rolling_std_{window}'] = df.groupby(['region', 'product'])['sales'].transform(
        lambda x: x.rolling(window=window, min_periods=1).std()
    )

print("‚úÖ Lag features created")
print(f"New columns: {[col for col in df.columns if 'lag' in col or 'rolling' in col]}")

## Step 3: Encode Categorical Variables

In [None]:
# One-hot encoding for region and product
df_encoded = pd.get_dummies(df, columns=['region', 'product', 'season'], drop_first=True)

print(f"‚úÖ Categorical encoding complete")
print(f"Original columns: {len(df.columns)}")
print(f"After encoding: {len(df_encoded.columns)}")
print(f"\nNew binary columns: {[col for col in df_encoded.columns if col.startswith(('region_', 'product_', 'season_'))]}")

## Step 4: Feature Selection
Choose features for modeling

In [None]:
# Define feature columns
feature_cols = [
    # Numerical features
    'price', 'quantity', 'month', 'day_of_week',
    # Lag features
    'sales_lag_1', 'sales_lag_7', 'sales_lag_30',
    # Rolling features
    'sales_rolling_mean_7', 'sales_rolling_mean_30',
    'sales_rolling_std_7', 'sales_rolling_std_30',
    # Binary features
    'is_weekend',
] + [col for col in df_encoded.columns if col.startswith(('region_', 'product_', 'season_'))]

target_col = 'sales'

print(f"‚úÖ Selected {len(feature_cols)} features")
print(f"Target: {target_col}")

## Step 5: Handle Missing Values
Lag features create NaN for first rows

In [None]:
print(f"Missing values before: {df_encoded[feature_cols].isnull().sum().sum()}")

# Drop rows with missing lag features (first 30 days)
df_clean = df_encoded.dropna(subset=feature_cols)

print(f"Missing values after: {df_clean[feature_cols].isnull().sum().sum()}")
print(f"Rows remaining: {len(df_clean):,} ({len(df_clean)/len(df)*100:.1f}%)")

## Step 6: Train/Test Split
‚ö†Ô∏è CRITICAL: Split BEFORE scaling to avoid data leakage!

In [None]:
# Prepare X and y
X = df_clean[feature_cols]
y = df_clean[target_col]

# Split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, shuffle=False  # No shuffle for time series!
)

print(f"‚úÖ Train/Test Split:")
print(f"   Train: {len(X_train):,} rows ({len(X_train)/len(X)*100:.1f}%)")
print(f"   Test:  {len(X_test):,} rows ({len(X_test)/len(X)*100:.1f}%)")

## Step 7: Feature Scaling
Scale ONLY on train data, then transform test data

In [None]:
# Identify numerical columns to scale
numerical_cols = ['price', 'quantity', 'month', 'day_of_week'] + \
                 [col for col in feature_cols if 'lag' in col or 'rolling' in col]

# Fit scaler on train data ONLY
scaler = StandardScaler()
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])

# Transform test data with same scaler
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

print("‚úÖ Features scaled")
print(f"   Scaled columns: {len(numerical_cols)}")
print(f"   Mean after scaling: {X_train[numerical_cols].mean().mean():.6f} (should be ~0)")
print(f"   Std after scaling: {X_train[numerical_cols].std().mean():.6f} (should be ~1)")

## Step 8: Feature Importance Analysis
Which features are most useful?

In [None]:
# Quick correlation with target
correlations = X_train.corrwith(y_train).sort_values(ascending=False)

plt.figure(figsize=(12, 8))
correlations.head(15).plot(kind='barh')
plt.title('Top 15 Features by Correlation with Sales')
plt.xlabel('Correlation')
plt.tight_layout()
plt.show()

print("\nTop 5 features:")
print(correlations.head(5))

## Step 9: Save Processed Data

In [None]:
import os

# Create output directory
os.makedirs('../data/processed', exist_ok=True)

# Save train/test splits
X_train.to_csv('../data/processed/X_train.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False, header=True)
y_test.to_csv('../data/processed/y_test.csv', index=False, header=True)

# Save scaler for production use
import joblib
joblib.dump(scaler, '../data/processed/scaler.joblib')

print("‚úÖ Processed data saved!")
print("   Location: ../data/processed/")
print("   Files: X_train.csv, X_test.csv, y_train.csv, y_test.csv, scaler.joblib")

## ‚úÖ Summary

### What We Created:
1. ‚úÖ **Lag Features**: sales_lag_1, sales_lag_7, sales_lag_30
2. ‚úÖ **Rolling Features**: mean and std for 7 and 30 days
3. ‚úÖ **Encoded Categoricals**: region, product, season
4. ‚úÖ **Scaled Numericals**: StandardScaler fit on train only
5. ‚úÖ **Train/Test Split**: 80/20, no shuffle (time series)

### Why This Matters for MLOps:
- üîÑ **Reproducibility**: Saved scaler ensures consistent transformations
- üìä **No Data Leakage**: Scaled after split
- üéØ **Feature Store Ready**: Clean, processed features
- üìà **DVC Tracking**: Can version processed data

---

**Next:** `03_Model_Training_Comparison.ipynb` - Train and compare 3 models

**¬© 2024 Amey Talkatkar**