# 01 - Feast Feature Store Setup

![Workflow](../docs/01-features-workflow.png)

## What This Notebook Does

| Step | Action | Output |
|------|--------|--------|
| 1 | Generate synthetic sales data | `sales_features.parquet` |
| 2 | Engineer lag/rolling features | 22 total features |
| 3 | `feast apply` | Register features in PostgreSQL registry |
| 4 | `feast materialize` | Populate online store for serving |

## Why Feast?

```
Without Feast:                    With Feast:
┌─────────┐                       ┌─────────┐
│Training │ ← manual features     │Training │ ← FeatureService
└─────────┘                       └─────────┘
┌─────────┐                       ┌─────────┐
│ Serving │ ← duplicate logic     │ Serving │ ← Same FeatureService
└─────────┘   (SKEW RISK!)        └─────────┘   (CONSISTENT)
```

**Prerequisites:** PostgreSQL, PVC (`shared`), RayCluster running.

In [None]:
%pip install -q "feast[postgres]==0.59.0" pandas pyarrow psycopg2-binary
import os, shutil, subprocess
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime, timedelta, timezone

## Configuration

| Variable | Value | Purpose |
|----------|-------|----------|
| `SHARED_ROOT` | `/opt/app-root/src/shared` | PVC mount point |
| `WEEKS` | 104 | 2 years of data |
| `STORES × DEPTS` | 45 × 14 | 630 unique entities |
| **Total Records** | **65,520** | Weekly granularity |

In [None]:
# PVC mounted at /opt/app-root/src/shared in RHOAI workbench
SHARED_ROOT = Path("/opt/app-root/src/shared")
FEATURE_REPO = SHARED_ROOT / "feature_repo"
DATA_DIR = SHARED_ROOT / "data"
DATA_DIR.mkdir(parents=True, exist_ok=True)

START_DATE, WEEKS, STORES, DEPTS, SEED = "2022-01-01", 104, 45, 14, 42
print(f"Records: {WEEKS * STORES * DEPTS:,}")

## Generate Synthetic Sales Data

Creates Walmart-style retail data with realistic patterns:

| Feature | Logic | Purpose |
|---------|-------|----------|
| `weekly_sales` | Base × Store × Dept × Season × Holiday | Target variable |
| `is_holiday` | Weeks 6,27,36,47,51 | Super Bowl, Memorial Day, etc. |
| `seasonal` | `sin(2π × week/52)` | Summer peak, winter dip |
| `temperature` | `60 + 20×sin()` | Weather correlation |
| `fuel_price`, `cpi` | Random walk | Economic indicators |

In [None]:
np.random.seed(SEED)
base_date = datetime.fromisoformat(START_DATE).replace(tzinfo=timezone.utc)
HOLIDAYS = {6, 27, 36, 47, 51}  # Major holiday weeks
HOLIDAY_WEEKS = sorted(HOLIDAYS)

records = []
for week in range(WEEKS):
    dt = base_date + timedelta(weeks=week)
    woy, month = dt.isocalendar()[1], dt.month
    day = dt.day
    week_of_month = (day - 1) // 7 + 1
    # Check if last week of month
    next_week = dt + timedelta(weeks=1)
    is_month_end = 1 if next_week.month != month else 0
    # Days to next holiday
    days_to_holiday = min([abs((h - woy) % 52) * 7 for h in HOLIDAY_WEEKS])
    
    seasonal = 1 + 0.3 * np.sin(2 * np.pi * woy / 52)
    for s in range(1, STORES + 1):
        for d in range(1, DEPTS + 1):
            sales = max(0, (50000 + s*5000) * (0.5 + d*0.2) * seasonal * (1.5 if woy in HOLIDAYS else 1) + np.random.normal(0, 2000))
            records.append({
                "store_id": s, "dept_id": d, "event_timestamp": dt, "weekly_sales": round(sales, 2),
                "week_of_year": woy, "month": month, "quarter": (month-1)//3+1, 
                "week_of_month": week_of_month, "is_month_end": is_month_end,
                "is_holiday": int(woy in HOLIDAYS), "days_to_holiday": days_to_holiday,
                "temperature": round(60 + 20*np.sin(2*np.pi*woy/52) + np.random.normal(0,5), 1),
                "fuel_price": round(3 + 0.5*np.random.random(), 2), "cpi": round(220 + week*0.1, 1), 
                "unemployment": round(5 + np.random.normal(0, 0.5), 1)
            })

sales_df = pd.DataFrame(records).sort_values(["store_id", "dept_id", "event_timestamp"]).reset_index(drop=True)
print(f"Generated {len(sales_df):,} rows")

## Feature Engineering

Add time-series features that capture historical patterns:

```
lag_1:  Sales from 1 week ago  → Most predictive
lag_2:  Sales from 2 weeks ago → Recent trend
lag_4:  Sales from 4 weeks ago → Monthly pattern
lag_8:  Sales from 8 weeks ago → Bi-monthly pattern

rolling_mean_4w:  4-week moving average → Smoothed trend
rolling_std_4w:   4-week std deviation → Volatility
sales_vs_avg:     Current / Average    → Relative performance
```

**Note:** Lags are filled with `rolling_mean_4w` to avoid NaN at start.

In [None]:
# Lag + rolling features
for lag in [1, 2, 4, 8]:
    sales_df[f"lag_{lag}"] = sales_df.groupby(["store_id", "dept_id"])["weekly_sales"].shift(lag)

g = sales_df.groupby(["store_id", "dept_id"])["weekly_sales"]
sales_df["rolling_mean_4w"] = g.transform(lambda x: x.rolling(4, min_periods=1).mean())
sales_df["rolling_std_4w"] = g.transform(lambda x: x.rolling(4, min_periods=2).std()).fillna(0)
sales_df["sales_vs_avg"] = (sales_df["weekly_sales"] / sales_df["rolling_mean_4w"].replace(0, 1)).fillna(1)

for lag in [1, 2, 4, 8]:
    sales_df[f"lag_{lag}"] = sales_df[f"lag_{lag}"].fillna(sales_df["rolling_mean_4w"])
sales_df = sales_df.fillna(0)
print(f"Features: {len(sales_df.columns)}")

## Save to Parquet

Save feature data to PVC for Feast to read:

```
/opt/app-root/src/shared/
└── data/
    ├── sales_features.parquet   # 65K rows, 22 cols
    └── store_features.parquet   # Store metadata
```

In [None]:
# Save
sales_df.to_parquet(DATA_DIR / "sales_features.parquet", index=False)
stores = pd.DataFrame([{"store_id": s, "dept_id": d, "event_timestamp": base_date, "store_type": ["A","B","C"][s%3], "store_size": 100000+s*10000, "region": f"region_{(s-1)//15+1}"} for s in range(1, STORES+1) for d in range(1, DEPTS+1)])
stores.to_parquet(DATA_DIR / "store_features.parquet", index=False)
print(f"✅ Saved to {DATA_DIR}")

## Setup Feast Repository

Copy feature definitions to the shared PVC:

| File | Purpose |
|------|----------|
| `feature_store.yaml` | Registry (PostgreSQL), offline/online stores |
| `feature_store_ray.yaml` | Ray-enabled config for distributed retrieval |
| `features.py` | FeatureViews, Entities, FeatureServices |

**Key FeatureServices defined:**
- `training_features` → All features for model training
- `inference_features` → Subset for real-time serving

In [None]:
FEATURE_REPO.mkdir(parents=True, exist_ok=True)
# Flat structure: feature_repo/ is in same directory as notebook
src_dir = Path("/opt/app-root/src/feature_repo")
for f in ["feature_store.yaml", "feature_store_ray.yaml", "features.py"]:
    src = src_dir / f
    if src.exists():
        shutil.copy(src, FEATURE_REPO / f)
        print(f"✅ {f}")
    else:
        print(f"❌ Missing: {src}")

## Feast Apply

Register feature definitions in PostgreSQL:

```
feast apply
    │
    ├── Reads features.py
    ├── Creates tables in PostgreSQL registry
    └── Stores feature metadata (schemas, timestamps)
```

**Output:** FeatureViews (`store_features`, `sales_features`) registered.

In [None]:
os.chdir(str(FEATURE_REPO))
result = subprocess.run(["feast", "apply"], capture_output=True, text=True)
print(result.stdout)
if result.returncode != 0:
    print(f"ERROR: {result.stderr}")

## Feast Materialize

Populate the **online store** for low-latency serving:

```
Offline Store (Parquet)          Online Store (PostgreSQL)
┌────────────────────┐           ┌────────────────────┐
│ Full history       │  ──────▶  │ Latest values only │
│ 65K rows           │ materialize│ 630 entities       │
│ For training       │           │ For serving        │
└────────────────────┘           └────────────────────┘
```

**Why:** Training needs history; serving needs current values fast.

In [None]:
end_ts = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S")
result = subprocess.run(["feast", "materialize", f"{START_DATE}T00:00:00", end_ts], capture_output=True, text=True, cwd=str(FEATURE_REPO))
print(result.stdout)
if result.returncode != 0:
    print(f"ERROR: {result.stderr}")

## Verify Setup

Test online feature retrieval (what serving will use):

In [None]:
from feast import FeatureStore
store = FeatureStore(repo_path=str(FEATURE_REPO))
print(f"Views: {[fv.name for fv in store.list_feature_views()]}")
print(f"Services: {[fs.name for fs in store.list_feature_services()]}")

features = store.get_online_features(features=["sales_features:weekly_sales", "sales_features:lag_1"], entity_rows=[{"store_id": 1, "dept_id": 1}]).to_dict()
print(f"\n✅ Online: {features}")

---
## ✅ Complete!

**What we built:**
- 65K rows of synthetic sales data
- 22 engineered features (lags, rolling stats)
- Feast registry in PostgreSQL
- Online store populated for serving

**Next:** `02-training.ipynb` → Train model with distributed feature retrieval