# 01 - Feast Feature Store Setup

![Workflow](../docs/01-features-workflow.png)

## What This Notebook Does

| Step | Action | Output |
|------|--------|--------|
| 1 | Generate synthetic sales data | `sales_features.parquet` |
| 2 | Engineer lag/rolling features | 22 total features |
| 3 | `feast apply` via Ray | Register features in PostgreSQL |
| 4 | `feast materialize` via Ray | Populate online store |

## Architecture (KubeRay + CodeFlare SDK)

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Notebook   ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  KubeRay    ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  PostgreSQL ‚îÇ
‚îÇ  (Feast)    ‚îÇ     ‚îÇ  Cluster    ‚îÇ     ‚îÇ  Registry   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
       ‚îÇ                   ‚îÇ                   ‚îÇ
       ‚îÇ CodeFlare SDK     ‚îÇ Distributed       ‚îÇ Online Store
       ‚îÇ Auto-Auth         ‚îÇ Materialize       ‚îÇ (Low-latency)
```

**Prerequisites:** Run `kubectl apply -k manifests/` first to deploy:
- PostgreSQL (registry + online store)
- RayCluster `feast-ray`
- Shared PVC

In [None]:
%pip install -q "feast[postgres,ray]==0.59.0" codeflare-sdk pandas pyarrow psycopg2-binary
import os, shutil, subprocess
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime, timedelta, timezone

# Setup CodeFlare SDK auth from service account token (for Ray cluster access)
token_path = "/var/run/secrets/kubernetes.io/serviceaccount/token"
if os.path.exists(token_path):
    with open(token_path) as f:
        os.environ["FEAST_RAY_AUTH_TOKEN"] = f.read().strip()
    k8s_host = os.environ.get("KUBERNETES_SERVICE_HOST", "")
    k8s_port = os.environ.get("KUBERNETES_SERVICE_PORT", "443")
    if k8s_host:
        os.environ["FEAST_RAY_AUTH_SERVER"] = f"https://{k8s_host}:{k8s_port}"
    os.environ["FEAST_RAY_SKIP_TLS"] = "true"
    print("üîê CodeFlare SDK auth configured for KubeRay access")

## Configuration

| Variable | Value | Purpose |
|----------|-------|----------|
| `SHARED_ROOT` | `/opt/app-root/src/shared` | PVC mount point |
| `WEEKS` | 104 | 2 years of data |
| `STORES √ó DEPTS` | 45 √ó 14 | 630 unique entities |
| **Total Records** | **65,520** | Weekly granularity |

In [None]:
# PVC mounted at /opt/app-root/src/shared in RHOAI workbench
SHARED_ROOT = Path("/opt/app-root/src/shared")
FEATURE_REPO = SHARED_ROOT / "feature_repo"
DATA_DIR = SHARED_ROOT / "data"
DATA_DIR.mkdir(parents=True, exist_ok=True)

START_DATE, WEEKS, STORES, DEPTS, SEED = "2022-01-01", 104, 45, 14, 42
print(f"üìä Config:")
print(f"   Data dir: {DATA_DIR}")
print(f"   Feature repo: {FEATURE_REPO}")
print(f"   Total records: {WEEKS * STORES * DEPTS:,}")

## Step 1: Generate Synthetic Sales Data

Creates Walmart-style retail data with realistic patterns:

| Feature | Logic | Purpose |
|---------|-------|----------|
| `weekly_sales` | Base √ó Store √ó Dept √ó Season √ó Holiday | Target variable |
| `is_holiday` | Weeks 6,27,36,47,51 | Super Bowl, July 4th, etc. |
| `seasonal` | `sin(2œÄ √ó week/52)` | Summer peak, winter dip |
| `temperature` | `60 + 20√ósin()` | Weather correlation |
| `fuel_price`, `cpi` | Random walk | Economic indicators |

In [None]:
np.random.seed(SEED)
base_date = datetime.fromisoformat(START_DATE).replace(tzinfo=timezone.utc)
HOLIDAYS = {6, 27, 36, 47, 51}  # Major holiday weeks
HOLIDAY_WEEKS = sorted(HOLIDAYS)

records = []
for week in range(WEEKS):
    dt = base_date + timedelta(weeks=week)
    woy, month = dt.isocalendar()[1], dt.month
    day = dt.day
    week_of_month = (day - 1) // 7 + 1
    next_week = dt + timedelta(weeks=1)
    is_month_end = 1 if next_week.month != month else 0
    days_to_holiday = min([abs((h - woy) % 52) * 7 for h in HOLIDAY_WEEKS])
    
    seasonal = 1 + 0.3 * np.sin(2 * np.pi * woy / 52)
    for s in range(1, STORES + 1):
        for d in range(1, DEPTS + 1):
            sales = max(0, (50000 + s*5000) * (0.5 + d*0.2) * seasonal * (1.5 if woy in HOLIDAYS else 1) + np.random.normal(0, 2000))
            records.append({
                "store_id": s, "dept_id": d, "event_timestamp": dt, "weekly_sales": round(sales, 2),
                "week_of_year": woy, "month": month, "quarter": (month-1)//3+1, 
                "week_of_month": week_of_month, "is_month_end": is_month_end,
                "is_holiday": int(woy in HOLIDAYS), "days_to_holiday": days_to_holiday,
                "temperature": round(60 + 20*np.sin(2*np.pi*woy/52) + np.random.normal(0,5), 1),
                "fuel_price": round(3 + 0.5*np.random.random(), 2), "cpi": round(220 + week*0.1, 1), 
                "unemployment": round(5 + np.random.normal(0, 0.5), 1)
            })

sales_df = pd.DataFrame(records).sort_values(["store_id", "dept_id", "event_timestamp"]).reset_index(drop=True)
print(f"‚úÖ Generated {len(sales_df):,} rows")
print(f"   Date range: {sales_df['event_timestamp'].min().date()} to {sales_df['event_timestamp'].max().date()}")

In [None]:
# üìä SAMPLE DATA: Raw sales data (before feature engineering)
print("üìä Sample: Raw sales data (5 rows)")
print(f"   Columns: {list(sales_df.columns)}")
print()
sales_df.head()

In [None]:
# üìä SAMPLE DATA: Sales statistics by store
print("üìä Sample: Sales distribution")
print(sales_df[['weekly_sales', 'temperature', 'fuel_price', 'cpi']].describe().round(2))

## Step 2: Feature Engineering

Add time-series features that capture historical patterns:

```
lag_1:  Sales from 1 week ago  ‚Üí Most predictive (35% importance)
lag_2:  Sales from 2 weeks ago ‚Üí Recent trend
lag_4:  Sales from 4 weeks ago ‚Üí Monthly pattern
lag_8:  Sales from 8 weeks ago ‚Üí Bi-monthly pattern

rolling_mean_4w:  4-week moving average ‚Üí Smoothed trend (28% importance)
rolling_std_4w:   4-week std deviation ‚Üí Volatility
sales_vs_avg:     Current / Average    ‚Üí Relative performance
```

In [None]:
# Lag features (most predictive - 35% importance)
for lag in [1, 2, 4, 8]:
    sales_df[f"lag_{lag}"] = sales_df.groupby(["store_id", "dept_id"])["weekly_sales"].shift(lag)

# Rolling statistics (28% importance)
g = sales_df.groupby(["store_id", "dept_id"])["weekly_sales"]
sales_df["rolling_mean_4w"] = g.transform(lambda x: x.rolling(4, min_periods=1).mean())
sales_df["rolling_std_4w"] = g.transform(lambda x: x.rolling(4, min_periods=2).std()).fillna(0)
sales_df["sales_vs_avg"] = (sales_df["weekly_sales"] / sales_df["rolling_mean_4w"].replace(0, 1)).fillna(1)

# Fill NaN lags with rolling mean (more realistic than 0)
for lag in [1, 2, 4, 8]:
    sales_df[f"lag_{lag}"] = sales_df[f"lag_{lag}"].fillna(sales_df["rolling_mean_4w"])
sales_df = sales_df.fillna(0)

print(f"‚úÖ Features engineered: {len(sales_df.columns)} columns")

In [None]:
# üìä SAMPLE DATA: After feature engineering (show lag and rolling features)
print("üìä Sample: Engineered features for Store 1, Dept 1")
feature_cols = ['event_timestamp', 'weekly_sales', 'lag_1', 'lag_2', 'rolling_mean_4w', 'rolling_std_4w', 'sales_vs_avg']
sales_df[(sales_df['store_id'] == 1) & (sales_df['dept_id'] == 1)][feature_cols].head(10)

In [None]:
# üìä SAMPLE DATA: Feature correlation with target
print("üìä Feature correlation with weekly_sales:")
numeric_cols = ['lag_1', 'lag_2', 'lag_4', 'lag_8', 'rolling_mean_4w', 'rolling_std_4w', 
                'week_of_year', 'is_holiday', 'temperature']
correlations = sales_df[numeric_cols + ['weekly_sales']].corr()['weekly_sales'].drop('weekly_sales').sort_values(ascending=False)
print(correlations.round(3).to_string())

## Step 3: Save to Parquet

Save feature data to PVC for Feast to read:

```
/opt/app-root/src/shared/data/
‚îú‚îÄ‚îÄ sales_features.parquet   # 65K rows, 22 cols
‚îî‚îÄ‚îÄ store_features.parquet   # Store metadata
```

In [None]:
# Save sales features
sales_df.to_parquet(DATA_DIR / "sales_features.parquet", index=False)
print(f"‚úÖ Saved: {DATA_DIR / 'sales_features.parquet'}")
print(f"   Shape: {sales_df.shape}")

# Create and save store features (static metadata)
stores = pd.DataFrame([
    {
        "store_id": s, "dept_id": d, "event_timestamp": base_date,
        "store_type": ["A", "B", "C"][s % 3],
        "store_size": 100000 + s * 10000,
        "region": f"region_{(s - 1) // 15 + 1}"
    }
    for s in range(1, STORES + 1) for d in range(1, DEPTS + 1)
])
stores.to_parquet(DATA_DIR / "store_features.parquet", index=False)
print(f"‚úÖ Saved: {DATA_DIR / 'store_features.parquet'}")
print(f"   Shape: {stores.shape}")

In [None]:
# üìä SAMPLE DATA: Store features
print("üìä Sample: Store features (5 rows)")
stores.head()

In [None]:
# üìä SAMPLE DATA: Store type distribution
print("üìä Store type distribution:")
print(stores.groupby('store_type')['store_id'].nunique())

## Step 4: Setup Feast Repository

Copy feature definitions to the shared PVC:

| File | Purpose |
|------|----------|
| `feature_store.yaml` | Ray-enabled config (KubeRay + CodeFlare SDK) |
| `features.py` | FeatureViews, Entities, FeatureServices + **auto-auth** |

**Key Features:**
- `training_features` ‚Üí All features for model training
- `inference_features` ‚Üí Subset for real-time serving

**Auto-Auth:** `features.py` reads service account token from `/var/run/secrets/kubernetes.io/serviceaccount/token`

In [None]:
FEATURE_REPO.mkdir(parents=True, exist_ok=True)
(DATA_DIR / "ray_storage").mkdir(parents=True, exist_ok=True)

# Look for feature_repo in multiple possible locations
possible_paths = [
    Path("/opt/app-root/src/feature_repo"),
    Path("/opt/app-root/src/sales-demand-forecasting/feature_repo"),
    Path("../feature_repo"),
]

src_dir = None
for p in possible_paths:
    if p.exists() and (p / "features.py").exists():
        src_dir = p
        print(f"üìÅ Found feature_repo at: {src_dir}")
        break

if src_dir is None:
    raise FileNotFoundError(f"feature_repo not found in: {possible_paths}")

# Copy features.py (includes auto-auth for CodeFlare SDK)
shutil.copy(src_dir / "features.py", FEATURE_REPO / "features.py")
print("‚úÖ features.py (with CodeFlare SDK auto-auth)")

# Use Ray config as main feature_store.yaml
shutil.copy(src_dir / "feature_store_ray.yaml", FEATURE_REPO / "feature_store.yaml")
print("‚úÖ feature_store.yaml (Ray + KubeRay)")

In [None]:
# üìä SAMPLE DATA: Show feature_store.yaml config
print("üìä Feast Config (feature_store.yaml):")
print("-" * 50)
with open(FEATURE_REPO / "feature_store.yaml") as f:
    print(f.read())

## Step 5: Feast Apply (via Remote KubeRay Cluster)

Register feature definitions in PostgreSQL using the deployed Ray cluster:

```
feast apply
    ‚îÇ
    ‚îú‚îÄ‚îÄ Reads features.py (auto-auth configures CodeFlare SDK)
    ‚îú‚îÄ‚îÄ Connects to KubeRay cluster "feast-ray" via CodeFlare SDK
    ‚îú‚îÄ‚îÄ Uses mTLS for secure communication
    ‚îî‚îÄ‚îÄ Creates tables in PostgreSQL registry
```

**Note:** The Ray cluster must be running before this step. Deploy with `kubectl apply -f manifests/03-raycluster.yaml`

In [None]:
os.chdir(str(FEATURE_REPO))
print(f"üìç Working dir: {os.getcwd()}")
print("\nüöÄ Running: feast apply")
print("-" * 50)

result = subprocess.run(["feast", "apply"], capture_output=True, text=True)
print(result.stdout)
if result.returncode != 0:
    print(f"‚ùå ERROR: {result.stderr}")
else:
    print("‚úÖ Features registered to PostgreSQL")

## Step 6: Feast Materialize (via Remote KubeRay Cluster)

Populate the **online store** using the deployed Ray cluster for distributed processing:

```
Offline Store (Parquet)     KubeRay Cluster     Online Store (PostgreSQL)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Full history       ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  Distributed  ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ Latest values only ‚îÇ
‚îÇ 65K rows           ‚îÇ    ‚îÇ  Processing   ‚îÇ    ‚îÇ 630 entities       ‚îÇ
‚îÇ For training       ‚îÇ    ‚îÇ (feast-ray)   ‚îÇ    ‚îÇ For serving        ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Why Ray for Materialize:**
- Distributes work across KubeRay cluster `feast-ray`
- Faster for large datasets (>1M rows)
- Uses `batch_engine: ray.engine` in feature_store_ray.yaml
- CodeFlare SDK handles mTLS authentication automatically

In [None]:
end_ts = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S")
print(f"üöÄ Running: feast materialize {START_DATE}T00:00:00 {end_ts}")
print("-" * 50)

result = subprocess.run(
    ["feast", "materialize", f"{START_DATE}T00:00:00", end_ts],
    capture_output=True, text=True, cwd=str(FEATURE_REPO)
)
print(result.stdout)
if result.returncode != 0:
    print(f"‚ùå ERROR: {result.stderr}")
else:
    print("‚úÖ Features materialized to PostgreSQL online store")

## Step 7: Verify Setup

Test feature retrieval to ensure everything is working:

In [None]:
from feast import FeatureStore

store = FeatureStore(repo_path=str(FEATURE_REPO))

print("üìã Registered Objects:")
print(f"   Entities: {[e.name for e in store.list_entities()]}")
print(f"   FeatureViews: {[fv.name for fv in store.list_feature_views()]}")
print(f"   FeatureServices: {[fs.name for fs in store.list_feature_services()]}")

In [None]:
# üìä SAMPLE DATA: Online feature lookup (what serving will use)
print("üìä Sample: Online feature lookup for Store 1, Dept 1")
print("-" * 50)

online_features = store.get_online_features(
    features=[
        "sales_features:weekly_sales",
        "sales_features:lag_1",
        "sales_features:rolling_mean_4w",
        "sales_features:is_holiday",
        "store_features:store_type",
        "store_features:store_size",
    ],
    entity_rows=[{"store_id": 1, "dept_id": 1}]
).to_dict()

for k, v in online_features.items():
    print(f"   {k}: {v[0]}")

In [None]:
# üìä SAMPLE DATA: Online features for multiple entities
print("üìä Sample: Online features for multiple stores")
print("-" * 50)

entities = [
    {"store_id": 1, "dept_id": 1},
    {"store_id": 10, "dept_id": 5},
    {"store_id": 25, "dept_id": 10},
    {"store_id": 45, "dept_id": 14},
]

multi_features = store.get_online_features(
    features=["sales_features:weekly_sales", "sales_features:lag_1", "store_features:store_type"],
    entity_rows=entities
).to_df()

multi_features

In [None]:
# üìä SAMPLE DATA: Historical features (what training will use via remote Ray)
print("üìä Sample: Historical feature retrieval via Remote KubeRay")
print("-" * 50)
print("This uses get_historical_features() which distributes PIT joins across KubeRay cluster")
print()

# Small entity DataFrame for demo
entity_df = pd.DataFrame([
    {"store_id": 1, "dept_id": 1, "event_timestamp": datetime(2023, 6, 1, tzinfo=timezone.utc)},
    {"store_id": 1, "dept_id": 1, "event_timestamp": datetime(2023, 6, 15, tzinfo=timezone.utc)},
    {"store_id": 10, "dept_id": 5, "event_timestamp": datetime(2023, 6, 1, tzinfo=timezone.utc)},
    {"store_id": 25, "dept_id": 10, "event_timestamp": datetime(2023, 7, 1, tzinfo=timezone.utc)},
])

historical = store.get_historical_features(
    entity_df=entity_df,
    features=["sales_features:weekly_sales", "sales_features:lag_1", "sales_features:rolling_mean_4w", "store_features:store_type"]
).to_df()

print(f"‚úÖ Retrieved {len(historical)} rows with {len(historical.columns)} columns")
print(f"   Columns: {list(historical.columns)}")
print()
historical

## ‚úÖ Complete!

### What We Built

| Component | Count | Location |
|-----------|-------|----------|
| Sales records | 65,520 | `/shared/data/sales_features.parquet` |
| Store records | 630 | `/shared/data/store_features.parquet` |
| Features | 22 | Lags, rolling stats, temporal, economic |
| Registry | PostgreSQL | Feature metadata |
| Online Store | PostgreSQL | Latest values for serving |

### Feature Importance (typical retail forecasting)

| Feature Group | Importance | Examples |
|---------------|------------|----------|
| Lag features | 35% | `lag_1`, `lag_2`, `lag_4`, `lag_8` |
| Rolling stats | 28% | `rolling_mean_4w`, `rolling_std_4w` |
| Temporal | 18% | `week_of_year`, `month`, `quarter` |
| Holiday | 10% | `is_holiday`, `days_to_holiday` |
| Economic | 7% | `temperature`, `fuel_price`, `cpi` |
| Store | 2% | `store_type`, `store_size` |

### Next Steps

**Option A: Use Manifests (Recommended)**
```bash
kubectl apply -f manifests/05-dataprep-job.yaml   # Regenerate data
kubectl apply -f manifests/06-trainjob.yaml       # Train model
```

**Option B: Use Notebooks**
- `02-training.ipynb` ‚Üí Train model with `get_historical_features()`
- `03-inference.ipynb` ‚Üí Deploy model with KServe