# Phase 1B: Feature Store with Feast

**Goal:** Build a training dataset using Feast's point-in-time correct feature retrieval and compare model performance to the baseline.

**Key concepts:**
- Feature definitions as version-controlled code
- Point-in-time joins prevent future data leakage
- Materialization populates offline and online stores
- `get_historical_features()` ensures training/serving consistency

## Setup and Imports

In [None]:
# TODO: Import necessary libraries
# - pandas for data manipulation
# - Feast FeatureStore for feature retrieval
# - lightgbm for model training
# - sklearn for preprocessing and evaluation
# - Any other libraries you need

## 1. Load the Dataset and Understand Timestamps

Before materializing features, you need to understand the timestamp range in your data.

In [None]:
# TODO: Load your fitness dataset
# - Read the CSV into a pandas DataFrame
# - Examine the timestamp column
# - Convert to datetime if needed
# - Determine min and max timestamps (you'll need these for 'feast materialize')

In [None]:
# TODO: Check timestamp range
# - Print the earliest timestamp
# - Print the latest timestamp
# - Verify timestamps are properly formatted
#
# Note: You'll use this range when running 'feast materialize' from the CLI

## 2. Compute Engineered Features

Before Feast can materialize features, they need to exist in your data source.

**Feature engineering:**
1. Rolling heart rate statistics (5min and 15min windows)
2. Acceleration magnitude: `sqrt(accel_x² + accel_y² + accel_z²)`
3. Exertion intensity: `acceleration_magnitude × heart_rate`

In [None]:
# TODO: Compute rolling heart rate statistics
# - Sort data by participant_id and timestamp
# - Group by participant_id
# - Use rolling() with appropriate time windows (5min, 15min)
# - Compute mean and std for each window
# - Handle NaN values from initial windows

In [None]:
# TODO: Compute acceleration magnitude
# - Calculate sqrt(accel_x² + accel_y² + accel_z²)
# - This represents total movement regardless of direction

In [None]:
# TODO: Compute exertion intensity
# - Multiply acceleration_magnitude by heart_rate
# - This is a proxy for how hard someone is working

In [None]:
# TODO: Save the engineered features to a CSV file
# - This file will be the FileSource for Feast
# - Save to a location accessible from feature_repo/
# - Ensure all required columns are present:
#   - participant_id
#   - timestamp (or event_timestamp)
#   - heart rate rolling features
#   - acceleration_magnitude
#   - exertion_intensity
#   - activity (label)

## 3. Switch to CLI: Register and Materialize Features

**Pause here and switch to your terminal:**

```bash
cd feature_repo
feast apply
feast materialize <START_TIMESTAMP> <END_TIMESTAMP>
```

Use the timestamp range you identified above.

**What's happening:**
- `feast apply` registers your feature definitions from `features.py`
- `feast materialize` populates the offline/online stores with feature values

Once complete, return here to retrieve features.

## 4. Initialize Feast Feature Store

Connect to your Feast feature store to retrieve features.

In [None]:
# TODO: Initialize the Feast FeatureStore
# - Point to your feature_repo/ directory
# - This loads the feature_store.yaml configuration
#
# Example:
# from feast import FeatureStore
# store = FeatureStore(repo_path="../feature_repo")

In [None]:
# TODO: List available feature views to verify registration
# - Use store.list_feature_views()
# - Confirm your feature views appear

## 5. Create Entity DataFrame

An entity DataFrame contains:
- Entity key(s): e.g., participant_id
- Event timestamp: when the features should be retrieved for
- Labels: the target variable (activity)

This is what you'll join features onto.

In [None]:
# TODO: Create entity DataFrame
# - Select participant_id, timestamp, and activity (label)
# - Rename timestamp column to 'event_timestamp' (Feast convention)
# - Ensure event_timestamp is datetime type
# - This represents "I want features for participant X at time T"

In [None]:
# TODO: Inspect entity DataFrame
# - Print first few rows
# - Check data types
# - Verify timestamp format matches Feast expectations

## 6. Retrieve Historical Features with Point-in-Time Correctness

**This is the core of Feast's value:**

`get_historical_features()` performs a point-in-time join:
- For each row in your entity DataFrame (participant_id + event_timestamp)
- Feast retrieves feature values as they existed at that timestamp
- No future data leaks into your training set

This prevents training/serving skew.

In [None]:
# TODO: Retrieve historical features
# - Use store.get_historical_features()
# - Pass your entity DataFrame
# - Specify which feature views to retrieve
# - Convert to pandas DataFrame with .to_df()
#
# Example:
# training_df = store.get_historical_features(
#     entity_df=entity_df,
#     features=[
#         "heart_rate_stats:hr_5min_mean",
#         "heart_rate_stats:hr_5min_std",
#         # ... list all features you want
#     ],
# ).to_df()

In [None]:
# TODO: Inspect the retrieved features
# - Print shape (rows, columns)
# - Display first few rows
# - Check for missing values
# - Verify feature columns were joined correctly

## 7. Prepare Training Data

Split features and labels, then create train/test split.

In [None]:
# TODO: Separate features and labels
# - X = feature columns (drop entity keys, timestamps, labels)
# - y = activity labels

In [None]:
# TODO: Encode labels if necessary
# - Use LabelEncoder or pd.factorize
# - Store the mapping for later interpretation

In [None]:
# TODO: Create train/test split
# - Use the same random_state as Phase 1A for fair comparison
# - Consider using time-based split instead of random for time series data

## 8. Train LightGBM Model with Feast Features

In [None]:
# TODO: Train LightGBM classifier
# - Use default hyperparameters (same as baseline)
# - Fit on X_train, y_train
# - Time the training duration

## 9. Evaluate and Compare to Baseline

In [None]:
# TODO: Make predictions on test set
# - Get predicted labels
# - Get predicted probabilities (for more detailed analysis)

In [None]:
# TODO: Compute evaluation metrics
# - Accuracy
# - Macro F1 score
# - Per-class F1 scores
# - Classification report

In [None]:
# TODO: Generate confusion matrix
# - Use sklearn.metrics.confusion_matrix
# - Visualize with seaborn heatmap
# - Compare to baseline confusion matrix

## 10. Compare Results: Baseline vs. Feast Features

**Key question:** Did Feast features improve performance?

**More important question:** What did you learn about the workflow?

In [None]:
# TODO: Create comparison table
# - Load baseline metrics from Phase 1A
# - Compare accuracy, F1 scores side-by-side
# - Note: Performance difference might be small — that's OK!

In [None]:
# TODO: Analyze feature importance
# - Use model.feature_importance()
# - Which Feast features are most predictive?
# - Do the engineered features (rolling stats, exertion) help?

## 11. Reflection: What Did We Learn?

**Technical workflow:**
- Features defined in code (`features.py`) rather than scattered across notebooks
- `feast apply` registers features; `feast materialize` populates stores
- `get_historical_features()` ensures point-in-time correctness
- Same feature definitions will be used for online serving (Phase 3)

**Key insight:**
Even if model performance didn't dramatically improve, you now have:
1. **Reproducible feature definitions** (version-controlled)
2. **No data leakage risk** (point-in-time joins)
3. **Serving consistency** (same features in training and production)

This is the foundation of production ML systems.

## Next Steps

✅ Phase 1A: Baseline model  
✅ Phase 1B: Feature store with Feast  
⬜ Phase 2: MLflow experiment tracking and model registry

In Phase 2, you'll run multiple experiments, log everything to MLflow, and manage model versions through a lifecycle (Staging → Production).