# Feature Store

**NOTE:** This project does not implement Feature Store.


We chose to use Amazon S3 + Athena as the primary offline feature repository, rather than SageMaker Feature Store, for the following reasons:
- Data scale and ingestion overhead: Our dataset contains ~26 million network flow records. While SageMaker Feature Store supports large datasets, efficient ingestion at this scale requires Spark- or Glue-based batch ingestion pipelines. For this project, Athena already provided direct, scalable access to Parquet data in S3 without requiring additional ingestion jobs, reducing operational complexity and development overhead.

- Batch training use case: Our model is trained in batch mode and does not require online feature retrieval or low-latency serving. Athena natively supports analytical, batch-oriented workflows and integrates cleanly with S3-stored Parquet data, making it a natural fit for offline model training.

- Cost and architectural simplicity: Athena + S3 avoids the additional infrastructure, catalog management, and ingestion costs associated with Feature Store while still providing a robust, scalable offline feature store. For a single-model, single-team workflow, this approach was sufficient and more efficient.

In [3]:
!pip -q install "PyAthena[SQLAlchemy]" sqlalchemy s3fs

In [24]:
import boto3
import sagemaker
import time
from time import gmtime, strftime
from sagemaker.session import Session
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
from sqlalchemy import create_engine, text
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.feature_store.feature_definition import FeatureDefinition, FeatureTypeEnum

## Connect to Athena

In [6]:
sess = sagemaker.Session()
region = boto3.Session().region_name

results_bucket = sess.default_bucket()
athena_results_path = f"s3://{results_bucket}/athena/staging/"

database_name = "aai540_eda"

engine = create_engine(
    f"awsathena+rest://@athena.{region}.amazonaws.com:443/{database_name}",
    connect_args={"s3_staging_dir": athena_results_path, "region_name": region},
)
print("Region:", region)
print("Athena results:", athena_results_path)

Region: us-east-1
Athena results: s3://sagemaker-us-east-1-128131109986/athena/staging/


In [7]:
# Helper functions for queries
def exec_ddl(sql: str):
    with engine.begin() as conn:
        conn.execute(text(sql))

def read_sql(sql: str) -> pd.DataFrame:
    return pd.read_sql(sql, engine)

## Create an Athena table for the Feature Store

In [15]:
# drop table if it exists
exec_ddl(f"DROP TABLE IF EXISTS {database_name}.data_split_fs_v1")

# create FS-ready table using a surrogate unique key (row_number)
exec_ddl(f"""
CREATE TABLE {database_name}.data_split_fs_v1
WITH (
  format='PARQUET',
  external_location='s3://{results_bucket}/aai540/processed/data_split_fs_v1/',
  parquet_compression='SNAPPY'
) AS

WITH numbered AS (
  SELECT
    *,
    -- deterministic row number within each dataset (stable for frozen snapshot)
    row_number() OVER (
      PARTITION BY source_dataset
      ORDER BY
        label,
        duration,
        pkt_total,
        bytes_total,
        pkt_fwd,
        pkt_bwd,
        bytes_fwd,
        bytes_bwd,
        original_attack_type,
        attack_category
    ) AS row_num
  FROM {database_name}.data_split
)

SELECT
  *,

  -- surrogate primary key (guaranteed unique for this snapshot)
  concat(
    cast(source_dataset as varchar), '-',
    cast(row_num as varchar)
  ) AS record_id,

  -- required by Feature Store (VARCHAR to avoid timezone issues)
  date_format(current_timestamp, '%Y-%m-%dT%H:%i:%sZ') AS event_time

FROM numbered
""")

### Verify table with ids

In [17]:
read_sql(f"""
SELECT
  COUNT(*) AS total_rows,
  COUNT(DISTINCT record_id) AS unique_ids
FROM {database_name}.data_split_fs_v1
""")

Unnamed: 0,total_rows,unique_ids
0,26708942,26708942


## Define Feature Group schema

In [21]:
feature_group_name = f"aai540-ids-features-v1-{strftime('%Y%m%d-%H%M%S', gmtime())}"
fg = FeatureGroup(name=feature_group_name, sagemaker_session=sess)

feature_definitions = [
    FeatureDefinition("record_id", FeatureTypeEnum.STRING),
    FeatureDefinition("event_time", FeatureTypeEnum.STRING),
    FeatureDefinition("data_split", FeatureTypeEnum.STRING),
    FeatureDefinition("label", FeatureTypeEnum.INTEGRAL),
    FeatureDefinition("original_attack_type", FeatureTypeEnum.STRING),
    FeatureDefinition("attack_category", FeatureTypeEnum.STRING),
    FeatureDefinition("source_dataset", FeatureTypeEnum.STRING),
    FeatureDefinition("duration", FeatureTypeEnum.FRACTIONAL),
    FeatureDefinition("pkt_total", FeatureTypeEnum.FRACTIONAL),
    FeatureDefinition("bytes_total", FeatureTypeEnum.FRACTIONAL),
    FeatureDefinition("pkt_fwd", FeatureTypeEnum.FRACTIONAL),
    FeatureDefinition("pkt_bwd", FeatureTypeEnum.FRACTIONAL),
    FeatureDefinition("bytes_fwd", FeatureTypeEnum.FRACTIONAL),
    FeatureDefinition("bytes_bwd", FeatureTypeEnum.FRACTIONAL),
    FeatureDefinition("pkt_rate", FeatureTypeEnum.FRACTIONAL),
    FeatureDefinition("byte_rate", FeatureTypeEnum.FRACTIONAL),
    FeatureDefinition("bytes_per_pkt", FeatureTypeEnum.FRACTIONAL),
    FeatureDefinition("pkt_ratio", FeatureTypeEnum.FRACTIONAL),
    FeatureDefinition("byte_ratio", FeatureTypeEnum.FRACTIONAL),
]


## Create Feature Group

In [23]:
# attach feature definitions
fg.feature_definitions = feature_definitions

# create Feature Group (offline only)
fg.create(
    s3_uri=f"s3://{sess.default_bucket()}/aai540/feature-store/offline/{feature_group_name}/",
    record_identifier_name="record_id",
    event_time_feature_name="event_time",
    role_arn=role,
    enable_online_store=False,
)

print("Creating Feature Group:", feature_group_name)

Creating Feature Group: aai540-ids-features-v1-20260208-071821


In [25]:
while True:
    status = fg.describe()["FeatureGroupStatus"]
    print("Status:", status)
    if status == "Created":
        break
    if status == "CreateFailed":
        raise RuntimeError("Feature Group creation failed")
    time.sleep(10)

print("Feature Group ready")

Status: Created
Feature Group ready


In [26]:
offline_table = fg.describe()["OfflineStoreConfig"]["DataCatalogConfig"]["TableName"]
database = fg.describe()["OfflineStoreConfig"]["DataCatalogConfig"]["Database"]

print("Athena DB:", database)
print("Athena table:", offline_table)

Athena DB: sagemaker_featurestore
Athena table: aai540_ids_features_v1_20260208_071821_1770535169


In [27]:
fg.describe()

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:128131109986:feature-group/aai540-ids-features-v1-20260208-071821',
 'FeatureGroupName': 'aai540-ids-features-v1-20260208-071821',
 'RecordIdentifierFeatureName': 'record_id',
 'EventTimeFeatureName': 'event_time',
 'FeatureDefinitions': [{'FeatureName': 'record_id', 'FeatureType': 'String'},
  {'FeatureName': 'event_time', 'FeatureType': 'String'},
  {'FeatureName': 'data_split', 'FeatureType': 'String'},
  {'FeatureName': 'label', 'FeatureType': 'Integral'},
  {'FeatureName': 'original_attack_type', 'FeatureType': 'String'},
  {'FeatureName': 'attack_category', 'FeatureType': 'String'},
  {'FeatureName': 'source_dataset', 'FeatureType': 'String'},
  {'FeatureName': 'duration', 'FeatureType': 'Fractional'},
  {'FeatureName': 'pkt_total', 'FeatureType': 'Fractional'},
  {'FeatureName': 'bytes_total', 'FeatureType': 'Fractional'},
  {'FeatureName': 'pkt_fwd', 'FeatureType': 'Fractional'},
  {'FeatureName': 'pkt_bwd', 'FeatureType': 'Fract