# SageMaker Feature Store Example

This notebook demonstrates how to create feature groups, ingest data into the SageMaker Feature Store, and query the data using Amazon Athena. 

The steps included are:
1. **Setup and Initialization**: Import libraries and set up the SageMaker environment.
2. **Data Inspection**: Load and inspect the synthetic data.
3. **Feature Group Creation**: Create feature groups for the data.
4. **Data Ingestion**: Ingest the data into the feature groups.
5. **Querying the Data**: Query the ingested data using Athena.
6. **Cleanup** (Optional): Clean up the resources after verification.


## Step 1: Setup and Initialization

We begin by importing the necessary libraries and setting up the SageMaker environment. The `get_execution_role` function is used to obtain the role associated with the current SageMaker session, and we define the S3 bucket to be used for storing the offline features.


In [None]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.feature_store.feature_group import FeatureGroup
import pandas as pd
import time

# Since you're using JupyterLab in SageMaker Studio, you can directly get the role
role = get_execution_role()

# Define S3 bucket and prefix for the offline store
s3_bucket_name = sagemaker.Session().default_bucket()  # This automatically uses the default session
prefix = 'sagemaker-featurestore-introduction'


## Step 2: Data Inspection

Here, we create synthetic customer and order data for demonstration purposes. In practice, you would load your data from a CSV file or another data source.


In [None]:
# Step 2: Inspect and load your data
# (Use actual data paths if needed)
customer_data = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'first_name': ['John', 'Jane', 'Doe'],
    'last_name': ['Doe', 'Smith', 'Johnson'],
    'age': [28, 34, 29],
    'account_balance': [1000.50, 1500.75, 2000.25]
})

orders_data = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'order_id': [101, 102, 103],
    'order_amount': [250.75, 100.50, 320.00],
    'order_date': pd.to_datetime(['2023-08-01', '2023-08-02', '2023-08-03'])
})

print(customer_data.head())
print(orders_data.head())


## Step 3: Feature Group Creation

In this step, we create feature groups for the customer and order data. Feature groups are the core components of the SageMaker Feature Store and serve as containers for your features.


In [None]:
# Step 3: Create Feature Groups
customers_feature_group_name = 'customers-feature-group-' + time.strftime('%d-%H-%M-%S', time.gmtime())
orders_feature_group_name = 'orders-feature-group-' + time.strftime('%d-%H-%M-%S', time.gmtime())

customers_feature_group = FeatureGroup(name=customers_feature_group_name)
orders_feature_group = FeatureGroup(name=orders_feature_group_name)

# Generate a proper ISO 8601 string (UTC)
current_time_str = time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())

# Instead of numeric timestamps, assign this ISO string directly
customer_data["EventTime"] = current_time_str
orders_data["EventTime"] = current_time_str

customers_feature_group.load_feature_definitions(data_frame=customer_data)
orders_feature_group.load_feature_definitions(data_frame=orders_data)

# Create Feature Groups
customers_feature_group.create(
    s3_uri=f"s3://{s3_bucket_name}/{prefix}",
    record_identifier_name='customer_id',
    event_time_feature_name="EventTime",
    role_arn=role,
    enable_online_store=True
)

orders_feature_group.create(
    s3_uri=f"s3://{s3_bucket_name}/{prefix}",
    record_identifier_name='customer_id',
    event_time_feature_name="EventTime",
    role_arn=role,
    enable_online_store=True
)


## Step 4: Data Ingestion

Once the feature groups are created, we can ingest the customer and order data into the respective feature groups. The data is stored in both the online and offline stores of SageMaker Feature Store.


In [None]:
# Wait for feature groups to be created
def check_feature_group_status(feature_group):
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group to be Created")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    print(f"FeatureGroup {feature_group.name} successfully created.")

check_feature_group_status(customers_feature_group)
check_feature_group_status(orders_feature_group)


In [None]:
# Step 4: Ingest Data
customers_feature_group.ingest(data_frame=customer_data, max_workers=3, wait=True)
orders_feature_group.ingest(data_frame=orders_data, max_workers=3, wait=True)


## Step 5: Querying the Data

We can query the data stored in the offline feature store using Amazon Athena. This step demonstrates how to run a simple SQL query to retrieve data from the feature store.


In [None]:
# Step 5: Retrieve Records from the Online Store
import boto3

# Initialize the SageMaker Feature Store runtime client
featurestore_runtime = boto3.client('sagemaker-featurestore-runtime')

# Retrieve a record from the online store
record_id = '1'  # Replace with the actual record ID you want to retrieve
response = featurestore_runtime.get_record(
    FeatureGroupName=customers_feature_group_name,
    RecordIdentifierValueAsString=record_id
)

# Print the record
print("Record retrieved from the online store:")
print(response['Record'])


In [None]:
# Step 5: Query Data Using Athena
from sagemaker.feature_store.feature_group import FeatureGroup, AthenaQuery
import sagemaker

# Create a SageMaker session
sagemaker_session = sagemaker.Session()

# Describe the feature groups to get the correct table names
customers_feature_group_description = customers_feature_group.describe()
orders_feature_group_description = orders_feature_group.describe()

# Extract the table names from the description
customers_table_name = customers_feature_group_description['OfflineStoreConfig']['DataCatalogConfig']['TableName']
orders_table_name = orders_feature_group_description['OfflineStoreConfig']['DataCatalogConfig']['TableName']

print(f"Customers Table Name: {customers_table_name}")
print(f"Orders Table Name: {orders_table_name}")

# Now use the correct table name from the feature group description
query_string = f'SELECT * FROM "{customers_table_name}" LIMIT 10'

# Create the AthenaQuery object
athena_query = AthenaQuery(
    catalog='AwsDataCatalog',
    database='sagemaker_featurestore',
    table_name=customers_table_name,
    sagemaker_session=sagemaker_session
)

# Execute the query
athena_query.run(query_string=query_string, output_location=f's3://{s3_bucket_name}/athena_results/')

# Wait for the query to complete
athena_query.wait()

# Convert the results to a DataFrame
query_results = athena_query.as_dataframe()

# Display the results
print(query_results)

## Step 6: Cleanup (Optional)

After you are done with the feature groups, you may want to delete them to avoid incurring additional charges. The following code demonstrates how to delete the feature groups.


In [None]:
# Step 6: Clean up (optional)
customers_feature_group.delete()
orders_feature_group.delete()


In [None]:
import boto3

s3 = boto3.resource('s3')
bucket = s3.Bucket(s3_bucket_name)
prefix = f'{prefix}/'  # Use the same prefix that was used during creation

# Delete all objects under the prefix
bucket.objects.filter(Prefix=prefix).delete()


In [None]:
athena_results_prefix = 'athena_results/'  # Or your specific prefix
bucket.objects.filter(Prefix=athena_results_prefix).delete()
