# Feature Store with Lake Formation Governance

This notebook demonstrates two workflows for using SageMaker Feature Store with Lake Formation governance:

1. **Example 1**: Create Feature Group with Lake Formation enabled at creation time
2. **Example 2**: Create Feature Group first, then enable Lake Formation separately

Both workflows include record ingestion to verify everything works end-to-end.

## Prerequisites

- AWS credentials configured with permissions for SageMaker, S3, Glue, and Lake Formation
- An S3 bucket for the offline store
- An IAM role with Feature Store permissions

## Required IAM Permissions

This notebook uses two separate IAM roles:
1. **Execution Role**: The SageMaker execution role running this notebook
2. **Offline Store Role**: A dedicated role for Feature Store S3 access

### Execution Role Policy

The execution role needs permissions to manage Feature Groups and configure Lake Formation:

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "FeatureGroupManagement",
            "Effect": "Allow",
            "Action": [
                "sagemaker:*"
            ],
            "Resource": "arn:aws:sagemaker:*:*:feature-group/*"
        },
        {
            "Sid": "LakeFormation",
            "Effect": "Allow",
            "Action": [
                "lakeformation:RegisterResource",
                "lakeformation:DeregisterResource",
                "lakeformation:GrantPermissions",
                "lakeformation:RevokePermissions"
            ],
            "Resource": "*"
        },
        {
            "Sid": "GlueCatalogRead",
            "Effect": "Allow",
            "Action": [
                "glue:GetTable",
                "glue:GetDatabase",
                "glue:DeleteTable"
            ],
            "Resource": [
                "arn:aws:glue:*:*:catalog",
                "arn:aws:glue:*:*:database/sagemaker_featurestore",
                "arn:aws:glue:*:*:table/sagemaker_featurestore/*"
            ]
        },
        {
            "Sid": "PassOfflineStoreRole",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::*:role/SagemakerFeatureStoreOfflineRole"
        },
        {
            "Sid": "LakeFormationServiceLinkedRole",
            "Effect": "Allow",
            "Action": [
                "iam:GetRole",
                "iam:PutRolePolicy",
                "iam:GetRolePolicy"
            ],
            "Resource": "arn:aws:iam::*:role/aws-service-role/lakeformation.amazonaws.com/AWSServiceRoleForLakeFormationDataAccess"
        },
        {
            "Sid": "S3SagemakerDefaultBucket",
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:GetBucketAcl",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::sagemaker-*"
            ]
        },
        {
            "Sid": "CreateGlueTable",
            "Effect": "Allow",
            "Action": [
                "glue:CreateTable"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
```

## Lake Formation Admin Requirements

The person enabling Lake Formation governance must be a **Data Lake Administrator** in Lake Formation. There are two options depending on your organization's setup:

### Option 1: Single User (Data Lake Admin + Feature Store Admin)

If the caller has both:
- Data Lake Administrator privileges in Lake Formation
- Permissions to create Feature Groups in SageMaker

Then they can use `FeatureGroup.create()` with `lake_formation_config` to enable governance at creation time (Example 1).

### Option 2: Separate Roles (ML Engineer + Data Lake Admin)

If the person creating the Feature Group is different from the Data Lake Administrator:

1. **ML Engineer** creates the Feature Group without Lake Formation using `FeatureGroup.create()`
2. **Data Lake Admin** later enables governance by calling `enable_lake_formation()` on the existing Feature Group (Example 2)


## Setup

In [None]:
import time
from datetime import datetime
from datetime import timezone

import boto3
from botocore.exceptions import ClientError

# Import the FeatureGroup with Lake Formation support
from sagemaker.mlops.feature_store.feature_group import FeatureGroup, LakeFormationConfig
from sagemaker.core.shapes import (
    FeatureDefinition,
    FeatureValue,
    OfflineStoreConfig,
    OnlineStoreConfig,
    S3StorageConfig,
)
from sagemaker.core.helper.session_helper import Session as SageMakerSession, get_execution_role

## Configuration

In [None]:
# Use SageMaker session to get default bucket and execution role
sagemaker_session = SageMakerSession()
S3_BUCKET = sagemaker_session.default_bucket()
REGION = sagemaker_session.boto_session.region_name

# Execution role (for running this notebook)
EXECUTION_ROLE_ARN = get_execution_role(sagemaker_session)

# Offline store role (dedicated role for Feature Store S3 access)
# Replace with your dedicated offline store role ARN
OFFLINE_STORE_ROLE_ARN = "arn:aws:iam::<account-id>:role/<feature-store-role>"

print(f"S3 Bucket: {S3_BUCKET}")
print(f"Execution Role ARN: {EXECUTION_ROLE_ARN}")
print(f"Offline Store Role ARN: {OFFLINE_STORE_ROLE_ARN}")
print(f"Region: {REGION}")

## Common Feature Definitions

In [None]:
feature_definitions = [
    FeatureDefinition(feature_name="customer_id", feature_type="String"),
    FeatureDefinition(feature_name="event_time", feature_type="String"),
    FeatureDefinition(feature_name="age", feature_type="Integral"),
    FeatureDefinition(feature_name="total_purchases", feature_type="Integral"),
    FeatureDefinition(feature_name="avg_order_value", feature_type="Fractional"),
]

print("Feature Definitions:")
for fd in feature_definitions:
    print(f"  - {fd.feature_name}: {fd.feature_type}")

## Helper Function: Ingest Records

In [None]:
def ingest_sample_records(feature_group, num_records=3):
    """
    Ingest sample records into the Feature Group.
    
    Args:
        feature_group: The FeatureGroup to ingest records into
        num_records: Number of sample records to ingest
    """
    print(f"\nIngesting {num_records} sample records...")
    
    for i in range(num_records):
        event_time = datetime.now(timezone.utc).isoformat()
        record = [
            FeatureValue(feature_name="customer_id", value_as_string=f"cust_{i+1}"),
            FeatureValue(feature_name="event_time", value_as_string=event_time),
            FeatureValue(feature_name="age", value_as_string=str(25 + i * 5)),
            FeatureValue(feature_name="total_purchases", value_as_string=str(10 + i * 3)),
            FeatureValue(feature_name="avg_order_value", value_as_string=str(50.0 + i * 10.5)),
        ]
        
        feature_group.put_record(record=record)
        print(f"  Ingested record for customer: cust_{i+1}")
    
    print(f"Successfully ingested {num_records} records!")

## Helper Function: Cleanup

In [None]:
def cleanup_feature_group(fg):
    """
    Delete a FeatureGroup and its associated Glue table.
    
    Args:
        fg: The FeatureGroup to delete.
    """
    try:
        # Delete the Glue table if it exists
        if fg.offline_store_config is not None:
            try:
                fg.refresh()  # Ensure we have latest config
                data_catalog_config = fg.offline_store_config.data_catalog_config
                if data_catalog_config is not None:
                    database_name = data_catalog_config.database
                    table_name = data_catalog_config.table_name

                    if database_name and table_name:
                        glue_client = boto3.client("glue")
                        try:
                            glue_client.delete_table(DatabaseName=database_name, Name=table_name)
                            print(f"Deleted Glue table: {database_name}.{table_name}")
                        except ClientError as e:
                            # Ignore if table doesn't exist
                            if e.response["Error"]["Code"] != "EntityNotFoundException":
                                raise
            except Exception as e:
                # Don't fail cleanup if Glue table deletion fails
                print(f"Warning: Could not delete Glue table: {e}")

        # Delete the FeatureGroup
        fg.delete()
        print(f"Deleted Feature Group: {fg.feature_group_name}")
    except ClientError as e:
        print(f"Error during cleanup: {e}")

---
# Example 1: Create Feature Group with Lake Formation Enabled

This example creates a Feature Group with Lake Formation governance enabled at creation time using `LakeFormationConfig`.

In [None]:
# Generate unique name for example 1
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
FG_NAME_WORKFLOW1 = f"lf-demo-workflow1-{timestamp}"

print(f"Example 1 Feature Group: {FG_NAME_WORKFLOW1}")

In [None]:
# Configure online and offline stores
online_store_config = OnlineStoreConfig(enable_online_store=True)

offline_store_config_1 = OfflineStoreConfig(
    s3_storage_config=S3StorageConfig(
        s3_uri=f"s3://{S3_BUCKET}/feature-store-demo/"
    )
)

# Configure Lake Formation - enabled at creation
lake_formation_config = LakeFormationConfig()
lake_formation_config.enabled = True
lake_formation_config.use_service_linked_role = True
lake_formation_config.show_s3_policy = True

print("Store Config:")
print(f"  Online Store: enabled")
print(f"  Offline Store S3: s3://{S3_BUCKET}/feature-store-demo/")
print("\nLake Formation Config:")
print(f"  enabled: {lake_formation_config.enabled}")
print(f"  use_service_linked_role: {lake_formation_config.use_service_linked_role}")

In [None]:
# Create Feature Group with Lake Formation enabled
print("Creating Feature Group with Lake Formation enabled...")
print("This will:")
print("  1. Create the Feature Group with online + offline stores")
print("  2. Wait for 'Created' status")
print("  3. Register S3 with Lake Formation")
print("  4. Grant permissions to execution role")
print("  5. Revoke IAMAllowedPrincipal permissions")
print()

fg_workflow1 = FeatureGroup.create(
    feature_group_name=FG_NAME_WORKFLOW1,
    record_identifier_feature_name="customer_id",
    event_time_feature_name="event_time",
    feature_definitions=feature_definitions,
    online_store_config=online_store_config,
    offline_store_config=offline_store_config_1,
    role_arn=OFFLINE_STORE_ROLE_ARN,
    description="Workflow 1: Lake Formation enabled at creation",
    lake_formation_config=lake_formation_config, # new field
    region=REGION,
)

print(f"\nFeature Group created: {fg_workflow1.feature_group_name}")
print(f"Status: {fg_workflow1.feature_group_status}")

In [None]:
# Verify Feature Group status
fg_workflow1.refresh()
print(f"Feature Group: {fg_workflow1.feature_group_name}")
print(f"Status: {fg_workflow1.feature_group_status}")
print(f"ARN: {fg_workflow1.feature_group_arn}")

In [None]:
# Ingest sample records to verify everything works
ingest_sample_records(fg_workflow1, num_records=3)

In [None]:
# Retrieve a sample record from the online store
print("Retrieving record for customer 'cust_1' from online store...")
response = fg_workflow1.get_record(record_identifier_value_as_string="cust_1")

print(f"\nRecord retrieved successfully!")
print(f"Features:")
for feature in response.record:
    print(f"  {feature.feature_name}: {feature.value_as_string}")

---
# Example 2: Create Feature Group, Then Enable Lake Formation

This example creates a Feature Group first without Lake Formation, then enables it separately using `enable_lake_formation()`.

In [None]:
# Generate unique name for example 2
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
FG_NAME_WORKFLOW2 = f"lf-demo-workflow2-{timestamp}"

print(f"Example 2 Feature Group: {FG_NAME_WORKFLOW2}")

In [None]:
# Configure online and offline stores
online_store_config_2 = OnlineStoreConfig(enable_online_store=True)

offline_store_config_2 = OfflineStoreConfig(
    s3_storage_config=S3StorageConfig(
        s3_uri=f"s3://{S3_BUCKET}/feature-store-demo/"
    ),
    table_format="Iceberg"
)

# Step 1: Create Feature Group WITHOUT Lake Formation
print("Step 1: Creating Feature Group without Lake Formation...")

fg_workflow2 = FeatureGroup.create(
    feature_group_name=FG_NAME_WORKFLOW2,
    record_identifier_feature_name="customer_id",
    event_time_feature_name="event_time",
    feature_definitions=feature_definitions,
    online_store_config=online_store_config_2,
    offline_store_config=offline_store_config_2,
    role_arn=OFFLINE_STORE_ROLE_ARN,
    description="Workflow 2: Lake Formation enabled after creation",
    region=REGION,
)

print(f"Feature Group created: {fg_workflow2.feature_group_name}")
print(f"Status: {fg_workflow2.feature_group_status}")

In [None]:
# Step 2: Wait for Feature Group to be ready
print("Step 2: Waiting for Feature Group to reach 'Created' status...")
fg_workflow2.wait_for_status(target_status="Created", poll=10, timeout=300)
print(f"Status: {fg_workflow2.feature_group_status}")

In [None]:
# Step 3: Enable Lake Formation governance
print("Step 3: Enabling Lake Formation governance...")
print("This will:")
print("  1. Register S3 with Lake Formation")
print("  2. Grant permissions to execution role")
print("  3. Revoke IAMAllowedPrincipal permissions")
print()

result = fg_workflow2.enable_lake_formation( # new method
    use_service_linked_role=True
)

print(f"\nLake Formation setup results:")
print(f"  s3_registration: {result['s3_registration']}")
print(f"  permissions_granted: {result['permissions_granted']}")
print(f"  iam_principal_revoked: {result['iam_principal_revoked']}")

In [None]:
# Step 4: Ingest sample records to verify everything works
print("Step 4: Ingesting records to verify Lake Formation setup...")
ingest_sample_records(fg_workflow2, num_records=3)

In [None]:
# Step 5: Retrieve a sample record from the online store
print("Step 5: Retrieving record for customer 'cust_1' from online store...")
response = fg_workflow2.get_record(record_identifier_value_as_string="cust_1")

print(f"\nRecord retrieved successfully!")
print(f"Features:")
for feature in response.record:
    print(f"  {feature.feature_name}: {feature.value_as_string}")

In [None]:
# Verify Feature Group status
fg_workflow2.refresh()
print(f"Feature Group: {fg_workflow2.feature_group_name}")
print(f"Status: {fg_workflow2.feature_group_status}")
print(f"ARN: {fg_workflow2.feature_group_arn}")

---
# Cleanup

Delete the Feature Groups and associated Glue tables created in this demo.

In [None]:
# Uncomment to delete the Feature Groups
cleanup_feature_group(fg_workflow1)
# cleanup_feature_group(fg_workflow2)

---
# Summary

This notebook demonstrated two workflows:

**Example 1: Lake Formation at Creation**
- Use `LakeFormationConfig` with `enabled=True` in `FeatureGroup.create()`
- Lake Formation is automatically configured after Feature Group creation
- Both online and offline stores enabled

**Example 2: Enable Lake Formation Later**
- Create Feature Group normally without Lake Formation
- Call `enable_lake_formation()` method after creation
- More control over when Lake Formation is enabled


---
# FAQ:

## What is the S3 deny policy for?

When you enable Lake Formation governance, you control access to data through Lake Formation permissions. However, **IAM roles that already have direct S3 access will continue to have access** to the underlying data files, bypassing Lake Formation entirely.

The S3 deny policy closes this access path by explicitly denying S3 access to all principals except:
- The Lake Formation service-linked role (for data access)
- The Feature Store offline store role provided during Feature Group creation

## Why don't we apply the S3 deny policy automatically?

We provide the policy as a **recommendation** rather than applying it automatically for several important reasons:

### 1. Protect existing SageMaker workflows from breaking

Many customers already have SageMaker training and processing jobs wired directly to S3 URIs. An automatic S3 deny could cause those jobs to fail the moment governance is enabled on a table.

### 2. Support different personas and trust levels

Different users have different access needs:
- **Analysts / BI users** - should only see data through governed surfaces (Lake Formation tables, Athena, Redshift, etc.)
- **ML / Data engineers** - often need raw S3 access for training, feature engineering, and debugging

### 3. Enable gradual migration to stronger governance

Many customers want to phase in Lake Formation governance:
1. Start by governing table access only
2. Later tighten S3 access once they've refactored jobs and validated behavior

### 4. Avoid breaking existing bucket policies

Automatically modifying bucket policies could:
- Conflict with existing policy statements
- Lock out users or services unexpectedly
- Cause cascading failures across multiple applications sharing the bucket

Therefore, the S3 policy is provided as a starting point that should be validated by the user. 
