# Sagemaker Feature Store

In this section, we utilize **Amazon SageMaker Feature Store** to efficiently manage, store, and retrieve machine learning features. Feature Store provides a centralized repository for feature engineering, ensuring consistency across training and inference workflows.  

### Steps Covered:  
1. **Feature Group Creation** – Defining and registering feature groups to store structured feature data.  
2. **Ingesting Data** – Storing transformed features into the Feature Store for reuse.  
3. **Retrieving Features** – Querying and loading features into training and inference pipelines.  
4. **Feature Versioning & Governance** – Ensuring traceability and reproducibility of features across different model iterations.  

By leveraging SageMaker Feature Store, we enable efficient feature sharing, real-time access to feature data, and improved model performance across multiple use cases.  

Attrition --> 1 = Stayed, 0 = Left

In [1]:
!pip install sagemaker pandas boto3 awswrangler --quiet


In [2]:
# AWS Imports
import boto3
from botocore.client import ClientError
import sagemaker
from pyathena import connect
import awswrangler as wr
import pandas as pd
import botocore
import time
from sagemaker.feature_store.feature_group import FeatureGroup
from time import gmtime, strftime

# Data Transformation Imports
from io import StringIO

# Misc Imports
from IPython.display import display, HTML




sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
sess = sagemaker.Session()

bucket = sess.default_bucket()

role = sagemaker.get_execution_role()

region = boto3.Session().region_name

account_id = boto3.client("sts").get_caller_identity().get("Account")

sm = boto3.Session().client(service_name="sagemaker", region_name=region)

s3 = boto3.client('s3')

featurestore_runtime = boto3.client("sagemaker-featurestore-runtime")

FILE_NAME="remaining_data.csv"
DATA_PATH = f"s3://{bucket}/aai-540-group-3-final-project/data/db_source/"
print(f"✅ Using S3 bucket: {DATA_PATH}")
print(f"✅ Using IAM Role: {role}")


✅ Using S3 bucket: s3://sagemaker-us-east-1-095342792399/aai-540-group-3-final-project/data/db_source/
✅ Using IAM Role: arn:aws:iam::095342792399:role/LabRole


In [4]:
# Load and Prepare Data
file_key = "aai-540-group-3-final-project/data/db_source/remaining_data.csv"

# Download the file from S3 to a local file object
response = s3.get_object(Bucket=bucket, Key=file_key)

# Read the content of the file into a pandas DataFrame
data = pd.read_csv(response['Body'])

# Display the DataFrame
display(data)

Unnamed: 0,Employee ID,Age,Gender,Years at Company,Job Role,Monthly Income,Work-Life Balance,Job Satisfaction,Performance Rating,Number of Promotions,...,Number of Dependents,Job Level,Company Size,Company Tenure,Remote Work,Leadership Opportunities,Innovation Opportunities,Company Reputation,Employee Recognition,Attrition
0,10861,37,0,27,4,12617,1,0,0,1,...,2,0,2,57,1,0,0,0,0,0
1,33332,35,1,12,0,5935,3,0,0,2,...,1,2,1,19,0,0,0,3,0,0
2,17066,52,1,34,0,3908,3,3,0,1,...,2,1,1,63,0,1,0,3,2,0
3,62940,35,1,21,2,5663,2,2,0,0,...,2,2,1,70,0,0,0,1,2,1
4,65686,30,1,4,2,8184,1,0,2,4,...,3,1,2,50,0,0,0,3,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44694,33489,25,0,1,0,7550,0,0,0,0,...,3,1,2,17,1,0,0,2,2,1
44695,71741,38,1,23,0,4199,2,3,0,1,...,4,0,1,35,0,0,0,2,1,1
44696,14104,22,1,2,2,7631,2,0,0,0,...,3,0,2,41,0,0,0,2,2,0
44697,65630,50,1,36,4,7472,3,0,1,3,...,0,1,1,72,1,0,0,3,3,1


In [5]:
# Rename feature names to remove spaces
data.columns = (
    data.columns
    .str.replace(" ", "_")  # Replace spaces with underscores
    .str.replace("-", "_")  # Replace hyphens with underscores (optional)
    .str.replace("/", "_")  # Replace slashes with underscores (optional)
)

print("✅ Processed DataFrame:\n", data.head())


✅ Processed DataFrame:
    Employee_ID  Age  Gender  Years_at_Company  Job_Role  Monthly_Income  \
0        10861   37       0                27         4           12617   
1        33332   35       1                12         0            5935   
2        17066   52       1                34         0            3908   
3        62940   35       1                21         2            5663   
4        65686   30       1                 4         2            8184   

   Work_Life_Balance  Job_Satisfaction  Performance_Rating  \
0                  1                 0                   0   
1                  3                 0                   0   
2                  3                 3                   0   
3                  2                 2                   0   
4                  1                 0                   2   

   Number_of_Promotions  ...  Number_of_Dependents  Job_Level  Company_Size  \
0                     1  ...                     2          0            

In [6]:
# Define the Feature Group Schema

feature_group_name = "employee-attrition-feature-store"

# Load Data (Ensure column names are properly formatted)
data["Employee_ID"] = data["Employee_ID"].astype(str)  # Convert ID to string
data["EventTime"] = pd.to_datetime("now", utc=True).strftime("%Y-%m-%dT%H:%M:%SZ")

# Define feature group
feature_group = FeatureGroup(name=feature_group_name, sagemaker_session=sess)


In [7]:
import time
import botocore

def wait_for_feature_group_ready(feature_group_name, timeout=600, interval=10):
    """
    Waits for the Feature Group and Offline Store to reach ACTIVE state.
    - timeout: Maximum time (seconds) to wait.
    - interval: Time (seconds) between status checks.
    """
    start_time = time.time()
    while time.time() - start_time < timeout:
        response = sess.sagemaker_client.describe_feature_group(FeatureGroupName=feature_group_name)
        feature_group_status = response["FeatureGroupStatus"]
        offline_store_status = response.get("OfflineStoreStatus", {}).get("Status", "N/A")

        print(f"🔄 Feature Group Status: {feature_group_status}, Offline Store Status: {offline_store_status}")

        if feature_group_status == "Creating" or offline_store_status == "Creating":
            time.sleep(interval)  # Wait before checking again
        elif feature_group_status == "Created" and offline_store_status in ["Active", "N/A"]:
            print(f"✅ Feature Group '{feature_group_name}' is now ACTIVE!")
            print("⏳ Waiting 60 more seconds for metadata propagation...")
            time.sleep(60)  # ✅ Additional wait for metadata consistency
            return
        else:
            raise RuntimeError(f"❌ Unexpected state: {feature_group_status}, Offline Store: {offline_store_status}")

    raise TimeoutError(f"❌ Timeout: Feature Group '{feature_group_name}' did not become ACTIVE in {timeout} seconds.")



In [8]:

# Check if Feature Group exists
try:
    existing_feature_group = sess.sagemaker_client.describe_feature_group(
        FeatureGroupName=feature_group_name
    )
    print(f"✅ Feature Group '{feature_group_name}' already exists. Skipping creation.")
except botocore.exceptions.ClientError as e:
    if "ResourceNotFound" in str(e):
        print(f"🔄 Feature Group '{feature_group_name}' not found. Creating a new one...")

        # Initialize Feature Group
        feature_group = FeatureGroup(name=feature_group_name, sagemaker_session=sess)

        # Load feature definitions
        feature_group.load_feature_definitions(data_frame=data)

        # Create Feature Group with Offline Store
        feature_group.create(
            record_identifier_name="Employee_ID",
            event_time_feature_name="EventTime",
            role_arn=role,
            description="Feature store for employee attrition prediction",
            s3_uri=DATA_PATH,  # ✅ Ensure bucket exists in the correct region
        )

        print(f"🚀 Feature Group '{feature_group_name}' has been successfully created.")
    else:
        raise  # Raise any unexpected error

# **Wait until the Feature Group is truly ready**
wait_for_feature_group_ready(feature_group_name)

# **Final Check: Re-confirm Feature Group Status Before Ingestion**
final_status = sess.sagemaker_client.describe_feature_group(FeatureGroupName=feature_group_name)["FeatureGroupStatus"]
if final_status != "Created":
    raise RuntimeError(f"❌ Feature Group is still not ready: {final_status}")

print("🚀 Feature Group is READY. Proceeding with ingestion...")

# **Ingest Data** into Feature Store
feature_group.ingest(data_frame=data, max_workers=1, wait=True)

# **Describe Feature Group** to check status
feature_group.describe()



✅ Feature Group 'employee-attrition-feature-store' already exists. Skipping creation.
🔄 Feature Group Status: Created, Offline Store Status: Active
✅ Feature Group 'employee-attrition-feature-store' is now ACTIVE!
⏳ Waiting 60 more seconds for metadata propagation...
🚀 Feature Group is READY. Proceeding with ingestion...


{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:095342792399:feature-group/employee-attrition-feature-store',
 'FeatureGroupName': 'employee-attrition-feature-store',
 'RecordIdentifierFeatureName': 'Employee_ID',
 'EventTimeFeatureName': 'EventTime',
 'FeatureDefinitions': [{'FeatureName': 'Employee_ID',
   'FeatureType': 'String'},
  {'FeatureName': 'Age', 'FeatureType': 'Integral'},
  {'FeatureName': 'Gender', 'FeatureType': 'Integral'},
  {'FeatureName': 'Years_at_Company', 'FeatureType': 'Integral'},
  {'FeatureName': 'Job_Role', 'FeatureType': 'Integral'},
  {'FeatureName': 'Monthly_Income', 'FeatureType': 'Integral'},
  {'FeatureName': 'Work_Life_Balance', 'FeatureType': 'Integral'},
  {'FeatureName': 'Job_Satisfaction', 'FeatureType': 'Integral'},
  {'FeatureName': 'Performance_Rating', 'FeatureType': 'Integral'},
  {'FeatureName': 'Number_of_Promotions', 'FeatureType': 'Integral'},
  {'FeatureName': 'Overtime', 'FeatureType': 'Integral'},
  {'FeatureName': 'Distance_from_Home

In [9]:
# Convert Employee_ID to string
data["Employee_ID"] = data["Employee_ID"].astype(str)

# Convert EventTime to correct format
data["EventTime"] = pd.to_datetime("now").strftime("%Y-%m-%dT%H:%M:%SZ")

# Verify again
print("✅ Data Types After Conversion:\n", data.dtypes)
print("✅ Sample Data After Conversion:\n", data.head())


✅ Data Types After Conversion:
 Employee_ID                 object
Age                          int64
Gender                       int64
Years_at_Company             int64
Job_Role                     int64
Monthly_Income               int64
Work_Life_Balance            int64
Job_Satisfaction             int64
Performance_Rating           int64
Number_of_Promotions         int64
Overtime                     int64
Distance_from_Home           int64
Education_Level              int64
Marital_Status               int64
Number_of_Dependents         int64
Job_Level                    int64
Company_Size                 int64
Company_Tenure               int64
Remote_Work                  int64
Leadership_Opportunities     int64
Innovation_Opportunities     int64
Company_Reputation           int64
Employee_Recognition         int64
Attrition                    int64
EventTime                   object
dtype: object
✅ Sample Data After Conversion:
   Employee_ID  Age  Gender  Years_at_Company  

  data["EventTime"] = pd.to_datetime("now").strftime("%Y-%m-%dT%H:%M:%SZ")


In [10]:
# Verify column names and data types before ingestion
print("🔍 Column Names:", data.columns)
print("🔍 Data Types:\n", data.dtypes)
print("🔍 First Few Rows:\n", data.head())


🔍 Column Names: Index(['Employee_ID', 'Age', 'Gender', 'Years_at_Company', 'Job_Role',
       'Monthly_Income', 'Work_Life_Balance', 'Job_Satisfaction',
       'Performance_Rating', 'Number_of_Promotions', 'Overtime',
       'Distance_from_Home', 'Education_Level', 'Marital_Status',
       'Number_of_Dependents', 'Job_Level', 'Company_Size', 'Company_Tenure',
       'Remote_Work', 'Leadership_Opportunities', 'Innovation_Opportunities',
       'Company_Reputation', 'Employee_Recognition', 'Attrition', 'EventTime'],
      dtype='object')
🔍 Data Types:
 Employee_ID                 object
Age                          int64
Gender                       int64
Years_at_Company             int64
Job_Role                     int64
Monthly_Income               int64
Work_Life_Balance            int64
Job_Satisfaction             int64
Performance_Rating           int64
Number_of_Promotions         int64
Overtime                     int64
Distance_from_Home           int64
Education_Level        

In [11]:
# Reload feature definitions with corrected data
feature_group.load_feature_definitions(data_frame=data)

# Ingest data into Feature Store with reduced parallel workers
feature_group.ingest(data_frame=data, max_workers=1, wait=True)


IngestionManagerPandas(feature_group_name='employee-attrition-feature-store', feature_definitions={'Employee_ID': {'FeatureName': 'Employee_ID', 'FeatureType': 'String'}, 'Age': {'FeatureName': 'Age', 'FeatureType': 'Integral'}, 'Gender': {'FeatureName': 'Gender', 'FeatureType': 'Integral'}, 'Years_at_Company': {'FeatureName': 'Years_at_Company', 'FeatureType': 'Integral'}, 'Job_Role': {'FeatureName': 'Job_Role', 'FeatureType': 'Integral'}, 'Monthly_Income': {'FeatureName': 'Monthly_Income', 'FeatureType': 'Integral'}, 'Work_Life_Balance': {'FeatureName': 'Work_Life_Balance', 'FeatureType': 'Integral'}, 'Job_Satisfaction': {'FeatureName': 'Job_Satisfaction', 'FeatureType': 'Integral'}, 'Performance_Rating': {'FeatureName': 'Performance_Rating', 'FeatureType': 'Integral'}, 'Number_of_Promotions': {'FeatureName': 'Number_of_Promotions', 'FeatureType': 'Integral'}, 'Overtime': {'FeatureName': 'Overtime', 'FeatureType': 'Integral'}, 'Distance_from_Home': {'FeatureName': 'Distance_from_Home

In [12]:
# Describe the Feature Group to check ingestion status
feature_group.describe()


{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:095342792399:feature-group/employee-attrition-feature-store',
 'FeatureGroupName': 'employee-attrition-feature-store',
 'RecordIdentifierFeatureName': 'Employee_ID',
 'EventTimeFeatureName': 'EventTime',
 'FeatureDefinitions': [{'FeatureName': 'Employee_ID',
   'FeatureType': 'String'},
  {'FeatureName': 'Age', 'FeatureType': 'Integral'},
  {'FeatureName': 'Gender', 'FeatureType': 'Integral'},
  {'FeatureName': 'Years_at_Company', 'FeatureType': 'Integral'},
  {'FeatureName': 'Job_Role', 'FeatureType': 'Integral'},
  {'FeatureName': 'Monthly_Income', 'FeatureType': 'Integral'},
  {'FeatureName': 'Work_Life_Balance', 'FeatureType': 'Integral'},
  {'FeatureName': 'Job_Satisfaction', 'FeatureType': 'Integral'},
  {'FeatureName': 'Performance_Rating', 'FeatureType': 'Integral'},
  {'FeatureName': 'Number_of_Promotions', 'FeatureType': 'Integral'},
  {'FeatureName': 'Overtime', 'FeatureType': 'Integral'},
  {'FeatureName': 'Distance_from_Home

In [17]:
import awswrangler as wr

# Define Athena database and table names
athena_database = "sagemaker_featurestore"
athena_table = "employee_attrition_feature_store_1739750199"

# Example query: Fetch 5 records from the feature store
query = f"SELECT * FROM {athena_database}.{athena_table} LIMIT 5"

# Execute Athena query and return results as Pandas DataFrame
athena_df = wr.athena.read_sql_query(query, database=athena_database)

# Display results
print("✅ Retrieved Data from Offline Store (Athena):")
print(athena_df.head())

from IPython.display import display
display(athena_df)


✅ Retrieved Data from Offline Store (Athena):
  employee_id  age  gender  years_at_company  job_role  monthly_income  \
0       15667   39       1                24         0            4604   
1        7731   33       0                 9         2            8011   
2       28382   43       0                12         2            9001   
3        5088   37       1                 3         4            8017   
4       61027   51       1                34         4            9591   

   work_life_balance  job_satisfaction  performance_rating  \
0                  2                 0                   0   
1                  0                 0                   0   
2                  0                 3                   0   
3                  1                 0                   0   
4                  2                 3                   0   

   number_of_promotions  ...  remote_work  leadership_opportunities  \
0                     0  ...            1                        

Unnamed: 0,employee_id,age,gender,years_at_company,job_role,monthly_income,work_life_balance,job_satisfaction,performance_rating,number_of_promotions,...,remote_work,leadership_opportunities,innovation_opportunities,company_reputation,employee_recognition,attrition,eventtime,write_time,api_invocation_time,is_deleted
0,15667,39,1,24,0,4604,2,0,0,0,...,1,0,0,2,0,1,2025-02-17T00:16:32Z,2025-02-17 00:22:41.411,2025-02-17 00:17:42,False
1,7731,33,0,9,2,8011,0,0,0,1,...,0,0,0,1,0,1,2025-02-17T00:16:32Z,2025-02-17 00:22:41.411,2025-02-17 00:17:42,False
2,28382,43,0,12,2,9001,0,3,0,0,...,0,0,0,2,2,1,2025-02-17T00:16:32Z,2025-02-17 00:22:41.411,2025-02-17 00:17:42,False
3,5088,37,1,3,4,8017,1,0,0,2,...,1,1,0,3,2,0,2025-02-17T00:16:32Z,2025-02-17 00:22:41.411,2025-02-17 00:17:43,False
4,61027,51,1,34,4,9591,2,3,0,0,...,0,0,1,1,1,0,2025-02-17T00:16:32Z,2025-02-17 00:22:41.411,2025-02-17 00:17:43,False


In [18]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}

<IPython.core.display.Javascript object>

In [19]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>