# Fraud Detection with Amazon SageMaker FeatureStore


This notebook is the basic breakdown for ingesting data into sagemaker and creating a training & testing dataset for a random forrest model  
**Please be mindful:**  
This is not a notebook for final submission, rather a draft with notes and edits to communicate my updates to the team
___


Kernel `Python 3 (Data Science)` works well with this notebook.

The following policies need to be attached to the execution role:
- AmazonSageMakerFullAccess
- AmazonS3FullAccess

## Contents
1. [Background](#Background)
1. [Setup SageMaker FeatureStore](#Setup-SageMaker-FeatureStore)
1. [Inspect Dataset](#Inspect-Dataset)
1. [Ingest Data into FeatureStore](#Ingest-Data-into-FeatureStore)
1. [Build Training Dataset](#Build-Training-Dataset)
1. [Train and Deploy the Model](#Train-and-Deploy-the-Model)
1. [SageMaker FeatureStore At Inference](#SageMaker-FeatureStore-During-Inference)
1. [Cleanup Resources](#Cleanup-Resources)

## Background

Amazon SageMaker FeatureStore is a new SageMaker capability that makes it easy for customers to create and manage curated data for machine learning (ML) development. SageMaker FeatureStore enables data ingestion via a high TPS API and data consumption via the online and offline stores. 

This notebook provides an example for the APIs provided by SageMaker FeatureStore by walking through the process of training a fraud detection model. The notebook demonstrates how the dataset's tables can be ingested into the FeatureStore, queried to create a training dataset, and quickly accessed during inference. 


### Terminology

A **FeatureGroup** is the main resource that contains the metadata for all the data stored in SageMaker FeatureStore. A FeatureGroup contains a list of FeatureDefinitions. A **FeatureDefinition** consists of a name and one of the following data types: a integral, string or decimal. The FeatureGroup also contains an **OnlineStoreConfig** and an **OfflineStoreConfig** controlling where the data is stored. Enabling the online store allows quick access to the latest value for a Record via the GetRecord API. The offline store, a required configuration, allows storage of historical data in your S3 bucket. 

Once a FeatureGroup is created, data can be added as Records. **Records** can be thought of as a row in a table. Each record will have a unique **RecordIdentifier** along with values for all other FeatureDefinitions in the FeatureGroup. 

## Setup SageMaker FeatureStore

Let's start by setting up the SageMaker Python SDK and boto client. Note that this notebook requires a `boto3` version above `1.17.21`

In [1]:
import boto3
import sagemaker

original_boto3_version = boto3.__version__
%pip install 'boto3>1.17.21'

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Note: you may need to restart the kernel to use updated packages.


In [2]:
from sagemaker.session import Session

region = boto3.Session().region_name

boto_session = boto3.Session(region_name=region)

sagemaker_client = boto_session.client(service_name="sagemaker", region_name=region)
featurestore_runtime = boto_session.client(
    service_name="sagemaker-featurestore-runtime", region_name=region
)

feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime,
)

In [3]:
default_s3_bucket_name = feature_store_session.default_bucket()
print(default_s3_bucket_name)


sagemaker-us-west-2-204864359127


# Inspect Dataset  
Confirming that we are working with the correct dataset.
* It is to be noted here, the main reason we use Sagemaker and AWS ML tools, is to deal with multiple large datasets that are unstructured. AWS allows us to both concatenate and structure these datasets in a clean way that helps us to build better models. Here for the scope of this class project, we are only using **ONE** dataset, and this dataset is small. We will **not** combine multiple datasets (_unless we choose to do so in the future_) so some steps were skipped.
* **Why this matters:** The purpose of the feature store is to create a framework that accepts different datasets and organizes them into unique features. Because we are only working with one dataset, there is only one feature group. There is no need for more than one feature group.

The objective of the model is to predict if a patient is likely for Austism Disorder.  
____

In [4]:
import pandas as pd

s3_uri = "s3://sagemaker-us-west-2-204864359127/autism_prediction/csv/7b6209d9b95c40288fc24b30de17561e.snappy.parquet"
df = pd.read_parquet(s3_uri, engine="pyarrow")

> Here we visualize the data, and inspect it.

In [5]:
import pandas as pd

# Read Parquet file from S3 (make sure you have pyarrow or fastparquet installed)
df = pd.read_parquet(s3_uri, engine='pyarrow')

# Show first few rows
print(df.head())

# Show datatypes
print(df.dtypes)


   id  a1_score  a2_score  a3_score  a4_score  a5_score  a6_score  a7_score  \
0   1         1         0         1         0         1         0         1   
1   2         0         0         0         0         0         0         0   
2   3         1         1         1         1         1         1         1   
3   4         0         0         0         0         0         0         0   
4   5         0         0         0         0         0         0         0   

   a8_score  a9_score  ...  gender       ethnicity jaundice austim  \
0         0         1  ...       f               ?       no     no   
1         0         0  ...       m               ?       no     no   
2         1         1  ...       m  White-European       no    yes   
3         0         0  ...       f               ?       no     no   
4         0         0  ...       m               ?       no     no   

   contry_of_res used_app_before     result     age_desc  relation class_asd  
0        Austria         

In [8]:
df['ethnicity'].unique()

array(['?', 'White-European', 'Middle Eastern ', 'Pasifika', 'Black',
       'Others', 'Hispanic', 'Asian', 'Turkish', 'South Asian', 'Latino',
       'others'], dtype=object)

In [9]:
df['contry_of_res'].unique()

array(['Austria', 'India', 'United States', 'South Africa', 'Jordan',
       'United Kingdom', 'Brazil', 'New Zealand', 'Canada', 'Kazakhstan',
       'United Arab Emirates', 'Australia', 'Ukraine', 'Iraq', 'France',
       'Malaysia', 'Viet Nam', 'Egypt', 'Netherlands', 'Afghanistan',
       'Oman', 'Italy', 'AmericanSamoa', 'Bahamas', 'Saudi Arabia',
       'Ireland', 'Aruba', 'Sri Lanka', 'Russia', 'Bolivia', 'Azerbaijan',
       'Armenia', 'Serbia', 'Ethiopia', 'Sweden', 'Iceland', 'Hong Kong',
       'Angola', 'China', 'Germany', 'Spain', 'Tonga', 'Pakistan', 'Iran',
       'Argentina', 'Japan', 'Mexico', 'Nicaragua', 'Sierra Leone',
       'Czech Republic', 'Niger', 'Romania', 'Cyprus', 'Belgium',
       'Burundi', 'Bangladesh'], dtype=object)

In [11]:
df['relation'].unique()

array(['Self', 'Relative', 'Parent', '?', 'Others',
       'Health care professional'], dtype=object)

In [12]:
df['age_desc'].unique()

array(['18 and more'], dtype=object)

In [16]:
df.describe()

Unnamed: 0,id,a1_score,a2_score,a3_score,a4_score,a5_score,a6_score,a7_score,a8_score,a9_score,a10_score,age,result,class_asd
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,400.5,0.56,0.53,0.45,0.415,0.395,0.30375,0.3975,0.50875,0.495,0.6175,28.452118,8.537303,0.20125
std,231.0844,0.496697,0.499411,0.497805,0.49303,0.489157,0.460164,0.489687,0.500236,0.500288,0.486302,16.310966,4.807676,0.401185
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.71855,-6.137748,0.0
25%,200.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.198153,5.306575,0.0
50%,400.5,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,24.84835,9.605299,0.0
75%,600.25,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,35.865429,12.514484,0.0
max,800.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,89.461718,15.853126,1.0


In [39]:
response = sagemaker_client.list_feature_groups()
for fg in response['FeatureGroupSummaries']:
    print(fg['FeatureGroupName'], fg['CreationTime'], fg['FeatureGroupStatus'])


transaction-feature-group-17-07-05-45 2025-06-17 07:05:49.848000+00:00 Created
identity-feature-group-17-07-05-45 2025-06-17 07:05:48.237000+00:00 Created


> **The cell below will delete your feature group**  
> _(Only run if you need to recreate)_

In [38]:
#feature_group_name = "autism-feature-group-2025-06-18-12-00-02"

# Delete the feature group
#sagemaker_client.delete_feature_group(FeatureGroupName=feature_group_name)

#print(f"Requested deletion for feature group: {feature_group_name}")


Requested deletion for feature group: autism-feature-group-2025-06-18-12-00-02


In [27]:
from sagemaker import get_execution_role

# Get your role as (role).
role = get_execution_role()
print(role)

arn:aws:iam::204864359127:role/LabRole


___
Event_time columns are required for AWS feature store ingestion. As our dataset did not come with an origianl "event_time" column, we must create one manually using pandas (pd.to_datetime...) command. AWS accepts format _ISO-8601 with 'Z' suffix (UTC)_

* It is here that I begin to understand that the AWS online feature store is best fit for time-dependent, real-time data processing.
* The purpose of the offline store is to capture historical data from the online store. Because we only have one, static, dataset there is essentially, nothing to capture. And nothing to monitor.
* Here we will ingest the dataset into the **online store**, but only to be ingested into the **offline store**. We will then train our model from the offline store.
___

In [49]:
import time
from datetime import datetime
# Convert to proper ISO-8601 format with 'Z' suffix (UTC)
df["event_time"] = pd.to_datetime(df["event_time"]).dt.strftime('%Y-%m-%dT%H:%M:%S.%f').str[:-3] + 'Z'
event_time_feature_name = "event_time"


In [41]:
from sagemaker.feature_store.feature_group import FeatureGroup
from time import strftime, gmtime

# Generate name
feature_group_name = f"autism-feature-group-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}"

# Initialize the Feature Group object
feature_group = FeatureGroup(name=feature_group_name, sagemaker_session=feature_store_session)

# Load feature definitions from the DataFrame
feature_group.load_feature_definitions(data_frame=df)

[FeatureDefinition(feature_name='id', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>, collection_type=None),
 FeatureDefinition(feature_name='a1_score', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>, collection_type=None),
 FeatureDefinition(feature_name='a2_score', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>, collection_type=None),
 FeatureDefinition(feature_name='a3_score', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>, collection_type=None),
 FeatureDefinition(feature_name='a4_score', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>, collection_type=None),
 FeatureDefinition(feature_name='a5_score', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>, collection_type=None),
 FeatureDefinition(feature_name='a6_score', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>, collection_type=None),
 FeatureDefinition(feature_name='a7_score', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>, collection_type=None),
 FeatureDefinition(feature_name='a8_score', fe

In [42]:
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group Creation")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    if status != "Created":
        raise RuntimeError(f"Failed to create feature group {feature_group.name}")
    print(f"FeatureGroup {feature_group.name} successfully created.")

# Create the feature group with shorthand parameters (if supported)
feature_group.create(
    s3_uri=f"s3://{default_s3_bucket_name}/{prefix}",
    record_identifier_name='id',
    event_time_feature_name=event_time_feature_name,
    role_arn=role,
    enable_online_store=True,
)

wait_for_feature_group_creation_complete(feature_group=feature_group)



Waiting for Feature Group Creation
Waiting for Feature Group Creation
Waiting for Feature Group Creation
Waiting for Feature Group Creation
FeatureGroup autism-feature-group-2025-06-18-12-22-51 successfully created.


> ## Feature Group Created  

These feature groups can be confirmed in your AWS Athena/Glue GUI

In [43]:
feature_group.describe()

{'FeatureGroupArn': 'arn:aws:sagemaker:us-west-2:204864359127:feature-group/autism-feature-group-2025-06-18-12-22-51',
 'FeatureGroupName': 'autism-feature-group-2025-06-18-12-22-51',
 'RecordIdentifierFeatureName': 'id',
 'EventTimeFeatureName': 'event_time',
 'FeatureDefinitions': [{'FeatureName': 'id', 'FeatureType': 'Integral'},
  {'FeatureName': 'a1_score', 'FeatureType': 'Integral'},
  {'FeatureName': 'a2_score', 'FeatureType': 'Integral'},
  {'FeatureName': 'a3_score', 'FeatureType': 'Integral'},
  {'FeatureName': 'a4_score', 'FeatureType': 'Integral'},
  {'FeatureName': 'a5_score', 'FeatureType': 'Integral'},
  {'FeatureName': 'a6_score', 'FeatureType': 'Integral'},
  {'FeatureName': 'a7_score', 'FeatureType': 'Integral'},
  {'FeatureName': 'a8_score', 'FeatureType': 'Integral'},
  {'FeatureName': 'a9_score', 'FeatureType': 'Integral'},
  {'FeatureName': 'a10_score', 'FeatureType': 'Integral'},
  {'FeatureName': 'age', 'FeatureType': 'Fractional'},
  {'FeatureName': 'gender', '

In [50]:
print(df.dtypes)
print(df.columns.tolist())

# Check for missing values
print(df.isnull().sum())

# Check sample rows for identifier and event time
print(df[['id', 'event_time']].head())


id                   int64
a1_score             int64
a2_score             int64
a3_score             int64
a4_score             int64
a5_score             int64
a6_score             int64
a7_score             int64
a8_score             int64
a9_score             int64
a10_score            int64
age                float64
gender              object
ethnicity           object
jaundice            object
austim              object
contry_of_res       object
used_app_before     object
result             float64
age_desc            object
relation            object
class_asd            int64
event_time          object
dtype: object
['id', 'a1_score', 'a2_score', 'a3_score', 'a4_score', 'a5_score', 'a6_score', 'a7_score', 'a8_score', 'a9_score', 'a10_score', 'age', 'gender', 'ethnicity', 'jaundice', 'austim', 'contry_of_res', 'used_app_before', 'result', 'age_desc', 'relation', 'class_asd', 'event_time']
id                 0
a1_score           0
a2_score           0
a3_score           0
a4_s

> We now ingetst the data from the dataframe _(sourced from s3)_ into the feature groups

In [51]:
feature_group.ingest(
    data_frame=df,
    max_workers=3,
    wait=True
)


IngestionManagerPandas(feature_group_name='autism-feature-group-2025-06-18-12-22-51', feature_definitions={'id': {'FeatureName': 'id', 'FeatureType': 'Integral'}, 'a1_score': {'FeatureName': 'a1_score', 'FeatureType': 'Integral'}, 'a2_score': {'FeatureName': 'a2_score', 'FeatureType': 'Integral'}, 'a3_score': {'FeatureName': 'a3_score', 'FeatureType': 'Integral'}, 'a4_score': {'FeatureName': 'a4_score', 'FeatureType': 'Integral'}, 'a5_score': {'FeatureName': 'a5_score', 'FeatureType': 'Integral'}, 'a6_score': {'FeatureName': 'a6_score', 'FeatureType': 'Integral'}, 'a7_score': {'FeatureName': 'a7_score', 'FeatureType': 'Integral'}, 'a8_score': {'FeatureName': 'a8_score', 'FeatureType': 'Integral'}, 'a9_score': {'FeatureName': 'a9_score', 'FeatureType': 'Integral'}, 'a10_score': {'FeatureName': 'a10_score', 'FeatureType': 'Integral'}, 'age': {'FeatureName': 'age', 'FeatureType': 'Fractional'}, 'gender': {'FeatureName': 'gender', 'FeatureType': 'String'}, 'ethnicity': {'FeatureName': 'eth

We confirm that the data is ingested into the feature group by using the `record_indentifier_value` command to identify one specific row. This shows the values of each column for the id=280. You may adjust the value string but get errors for values that are not in the data `count` range.

In [57]:
record_identifier_value = str(280)

featurestore_runtime.get_record(
    FeatureGroupName=feature_group_name,
    RecordIdentifierValueAsString=record_identifier_value,
)


{'ResponseMetadata': {'RequestId': '5b7b2dc8-c7f9-41a2-bc9d-552eeee03578',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '5b7b2dc8-c7f9-41a2-bc9d-552eeee03578',
   'content-type': 'application/json',
   'content-length': '1762',
   'date': 'Wed, 18 Jun 2025 12:34:18 GMT'},
  'RetryAttempts': 0},
 'Record': [{'FeatureName': 'id', 'ValueAsString': '280'},
  {'FeatureName': 'a1_score', 'ValueAsString': '0'},
  {'FeatureName': 'a2_score', 'ValueAsString': '1'},
  {'FeatureName': 'a3_score', 'ValueAsString': '0'},
  {'FeatureName': 'a4_score', 'ValueAsString': '0'},
  {'FeatureName': 'a5_score', 'ValueAsString': '0'},
  {'FeatureName': 'a6_score', 'ValueAsString': '0'},
  {'FeatureName': 'a7_score', 'ValueAsString': '1'},
  {'FeatureName': 'a8_score', 'ValueAsString': '0'},
  {'FeatureName': 'a9_score', 'ValueAsString': '1'},
  {'FeatureName': 'a10_score', 'ValueAsString': '1'},
  {'FeatureName': 'age', 'ValueAsString': '25.83802099'},
  {'FeatureName': 'gender', 'ValueAsSt

In [56]:
df[df['id'] == 290]


Unnamed: 0,id,a1_score,a2_score,a3_score,a4_score,a5_score,a6_score,a7_score,a8_score,a9_score,...,ethnicity,jaundice,austim,contry_of_res,used_app_before,result,age_desc,relation,class_asd,event_time
289,290,1,1,0,0,0,0,1,0,0,...,South Asian,no,no,United Arab Emirates,no,13.545045,18 and more,?,0,2025-06-18T12:22:49.992Z


In [58]:
featurestore_runtime.batch_get_record(
    Identifiers=[
        {
            "FeatureGroupName": feature_group_name,
            "RecordIdentifiersValueAsString": ["280"],
        },
    ]
)

{'ResponseMetadata': {'RequestId': '2f01a542-8c2c-41ba-9f47-6d70b7c8bd83',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '2f01a542-8c2c-41ba-9f47-6d70b7c8bd83',
   'content-type': 'application/json',
   'content-length': '1916',
   'date': 'Wed, 18 Jun 2025 14:39:26 GMT'},
  'RetryAttempts': 0},
 'Records': [{'FeatureGroupName': 'autism-feature-group-2025-06-18-12-22-51',
   'RecordIdentifierValueAsString': '280',
   'Record': [{'FeatureName': 'id', 'ValueAsString': '280'},
    {'FeatureName': 'a1_score', 'ValueAsString': '0'},
    {'FeatureName': 'a2_score', 'ValueAsString': '1'},
    {'FeatureName': 'a3_score', 'ValueAsString': '0'},
    {'FeatureName': 'a4_score', 'ValueAsString': '0'},
    {'FeatureName': 'a5_score', 'ValueAsString': '0'},
    {'FeatureName': 'a6_score', 'ValueAsString': '0'},
    {'FeatureName': 'a7_score', 'ValueAsString': '1'},
    {'FeatureName': 'a8_score', 'ValueAsString': '0'},
    {'FeatureName': 'a9_score', 'ValueAsString': '1'},
    {'Fea

In [59]:
print(feature_group.as_hive_ddl())

CREATE EXTERNAL TABLE IF NOT EXISTS sagemaker_featurestore.autism-feature-group-2025-06-18-12-22-51 (
  id INT
  a1_score INT
  a2_score INT
  a3_score INT
  a4_score INT
  a5_score INT
  a6_score INT
  a7_score INT
  a8_score INT
  a9_score INT
  a10_score INT
  age FLOAT
  gender STRING
  ethnicity STRING
  jaundice STRING
  austim STRING
  contry_of_res STRING
  used_app_before STRING
  result FLOAT
  age_desc STRING
  relation STRING
  class_asd INT
  event_time STRING
  write_time TIMESTAMP
  event_time TIMESTAMP
  is_deleted BOOLEAN
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
  STORED AS
  INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
  OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
LOCATION 's3://sagemaker-us-west-2-204864359127/sagemaker/autism-feature-group-2025-06-17-18-52-02/204864359127/sagemaker/us-west-2/offline-store/autism-feature-group-2025-06-18-12-22-51-1750249373/data'


## S3 Bucket Setup For The OfflineStore  
Now we prepare the offline store. In real-world case, we will not need to use AWS online store for one dataset, and will be able to work from local source. This process is more for large and multiple datasets. But here, for us, the only reason we created the online store, is to fulfill the requirements for the offline store here and course demands.

In [65]:
s3_client = boto3.client("s3")

feature_group_resolved_output_s3_uri = (
    feature_group.describe()
    .get("OfflineStoreConfig")
    .get("S3StorageConfig")
    .get("ResolvedOutputS3Uri")
)

feature_group_s3_prefix = feature_group_resolved_output_s3_uri.replace(
    f"s3://{default_s3_bucket_name}/", ""
)

offline_store_contents = None
while offline_store_contents is None:
    objects_in_bucket = s3_client.list_objects(
        Bucket=default_s3_bucket_name, Prefix=feature_group_s3_prefix
    )
    if "Contents" in objects_in_bucket and len(objects_in_bucket["Contents"]) > 1:
        offline_store_contents = objects_in_bucket["Contents"]
    else:
        print("Waiting for data in offline store...\n")
        sleep(60)

print("Data available.")

Data available.


> * The data is avaialble in the offline store and we can confirm by printing the first 10 parquet files from the offline store  
> * Why parquet values? The offline store, stores data in batch files (I forget why, but I believe for compressibility)  
> * To use these files, we must concatenate all parquet files into one database  

In [85]:
for obj in offline_store_contents[:10]:
    print(obj["Key"])

sagemaker/autism-feature-group-2025-06-17-18-52-02/204864359127/sagemaker/us-west-2/offline-store/autism-feature-group-2025-06-18-12-22-51-1750249373/data/year=2025/month=06/day=18/hour=12/20250618T122249Z_2Ivs88dLYliG5jXn.parquet
sagemaker/autism-feature-group-2025-06-17-18-52-02/204864359127/sagemaker/us-west-2/offline-store/autism-feature-group-2025-06-18-12-22-51-1750249373/data/year=2025/month=06/day=18/hour=12/20250618T122249Z_8MXOrImlmY1Vt8iv.parquet
sagemaker/autism-feature-group-2025-06-17-18-52-02/204864359127/sagemaker/us-west-2/offline-store/autism-feature-group-2025-06-18-12-22-51-1750249373/data/year=2025/month=06/day=18/hour=12/20250618T122249Z_AzYGbk32s9l51NPL.parquet
sagemaker/autism-feature-group-2025-06-17-18-52-02/204864359127/sagemaker/us-west-2/offline-store/autism-feature-group-2025-06-18-12-22-51-1750249373/data/year=2025/month=06/day=18/hour=12/20250618T122249Z_JQdYf4nudlkJiN8u.parquet
sagemaker/autism-feature-group-2025-06-17-18-52-02/204864359127/sagemaker/us

In [67]:
import pandas as pd

s3_uri = f"s3://{default_s3_bucket_name}/{offline_store_contents[0]['Key']}"
df = pd.read_parquet(s3_uri, engine="pyarrow")
print(df.head())


    id  a1_score  a2_score  a3_score  a4_score  a5_score  a6_score  a7_score  \
0  277         0         0         0         0         0         0         1   
1   30         1         1         1         1         1         1         1   
2   35         0         0         0         0         0         0         0   
3  578         1         1         0         1         1         1         1   
4   52         0         1         0         0         0         0         0   

   a8_score  a9_score  ...  relation  class_asd                event_time  \
0         0         0  ...      Self          0  2025-06-18T12:22:49.992Z   
1         1         1  ...      Self          1  2025-06-18T12:22:49.992Z   
2         1         0  ...      Self          0  2025-06-18T12:22:49.992Z   
3         1         1  ...      Self          1  2025-06-18T12:22:49.992Z   
4         1         0  ...         ?          0  2025-06-18T12:22:49.992Z   

                        write_time       api_invocation_

> * Like stated before, the data in the offline store is distributed in multiple parquet files, we combine them here:

In [75]:
import boto3
import pandas as pd
import io

bucket = "sagemaker-us-west-2-204864359127"
s3 = boto3.client("s3")

# Only keep valid .parquet keys
valid_keys = [obj["Key"] for obj in offline_store_contents if obj["Key"].endswith(".parquet")]

# If we had multiple datasets, we would only pull specific columns from the feature group HERE.
# But because we only have one dataset, it is more effiecent to Feature Engineer after dataset is aquired

# Read and concatenate all Parquet files into a single DataFrame
dfs = []
for key in valid_keys:
    response = s3.get_object(Bucket=bucket, Key=key)
    dfs.append(pd.read_parquet(io.BytesIO(response['Body'].read())))

df_combined = pd.concat(dfs, ignore_index=True)


Here we confirmed that the parquet files were combined correctly
>* df_combined acts ass complete dataset with 800 values
>* df acts only as one sample parquet file which df_combined was organized

In [77]:
print(df_combined.value_counts)


<bound method DataFrame.value_counts of       id  a1_score  a2_score  a3_score  a4_score  a5_score  a6_score  \
0    277         0         0         0         0         0         0   
1     30         1         1         1         1         1         1   
2     35         0         0         0         0         0         0   
3    578         1         1         0         1         1         1   
4     52         0         1         0         0         0         0   
..   ...       ...       ...       ...       ...       ...       ...   
795  769         1         1         1         1         1         1   
796  497         0         1         1         1         1         0   
797  770         1         0         1         0         1         0   
798  787         1         1         1         1         1         0   
799  263         1         1         1         1         1         1   

     a7_score  a8_score  a9_score  ...   contry_of_res  used_app_before  \
0           1       

In [68]:
df.describe()

Unnamed: 0,id,a1_score,a2_score,a3_score,a4_score,a5_score,a6_score,a7_score,a8_score,a9_score,a10_score,age,result,class_asd
count,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0
mean,279.727273,0.363636,0.545455,0.272727,0.454545,0.454545,0.363636,0.545455,0.454545,0.454545,0.636364,28.014649,9.29298,0.181818
std,257.816249,0.504525,0.522233,0.467099,0.522233,0.522233,0.504525,0.522233,0.522233,0.522233,0.504525,15.920551,4.70692,0.40452
min,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.891801,0.245984,0.0
25%,67.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16.237132,7.45195,0.0
50%,137.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,24.536547,10.650866,0.0
75%,528.0,1.0,1.0,0.5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,41.230618,12.012336,0.0
max,695.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,57.227613,15.770715,1.0


In [78]:
df_combined.dtypes

id                                   int64
a1_score                             int64
a2_score                             int64
a3_score                             int64
a4_score                             int64
a5_score                             int64
a6_score                             int64
a7_score                             int64
a8_score                             int64
a9_score                             int64
a10_score                            int64
age                                float64
gender                              object
ethnicity                           object
jaundice                            object
austim                              object
contry_of_res                       object
used_app_before                     object
result                             float64
age_desc                            object
relation                            object
class_asd                            int64
event_time                          object
write_time 

Moving forward with df_combined, we create a train-test-validation split from the data, and prepare it for model training.

In [80]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder

# Drop unneeded columns (adjust as necessary)
X = df_combined.drop(columns=[
    'id', 'relation', 'class_asd',
    'event_time', 'write_time', 'api_invocation_time',
    'is_deleted'
])

# Target column
y = df_combined['class_asd']

# Identify categorical columns
categorical_cols = X.select_dtypes(include='object').columns.tolist()

# Encode categorical features
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
X[categorical_cols] = encoder.fit_transform(X[categorical_cols].astype(str))

# Train/Validation split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


In [81]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)


In [82]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_val)
print(classification_report(y_val, y_pred))


              precision    recall  f1-score   support

           0       0.91      0.91      0.91       128
           1       0.65      0.62      0.63        32

    accuracy                           0.86       160
   macro avg       0.78      0.77      0.77       160
weighted avg       0.85      0.86      0.86       160



In [86]:
print(f"Total samples: {len(df_combined)}")
print(df_combined['class_asd'].value_counts())


Total samples: 800
class_asd
0    639
1    161
Name: count, dtype: int64


# Conclusive Reasoning
* Here we can explain why we chose to use the _random forest model_ to suffice for a simple binary classification task, but the focus of the project is not on model evaluation, rather, Machine Learning Operations. We want to focus more on using the **AWS services**, why we would do so, or why we wouldn't. To truly fulfill the tasks of the course requirements for this project, we must direct more attention towards storing/comparing different models and different datasets.  

* Because we only have one dataset, the training will be the same regardless if we use Athena, Feature Stores, or other AWS tools, because we are only referencing one dataset (800 values). We do not experience the real effectivness of AWS ML tools.

## Cleanup Resources

In [41]:
# restore original boto3 version
%pip install 'boto3=={}'.format(original_boto3_version)

/bin/bash: -c: line 1: syntax error near unexpected token `('
/bin/bash: -c: line 1: `/opt/conda/bin/python -m pip install 'boto3=={}'.format(original_boto3_version)'
Note: you may need to restart the kernel to use updated packages.


## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)
