# Amazon SageMaker FeatureStore - Basic Example with Vectors

Kernel `Python 3 (Data Science)` works well with this notebook.

In this notebook, you will learn how to create a feature group for a dummy created dataset in the SageMaker Feature Store. You will then learn how to ingest the feature columns into the created feature group (both the Online and the Offline store) using SageMaker Python SDK. You will also see how to get an ingested feature record from the Online store and perfrom SQL query to fetch from Online feature store using Athena API.
Then you'll test scenarios as wrong feature type and Null values.

## Setup SageMaker FeatureStore

Let's start by setting up the SageMaker Python SDK and boto client.

In [None]:
import boto3
import json
import sagemaker
from sagemaker.session import Session

region = boto3.Session().region_name
s3_client = boto3.client('s3', region_name=region)

boto_session = boto3.Session(region_name=region)

sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)

feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime
)

#### S3 Bucket Setup For The OfflineStore

SageMaker FeatureStore writes the data in the OfflineStore of a FeatureGroup to a S3 bucket owned by you. To be able to write to your S3 bucket, SageMaker FeatureStore assumes an IAM role which has access to it. The role is also owned by you.
Note that the same bucket can be re-used across FeatureGroups. Data in the bucket is partitioned by FeatureGroup.

Set the default s3 bucket name and it will be referenced throughout the notebook.

In [None]:
# You can modify the following to use a bucket of your choosing
default_s3_bucket_name = feature_store_session.default_bucket()
prefix = 'sagemaker-basic-featurestore-vecors-demo'

print(default_s3_bucket_name)

Set up the IAM role. This role gives SageMaker FeatureStore access to your S3 bucket. 

<div class="alert alert-block alert-warning">
<b>Note:</b> In this example we use the default SageMaker role, assuming it has both <b>AmazonSageMakerFullAccess</b> and <b>AmazonSageMakerFeatureStoreAccess</b> managed policies. If not, please make sure to attach them to the role before proceeding.
</div>

In [None]:
from sagemaker import get_execution_role

# You can modify the following to use a role of your choosing. See the documentation for how to create this.
role = get_execution_role()
print (role)

## Inspect Dataset

We create a random embeddings with 128 chars length and then create a dummy dataset 

In [None]:
embeddings = np.random.rand(3,128).tolist()
embeddings

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import io

d = {
    'TransactionID': [1, 2, 3], 
    'TransactionAmt': [1000, 2000, 3000], 
    'card': ["mastercard","visa","diners"],
    'embeddings': embeddings
}
my_sample_data = pd.DataFrame(data=d)
my_sample_data

## Ingest Data into FeatureStore

In this step we will create the FeatureGroups representing the transaction and identity tables.

#### Define FeatureGroups

In [None]:
from time import gmtime, strftime, sleep

my_sample_feature_group_name = 'my-sample-vectors-feature-group-' + strftime('%d-%H-%M-%S', gmtime())

In [None]:
from sagemaker.feature_store.feature_group import FeatureGroup

my_sample_feature_group = FeatureGroup(name=my_sample_feature_group_name, sagemaker_session=feature_store_session)

### Convert Object to String

In [None]:
my_sample_data.info()

In [None]:
import time

current_time_sec = int(round(time.time()))

def cast_object_to_string(data_frame):
    for label in data_frame.columns:
        if data_frame.dtypes[label] == 'object':
            data_frame[label] = data_frame[label].astype("str").astype("string")

# cast object dtype to string. The SageMaker FeatureStore Python SDK will then map the string dtype to String feature type.
cast_object_to_string(my_sample_data)

### Add `EventTime` needed

In [None]:
# record identifier and event time feature names
record_identifier_feature_name = "TransactionID"
event_time_feature_name = "EventTime"

# append EventTime feature
my_sample_data[event_time_feature_name] = pd.Series([current_time_sec]*len(my_sample_data), dtype="float64")

# load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.
my_sample_feature_group.load_feature_definitions(data_frame=my_sample_data); # output is suppressed

In [None]:
my_sample_data

In [None]:
my_sample_data.info()

#### Create FeatureGroups in SageMaker FeatureStore

In [None]:
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group Creation")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    if status != "Created":
        raise RuntimeError(f"Failed to create feature group {feature_group.name}")
    print(f"FeatureGroup {feature_group.name} successfully created.")

my_sample_feature_group.create(
    s3_uri=f"s3://{default_s3_bucket_name}/{prefix}",
    record_identifier_name=record_identifier_feature_name,
    event_time_feature_name=event_time_feature_name,
    role_arn=role,
    enable_online_store=True
)

wait_for_feature_group_creation_complete(feature_group=my_sample_feature_group)

Confirm the FeatureGroup has been created by using the DescribeFeatureGroup and ListFeatureGroups APIs.

In [None]:
feature_group_describe_response = my_sample_feature_group.describe()
feature_group_describe_response

#### PutRecords into FeatureGroup

After the FeatureGroups have been created, we can put data into the FeatureGroups by using the PutRecord API. This API can handle high TPS and is designed to be called by different streams. The data from all of these Put requests is buffered and written to S3 in chunks. The files will be written to the offline store within a few minutes of ingestion.

In [None]:
my_sample_feature_group.ingest(
    data_frame=my_sample_data, max_workers=1, wait=True
)

To confirm that data has been ingested, we can quickly retrieve a record from the online store:

In [None]:
record_identifier_value = str(1)

featurestore_runtime.get_record(FeatureGroupName=my_sample_feature_group_name, RecordIdentifierValueAsString=record_identifier_value)

#### Create Pandas DataFrame from FeatureStore response

In [None]:
record_identifier_value = str(3)

record = featurestore_runtime.get_record(FeatureGroupName=my_sample_feature_group_name, RecordIdentifierValueAsString=record_identifier_value)["Record"]
record

In [None]:
def map_feature_name_value(record):
    result_dict = {}
    for feature in record:
        result_dict[feature["FeatureName"]] = [feature["ValueAsString"]]
    return result_dict

In [None]:
record_as_dict = map_feature_name_value(record)
record_as_dict

In [None]:
pd.DataFrame(data=record_as_dict)

## Batch fetch multiple Product records from the Online Feature Store

Fetch a list of selected items from the feature group.
##### Up to 100 records can be fetched from an online Feature Store in a single batch operation.

In [None]:
identifiers = [
    {
        'FeatureGroupName': my_sample_feature_group_name,
        'RecordIdentifiersValueAsString': ["1", "2","3"]
    }
]
        
batch_get_record_response = featurestore_runtime.batch_get_record(Identifiers=identifiers)
batch_get_record_response['Records']

## Update a record in the Feature Store

We will replace the 1st record value from 1000 to 1111.

In [None]:
my_sample_data.replace(1000, 1111, inplace=True)
my_sample_data

Ingest records again

In [None]:
my_sample_feature_group.ingest(
    data_frame=my_sample_data, max_workers=1, wait=True
)

To confirm that record has been updated, we can quickly retrieve a record from the online store:

In [None]:
record_identifier_value = str(1)

featurestore_runtime.get_record(FeatureGroupName=my_sample_feature_group_name, RecordIdentifierValueAsString=record_identifier_value)

The SageMaker Python SDK’s FeatureStore class also provides the functionality to generate Hive DDL commands. Schema of the table is generated based on the feature definitions. Columns are named after feature name and data-type are inferred based on feature type.

In [None]:
print(my_sample_feature_group.as_hive_ddl())

Now let's wait for the data to appear in our offline store before moving forward to creating a dataset. This will take approximately 5 minutes.

In [None]:
s3_offline_location=feature_group_describe_response['OfflineStoreConfig']['S3StorageConfig']['ResolvedOutputS3Uri']
print(s3_offline_location)

my_sample_feature_group_s3_prefix = '/'.join(s3_offline_location.split("/")[3:])
print(my_sample_feature_group_s3_prefix)

offline_store_contents = None
while (offline_store_contents is None):
    objects_in_bucket = s3_client.list_objects(Bucket=default_s3_bucket_name,Prefix=my_sample_feature_group_s3_prefix)
    if ('Contents' in objects_in_bucket and len(objects_in_bucket['Contents']) > 1):
        offline_store_contents = objects_in_bucket['Contents']
    else:
        print('Waiting for data in offline store...\n')
        sleep(60)
    
print('Data available.')

## Build Training Dataset

SageMaker FeatureStore automatically builds the Glue Data Catalog for FeatureGroups (you can optionally turn it on/off while creating the FeatureGroup). In this example, we want to create one training dataset with FeatureValues from both identity and transaction FeatureGroups. This is done by utilizing the auto-built Catalog. We run an Athena query that joins the data stored in the offline store in S3 from the 2 FeatureGroups. 

In [None]:
my_sample_query = my_sample_feature_group.athena_query()

my_sample_table = my_sample_query.table_name

query_string = 'SELECT * FROM "'+my_sample_table+'"'
print('Running ' + query_string)

# run Athena query. The output is loaded to a Pandas dataframe.
#dataset = pd.DataFrame()
my_sample_query.run(query_string=query_string, output_location='s3://'+default_s3_bucket_name+'/'+prefix+'/query_results/')
my_sample_query.wait()
dataset = my_sample_query.as_dataframe()

dataset

## Ingest erroneous data 

### Wrong Feature Type

In [None]:
d = {'TransactionID': [1, 2, 3], 'TransactionAmt': ["one", "two", "three"], 'card': ["mastercard","visa","diners"]}
my_erroneous_sample_data = pd.DataFrame(data=d)
my_erroneous_sample_data

In [None]:
cast_object_to_string(my_erroneous_sample_data)
cast_object_to_string(my_sample_data)
my_erroneous_sample_data

#### PutRecords into FeatureGroup

We will now put the erroneous data into the FeatureGroups by using the PutRecord API

In [None]:
my_sample_feature_group.ingest(
    data_frame=my_erroneous_sample_data, max_workers=1, wait=True
)

### Null Values

In [None]:
my_erroneous_sample_data['TransactionAmt']=None
my_erroneous_sample_data

In [None]:
my_sample_feature_group.ingest(
    data_frame=my_erroneous_sample_data, max_workers=1, wait=True
)

## Cleanup Resources

In [None]:
my_sample_feature_group.delete()