# Feature Store Ingestion Using SageMaker Feature Store
Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference. For example, in an application that recommends a music playlist, features could include song ratings, listening duration, and listener demographics. 

In this notebook, we will create feature groups and use the SageMaker Python SDK to ingest features into feature store.

This notebook is the prerequisites for the sagemaker inference pipeline notebook in the same folder. Essentially, we'll be creating some feature stores and ingest user item interaction data into them so that we could run realtime inference against the serial inference pipeline hosted in SageMaker. 

The dataset used in this notebook is extracted from [movielens](https://movielens.org/), an open source movie data commonly used for training / evaluating movie recommendation systems. 

First let's define the required dependencies

In [None]:
import pandas as pd
import sagemaker
from time import strftime, gmtime
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.feature_store.inputs import TableFormatEnum
import logging
from datetime import datetime, timezone, date
import time

In [None]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

For the lab, we'll ingest the data in the `data` folder into the feature store.

In [None]:
%%sh
cd data 
tar -xzvf data.tar.gz

Let's examine the dataset to see the columns

In [None]:
df = pd.read_csv("data/final_title_embeddings.csv")

In [None]:
df.head()

Feature Store relies on a timestamp and an identity to define individual record. Following is a utility funtion that creates the timestamp for the dataset. 

In [None]:
def generate_event_timestamp():
    # naive datetime representing local time
    naive_dt = datetime.now()
    # take timezone into account
    aware_dt = naive_dt.astimezone()
    # time in UTC
    utc_dt = aware_dt.astimezone(timezone.utc)
    # transform to ISO-8601 format
    event_time = utc_dt.isoformat(timespec='milliseconds')
    event_time = event_time.replace('+00:00', 'Z')
    return event_time

Create SageMaker session so we could leverage the SageMaker SDK for feature store operations

In [None]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

In [None]:
current_timestamp = strftime('%m-%d-%H-%M', gmtime())

In [None]:
fs_prefix = 'movielens' 

In [None]:
titles_feature_group_name = f'{fs_prefix}-titles-{current_timestamp}'
titles_embeddings_mappings_feature_group_name = f'{fs_prefix}-titles-embeddings-mapping-{current_timestamp}'

Create feature group names as followed:

1. Titles feature group - Use for store movie titles and embeddings.
2. Titles embedding mappings feature group - feature store that maps the original movie id with the newly asisgned index for each movie record. This is used for reversed lookup of the recommended movie based on a given index.

In [None]:
titles_feature_group = FeatureGroup(name=titles_feature_group_name, sagemaker_session=sagemaker_session)
titles_embeddings_mappings_feature_group = FeatureGroup(name=titles_embeddings_mappings_feature_group_name, sagemaker_session=sagemaker_session)

In [None]:
event_time = generate_event_timestamp()

In [None]:
df['event_time'] = event_time

In [None]:
fs_titles_df = df.loc[:, ['movieId', 'embeddings', 'event_time']]
fs_title_embedding_mapping_df = df.reset_index().loc[:, ['index', 'movieId', 'event_time']]

In [None]:
fs_titles_df['embeddings'] = df['embeddings'].astype(str)

We will use the column definitions from the pandas dataframe to define the feature groups. 

In [None]:
titles_feature_group.load_feature_definitions(data_frame=fs_titles_df)
titles_embeddings_mappings_feature_group.load_feature_definitions(data_frame=fs_title_embedding_mapping_df)

Define the table format 

In [None]:
table_format = TableFormatEnum.GLUE

In [None]:
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get('FeatureGroupStatus')
    print(f'Initial status: {status}')
    while status == 'Creating':
        logger.info(f'Waiting for feature group: {feature_group.name} to be created ...')
        time.sleep(5)
        status = feature_group.describe().get('FeatureGroupStatus')
    if status != 'Created':
        raise SystemExit(f'Failed to create feature group {feature_group.name}: {status}')
    logger.info(f'FeatureGroup {feature_group.name} was successfully created.')

Create the feature store group using the SDK

In [None]:
titles_feature_group.create(s3_uri=False, 
                               record_identifier_name='movieId', 
                               event_time_feature_name='event_time', 
                               role_arn=role, 
                               enable_online_store=True,
                               table_format=table_format 
                              )

titles_embeddings_mappings_feature_group.create(s3_uri=False, 
                               record_identifier_name='index', 
                               event_time_feature_name='event_time', 
                               role_arn=role, 
                               enable_online_store=True,
                               table_format=table_format 
                              )

Wait for the feature store groups completion

In [None]:
wait_for_feature_group_creation_complete(titles_feature_group)
wait_for_feature_group_creation_complete(titles_embeddings_mappings_feature_group)

After the feature are created successfully, we'll start ingesting the movie and reference mapping dataset to the corresponding feature groups.

In [None]:
titles_feature_group.ingest(data_frame=fs_titles_df, max_processes=16, wait=True)

In [None]:
titles_embeddings_mappings_feature_group.ingest(data_frame=fs_title_embedding_mapping_df, max_processes=16, wait=True)

# Validate a record from feature store
In the following section, we run some integration tests against the feature groups that we just created. 
This ensures that the feature groups are created successfully and are functioning as expected. We'll use these feature groups in the serial inference pipeline to provide realtime recommendation for the users. 

In [None]:
import boto3

In [None]:
featurestore_runtime_client = boto3.client('sagemaker-featurestore-runtime')

In [None]:
test_movieId = fs_titles_df.sample(n=1)['movieId'].values[0]

In [None]:
test_movieId

Runs a query against the feature group using an index

In [None]:
title_embedding_feature_record = featurestore_runtime_client.get_record(FeatureGroupName=titles_embeddings_mappings_feature_group_name, 
                                                        RecordIdentifierValueAsString=str(1076))

In [None]:
title_embedding_feature_record

Store the feature group names for `sagemaker_inference_pipeline.ipynb`

In [None]:
%store titles_feature_group_name titles_embeddings_mappings_feature_group_name

## Conclusion
In this notebook, we created 2 feature groups with movielens dataset using SageMaker feature store. 
We also tested the feature stores to ensure successful ingestion and retrieval. These feature groups are to be used in the next step in providing realtime movie recommendations.