# Module 1: Introduction to SageMaker Feature Store

**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Load and explore datasets](#Load-and-explore-datasets)
1. [Create feature definitions and groups](#Create-feature-definitions-and-groups)
1. [Ingest data into feature groups](#Ingest-data-into-feature-groups)
1. [Get feature record from the Online feature store](#Get-feature-record-from-the-Online-feature-store)
1. [List feature groups](#List-feature-groups)

# Background

In this notebook, you will learn how to create **3** feature groups for `customers`, `products` and `orders` datasets 
in the SageMaker Feature Store. You will then learn how to ingest the feature 
columns into the created feature groups (both the Online and the Offline store) using SageMaker Python SDK. You will also see how to get an ingested feature record from the Online store. In the end, you will know how to list all the feature groups created within the Feature Store and delete them.

**Note:** The feature groups created in this notebook will be used in the upcoming modules.


# Setup

#### Imports

In [None]:
from sagemaker.feature_store.feature_group import FeatureGroup
from time import gmtime, strftime, sleep
from random import randint
import pandas as pd
import numpy as np
import subprocess
import sagemaker
import importlib
import logging
import time
import sys
from sagemaker.feature_store.inputs import TableFormatEnum

In [None]:
sm_version = sagemaker.__version__
major, minor, patch = sm_version.split('.')
if int(major) < 2 or int(minor) < 125:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sagemaker==2.125.0'])
    importlib.reload(sagemaker)

In [None]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

In [None]:
logger.info(f'Using SageMaker version: {sagemaker.__version__}')
logger.info(f'Using Pandas version: {pd.__version__}')

#### Essentials

In [None]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()
logger.info(f'Default S3 bucket = {default_bucket}')
prefix = 'sagemaker-feature-store'

In [None]:
region = sagemaker_session.boto_region_name

# Load and explore datasets

In [None]:
customers_df = pd.read_csv('.././data/transformed/customers.csv')
customers_df.head(5)

In [None]:
customers_df.dtypes

In [None]:
customers_df['customer_id'] = customers_df['customer_id'].astype('string')
customers_df['event_time'] = customers_df['event_time'].astype('string')

In [None]:
customers_df.dtypes

In [None]:
products_df = pd.read_csv('.././data/transformed/products.csv')
products_df.head(5)

In [None]:
products_df['product_id'] = products_df['product_id'].astype('string')
products_df['event_time'] = products_df['event_time'].astype('string')

In [None]:
products_df.dtypes

In [None]:
orders_df = pd.read_csv('.././data/transformed/orders.csv')
orders_df

In [None]:
orders_df['order_id'] = orders_df['order_id'].astype('string')
orders_df['customer_id'] = orders_df['customer_id'].astype('string')
orders_df['product_id'] = orders_df['product_id'].astype('string')
orders_df['event_time'] = orders_df['event_time'].astype('string')

In [None]:
orders_df.dtypes

In [None]:
customers_count = customers_df.shape[0]
%store customers_count
products_count = products_df.shape[0]
%store products_count
orders_count = orders_df.shape[0]
%store orders_count

# Create feature definitions and groups

In [None]:
current_timestamp = strftime('%m-%d-%H-%M', gmtime())

In [None]:
# prefix to track all the feature groups created as part of feature store champions workshop (fscw)
fs_prefix = 'fscw-' 

In [None]:
customers_feature_group_name = f'{fs_prefix}customers-{current_timestamp}'
%store customers_feature_group_name
products_feature_group_name = f'{fs_prefix}products-{current_timestamp}'
%store products_feature_group_name
orders_feature_group_name = f'{fs_prefix}orders-{current_timestamp}'
%store orders_feature_group_name

In [None]:
logger.info(f'Customers feature group name = {customers_feature_group_name}')
logger.info(f'Products feature group name = {products_feature_group_name}')
logger.info(f'Orders feature group name = {orders_feature_group_name}')

In [None]:
customers_feature_group = FeatureGroup(name=customers_feature_group_name, sagemaker_session=sagemaker_session)
products_feature_group = FeatureGroup(name=products_feature_group_name, sagemaker_session=sagemaker_session)
orders_feature_group = FeatureGroup(name=orders_feature_group_name, sagemaker_session=sagemaker_session)

In [None]:
customers_feature_group.load_feature_definitions(data_frame=customers_df)

In [None]:
products_feature_group.load_feature_definitions(data_frame=products_df)

In [None]:
orders_feature_group.load_feature_definitions(data_frame=orders_df)

Let's create the feature groups now

Amazon SageMaker Feature Store supports the AWS Glue and Apache Iceberg table formats for the offline store. You can choose the table format when you’re creating a new feature group. 

In this notebook, we will be using the Iceberg table format. Using Apache Iceberg for storing features accelerates model development by enabling faster query performance when extracting ML training datasets, taking advantage of Iceberg table compaction. Depending on the design of your feature groups and their scale, you can experience training query performance improvements of 10x to 100x by using this new capability.

If you need to use the Glue table format, please update the variable below to `'Glue'`.  For more information on offline store formats, please refer to the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-offline.html).

In [None]:
table_format_param = 'ICEBERG' # or 'GLUE'

In [None]:
if table_format_param == 'ICEBERG':
    table_format = TableFormatEnum.ICEBERG
else:
    table_format = TableFormatEnum.GLUE

In [None]:
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get('FeatureGroupStatus')
    print(f'Initial status: {status}')
    while status == 'Creating':
        logger.info(f'Waiting for feature group: {feature_group.name} to be created ...')
        time.sleep(5)
        status = feature_group.describe().get('FeatureGroupStatus')
    if status != 'Created':
        raise SystemExit(f'Failed to create feature group {feature_group.name}: {status}')
    logger.info(f'FeatureGroup {feature_group.name} was successfully created.')

In [None]:
customers_feature_group.create(s3_uri=f's3://{default_bucket}/{prefix}', 
                               record_identifier_name='customer_id', 
                               event_time_feature_name='event_time', 
                               role_arn=role, 
                               enable_online_store=True,
                               table_format=table_format 
                              )

In [None]:
wait_for_feature_group_creation_complete(customers_feature_group)

In [None]:
products_feature_group.create(s3_uri=f's3://{default_bucket}/{prefix}', 
                               record_identifier_name='product_id', 
                               event_time_feature_name='event_time', 
                               role_arn=role, 
                               enable_online_store=True,
                               table_format=TableFormatEnum.ICEBERG # or 'GLUE'
                             )

In [None]:
wait_for_feature_group_creation_complete(products_feature_group)

In [None]:
orders_feature_group.create(s3_uri=f's3://{default_bucket}/{prefix}', 
                            record_identifier_name='order_id', 
                            event_time_feature_name='event_time', 
                            role_arn=role, 
                            enable_online_store=True,
                            table_format=TableFormatEnum.ICEBERG # or 'GLUE'
                           )

In [None]:
wait_for_feature_group_creation_complete(orders_feature_group)

# Ingest data into feature groups 

In [None]:
%%time

logger.info(f'Ingesting data into feature group: {customers_feature_group.name} ...')
customers_feature_group.ingest(data_frame=customers_df, max_processes=16, wait=True)
logger.info(f'{len(customers_df)} customer records ingested into feature group: {customers_feature_group.name}')

In [None]:
%%time

logger.info(f'Ingesting data into feature group: {products_feature_group.name} ...')
products_feature_group.ingest(data_frame=products_df, max_processes=16, wait=True)
logger.info(f'{len(products_df)} product records ingested into feature group: {products_feature_group.name}')  

In [None]:
%%time

logger.info(f'Ingesting data into feature group: {orders_feature_group.name} ...')
orders_feature_group.ingest(data_frame=orders_df, max_processes=16, wait=True)
logger.info(f'{len(orders_df)} order records ingested into feature group: {orders_feature_group.name}')

# Get feature record from the Online feature store 

In [None]:
featurestore_runtime_client = sagemaker_session.boto_session.client('sagemaker-featurestore-runtime', region_name=region)

Retrieve a record from customers feature group

In [None]:
customer_id =  f'C{randint(1, 10000)}'
logger.info(f'customer_id={customer_id}') 

In [None]:
feature_record = featurestore_runtime_client.get_record(FeatureGroupName=customers_feature_group_name, 
                                                        RecordIdentifierValueAsString=customer_id)
feature_record

# List feature groups 
Since we created all of our feature groups with a common name pattern, we'll just list all the ones that have our same month and day (e.g., 04-13).

In [None]:
import sys
sys.path.append('..')
from utilities.feature_store_helper import FeatureStore
fs = FeatureStore()

In [None]:
fs.list_feature_groups(current_timestamp[0:5])