# Batch Ingestion
This notebook reads the raw data from an S3 bucket, transforms it for ingestion into SageMaker Feature Store and then ingests it into an offline+online Feature Store. Refer [Official SageMaker FeatureStore documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html) and [Python SDK](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_featurestore.html).

We create two feature groups in this notebook:
1. An offline+online feature group for customer inputs that is used for ML model training.
2. An offline+online feature group for the destinations features, this is used both for ML model training and real-time inference.

**Note:** Please set kernel to `conda_python3` for this notebook and select instance to `ml.t3.2xlarge` as part of user inputs to the CloudFormation template.

## Imports

In [1]:
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker import get_execution_role
from sagemaker.session import Session
from datetime import datetime
from pathlib import Path
import pandas as pd
import sagemaker
import logging
import boto3
import time
import sys
import os

In [2]:
# import from a different path
sys.path.insert(0, '../utils')
path = Path(os.path.abspath(os.getcwd()))
package_dir = f'{str(path.parent)}/utils'
print(package_dir)
import utils

/home/ec2-user/SageMaker/feature-store-expedia/utils


## Setup Logging

In [3]:
logger = logging.getLogger('__name__')
logging.basicConfig(format="%(asctime)s,%(filename)s,%(funcName)s,%(lineno)s,%(levelname)s,p%(process)s,%(message)s", level=logging.INFO)       
logger.info(f'Using SageMaker version: {sagemaker.__version__}')
logger.info(f'Using Pandas version: {pd.__version__}')

2022-06-08 15:00:55,744,<ipython-input-3-e08248e9535e>,<module>,3,INFO,p27837,Using SageMaker version: 2.86.2
2022-06-08 15:00:55,746,<ipython-input-3-e08248e9535e>,<module>,4,INFO,p27837,Using Pandas version: 1.1.5


## Global Constants

In [4]:
# global constants
STACK_NAME = "expedia-feature-store-demo-v2"

# number of worker processes to use for batch ingesting data into feature store
MAX_WORKERS = 8

# number of principal components to keep for the destinations dataset
PC_TO_KEEP = 3

## Setup Config Variables
Read the config variables used by this notebook from the cloud formation outputs and parameters.

In [5]:
# read output variables from cloud formation stack, these will be used as parameters throughout
# the code
data_bucket_name = utils.get_cfn_stack_outputs(STACK_NAME, 'DataBucketName')
athena_query_results_bucket_name = utils.get_cfn_stack_outputs(STACK_NAME, 'AthenaQueryResultsBucketName')
feature_store_bucket_name = utils.get_cfn_stack_outputs(STACK_NAME, 'FeatureStoreBucketName')
logger.info(f"data_bucket_name={data_bucket_name},\nathena_query_results_bucket_name={athena_query_results_bucket_name},\nfeature_store_bucket_name={feature_store_bucket_name}")

2022-06-08 15:00:56,132,<ipython-input-5-f06a6a5b0cb4>,<module>,6,INFO,p27837,data_bucket_name=expedia-customer-behavior-data-195cbf60,
athena_query_results_bucket_name=athena-query-results-195cbf60,
feature_store_bucket_name=expedia-feature-store-offline-195cbf60


In [6]:
# read params from cloud formation stack. The cloud formation stack provided a convenient
# way to provide configuration parameters for a notebook workflow without having to use
# parameter store or other services for providing config.
customer_inputs_fg_name = utils.get_cfn_stack_parameters(STACK_NAME, 'CustomerInputFeatureGroupName')
destinations_fg_name = utils.get_cfn_stack_parameters(STACK_NAME, 'DestinationsFeatureGroupName')
app_name = utils.get_cfn_stack_parameters(STACK_NAME, 'AppName')

always_recreate_fg = utils.get_cfn_stack_parameters(STACK_NAME, 'AlwaysRecreateFeatureGroup')
always_recreate_fg = True if always_recreate_fg == "true" else False

raw_data_dir = utils.get_cfn_stack_parameters(STACK_NAME, 'RawDataDir')
training_dataset_fname = utils.get_cfn_stack_parameters(STACK_NAME, 'TrainingDatasetFileName')
test_dataset_fname = utils.get_cfn_stack_parameters(STACK_NAME, 'TestDatasetFileName')
destination_features_fname = utils.get_cfn_stack_parameters(STACK_NAME, 'DestinationFeaturesFileName')

# If an existing feature group by the same name is not going to be deleted then
# append a unique suffix to the feature group name to create a new unique feature group name
if always_recreate_fg is False:
    dttm = datetime.now()
    suffix = f"{dttm.year}-{dttm.month}-{dttm.day}-{dttm.hour}-{dttm.minute}"
    customer_inputs_fg_name = f"{customer_inputs_fg_name}-{suffix}"
    destinations_fg_name = f"{destinations_fg_name}-{suffix}"

# log all params debugging help
logger.info(f"customer_inputs_fg_name={customer_inputs_fg_name},\ndestinations_fg_name={destinations_fg_name}\ndestination_features_fname={destination_features_fname}\n"
            f"always_recreate_fg={always_recreate_fg},\n"
            f"raw_data_dir={raw_data_dir},\ntraining_dataset_fname={training_dataset_fname},\n"
            f"test_dataset_fname={test_dataset_fname}, app_name={app_name}")

2022-06-08 15:00:57,177,<ipython-input-6-78641fb40f60>,<module>,25,INFO,p27837,customer_inputs_fg_name=expedia-customer-inputs-2022-6-8-15-0,
destinations_fg_name=expedia-destinations-2022-6-8-15-0
destination_features_fname=destinations.csv
always_recreate_fg=False,
raw_data_dir=raw_data,
training_dataset_fname=train.csv,
test_dataset_fname=test.csv, app_name=hotel_cluster_prediction


## Read raw data from S3 bucket
The raw data exists in an S3 bucket. Note that the data upload to the S3 bucket in the raw data directory (typicall raw_data) needs to be done manually prior to running this step. The data is read directly using the Pandas read_csv method. In another version of this code, Pandas will be replaced with Pyspark.

We read two datasets here:
1. The customer inputs datasets from the train.csv file that represents customers looking up hotels via the Expedia website.
2. The destination features dataset from destinations.csv that represents embeddings for each destination, this will be joined with the customer input dataset at the time of model training.

In [7]:
# read data from the bucket in a pandas dataframe, this will be ingested in the feature store
s3a_uri = f"s3a://{data_bucket_name}/{raw_data_dir}/{training_dataset_fname}"
df = pd.read_csv(s3a_uri)
logger.info(f"shape of the dataframe read from {s3a_uri} is {df.shape}")

# drop rows with NA
df_customer_inputs = df.dropna()
logger.info(f"shape of the dataframe after dropna is {df_customer_inputs.shape}")
display(df_customer_inputs.head())

2022-06-08 15:00:57,561,api.py,from_bytes,356,INFO,p27837,ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-06-08 15:00:57,562,api.py,from_bytes,367,INFO,p27837,ascii should target any language(s) of ['Latin Based']
2022-06-08 15:00:57,567,api.py,from_bytes,385,INFO,p27837,We detected language [('English', 1.0), ('Simple English', 1.0), ('Indonesian', 0.9524)] using ascii
2022-06-08 15:00:57,568,api.py,from_bytes,419,INFO,p27837,ascii is most likely the one. Stopping the process.
2022-06-08 15:00:57,575,api.py,from_bytes,356,INFO,p27837,ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-06-08 15:00:57,576,api.py,from_bytes,367,INFO,p27837,ascii should target any language(s) of ['Latin Based']
2022-06-08 15:00:57,579,api.py,from_bytes,385,INFO,p27837,We detected language [('German', 0.8333), ('Hungarian', 0.8333), ('Slovak', 0.8333), ('English', 0.75), ('Dutch', 0.75), ('Italian', 0.75), ('Swedish', 0.75), ('Norwegian', 0.75), ('Czech', 0

Unnamed: 0,date_time,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,is_package,...,srch_children_cnt,srch_rm_cnt,srch_destination_id,srch_destination_type_id,is_booking,cnt,hotel_continent,hotel_country,hotel_market,hotel_cluster
0,2014-08-11 07:46:59,2,3,66,348,48862,2234.2641,12,0,1,...,0,1,8250,1,0,3,2,50,628,1
1,2014-08-11 08:22:12,2,3,66,348,48862,2234.2641,12,0,1,...,0,1,8250,1,1,1,2,50,628,1
2,2014-08-11 08:24:33,2,3,66,348,48862,2234.2641,12,0,0,...,0,1,8250,1,0,1,2,50,628,1
3,2014-08-09 18:05:16,2,3,66,442,35390,913.1932,93,0,0,...,0,1,14984,1,0,1,2,50,1457,80
4,2014-08-09 18:08:18,2,3,66,442,35390,913.6259,93,0,0,...,0,1,14984,1,0,1,2,50,1457,21


In [8]:
# read data from the bucket in a pandas dataframe, this will be ingested in the feature store
s3a_uri = f"s3a://{data_bucket_name}/{raw_data_dir}/{destination_features_fname}"
df_destinations = pd.read_csv(s3a_uri)
logger.info(f"shape of the dataframe read from {s3a_uri} is {df_destinations.shape}")

# drop rows with NA
df_destinations = df_destinations.dropna()
logger.info(f"shape of the dataframe after dropna is {df_destinations.shape}")
display(df_destinations.head())

2022-06-08 15:05:55,976,<ipython-input-8-852fe5e78f44>,<module>,4,INFO,p27837,shape of the dataframe read from s3a://expedia-customer-behavior-data-195cbf60/raw_data/destinations.csv is (62106, 150)
2022-06-08 15:05:56,036,<ipython-input-8-852fe5e78f44>,<module>,8,INFO,p27837,shape of the dataframe after dropna is (62106, 150)


Unnamed: 0,srch_destination_id,d1,d2,d3,d4,d5,d6,d7,d8,d9,...,d140,d141,d142,d143,d144,d145,d146,d147,d148,d149
0,0,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-1.897627,-2.198657,-2.198657,-1.897627,...,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657
1,1,-2.18169,-2.18169,-2.18169,-2.082564,-2.18169,-2.165028,-2.18169,-2.18169,-2.031597,...,-2.165028,-2.18169,-2.165028,-2.18169,-2.18169,-2.165028,-2.18169,-2.18169,-2.18169,-2.18169
2,2,-2.18349,-2.224164,-2.224164,-2.189562,-2.105819,-2.075407,-2.224164,-2.118483,-2.140393,...,-2.224164,-2.224164,-2.196379,-2.224164,-2.192009,-2.224164,-2.224164,-2.224164,-2.224164,-2.057548
3,3,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.115485,-2.177409,-2.177409,-2.177409,...,-2.161081,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409
4,4,-2.189562,-2.187783,-2.194008,-2.171153,-2.152303,-2.056618,-2.194008,-2.194008,-2.145911,...,-2.187356,-2.194008,-2.191779,-2.194008,-2.194008,-2.185161,-2.194008,-2.194008,-2.194008,-2.188037


## Data Transformation for Ingesting Into Feature Store
Before this data can be ingested into the SageMaker FeatureStore, certain transformations need to be done.

1. The date_time field which will be used as "Event Time" need to be converted to the ISO-8601 format i.e. YYYY-MM-DDTHH:MM:SSZ.
2. The user_id field which will be used for "Record Identifier" needs to be converted to string.
3. All "object" type fields need to be converted to string.

In [9]:
# convert to datetime first
df_customer_inputs.date_time = pd.to_datetime(df_customer_inputs.date_time)

# the above returns (for example) 2015-09-03 17:09:54, change this to 2015-09-03T17:09:54Z
# The dataset documentation does not mention the timezone of the date_time so will just assume it to be UTC.
df_customer_inputs.date_time = df_customer_inputs.date_time.map(lambda x: x.isoformat() + 'Z')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [10]:
# Convert user_id to string
df_customer_inputs.user_id = df_customer_inputs.user_id.astype("string")

# destination id as well since this is going to be used as a key in the feature group for the destinations data
# and the feature group record identifier can only be a string, BUT this is not the destinations table this is
# the customer inputs table...so what gives..well, the customer inputs and destinations would be joined at the
# time of model training and instead of doing a cast there, let's just do it here.
df_customer_inputs.srch_destination_id = df_customer_inputs.srch_destination_id.astype("string")

In [11]:
# only keep rows where is_booking == 1 because we are only concerned with events when the user actually booked a hotel and that is also what the test data contains. 
if "is_booking" in df_customer_inputs.columns:
    df_customer_inputs = df_customer_inputs[df_customer_inputs.is_booking == 1]
    logger.info(f"after removing all is_booking != 1 rows, shape of dataframe {df_customer_inputs.shape}")

2022-06-08 15:08:51,796,<ipython-input-11-85ff158cbdf4>,<module>,4,INFO,p27837,after removing all is_booking != 1 rows, shape of dataframe (1985514, 24)


### Create derived features
These features can then be stored in the Feature Store and be used for training the model. This is the advantage of having a feature store, these derived features would now be available ready to use when we want to train an ML model, any model whether it is the one being created in this repo or for a new future use-case.

In [12]:
# create derived features

# duration of the trip for which the hotel booking is needed seems to be intituively important
df_customer_inputs['duration'] = (pd.to_datetime(df_customer_inputs.srch_co, errors='coerce') - pd.to_datetime(df_customer_inputs.srch_ci, errors='coerce')).astype('timedelta64[D]')

# how far is the trip from the time when the user was looking up the Expedia website
df_customer_inputs['days_to_trip'] = (pd.to_datetime(df.srch_ci, errors='coerce') - pd.to_datetime(df_customer_inputs.date_time, errors='coerce').dt.tz_localize(None)).astype('timedelta64[D]')

# is the start or end of the trip on a weekend?
df_customer_inputs['start_of_trip_weekend'] = (pd.to_datetime(df_customer_inputs.srch_ci, errors='coerce').dt.weekday >= 5).astype(int)
df_customer_inputs['end_of_trip_weekend'] = (pd.to_datetime(df_customer_inputs.srch_co, errors='coerce').dt.weekday >= 5).astype(int)


In [13]:
# convert any "object" type columns to string
utils.cast_object_to_string(df_customer_inputs)

In [14]:
df_customer_inputs.head()

Unnamed: 0,date_time,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,is_package,...,is_booking,cnt,hotel_continent,hotel_country,hotel_market,hotel_cluster,duration,days_to_trip,start_of_trip_weekend,end_of_trip_weekend
1,2014-08-11T08:22:12Z,2,3,66,348,48862,2234.2641,12,0,1,...,1,1,2,50,628,1,4.0,17.0,0,0
79,2014-01-03T16:30:17Z,2,3,66,462,41898,2454.8588,1482,0,1,...,1,1,2,50,680,95,5.0,49.0,1,0
81,2014-01-03T16:44:56Z,2,3,66,462,41898,2454.8588,1482,0,1,...,1,1,2,50,680,95,5.0,49.0,1,0
83,2014-01-03T17:11:36Z,2,3,66,462,41898,2454.8588,1482,0,1,...,1,1,2,50,680,95,3.0,51.0,0,0
128,2014-10-29T14:32:19Z,2,3,66,174,40365,8456.8294,1713,0,0,...,1,1,3,5,89,38,1.0,12.0,0,0


In [15]:
# reduce the size of the dataset to make it more manageable for this demo
unique_user_id = list(df_customer_inputs.user_id.unique())
num_unique_user_ids = len(unique_user_id)
logger.info(f"there are {len(unique_user_id)} user_ids in the dataset")

# select 1% of the unique users
import random
FRACTION_OF_USER_IDS_TO_KEEP = 0.01
if FRACTION_OF_USER_IDS_TO_KEEP != 1:
    fraction_of_unique_user_ids = random.sample(unique_user_id, int(num_unique_user_ids*FRACTION_OF_USER_IDS_TO_KEEP))
    df_customer_inputs = df_customer_inputs[df_customer_inputs.user_id.isin(fraction_of_unique_user_ids)]
    logger.info(f"after filtering dataframe to keep {100*FRACTION_OF_USER_IDS_TO_KEEP}% of all user_ids, dataframe shape is {df_customer_inputs.shape}")

2022-06-08 15:09:14,896,<ipython-input-15-7c7d626332f7>,<module>,4,INFO,p27837,there are 596662 user_ids in the dataset
2022-06-08 15:09:15,541,<ipython-input-15-7c7d626332f7>,<module>,12,INFO,p27837,after filtering dataframe to keep 1.0% of all user_ids, dataframe shape is (20083, 28)


## Initialize SageMaker and FeatureStore Runtime

In [16]:
role = get_execution_role()
region = boto3.Session().region_name
boto_session = boto3.Session(region_name=region)

sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)

featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)

account_id = boto3.client('sts').get_caller_identity()["Account"]
logger.info(f"role={role}, region={region}, account_id={account_id}")



2022-06-08 15:09:15,924,<ipython-input-16-f1ccef0ce271>,<module>,10,INFO,p27837,role=arn:aws:iam::015469603702:role/expedia-feature-store-demo-v2-SageMakerRole-13TXOXWF10DUZ, region=us-east-1, account_id=015469603702


In [17]:
# Create a feature store session object
feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime
)

## Cleanup Existing FeatureGroup (if needed)
To allow running this notebook multiple time and not create a new feature group on every run we have a config parameter which controls whether or not to delete existing feature group by the same name. If the always recreate feature group param is set to false then a new feature group is created by suffixing the current datetime to the configured feature group name.

In [18]:
# get a list of feature groups
fg_list = sagemaker_client.list_feature_groups()
logger.info(f"there are {len(fg_list['FeatureGroupSummaries'])} feature groups")
logger.info(fg_list)
# if the feature group list is not empty and always recreate feature groups is set to True then delete existing feature group
if always_recreate_fg is True and len(fg_list['FeatureGroupSummaries']) > 0:
    logger.warning(f"always_recreate_fg is True, going to delete feature group ={fg_name}")
    _ = [sagemaker_client.delete_feature_group(FeatureGroupName=fg['FeatureGroupName']) for fg in fg_list['FeatureGroupSummaries'] 
         if fg['FeatureGroupName'] in [customer_inputs_fg_name, destinations_fg_name]]
    time.sleep(5)

2022-06-08 15:09:16,081,<ipython-input-18-0ac5e8f47bac>,<module>,3,INFO,p27837,there are 10 feature groups
2022-06-08 15:09:16,082,<ipython-input-18-0ac5e8f47bac>,<module>,4,INFO,p27837,{'FeatureGroupSummaries': [{'FeatureGroupName': 'expedia-destinations-2022-6-8-14-37', 'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:015469603702:feature-group/expedia-destinations-2022-6-8-14-37', 'CreationTime': datetime.datetime(2022, 6, 8, 14, 46, 40, 727000, tzinfo=tzlocal()), 'FeatureGroupStatus': 'Created', 'OfflineStoreStatus': {'Status': 'Active'}}, {'FeatureGroupName': 'expedia-destinations-2022-6-8-13-56', 'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:015469603702:feature-group/expedia-destinations-2022-6-8-13-56', 'CreationTime': datetime.datetime(2022, 6, 8, 13, 59, 27, 483000, tzinfo=tzlocal()), 'FeatureGroupStatus': 'Created', 'OfflineStoreStatus': {'Status': 'Active'}}, {'FeatureGroupName': 'expedia-destinations-2022-6-7-22-9', 'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:01546960

# Create Feature Group
Create a Feature Group and then set the schema from the feature group using the existing dataframe that contains the transformed data (already amenable for ingestion into feature store.)

In [19]:
feature_group = FeatureGroup(name=customer_inputs_fg_name, sagemaker_session=feature_store_session)
feature_group.load_feature_definitions(data_frame=df_customer_inputs)

[FeatureDefinition(feature_name='date_time', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='site_name', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='posa_continent', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='user_location_country', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='user_location_region', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='user_location_city', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='orig_destination_distance', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='user_id', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='is_mobile', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='is_package', feature_type=<Fe

This is the actual feature group creation step. Note that we usually always want to create an **online + offline feature store**. Online because we want to use it for real time predictions and offline because we want to use it for model training. While in this particular use case, a separate test dataset is provided so an online datastore is much more relevant for the tedt dataset rather than the training dataset, neverthless an offline+online datastore here does not hurt.

In [20]:
feature_group.create(
    s3_uri=f"s3://{feature_store_bucket_name}/{customer_inputs_fg_name}",
    record_identifier_name="user_id",
    event_time_feature_name="date_time",
    role_arn=role,
    enable_online_store=True,
    tags=[{'Key':'project','Value':'expedia-feature-store-demo'}]
)

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:015469603702:feature-group/expedia-customer-inputs-2022-6-8-15-0',
 'ResponseMetadata': {'RequestId': 'f03596e1-f266-4779-9c9f-232d2f8aebff',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'f03596e1-f266-4779-9c9f-232d2f8aebff',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '114',
   'date': 'Wed, 08 Jun 2022 15:09:15 GMT'},
  'RetryAttempts': 0}}

In [21]:
utils.check_feature_group_status(feature_group)

2022-06-08 15:09:16,591,utils.py,check_feature_group_status,98,INFO,p27837,Waiting for Feature Group to be Created
2022-06-08 15:09:21,657,utils.py,check_feature_group_status,98,INFO,p27837,Waiting for Feature Group to be Created
2022-06-08 15:09:26,711,utils.py,check_feature_group_status,98,INFO,p27837,Waiting for Feature Group to be Created
2022-06-08 15:09:31,775,utils.py,check_feature_group_status,98,INFO,p27837,Waiting for Feature Group to be Created
2022-06-08 15:09:36,830,utils.py,check_feature_group_status,101,INFO,p27837,FeatureGroup expedia-customer-inputs-2022-6-8-15-0 successfully created.


In [22]:
#Ingest features into the feature group
# actually batch ingest the data into the feature store now
logger.info(f"about to begin ingestion of data into feature store, max_workers={MAX_WORKERS}")
feature_group.ingest(
    data_frame=df_customer_inputs, max_workers=MAX_WORKERS, wait=True
)

2022-06-08 15:09:36,843,<ipython-input-22-7dc3a283fdae>,<module>,3,INFO,p27837,about to begin ingestion of data into feature store, max_workers=8
2022-06-08 15:09:39,677,feature_group.py,_ingest_single_batch,211,INFO,p515,Started ingesting index 12555 to 15066
2022-06-08 15:09:39,767,feature_group.py,_ingest_single_batch,211,INFO,p515,Started ingesting index 10044 to 12555
2022-06-08 15:09:39,771,feature_group.py,_ingest_single_batch,211,INFO,p515,Started ingesting index 0 to 2511
2022-06-08 15:09:39,777,feature_group.py,_ingest_single_batch,211,INFO,p515,Started ingesting index 5022 to 7533
2022-06-08 15:09:39,832,feature_group.py,_ingest_single_batch,211,INFO,p515,Started ingesting index 2511 to 5022
2022-06-08 15:09:39,833,feature_group.py,_ingest_single_batch,211,INFO,p515,Started ingesting index 15066 to 17577
2022-06-08 15:09:39,866,feature_group.py,_ingest_single_batch,211,INFO,p515,Started ingesting index 17577 to 20083
2022-06-08 15:09:39,873,feature_group.py,_ingest_single_ba

IngestionManagerPandas(feature_group_name='expedia-customer-inputs-2022-6-8-15-0', sagemaker_fs_runtime_client_config=<botocore.config.Config object at 0x7f6aec7a0fd0>, max_workers=8, max_processes=1, profile_name=None, _async_result=<multiprocess.pool.MapResult object at 0x7f6b0f8a38d0>, _processing_pool=<pool ProcessPool(ncpus=1)>, _failed_indices=[])

## Query ingested data from the "Online" feature store
This should immediately return the results.

In [23]:
# Use batch-get_record
record_identifier_values = list((df_customer_inputs.user_id.unique()))[:2]
response=featurestore_runtime.batch_get_record(
    Identifiers=[
        {"FeatureGroupName": customer_inputs_fg_name, "RecordIdentifiersValueAsString": record_identifier_values}
    ]
)
response

{'ResponseMetadata': {'RequestId': '4375001d-6684-4113-ac41-8321aec680b4',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '4375001d-6684-4113-ac41-8321aec680b4',
   'content-type': 'application/json',
   'content-length': '3346',
   'date': 'Wed, 08 Jun 2022 15:10:43 GMT'},
  'RetryAttempts': 0},
 'Records': [{'FeatureGroupName': 'expedia-customer-inputs-2022-6-8-15-0',
   'RecordIdentifierValueAsString': '23532',
   'Record': [{'FeatureName': 'date_time',
     'ValueAsString': '2014-11-03T15:57:09Z'},
    {'FeatureName': 'site_name', 'ValueAsString': '2'},
    {'FeatureName': 'posa_continent', 'ValueAsString': '3'},
    {'FeatureName': 'user_location_country', 'ValueAsString': '66'},
    {'FeatureName': 'user_location_region', 'ValueAsString': '254'},
    {'FeatureName': 'user_location_city', 'ValueAsString': '21713'},
    {'FeatureName': 'orig_destination_distance', 'ValueAsString': '1583.6919'},
    {'FeatureName': 'user_id', 'ValueAsString': '23532'},
    {'FeatureN

## Query ingested data from the "Offline" feature store
The offline featrure store is queried using Athena. The feature store object has an Athena query method that is used to construct a query.

**Note:** It could be several minutes (upto 15) until the data is ingested and available for querying.

In [24]:
# add a 1 minute sleep to wait for at least some data to show up in the offline feature store
# time.sleep(60)

# the feature group provided a convenient Athena object to query the offline feature store data
query = feature_group.athena_query()
customers_fg_table = query.table_name
logger.info(f"Athena table -> fg_table={customers_fg_table}")

2022-06-08 15:10:44,465,<ipython-input-24-402cfdbf5394>,<module>,7,INFO,p27837,Athena table -> fg_table=expedia-customer-inputs-2022-6-8-15-0-1654700956


In [25]:
query_string = f'SELECT * FROM "{customers_fg_table}" limit 10'
output_location=f's3://{athena_query_results_bucket_name}/{customer_inputs_fg_name}/query_results/'
logger.info(f"going to run this query -> {query_string} and store the results in {output_location}")

# run the query
query.run(query_string=query_string, output_location=output_location)

# wait for the results
query.wait()
df_fg = query.as_dataframe()

# results
df_fg.head()

2022-06-08 15:10:44,479,<ipython-input-25-7c31cd300965>,<module>,3,INFO,p27837,going to run this query -> SELECT * FROM "expedia-customer-inputs-2022-6-8-15-0-1654700956" limit 10 and store the results in s3://athena-query-results-195cbf60/expedia-customer-inputs-2022-6-8-15-0/query_results/
2022-06-08 15:10:44,719,session.py,wait_for_athena_query,4118,INFO,p27837,Query e7c3a9af-33ec-4373-84ed-c7640ed53867 is being executed.
2022-06-08 15:10:49,766,session.py,wait_for_athena_query,4127,INFO,p27837,Query e7c3a9af-33ec-4373-84ed-c7640ed53867 successfully executed.


Unnamed: 0,date_time,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,is_package,...,hotel_country,hotel_market,hotel_cluster,duration,days_to_trip,start_of_trip_weekend,end_of_trip_weekend,write_time,api_invocation_time,is_deleted


## Save variables for next stage

In [26]:
# write feature group names and query_string to a file, used when generating lineage
utils.write_param("customer_inputs_fg_name", customer_inputs_fg_name)
utils.write_param("customer_inputs_fg_table", customers_fg_table)
utils.write_param("customer_inputs_fg_query_string", query_string)


2022-06-08 15:10:49,978,utils.py,write_param,115,INFO,p27837,write_param, fpath=../config/customer_inputs_fg_name, writing customer_inputs_fg_name=expedia-customer-inputs-2022-6-8-15-0
2022-06-08 15:10:49,980,utils.py,write_param,115,INFO,p27837,write_param, fpath=../config/customer_inputs_fg_table, writing customer_inputs_fg_table=expedia-customer-inputs-2022-6-8-15-0-1654700956


## Create feature group for the destination features
We first do PCA on the destinations features to reduce it to 3 features and then store the principal components in a separate feature group of their own.


In [27]:
# reduce dimensions of destination folder
from sklearn.decomposition import PCA

# the number of principal components to keep is just set to 3 here since this is a demo
# but in an actual production model this would be determined by examining a scree plot/variance explained rule/other critiera
pca = PCA(n_components=PC_TO_KEEP)

In [28]:
# all columns except the src_destination_id 
cols_to_use = [c for c in df_destinations.columns if c != 'srch_destination_id']
destinations_pca = pca.fit_transform(df_destinations[cols_to_use])
df_destinations_pca = pd.DataFrame(destinations_pca, columns=[f'pc{x}' for x in range(1, (PC_TO_KEEP+1))])

# typecasting the destination id to string since this is going to be used as the record identifier in the feature store
# which has to be a string
df_destinations_pca["srch_destination_id"] = df_destinations["srch_destination_id"].astype('string')

# since there is no date time associated with these features in the input dataset so just use the current datetime
from datetime import datetime
# datetime.utcnow().isoformat() will return something like '2022-06-07T22:08:19.399890', need to
# trunchate it to yyyy-MM-dd'T'HH:mm:ss format to make it work with sagemaker feature store
datetime_iso8601_now = f"{datetime.utcnow().isoformat().split('.')[0]}Z"
df_destinations_pca["date_time"] = datetime_iso8601_now

In [29]:
df_destinations_pca.head()

Unnamed: 0,pc1,pc2,pc3,srch_destination_id,date_time
0,-0.044268,0.169419,0.032518,0,2022-06-08T15:10:52Z
1,-0.440761,0.077405,-0.091572,1,2022-06-08T15:10:52Z
2,0.001033,0.020677,0.012111,2,2022-06-08T15:10:52Z
3,-0.480467,-0.040345,-0.019321,3,2022-06-08T15:10:52Z
4,-0.207253,-0.042694,-0.011745,4,2022-06-08T15:10:52Z


In [30]:
df_destinations_pca['date_time'] = df_destinations_pca['date_time'].astype('string')

In [31]:
df_destinations_pca.dtypes

pc1                    float64
pc2                    float64
pc3                    float64
srch_destination_id     string
date_time               string
dtype: object

In [32]:
feature_group = FeatureGroup(name=destinations_fg_name, sagemaker_session=feature_store_session)
feature_group.load_feature_definitions(data_frame=df_destinations_pca)

[FeatureDefinition(feature_name='pc1', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='pc2', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='pc3', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='srch_destination_id', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='date_time', feature_type=<FeatureTypeEnum.STRING: 'String'>)]

In [33]:
feature_group.create(
    s3_uri=f"s3://{feature_store_bucket_name}/{customer_inputs_fg_name}",
    record_identifier_name="srch_destination_id",
    event_time_feature_name="date_time",
    role_arn=role,
    enable_online_store=True,
    tags=[{'Key':'AppName','Value':app_name}]
)

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:015469603702:feature-group/expedia-destinations-2022-6-8-15-0',
 'ResponseMetadata': {'RequestId': '8b4739d4-7024-4863-be6b-4335160b1827',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '8b4739d4-7024-4863-be6b-4335160b1827',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '111',
   'date': 'Wed, 08 Jun 2022 15:10:52 GMT'},
  'RetryAttempts': 0}}

In [34]:
utils.check_feature_group_status(feature_group)

2022-06-08 15:10:52,615,utils.py,check_feature_group_status,98,INFO,p27837,Waiting for Feature Group to be Created
2022-06-08 15:10:57,673,utils.py,check_feature_group_status,98,INFO,p27837,Waiting for Feature Group to be Created
2022-06-08 15:11:02,750,utils.py,check_feature_group_status,98,INFO,p27837,Waiting for Feature Group to be Created
2022-06-08 15:11:07,834,utils.py,check_feature_group_status,98,INFO,p27837,Waiting for Feature Group to be Created
2022-06-08 15:11:12,900,utils.py,check_feature_group_status,98,INFO,p27837,Waiting for Feature Group to be Created
2022-06-08 15:11:17,963,utils.py,check_feature_group_status,101,INFO,p27837,FeatureGroup expedia-destinations-2022-6-8-15-0 successfully created.


In [35]:
#Ingest features into the feature group
# actually batch ingest the data into the feature store now
logger.info(f"about to begin ingestion of data into feature store, max_workers={MAX_WORKERS}")
feature_group.ingest(
    data_frame=df_destinations_pca, max_workers=MAX_WORKERS, wait=True
)

2022-06-08 15:11:17,970,<ipython-input-35-abca2e197c58>,<module>,3,INFO,p27837,about to begin ingestion of data into feature store, max_workers=8
2022-06-08 15:11:20,050,feature_group.py,_ingest_single_batch,211,INFO,p1599,Started ingesting index 46584 to 54348
2022-06-08 15:11:20,063,feature_group.py,_ingest_single_batch,211,INFO,p1599,Started ingesting index 0 to 7764
2022-06-08 15:11:20,115,feature_group.py,_ingest_single_batch,211,INFO,p1599,Started ingesting index 15528 to 23292
2022-06-08 15:11:20,127,feature_group.py,_ingest_single_batch,211,INFO,p1599,Started ingesting index 54348 to 62106
2022-06-08 15:11:20,141,feature_group.py,_ingest_single_batch,211,INFO,p1599,Started ingesting index 31056 to 38820
2022-06-08 15:11:20,152,feature_group.py,_ingest_single_batch,211,INFO,p1599,Started ingesting index 38820 to 46584
2022-06-08 15:11:20,178,feature_group.py,_ingest_single_batch,211,INFO,p1599,Started ingesting index 23292 to 31056
2022-06-08 15:11:20,203,feature_group.py,_inges

IngestionManagerPandas(feature_group_name='expedia-destinations-2022-6-8-15-0', sagemaker_fs_runtime_client_config=<botocore.config.Config object at 0x7f6aec7a0fd0>, max_workers=8, max_processes=1, profile_name=None, _async_result=<multiprocess.pool.MapResult object at 0x7f6aec9b7128>, _processing_pool=<pool ProcessPool(ncpus=1)>, _failed_indices=[])

In [36]:
# Use batch-get_record
record_identifier_values = list((df_destinations_pca.srch_destination_id.unique()))[:2]
response=featurestore_runtime.batch_get_record(
    Identifiers=[
        {"FeatureGroupName": destinations_fg_name, "RecordIdentifiersValueAsString": record_identifier_values}
    ]
)
response

{'ResponseMetadata': {'RequestId': 'a38318bc-bccb-431e-b4b7-5dbb9bf8a2df',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'a38318bc-bccb-431e-b4b7-5dbb9bf8a2df',
   'content-type': 'application/json',
   'content-length': '875',
   'date': 'Wed, 08 Jun 2022 15:13:50 GMT'},
  'RetryAttempts': 0},
 'Records': [{'FeatureGroupName': 'expedia-destinations-2022-6-8-15-0',
   'RecordIdentifierValueAsString': '0',
   'Record': [{'FeatureName': 'pc1', 'ValueAsString': '-0.04426762797339248'},
    {'FeatureName': 'pc2', 'ValueAsString': '0.16941936159447119'},
    {'FeatureName': 'pc3', 'ValueAsString': '0.03251814805114878'},
    {'FeatureName': 'srch_destination_id', 'ValueAsString': '0'},
    {'FeatureName': 'date_time', 'ValueAsString': '2022-06-08T15:10:52Z'}]},
  {'FeatureGroupName': 'expedia-destinations-2022-6-8-15-0',
   'RecordIdentifierValueAsString': '1',
   'Record': [{'FeatureName': 'pc1', 'ValueAsString': '-0.4407610032019503'},
    {'FeatureName': 'pc2', 'ValueAsS

In [37]:
# add a 1 minute sleep to wait for at least some data to show up in the offline feature store
# time.sleep(60)

# the feature group provided a convenient Athena object to query the offline feature store data
query = feature_group.athena_query()
destinations_fg_table = query.table_name
logger.info(f"Athena table -> fg_table={destinations_fg_table}")

2022-06-08 15:13:51,051,<ipython-input-37-3b4fd3e5966a>,<module>,7,INFO,p27837,Athena table -> fg_table=expedia-destinations-2022-6-8-15-0-1654701052


In [38]:
query_string = f'SELECT * FROM "{destinations_fg_table}" limit 10'
utils.write_param("destinations_fg_table", destinations_fg_table)
output_location=f's3://{athena_query_results_bucket_name}/{destinations_fg_name}/query_results/'
logger.info(f"going to run this query -> {query_string} and store the results in {output_location}")

# run the query
query.run(query_string=query_string, output_location=output_location)

# wait for the results
query.wait()
df_fg = query.as_dataframe()

# results
df_fg.head()

2022-06-08 15:13:51,063,utils.py,write_param,115,INFO,p27837,write_param, fpath=../config/destinations_fg_table, writing destinations_fg_table=expedia-destinations-2022-6-8-15-0-1654701052
2022-06-08 15:13:51,068,<ipython-input-38-fa8da0ead993>,<module>,4,INFO,p27837,going to run this query -> SELECT * FROM "expedia-destinations-2022-6-8-15-0-1654701052" limit 10 and store the results in s3://athena-query-results-195cbf60/expedia-destinations-2022-6-8-15-0/query_results/
2022-06-08 15:13:51,258,session.py,wait_for_athena_query,4118,INFO,p27837,Query 91daf10f-acb0-49d1-b2f1-34edfe564c2b is being executed.
2022-06-08 15:13:56,315,session.py,wait_for_athena_query,4127,INFO,p27837,Query 91daf10f-acb0-49d1-b2f1-34edfe564c2b successfully executed.


Unnamed: 0,pc1,pc2,pc3,srch_destination_id,date_time,write_time,api_invocation_time,is_deleted


In [39]:
# write feature group name to a file
utils.write_param("destinations_fg_name", destinations_fg_name)

2022-06-08 15:13:56,485,utils.py,write_param,115,INFO,p27837,write_param, fpath=../config/destinations_fg_name, writing destinations_fg_name=expedia-destinations-2022-6-8-15-0
