End-to-end ML workflow to highlight SageMaker Feature Store
This repository demonstrates an end-to-end ML workflow using various AWS services such as SageMaker (Feature Store, Endpoints), Kinesis Data Streams, Lambda and DynamoDB.
The dataset used here is Expedia hotel recommendations dataset from Kaggle and the use-case is predicting a hotel cluster based on user inputs and destination features. We ingest the raw data from an S3 bucket into a Amazon SageMaker Feature Store and then read data from the Feature Store to train ML model for predicting a hotel cluster. The trained model is deployed as a SageMaker endpoint. A simulated inference pipeline is created using Amazon Kinesis Data Streams and Lambda. Test data is put on the stream and this triggers a Lambda function which joins the event data (customer inputs) read from the stream with destination features read from the online SageMaker Feature Store and then invokes the SageMaker Model Endpoint to get a prediction for the hotel cluster. The predicted hotel cluster along with the input data is stored in a DynamoDB table.
A blog post providing a full walkthrough of using a feature store should be coming soon.
For a full explanation of SageMaker Feature Store you can read here, which describes the capability as:
Amazon SageMaker Feature Store is a purpose-built repository where you can store and access features so it’s much easier to name, organize, and reuse them across teams. SageMaker Feature Store provides a unified store for features during training and real-time inference without the need to write additional code or create manual processes to keep features consistent.
This implementation demonstrates how to do the following:
- Create multiple online and offline SageMaker Feature Groups to store transformed data readily usable for training ML models.
- Train a SageMaker XGBoost model and deploy it as an endpoint for real time inference.
- Simulate an inference pipeline by putting events on a Kinesis Data Stream and trigger a Lambda function.
- Read data from the online feature store from the Lambda, combine it with event data and invokte a SageMaker endpoint.
- Store prediction results and input data in a DynamoDB table.
Prior to running the steps under Instructions, you will need access to an AWS Account where you have full Admin privileges. The CloudFormation template will deploy an AWS Lambda functions, IAM Role, and a new SageMaker notebook instance with this repo already cloned. In addition, having basic knowledge of the following services will be valuable: Amazon Kinesis Data Streams, Amazon SageMaker, AWS Lambda functions, Amazon IAM Roles.
PRE-REQ 1: The CloudFormation template also deploys a Lambda function and the code for the Lambda function needs to be in an S3 bucket that needs to be created prior to running the CloudFormation template. Create an S3 bucket and place the hotel_cluster_predictions_v1.zip file in the bucket. Keep the name of this bucket handy as it will be required as an input for the "Name of the S3 bucket for holding the zip file of the Lambda code" parameter in the CloudFormation template.
PRE-REQ 2: If you have a Cloud Trail created for your account, make sure that the "Exclude AWS KMS events" checkbox is checked under the Cloud Trail -> Management Events setting. Checking this checkbox will prevent AWS KMS events from getting logged in Cloud Trail. Ingesting data into the Feature Store triggers a KMS event and depending upon the size of the data this could result in a huge cost if not disabled, therefore, for the purpose of this demo it is recommended that AWS KMS events are not logged in Cloud Trail.
Use the CloudFormation template available in the templates folder of this repository to launch a CloudFormation stack. It is required to use
expedia-feature-store-demo-v2as the stack name (using a different name would require a change in the notebook code). All parameters needed by the template have a default value, you can leave these defaults unchanged unless there is a need to.
NOTE: This code has been tested only in the us-east-1 region although it is expected to work in other regions as well (but has not been tested in other regions). You can view the CloudFormation template directly by looking here. The stack will take a few minutes to launch. When it completes, you can view the items created by clicking on the Resources tab.
Once the stack is complete, browse to Amazon SageMaker in the AWS console and click on the 'Notebook Instances' tab on the left.
Click either 'Jupyter' or 'JupyterLab' to access the SageMaker Notebook instance. The CloudFormation template has cloned this git repository into the notebook instance for you. All of the example code to work through is in the notebooks directory.
The dataset used for this code is available on Kaggle and can be downloaded directly from the Kaggle website. The dataset is NOT included as part of this repository. The CloudFormation template creates an S3 bucket to hold the raw data. The data downloaded form the Kaggle website needs to be uploaded to a folder called raw_data in this bucket. See the output section of the CloudFormation stack and look for DataBucketName, this is the name of the bucket created by the CloudFormation stack in which the raw data needs to be uploaded (manually). Create a folder called raw_data in this bucket and upload the files
destinations.csvfrom the Kaggle dataset to this bucket in the raw_data folder.
Running the Notebooks
There are a series of notebooks which should be run in order. Follow the step-by-step guide in each notebook:
- notebooks/0_batch_ingestion.ipynb - Create feature groups and ingest data from S3. Wait 15 minutes after running this notebook to ensure that all data is available in the offline feature store.
- notebooks/1_ml_model_training.ipynb - Read feature data and train ML model and deploy as SageMaker Endpoint.
- notebooks/2_online_inference.ipynb - make hotel cluster predictions on streaming customer inputs.
- notebooks/3_lineage_tracking.ipynb - Lineage tracking of feature data and ML models.
- notebooks/4_feature_monitoring.ipynb - feature monitoring and profiling.
- View the Kinesis Stream that is used to ingest records.
- View the Lambda function that receives the kinesis events, reads feature data from the online feature store and triggers the model prediction.
CLEAN UP - IMPORTANT
To destroy the AWS resources created as part of this example, complete the following two steps:
Run the notebooks/5_cleanup.ipynb to delete S3 objects and SageMaker endpoint (these are resources not created by the CloudFormation template).
Go to CloudFormation in the AWS console, select
expedia-feature-store-demo-v2and click 'Delete'. Verify all the resources (S3 buckets, SageMaker notebook, SageMaker endpoint, Lambda, DynamoDB) are indeed deleted.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
The running cost for this demo is $15-$20 per day.