Skip to content

aws-samples/aws-comparing-algorithms-performance-mlops-cdk

Comparing Machine Learning Algorithms' Performance using Machine Learning Operations (MLOps) and the AWS Cloud Development Kit (CDK)

This is a project built using AWS Cloud Development Kit (AWS CDK). It shows how to fully automate the complete life cycle of how to perform comparison between two machine learning algorithms. For this project we are going to solve a regression use case to determine the age of the Abalone using both "XGboost" and "Linear learner" Algorithms. The main goal here is to show the process and how to automate data transformation, training, creation of models and endpoint configurations and finally how to automate performing the prediction against the deployed endpoints to determine which model is having a better prediction results. This will happen in a complete serverless environment using below aws services:

  • Amazon Simple Storage Service (Amazon S3)
  • AWS Lambda
  • AWS Step functions
  • Amazon Sagemaker

Table of contents:

Prerequisites

To be able to provision the solution you would need the following:

  1. Install git
  2. Install python 3.x
  3. Install CDK

Setup

You can confirm that the cdk is working fine using the below command

cdk --version

Then clone this repo and deploy it's infrastructure to your account using below commands.

## If you use https clone run below command
git clone https://github.com/aws-samples/aws-comparing-algorithms-performance-mlops-cdk.git

## If you use ssh then run below
git clone git@github.com:aws-samples/aws-comparing-algorithms-performance-mlops-cdk.git

## installation and deploy
cd aws-comparing-algorithms-performance-mlops-cdk/cdk-app/
npm install
cdk bootstrap
cdk deploy



cdk deploy

After the solutions has been deployed to your account successfully, this can be confirmed from the output of the cdk deploy command returning the CloudFormation stack arn as below.



successful deployment


Fetching and exploring the dataset

We use the Abalone data originally from the UCI data repository.

The dataset contains 9 fields ('Rings','Sex','Length','Diameter','Height','Whole Weight','Shucked Weight','Viscera Weight' and 'Shell Weight') starting with the Rings number which is a number indicating the age of the abalone (as age equals to number of rings plus 1.5). Usually the number of rings are counted through microscopes to estimate the abalone's age. So we will use our algorithms to predict the abalone age based on the other features which are mentioned respectively as below within the dataset. A snippet of a data frame is explained below showing each feature of the abalone dataset and it's corresponding datatype.

** data frame snippet**:
age               int  11 8 15 7 7 8 20 14 9
Sex               <feature_number>: factor with 3 levels "F","I","M" values of 1,2,3
Length            <feature_number>: float  0.255 0.145 0.53 0.42 0.32 0.423 ...
Diameter          <feature_number>: float  0.363 0.164 0.42 0.355 0.151 0.32 ...
Height            <feature_number>: float  0.019 0.05 0.235 0.225 0.16 0.0755 ...
Whole.weight      <feature_number>: float  0.312 0.213 0.532 0.412 0.601 ...
Shucked.weight    <feature_number>: float  0.4535 0.0564 0.2363 0.1153 0.0823 ...
Viscera.weight    <feature_number>: float  0.111 0.0222 0.4123 0.2133 0.0345 ...
Shell.weight      <feature_number>: float  0.31 0.08 0.21 0.155 0.044 0.11 ...

The above features starting from sex to Shell.weight are physical measurements that can be measured using the correct tools, so we improve the complexity of having to examine the abalone under microscopes to understand it's age.

In order to download the dataset we need to run the below python script which can be found in the "scripts" folder with the name "download_and_divide.py". The script will first check if the required packages are found in your environment and install them if required then it will download the original dataset and divide it into two datasets named "dataset1.csv" and "dataset2.csv". We will be using those two datasets to show how the solution can continuously include more data and use the added data for the training of the models.

python3 scripts/download_and_divide.py



running python


Starting the execution

Now that we have the infrastructure setup and datasets ready, let's start the execution and explore what the solution is performing.

  1. We start by uploading the the first half of the data set named "dataset1.csv" to the bucket named "abalone-blog-< ACCOUNT_ID>-<REGION>" to "/Inputs" folder.

Inputs Folder

Upload dataset1.csv


Architectural Overview

This architecture serves as an example of how you can build a MLOps pipeline that orchestrates the comparison of results between two algorithms predictions.

The solution uses a completely serverless environment so you don’t have to worry about managing the infrastructure. It also makes sure that the deployed endpoints which will be used for predictions are deleted immediately after collecting the predictions results not to incur any additional costs.

Solution Architecture

  1. The dataset is uploaded to the Amazon S3 cloud storage under the /Inputs directory (prefix).

  2. Once the dataset is uploaded to the Inputs folder using s3 event notification it initiate the MLOps pipeline built using a Step Functions state machine.

  3. The Lambda function then will initiate the MLOps pipeline built using a Step Functions state machine.

  4. The starting lambda will start by collecting the region corresponding training images URIs for both Linear Learner and XGBoost algorithms which will be used in training both algorithms over the dataset. It will also get the Amazon SageMaker Spark Container Image which will be used for running the SageMaker processing Job.

  5. The dataset is in libsvm format which is accepted by the XGBoost algorithm as per the Input/Output Interface for the XGBoost Algorithm. However, this is not supported by the Linear Learner Algorithm as per Input/Output interface for the linear learner algorithm. So we need to run a processing job using Amazon SageMaker Data Processing with Apache Spark. The processing job will transform the data from libsvm to csv and will divide the dataset into train, validation and test datasets. The output of the processing job will be stored under "/Xgboost" and "/Linear" directories (prefixes).

datasets folder

train validation test

  1. The workflow of Step Functions will perform the following steps in parallel

    A: Train both algorithms.
    B: Create models out of trained algorithms.
    C: Create endpoints configurations and deploy predictions endpoints for both models.
    D: Invoke lambda function to describe the deployed endpoints and wait for the endpoints to become available.
    E: Invoke lambda function to perform 3 live predictions using boto3 and the “test” sample taken from the dataset to calculate the average accuracy of each model.
    F: Invoke lambda function to delete deployed endpoints not to incur any additional charges.

  2. Finally, a Lambda function will be invoked to determine which model is having better accuracy in predicting the values.


The overall flow of step functions execution can be viewed by referring to the to the step functions definition graph below.
step functions definition exported graph

P.S: Note that training, deployment of endpoints and performing live predictions. These steps are executed in parallel. Also once the prediction is performed, all deployed endpoints will be automatically deleted in order not to incur any additional charges.

You can now watch the step functions workflow progress by going to Step Functions console and locate the state machine named abaloneStepFunction

Stepfunctions console

Stepfunctions Workflow console

Stepfunctions Workflow execution


Results

After waiting for the complete execution of step functions work flow, we can see the results as below, this doesn't mean that this algorithm is better than the other in all circumstances. It just means that based on the hyperparameter set configured for each algorithm and number of epochs performed, they resulted in that performance.

step functions results graph

To make sure that you are configuring a set of hyperparameters that gives the minimum loss and provide better version of the models, you would need to run a hyperparameters tuning jobs which will run many training jobs on your dataset using the algorithms and ranges of hyperparameters that you specify. This will help you allocate which set of hyperparameters is giving the best results


What is next?

Now, you can use the other half of the dataset named "dataset2.csv" to upload to the s3 "Inputs" folder. This will add more data and increase the amount of data used to train the models which shows how this process can be repetitive based on the frequency of collected data that will be used to train the models.



upload dataset2.csv

Finally, you can use this comparison to determine which algorithm is best suited for your production environment. Then you can configure your step functions workflow to update the configuration of the production endpoint with the better performing algorithm.



update prod flow


Cleanup

In order to delete all the infrastructure created, you can perform the below command which will delete the cloudformation stack used to provision the resources.

cdk destroy

cdk destroy


Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages