Skip to content

AthenaML Tutorial

Anthony Virtuoso edited this page Dec 3, 2019 · 4 revisions

This tutorial will show you how to use Amazon Athena ML to run a Federated Query that uses SageMaker inference to detect an anomalous value in our result.

  1. Create a new Role that AWS SageMaker can use to run an Athena query to generate our training dataset, train a new model, and deploy that model to a SageMaker endpoint. To do this, our role should have AmazonAthenaFullAccess, AmazonSageMakerFullAccess, and AmazonS3FullAccess managed policies. Note that in a production setting you should scope down the AmazonS3FullAccess policy to include only the buckets that you require for training your model.
  2. Create a new SageMaker notebook using at least an ml.m5.xlarge instance type. Be sure to use the ARN of the role we created in the previous as the IAM Role that this notebook will use when interacting with other AWS Services.
  3. While the SageMaker notebook launches, create an S3 bucket that we can use to store our training data and model.
  4. right-click and save as this Tutorial Notebook File hosting in this repository.
  5. Once the SageMaker notebook is created, launch the Jupyter notebook and upload the notebook file you downloaded in the previous step.
  6. Open the Athena-ML.ipynb notebook.
  7. Update the bucket defined in the first cell of the notebook to use the bucket you created in the above steps.
  8. From the 'Cell' menu at the top of our notebook, select "Run All" to execute all steps in the notebook.

This notebook will:

  • Create a table pointing to the NYC Taxi Rides Dataset.
  • Run a query to generate a training data set for Rides by day.
  • Train a RandomForest Model to detect anomalies.
  • Deploys the model to a SageMaker endpoint that our application (or query) can call.

Once the notebook execution completes (it may take >5 minutes), you are ready to use this model from an Athena query.

  1. From the Athena Console, create a new workgroup called "AmazonAthenaPreviewFunctionality" (if you don't already have such a workgroup). This workgroup will enable Athena ML capabilities for your query while this functionality is in Preview.
  2. Run the below query:
USING FUNCTION detect_anomaly(b INT) RETURNS DOUBLE 
                   TYPE SAGEMAKER_INVOKE_ENDPOINT
WITH (sagemaker_endpoint = '<ENDPOINT_NAME>')
SELECT detect_anomaly(number),
         time,
         number
FROM taxi_ridership_data limit 10;