# Building and Auto Scaling your own algorithm container (SageMaker Studio)

In this notebook, we'll show how to package a simple Python example which showcases the [decision tree](https://scikit-learn.org/stable/modules/tree.html) algorithm from the widely used [scikit-learn](https://scikit-learn.org/stable/index.html) machine learning package. The example is purposefully fairly trivial since the point is to show the surrounding structure that you'll want to add to your own code so you can train and host it in Amazon SageMaker.

The ideas shown here will work in any language or environment. You'll need to choose the right tools for your environment to serve HTTP requests for inference, but good HTTP environments are available in every language these days.
Then, we add autoscaling for the model with the Application Auto Scaling API. We will first register the model, then we will define an autoscaling policy.

In this example, we use a single image to support training and hosting. This is easy because it means that we only need to manage one image and we can set it up to do everything. Sometimes you'll want separate images for training and hosting because they have different requirements. Just separate the parts discussed below into separate Dockerfiles and build two images. Choosing whether to have a single image or two images is really a matter of which is more convenient for you to develop and manage.

If you're only using Amazon SageMaker for training or hosting, but not both, there is no need to build the unused functionality into your container.

## Part 1: Packaging and Uploading your Algorithm for use with Amazon SageMaker

### An overview of Docker

If you're familiar with Docker already, you can skip ahead to the next section.

For many data scientists, Docker containers are a new concept, but they are not difficult, as you'll see here. 

Docker provides a simple way to package arbitrary code into an _image_ that is totally self-contained. Once you have an image, you can use Docker to run a _container_ based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way you set up your program is the way it runs, no matter where you run it.

Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language independent and (b) it comprises your whole operating environment, including startup commands, environment variable, etc.

In some ways, a Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run on the same physical machine or virtual machine instance.

Docker uses a simple file called a `Dockerfile` to specify how the image is assembled. We'll see an example of that below. You can build your Docker images based on Docker images built by yourself or others, which can simplify things quite a bit.

Docker has become very popular in the programming and devops communities for its flexibility and well-defined specification of the code to be run. It is the underpinning of many services built in the past few years, such as [Amazon ECS].

Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.

In Amazon SageMaker, Docker containers are invoked in a certain way for training and a slightly different way for hosting. The following sections outline how to build containers for the SageMaker environment.

Some helpful links:

* [Docker home page](http://www.docker.com)
* [Getting started with Docker](https://docs.docker.com/get-started/)
* [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)
* [`docker run` reference](https://docs.docker.com/engine/reference/run/)

[Amazon ECS]: https://aws.amazon.com/ecs/

### How Amazon SageMaker runs your Docker container

Because you can run the same image in training or hosting, Amazon SageMaker runs your container with the argument `train` or `serve`. How your container processes this argument depends on the container:

* In the example here, we don't define an `ENTRYPOINT` in the Dockerfile so Docker will run the command `train` at training time and `serve` at serving time. In this example, we define these as executable Python scripts, but they could be any program that we want to start in that environment.
* If you specify a program as an `ENTRYPOINT` in the Dockerfile, that program will be run at startup and its first argument will be `train` or `serve`. The program can then look at that argument and decide what to do.
* If you are building separate containers for training and hosting (or building only for one or the other), you can define a program as an `ENTRYPOINT` in the Dockerfile and ignore (or verify) the first argument passed in. 

#### Running your container during training

When Amazon SageMaker runs training, your `train` script is run just like a regular Python program. A number of files are laid out for your use, under the `/opt/ml` directory:

    /opt/ml
    |-- input
    |   |-- config
    |   |   |-- hyperparameters.json
    |   |   `-- resourceConfig.json
    |   `-- data
    |       `-- <channel_name>
    |           `-- <input data>
    |-- model
    |   `-- <model files>
    `-- output
        `-- failure

##### The input

* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn't support distributed training, we'll ignore it here.
* `/opt/ml/input/data/<channel_name>/` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it's generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. 
* `/opt/ml/input/data/<channel_name>_<epoch_number>` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

##### The output

* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the `DescribeTrainingJob` result.
* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file will be returned in the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it will be ignored.

#### Running your container during hosting

Hosting has a very different model than training because hosting is responding to inference requests that come in via HTTP. In this example, we use our recommended Python serving stack to provide robust and scalable serving of inference requests:

![Request serving stack](stack.png)

This stack is implemented in the sample code here and you can mostly just leave it alone. 

Amazon SageMaker uses two URLs in the container:

* `/ping` will receive `GET` requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.
* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these will be passed in as well. 

The container will have the model files in the same place they were written during training:

    /opt/ml
    `-- model
        `-- <model files>



### The parts of the sample container

In the `container` directory are all the components you need to package the sample algorithm for Amazon SageMager:

    .
    |-- Dockerfile
    |-- build_and_push.sh
    `-- decision_trees
        |-- nginx.conf
        |-- predictor.py
        |-- serve
        |-- train
        `-- wsgi.py

Let's discuss each of these in turn:

* __`Dockerfile`__ describes how to build your Docker container image. More details below.
* __`build_and_push.sh`__ is a script that uses the Dockerfile to build your container images and then pushes it to ECR. We'll invoke the commands directly later in this notebook, but you can just copy and run the script for your own algorithms.
* __`decision_trees`__ is the directory which contains the files that will be installed in the container.
* __`local_test`__ is a directory that shows how to test your new container on any computer that can run Docker, including an Amazon SageMaker notebook instance. Using this method, you can quickly iterate using small datasets to eliminate any structural bugs before you use the container with Amazon SageMaker. We'll walk through local testing later in this notebook.

In this simple application, we only install five files in the container. You may only need that many or, if you have many supporting routines, you may wish to install more. These five show the standard structure of our Python containers, although you are free to choose a different toolset and therefore could have a different layout. If you're writing in a different programming language, you'll certainly have a different layout depending on the frameworks and tools you choose.

The files that we'll put in the container are:

* __`nginx.conf`__ is the configuration file for the nginx front-end. Generally, you should be able to take this file as-is.
* __`predictor.py`__ is the program that actually implements the Flask web server and the decision tree predictions for this app. You'll want to customize the actual prediction parts to your application. Since this algorithm is simple, we do all the processing here in this file, but you may choose to have separate files for implementing your custom logic.
* __`serve`__ is the program started when the container is started for hosting. It simply launches the gunicorn server which runs multiple instances of the Flask app defined in `predictor.py`. You should be able to take this file as-is.
* __`train`__ is the program that is invoked when the container is run for training. You will modify this program to implement your training algorithm.
* __`wsgi.py`__ is a small wrapper used to invoke the Flask app. You should be able to take this file as-is.

In summary, the two files you will probably want to change for your application are `train` and `predictor.py`.

### The Dockerfile

The Dockerfile describes the image that we want to build. You can think of it as describing the complete operating system installation of the system that you want to run. A Docker container running is quite a bit lighter than a full operating system, however, because it takes advantage of Linux on the host machine for the basic operations. 

For the Python science stack, we will start from a standard Ubuntu installation and run the normal tools to install the things needed by scikit-learn. Finally, we add the code that implements our specific algorithm to the container and set up the right environment to run under.

Along the way, we clean up extra space. This makes the container smaller and faster to start.

Let's look at the Dockerfile for the example:

In [10]:
!cat container/Dockerfile

# Build an image that can do training and inference in SageMaker
# This is a Python 3 image that uses the nginx, gunicorn, flask stack
# for serving inferences in a stable way.

FFROM python:3.7-slim-buster

RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         nginx \
         ca-certificates

# Here we get all python packages.
# There's substantial overlap between scipy and numpy that we eliminate by
# linking them together. Likewise, pip leaves the install caches populated which uses
# a significant amount of space. These optimizations save a fair amount of space in the
# image, which reduces start up time.
RUN pip --no-cache-dir install numpy==1.16.2 scipy==1.2.1 scikit-learn==0.20.2 pandas flask gunicorn

# Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard
# output stream, which means that logs can be delivered to the user quickly. PYTHONDONTWRITEBYTECODE
# keeps Python from writing the .pyc files whi

### Building and registering the container

Let's install the [SageMaker Docker Build](https://github.com/aws-samples/sagemaker-studio-image-build-cli) - a CLI for building Docker images in SageMaker Studio using AWS CodeBuild.

In [11]:
!pip install sagemaker-studio-image-build

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting sagemaker-studio-image-build
  Downloading sagemaker_studio_image_build-0.6.0.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sagemaker-studio-image-build
  Building wheel for sagemaker-studio-image-build (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker-studio-image-build: filename=sagemaker_studio_image_build-0.6.0-py3-none-any.whl size=13469 sha256=3b1dfebabc60b907d9f4ade89425724ea0f290b9c8bd4eba8b6a275fdc11bf2b
  Stored in directory: /tmp/pip-ephem-wheel-cache-6ip12oqz/wheels/c1/9c/e8/cbf0266d9d9b1b6161f7ba9ddf572d02aacd411e8a5b4d186b
Successfully built sagemaker-studio-image-build
Installing collected packages: sagemaker-studio-image-build
Successfully installed sagemaker-studio-image-build-0.6.0
You should consider upgrading via the '/opt

The following shell code shows how to build the container image and push the container image to ECR using `sm-docker build`.

This code looks for an ECR repository in the account you're using and the current default region (if you're using a SageMaker notebook instance, this will be the region where the notebook instance was created). If the repository doesn't exist, the script will create it.

In [12]:
%%sh

# The name of our algorithm
algorithm_name=sagemaker-decision-trees

cd container

chmod +x decision_trees/train
chmod +x decision_trees/serve

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(python -c "import boto3;print(boto3.Session().region_name)")
region=${region:-us-west-2}

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --region "${region}" --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --region "${region}" --repository-name "${algorithm_name}" > /dev/null
fi

sm-docker build . --repository "${algorithm_name}:latest"

...
[Container] 2022/04/14 08:51:29 Waiting for agent ping

[Container] 2022/04/14 08:51:30 Waiting for DOWNLOAD_SOURCE
[Container] 2022/04/14 08:51:31 Phase is DOWNLOAD_SOURCE
[Container] 2022/04/14 08:51:31 CODEBUILD_SRC_DIR=/codebuild/output/src705175620/src
[Container] 2022/04/14 08:51:31 YAML location is /codebuild/output/src705175620/src/buildspec.yml
[Container] 2022/04/14 08:51:31 Setting HTTP client timeout to higher timeout for S3 source
[Container] 2022/04/14 08:51:31 Processing environment variables
[Container] 2022/04/14 08:51:31 No runtime version selected in buildspec.
[Container] 2022/04/14 08:51:31 Moving to directory /codebuild/output/src705175620/src
[Container] 2022/04/14 08:51:31 Configuring ssm agent with target id: codebuild:1c13eaf2-0b73-4bb4-bb47-4662fafef48a
[Container] 2022/04/14 08:51:31 Successfully updated ssm agent configuration
[Container] 2022/04/14 08:51:31 Registering with agent
[Container] 2022/04/14 08:51:31 Phases found in YAML: 3
[Container] 2022/

## Testing your algorithm on your local machine or on an Amazon SageMaker notebook instance

While you're first packaging an algorithm use with Amazon SageMaker, you probably want to test it yourself to make sure it's working right. In the directory `container/local_test`, there is a framework for doing this. It includes three shell scripts for running and using the container and a directory structure that mimics the one outlined above.

The scripts are:

* `train_local.sh`: Run this with the name of the image and it will run training on the local tree. For example, you can run `$ ./train_local.sh sagemaker-decision-trees`. It will generate a model under the `/test_dir/model` directory. You'll want to modify the directory `test_dir/input/data/...` to be set up with the correct channels and data for your algorithm. Also, you'll want to modify the file `input/config/hyperparameters.json` to have the hyperparameter settings that you want to test (as strings).
* `serve_local.sh`: Run this with the name of the image once you've trained the model and it should serve the model. For example, you can run `$ ./serve_local.sh sagemaker-decision-trees`. It will run and wait for requests. Simply use the keyboard interrupt to stop it.
* `predict.sh`: Run this with the name of a payload file and (optionally) the HTTP content type you want. The content type will default to `text/csv`. For example, you can run `$ ./predict.sh payload.csv text/csv`.

The directories as shipped are set up to test the decision trees sample algorithm presented here.

## Part 2: Using your Algorithm in Amazon SageMaker

Once you have your container packaged, you can use it to train models and use the model for hosting or batch transforms. Let's do that with the algorithm we made above.

## Set up the environment

Here we specify a bucket to use and the role that will be used for working with SageMaker.

In [13]:
# S3 prefix
prefix = "DEMO-scikit-byo-iris"

# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

## Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [14]:
import sagemaker as sage
from time import gmtime, strftime

sess = sage.Session()

## Upload the data for training

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're using some the classic [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), which we have included. 

We can use use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. 

In [15]:
WORK_DIRECTORY = "data"

data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

## Create an estimator and fit the model

In order to use SageMaker to fit our algorithm, we'll create an `Estimator` that defines how to use the container to train. This includes the configuration we need to invoke SageMaker training:

* The __container name__. This is constructed as in the shell commands above.
* The __role__. As defined above.
* The __instance count__ which is the number of machines to use for training.
* The __instance type__ which is the type of machine to use for training.
* The __output path__ determines where the model artifact will be written.
* The __session__ is the SageMaker session object that we defined above.

Then we use fit() on the estimator to train against the data that we uploaded above.

In [17]:
%%time

account = sess.boto_session.client("sts").get_caller_identity()["Account"]
region = sess.boto_session.region_name
image = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-decision-trees:latest".format(account, region)


tree = sage.estimator.Estimator(
    image,
    role,
    1,
    "ml.c4.2xlarge",
    output_path="s3://{}/output".format(sess.default_bucket()),
    sagemaker_session=sess,
)

tree.fit(data_location)

2022-04-14 08:58:08 Starting - Starting the training job...
2022-04-14 08:58:32 Starting - Preparing the instances for trainingProfilerReport-1649926688: InProgress
.........
2022-04-14 09:00:01 Downloading - Downloading input data
2022-04-14 09:00:01 Training - Training image download completed. Training in progress..[34mStarting the training.[0m
[34mTraining complete.[0m

2022-04-14 09:00:33 Uploading - Uploading generated training model
2022-04-14 09:00:33 Completed - Training job completed
Training seconds: 27
Billable seconds: 27
CPU times: user 322 ms, sys: 13 ms, total: 335 ms
Wall time: 2min 42s


## Hosting your model
You can use a trained model to get real time predictions using HTTP endpoint. Follow these steps to walk you through the process.

### Deploy the model

Deploying the model to SageMaker hosting just requires a `deploy` call on the fitted model. This call takes an instance count, instance type, and optionally serializer and deserializer functions. These are used when the resulting predictor is created on the endpoint.

In [143]:
%%time

from sagemaker.serializers import CSVSerializer

predictor = tree.deploy(1, "ml.m5.large", serializer=CSVSerializer())

----!CPU times: user 73 ms, sys: 13.1 ms, total: 86.1 ms
Wall time: 2min 1s


### Choose some data and use it for a prediction

In order to do some predictions, we'll extract some of the data we used for training and do predictions against it. This is, of course, bad statistical practice, but a good way to see how the mechanism works.

In [144]:
shape = pd.read_csv("data/iris.csv", header=None)
shape.sample(3)

Unnamed: 0,0,1,2,3,4
57,versicolor,4.9,2.4,3.3,1.0
26,setosa,5.0,3.4,1.6,0.4
11,setosa,4.8,3.4,1.6,0.2


In [145]:
# drop the label column in the training set
shape.drop(shape.columns[[0]], axis=1, inplace=True)
shape.sample(3)

Unnamed: 0,1,2,3,4
49,5.0,3.3,1.4,0.2
132,6.4,2.8,5.6,2.2
4,5.0,3.6,1.4,0.2


In [146]:
import itertools

a = [50 * i for i in range(3)]
b = [40 + i for i in range(10)]
indices = [i + j for i, j in itertools.product(a, b)]

test_data = shape.iloc[indices[:-1]]

Prediction is as easy as calling predict with the predictor we got back from deploy and the data we want to do predictions with. The serializers take care of doing the data conversions for us.

In [147]:
print(predictor.predict(test_data.values).decode("utf-8"))

setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica



## Automatically Scale Amazon SageMaker Models

Amazon SageMaker supports automatic scaling (autoscaling) for your hosted models. Autoscaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. When the workload increases, autoscaling brings more instances online. When the workload decreases, autoscaling removes unnecessary instances so that you don't pay for provisioned instances that you aren't using.

Let us define a client to play with autoscaling options

In [148]:
client = boto3.client('application-autoscaling')

In [149]:
endpoint_name = predictor.endpoint_name
endpoint_name

'sagemaker-decision-trees-2022-04-14-14-36-03-248'

### Register a scalable target

A scalable target is a resource that Application Auto Scaling can scale out and scale in. Scalable targets are uniquely identified by the combination of resource ID, scalable dimension, and namespace.

When you register a new scalable target, you must specify values for minimum and maximum capacity. Current capacity will be adjusted within the specified range when scaling starts. Application Auto Scaling scaling policies will not scale capacity to values that are outside of this range.

We will define the resource id and then register the SageMaker Endpoint as a scalable target.

In [150]:
resource_id='endpoint/' + endpoint_name + '/variant/' + 'AllTraffic' # This is the format in which application autoscaling references the endpoint
resource_id

'endpoint/sagemaker-decision-trees-2022-04-14-14-36-03-248/variant/AllTraffic'

### Register a model

You can add autoscaling for a model with the AWS CLI or the Application Auto Scaling API. You first must register the model, then you must define an autoscaling policy.

 - `min-capacity` — The minimum number of instances that for this model. Set min-capacity to at least 1. It must be equal to or less than the value specified for `max-capacity`.
 - `max-capacity` — The maximum number of instances that Application Auto Scaling should manage. Set max-capacity to a minimum of 1, It must be equal to or greater than the value specified for `min-capacity`.

In [151]:
response = client.register_scalable_target(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=1,
    MaxCapacity=10
)
response

{'ResponseMetadata': {'RequestId': '5470a4a2-b4d5-4f00-be16-f965e8685620',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '5470a4a2-b4d5-4f00-be16-f965e8685620',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '2',
   'date': 'Thu, 14 Apr 2022 14:41:24 GMT'},
  'RetryAttempts': 0}}

To get information about the scalable targets in the specified SageMaker Endpoint, You can use `describe_scalable_targets` API call.

In [152]:
response = client.describe_scalable_targets(
    ServiceNamespace='sagemaker',
    ResourceIds=[resource_id]
)
response

{'ScalableTargets': [{'ServiceNamespace': 'sagemaker',
   'ResourceId': 'endpoint/sagemaker-decision-trees-2022-04-14-14-36-03-248/variant/AllTraffic',
   'ScalableDimension': 'sagemaker:variant:DesiredInstanceCount',
   'MinCapacity': 1,
   'MaxCapacity': 10,
   'RoleARN': 'arn:aws:iam::062083580489:role/aws-service-role/sagemaker.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint',
   'CreationTime': datetime.datetime(2022, 4, 14, 14, 41, 25, 947000, tzinfo=tzlocal()),
   'SuspendedState': {'DynamicScalingInSuspended': False,
    'DynamicScalingOutSuspended': False,
    'ScheduledScalingSuspended': False}}],
 'ResponseMetadata': {'RequestId': '9e4a9800-9734-4815-8446-421ca9fb3c3d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '9e4a9800-9734-4815-8446-421ca9fb3c3d',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '560',
   'date': 'Thu, 14 Apr 2022 14:41:25 GMT'},
  'RetryAttempts': 0}}

### Define a scaling policy

To specify the metrics and target values for a scaling policy, you configure a target-tracking scaling policy. You can use either a predefined metric or a custom metric.

To quickly define a target-tracking scaling policy for a variant, use the `SageMakerVariantInvocationsPerInstance` predefined metric. 

`SageMakerVariantInvocationsPerInstance` is the average number of times per minute that each instance for a variant is invoked. **We strongly recommend using this metric**.

To use a predefined metric in a scaling policy, create a target tracking configuration for your policy. In the target tracking configuration, include a `PredefinedMetricSpecification` for the predefined metric and a TargetValue for the target value of that metric.

The following example is a typical policy configuration for target-tracking scaling for a variant. In this configuration, we use the SageMakerVariantInvocationsPerInstance predefined metric to adjust the number of variant instances so that each instance has a InvocationsPerInstance metric of 10, thus 10 invocations per minute.

In [153]:
response = client.put_scaling_policy(
        PolicyName='Invocations-ScalingPolicy',
        ServiceNamespace='sagemaker',  # The namespace of the AWS service that provides the resource.
        ResourceId=resource_id,  # Endpoint name
        ScalableDimension='sagemaker:variant:DesiredInstanceCount',  # SageMaker supports only Instance Count
        PolicyType='TargetTrackingScaling',  # 'StepScaling'|'TargetTrackingScaling'
        TargetTrackingScalingPolicyConfiguration={
            'TargetValue': 10.0,
            # The target value for the metric. - here the metric is - SageMakerVariantInvocationsPerInstance
            'PredefinedMetricSpecification': {
                'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance',
                # is the average number of times per minute that each instance for a variant is invoked.
            },
            'ScaleInCooldown': 60,
            # The cooldown period helps you prevent your Auto Scaling group from launching or terminating
            # additional instances before the effects of previous activities are visible.
            # You can configure the length of time based on your instance startup time or other application needs.
            # ScaleInCooldown - The amount of time, in seconds, after a scale in activity completes before another scale in activity can start.
            'ScaleOutCooldown': 30
            # ScaleOutCooldown - The amount of time, in seconds, after a scale out activity completes before another scale out activity can start.

            # 'DisableScaleIn': True|False - ndicates whether scale in by the target tracking policy is disabled.
            # If the value is true , scale in is disabled and the target tracking policy won't remove capacity from the scalable resource.
        }
)
response

{'PolicyARN': 'arn:aws:autoscaling:us-east-1:062083580489:scalingPolicy:452ccaf6-9083-4289-b8db-56601539f2c9:resource/sagemaker/endpoint/sagemaker-decision-trees-2022-04-14-14-36-03-248/variant/AllTraffic:policyName/Invocations-ScalingPolicy',
 'Alarms': [{'AlarmName': 'TargetTracking-endpoint/sagemaker-decision-trees-2022-04-14-14-36-03-248/variant/AllTraffic-AlarmHigh-343b9f8d-e73a-4743-b1e8-77d3ddf91e8c',
   'AlarmARN': 'arn:aws:cloudwatch:us-east-1:062083580489:alarm:TargetTracking-endpoint/sagemaker-decision-trees-2022-04-14-14-36-03-248/variant/AllTraffic-AlarmHigh-343b9f8d-e73a-4743-b1e8-77d3ddf91e8c'},
  {'AlarmName': 'TargetTracking-endpoint/sagemaker-decision-trees-2022-04-14-14-36-03-248/variant/AllTraffic-AlarmLow-a9e5cadf-420c-42f4-a9b9-f61dbdf27fd1',
   'AlarmARN': 'arn:aws:cloudwatch:us-east-1:062083580489:alarm:TargetTracking-endpoint/sagemaker-decision-trees-2022-04-14-14-36-03-248/variant/AllTraffic-AlarmLow-a9e5cadf-420c-42f4-a9b9-f61dbdf27fd1'}],
 'ResponseMetadata'

To get information about the scaling policy, you can use `describe_scaling_policies` API call.

In [154]:
response = client.describe_scaling_policies(
    PolicyNames=[
        'Invocations-ScalingPolicy',
    ],
    ServiceNamespace='sagemaker',
    ResourceId=resource_id
)
response

{'ScalingPolicies': [{'PolicyARN': 'arn:aws:autoscaling:us-east-1:062083580489:scalingPolicy:452ccaf6-9083-4289-b8db-56601539f2c9:resource/sagemaker/endpoint/sagemaker-decision-trees-2022-04-14-14-36-03-248/variant/AllTraffic:policyName/Invocations-ScalingPolicy',
   'PolicyName': 'Invocations-ScalingPolicy',
   'ServiceNamespace': 'sagemaker',
   'ResourceId': 'endpoint/sagemaker-decision-trees-2022-04-14-14-36-03-248/variant/AllTraffic',
   'ScalableDimension': 'sagemaker:variant:DesiredInstanceCount',
   'PolicyType': 'TargetTrackingScaling',
   'TargetTrackingScalingPolicyConfiguration': {'TargetValue': 10.0,
    'PredefinedMetricSpecification': {'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'},
    'ScaleOutCooldown': 30,
    'ScaleInCooldown': 60},
   'Alarms': [{'AlarmName': 'TargetTracking-endpoint/sagemaker-decision-trees-2022-04-14-14-36-03-248/variant/AllTraffic-AlarmHigh-343b9f8d-e73a-4743-b1e8-77d3ddf91e8c',
     'AlarmARN': 'arn:aws:cloudwatch:us-east-1:0

### Testing Autoscaling

We defined a scaling policy based on 10 times per minute that each instance for a variant is invoked, for the duration of 30 seconds. Let's invoked the endpoint for few minutes and see what happens.

In [156]:
import time
from datetime import datetime

print(f"{datetime.now()} - Sending test traffic for 15 minutes to the endpoint {endpoint_name}. \nPlease wait...")

counter = 1
for x in range(180):
    for _ in range(counter):    # invocations as number of instances
        result = predictor.predict(test_data.values).decode("utf-8")
    time.sleep(5)
    if x % 12 == 0 and x > 1:
        print(f"{datetime.now()} - {counter} minutes passed.")
        counter = counter + 1
        
print("Done!")

2022-04-14 14:42:01.947807 - Sending test traffic for 15 minutes to the endpoint sagemaker-decision-trees-2022-04-14-14-36-03-248. 
Please wait...
2022-04-14 14:43:07.307417 - 1 minutes passed.
2022-04-14 14:44:07.802461 - 2 minutes passed.
2022-04-14 14:45:08.522068 - 3 minutes passed.
2022-04-14 14:46:09.513451 - 4 minutes passed.
2022-04-14 14:47:10.647718 - 5 minutes passed.
2022-04-14 14:48:11.909096 - 6 minutes passed.
2022-04-14 14:49:13.610748 - 7 minutes passed.
2022-04-14 14:50:15.573243 - 8 minutes passed.
2022-04-14 14:51:17.689110 - 9 minutes passed.
2022-04-14 14:52:20.234165 - 10 minutes passed.
2022-04-14 14:53:22.945126 - 11 minutes passed.
2022-04-14 14:54:25.892102 - 12 minutes passed.
2022-04-14 14:55:28.923613 - 13 minutes passed.
2022-04-14 14:56:32.315079 - 14 minutes passed.
Done!


## Query Endpoint Autoscaling History

You can view the status of scaling activities from your endpoint using `DescribeScalingActivities`. `DescribeScalingActivities` provides descriptive information about the scaling activities in the specified namespace from the previous six weeks.

You are able to see `Scale Out` actions. If you wait few minutes, you'll be able to see also the `Scale In` actions.

In [158]:
response = client.describe_scaling_activities(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id
)

response

{'ScalingActivities': [{'ActivityId': '0b132bb7-9a69-44c9-9f73-eba338aabdc2',
   'ServiceNamespace': 'sagemaker',
   'ResourceId': 'endpoint/sagemaker-decision-trees-2022-04-14-14-36-03-248/variant/AllTraffic',
   'ScalableDimension': 'sagemaker:variant:DesiredInstanceCount',
   'Description': 'Setting desired instance count to 10.',
   'Cause': 'monitor alarm TargetTracking-endpoint/sagemaker-decision-trees-2022-04-14-14-36-03-248/variant/AllTraffic-AlarmHigh-343b9f8d-e73a-4743-b1e8-77d3ddf91e8c in state ALARM triggered policy Invocations-ScalingPolicy',
   'StartTime': datetime.datetime(2022, 4, 14, 14, 54, 8, 25000, tzinfo=tzlocal()),
   'EndTime': datetime.datetime(2022, 4, 14, 14, 55, 51, 793000, tzinfo=tzlocal()),
   'StatusCode': 'Successful',
   'StatusMessage': 'Successfully set desired instance count to 10. Change successfully fulfilled by sagemaker.'},
  {'ActivityId': 'c31392ea-7821-4920-b5b8-033b7c7f05ef',
   'ServiceNamespace': 'sagemaker',
   'ResourceId': 'endpoint/sage

In [165]:
for scaling_activity in response['ScalingActivities']:
    print(f"{scaling_activity['StartTime']} - {scaling_activity['EndTime']} - {scaling_activity['StatusMessage']}")

2022-04-14 14:54:08.025000+00:00 - 2022-04-14 14:55:51.793000+00:00 - Successfully set desired instance count to 10. Change successfully fulfilled by sagemaker.
2022-04-14 14:51:08.040000+00:00 - 2022-04-14 14:53:27.434000+00:00 - Successfully set desired instance count to 7. Change successfully fulfilled by sagemaker.
2022-04-14 14:49:08.076000+00:00 - 2022-04-14 14:50:44.816000+00:00 - Successfully set desired instance count to 5. Change successfully fulfilled by sagemaker.
2022-04-14 14:47:08.021000+00:00 - 2022-04-14 14:48:50.007000+00:00 - Successfully set desired instance count to 3. Change successfully fulfilled by sagemaker.


### Optional cleanup
When you're done with the endpoint, you'll want to clean it up.

In [142]:
predictor.delete_endpoint()