# Run Qwen QwQ 32B reasoning model efficient on Amazon SageMaker AI with SGLang and auto scale down to zero

> This notebook has been tested on the Python 3 kernel of a SageMaker Jupternotebook instance on a ml.m5.xlarge instance with 50GB of disk size

Amazon SageMaker AI provides the ability to build Docker containers to run on SageMaker endpoints, where they listen for health checks on /ping and receive real-time inference requests on /invocations. Using SageMaker AI for inference offers several benefits:

- **Scalability**: SageMaker AI can automatically scale your inference endpoints up and down based on demand, ensuring your models can handle varying workloads.
- **High Availability**: SageMaker AI manages the infrastructure and maintains the availability of your inference endpoints, so you don't have to worry about managing the underlying resources.
- **Monitoring and Logging**: SageMaker AI provides built-in monitoring and logging capabilities, making it easier to track the performance and health of your inference endpoints.
- **Security**: SageMaker AI integrates with other AWS services, such as AWS Identity and Access Management (IAM), to provide robust security controls for your inference workloads.

Note that SageMaker provides [pre-built SageMaker AI Docker images](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) that can help you quickly start with the model inference on SageMaker. It also allows you to [bring your own Docker container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html) and use it inside SageMaker AI for training and inference. To be compatible with SageMaker AI, your container must have the following characteristics:

- Your container must have a web server listening on port 8080.
- Your container must accept POST requests to the /invocations and /ping real-time endpoints.

In this notebook, we'll demonstrate how to adapt the [**SGLang**](https://github.com/sgl-project/sglang) framework to run on SageMaker AI endpoints. SGLang is a serving framework for large language models that provides state-of-the-art performance, including a fast backend runtime for efficient serving with RadixAttention, extensive model support, and an active open-source community. For more information refer to https://docs.sglang.ai/index.html and https://github.com/sgl-project/sglang.

By using SGLang and building a custom Docker container, you can run advanced AI models like the [Qwen QwQ 32B](https://huggingface.co/Qwen/QwQ-32B) on a SageMaker AI endpoint.


## Prepare the SGLang SageMaker container

SageMaker AI makes extensive use of Docker containers for build and runtime tasks. Using containers, you can train machine learning algorithms and deploy models quickly and reliably at any scale. See [this link](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-run-image) to understand how SageMaker AI runs your inference image. 

- For model inference, SageMaker AI runs the container as:
```
docker run image serve
```

- You can provide your entrypoint script as `exec` form to provide instruction of how to perform the inference process, for example:
```
ENTRYPOINT ["python", "inference.py"]
```

- When deploying ML models, one option is to archive and compress the model artifacts into a `tar.gz` format and provided the s3 path of the model artifacts as the `ModelDataUrl` in the [`CreateModel`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API request. SageMaker AI will copy the model artifacts from the S3 location 
 and decompresses this tar file into `/opt/ml/model` directory before your container starts for use by your inference code. However, for deploying large models, SageMaker AI allows you to [deploy uncompressed models](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-uncompressed.html). In this example, we will show you how to use the uncompressed DeepSeek R1 Distilled Llama 70B model.

- To receive inference requests, the container must have a web server listening on port `8080` and must accept `POST` requests to the `/invocations` and `/ping` endpoints.

The below diagram shows on a high-level, how you should prepare your own container image to be compatible for SageMaker AI hosting. 

![inference](./img/sagemaker-real-time-inference.png)

If you already have a docker image, you can see more instructions for [adapting your own inference container for SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html). Also it is important to note that, SageMaker AI provided containers automatically implements a web server for serving requests that responds to `/invocations` and `/ping` (for healthcheck) requests. You can find more about the [prebuilt SageMaker AI docker images for deep learning in our SageMaker doc](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html).




In [None]:
!pip install sagemaker boto3 huggingface_hub --upgrade --quiet

#### Create the entrypoint serve file 
The `serve` file will used as the `exec` form to be executed at the container starting time. The main command to start sglang in the SageMaker docker image is 
```
python3 -m sglang.launch_server --model-path <your model path> --host 0.0.0.0 --port 8080
```
Here the `model-path` can be set as `/opt/ml/model` as this is where SageMaker AI will copy the model artifacts from s3 to the endpoint and use `port` **8080** as required by SageMaker hosting.

In [None]:
%%writefile serve
#!/bin/bash

echo "Starting server"

SERVER_ARGS="--host 0.0.0.0 --port 8080"

if [ -n "$TENSOR_PARALLEL_DEGREE" ]; then
    SERVER_ARGS="${SERVER_ARGS} --tp-size ${TENSOR_PARALLEL_DEGREE}"
fi

if [ -n "$DATA_PARALLEL_DEGREE" ]; then
    SERVER_ARGS="${SERVER_ARGS} --dp-size ${DATA_PARALLEL_DEGREE}"
fi

if [ -n "$EXPERT_PARALLEL_DEGREE" ]; then
    SERVER_ARGS="${SERVER_ARGS} --ep-size ${EXPERT_PARALLEL_DEGREE}"
fi

if [ -n "$MEM_FRACTION_STATIC" ]; then
    SERVER_ARGS="${SERVER_ARGS} --mem-fraction-static ${MEM_FRACTION_STATIC}"
fi

if [ -n "$QUANTIZATION" ]; then
    SERVER_ARGS="${SERVER_ARGS} --quantization ${QUANTIZATION}"
fi

if [ -n "$CHUNKED_PREFILL_SIZE" ]; then
    SERVER_ARGS="${SERVER_ARGS} --chunked-prefill-size ${CHUNKED_PREFILL_SIZE}"
fi

if [ -n "$MODEL_ID" ]; then
    SERVER_ARGS="${SERVER_ARGS} --model-path ${MODEL_ID}"
else
    SERVER_ARGS="${SERVER_ARGS} --model-path /opt/ml/model"
fi

if [ -n "$TORCH_COMPILE" ]; then
    SERVER_ARGS="${SERVER_ARGS} --enable-torch-compile"
fi

if [ -n "$TORCHAO_CONFIG" ]; then
    SERVER_ARGS="${SERVER_ARGS} --torchao-config ${TORCHAO_CONFIG}"
fi

if [ -n "$KV_CACHE_DTYPE" ]; then
    SERVER_ARGS="{$SERVER_ARGS} --kv-cache-dtype ${KV_CACHE_DTYPE}"
fi

python3 -m sglang.launch_server $SERVER_ARGS

SGLang has provided the based [Dockerfile here](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile). You can directly extend the base image with

```
# Extend from the base sglang image
FROM lmsysorg/sglang:v0.5.3rc0-cu126
```

In this example, we have copied the whole base Dockerfile with the specific tag and added the below lines to make it compatible with SageMaker
```
COPY serve /usr/bin/serve
RUN chmod 777 /usr/bin/serve

ENTRYPOINT [ "/usr/bin/serve" ]
```
You can add additional layers in the container image to accomodate your specific use case.

In [None]:
%%writefile Dockerfile
FROM lmsysorg/sglang:v0.5.3rc0-cu126

COPY serve /usr/bin/serve
RUN chmod 777 /usr/bin/serve

ENTRYPOINT [ "/usr/bin/serve" ]

Next, we will need to create an ECR repository for the custom docker image and build the image locally and push to the ECR repository. Note that you need to make sure the IAM role you used here has permission to push to ECR. 

The below cell might take sometime, please be patient. If you have already built the docker image from other development environment, please feel free to skip the below cell.

In [None]:
%%sh
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=$(aws configure get region)
REPOSITORY_NAME=sglang-sagemaker

# Create ECR repository if needed
if aws ecr describe-repositories --repository-names "${REPOSITORY_NAME}" &>/dev/null; then
    echo "Repository ${REPOSITORY_NAME} already exists"
else
    echo "Creating ECR repository ${REPOSITORY_NAME}..."
    aws ecr create-repository \
        --repository-name "${REPOSITORY_NAME}" \
        --region "${REGION}"
fi

#build docker image and push to ECR repository
docker build -t sglang .
aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com
docker tag sglang:latest $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPOSITORY_NAME:latest
docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPOSITORY_NAME:latest

## Create SageMaker AI endpoint for Qwen QWQ 32B model

### Introduction: [Qwen QwQ 32B](https://huggingface.co/Qwen/QwQ-32B)

[QwQ](https://huggingface.co/Qwen/QwQ-32B) is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.

- **Mathematical Reasoning**: Achieves an impressive 90.6% on MATH-500, outperforming both Claude 3.5 (78.3%) and matching OpenAI's o1-mini (90.0%)
- **Advanced Mathematics**: Scores 50.0% on AIME (American Invitational Mathematics Examination), significantly 'higher than Claude 3.5 (16.0%)
- **Scientific Reasoning**: Demonstrates strong performance on GPQA with 65.2%, on par with Claude 3.5 (65.0%)
- **Programming**: Achieves 50.0% on LiveCodeBench, showing competitive performance with leading proprietary models

> [NOTE]
> QwQ-32B is released under the Apache 2.0 license, making it suitable for both research and commercial applications.

Let's get started deploying one of the most capable open-source reasoning models available today!


In [None]:
import json
import sagemaker
import boto3
import sys
import time
from typing import List, Dict
from datetime import datetime


boto_region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session(boto_session=boto3.Session(region_name=boto_region))
role = sagemaker.get_execution_role()
sagemaker_client = boto3.client("sagemaker")
sagemaker_runtime_client = boto3.client("sagemaker-runtime")
s3_client = boto3.client("s3")
model_bucket = sagemaker_session.default_bucket()  # bucket to house artifacts
s3_model_prefix = (
    "hf-large-models/model_qwen"  # folder within bucket where code artifact will go
)

prefix = sagemaker.utils.unique_name_from_base("DEMO")
print(f"prefix: {prefix}")

## Setup your SageMaker Real-time Endpoint 
### Create a SageMaker endpoint configuration

We begin by creating the endpoint configuration and set MinInstanceCount to 0. This allows the endpoint to scale in all the way down to zero instances when not in use.

There are a few parameters we want to setup for our endpoint. We first start by setting the variant name, and instance type we want our endpoint to use. In addition we set the *model_data_download_timeout_in_seconds* and *container_startup_health_check_timeout_in_seconds* to have some guardrails for when we deploy inference components to our endpoint. In addition we will use Managed Instance Scaling which allows SageMaker to scale the number of instances based on the requirements of the scaling of your inference components. We set a *MinInstanceCount* and *MinInstanceCount* variable to size this according to the workload you want to service and also maintain controls around cost. Lastly, we set *RoutingStrategy* for the endpoint to optimally tune how to route requests to instances and inference components for the best performance.

The suggested instance types to host the QwQ 32B model can be `ml.g5.12xlarge`, `ml.g6.12xlarge`, `ml.g6e.12xlarge`.

In [None]:
# Set an unique endpoint config name
endpoint_config_name = f"{prefix}-endpoint-config"
print(f"Demo endpoint config name: {endpoint_config_name}")

# Set varient name and instance type for hosting
variant_name = "AllTraffic"
instance_type = "ml.g5.12xlarge"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600

min_instance_count = 0 # Minimum instance must be set to 0
max_instance_count = 3

sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": min_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)

You can simply call the [`deploy` function](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/model.py#L149) from the SageMaker Model class to deploy the model to an endpoint and it will return a [`Predictor`](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/base_predictor.py#L98) object to perform invocation against this endpoint.

In [None]:
# Set a unique endpoint name
endpoint_name = f"{prefix}-endpoint"
print(f"Demo endpoint name: {endpoint_name}")

sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)

In [None]:
sagemaker_session.wait_for_endpoint(endpoint_name)

## Download the model from Hugging Face and upload the model artifacts on Amazon S3
In this example, we will demonstrate how to download your copy of the model from huggingface and upload it to an s3 location in your AWS account, then deploy the model with the downloaded model artifacts to an endpoint. 

First, download the model artifact data from HuggingFace. 

In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path
import os
import sagemaker
import jinja2

qwen_qwq_32b = "Qwen/QwQ-32B"

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(".")
local_model_path.mkdir(exist_ok=True)
model_name = qwen_qwq_32b
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.safetensors", "*.bin", "*.txt"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

In [None]:
# define a variable to contain the s3url of the location that has the model
pretrained_model_location = f"s3://{model_bucket}/{s3_model_prefix}/"
print(f"Pretrained model will be uploaded to ---- > {pretrained_model_location}")

Upload model data to s3.

In [None]:
model_artifact = sagemaker_session.upload_data(path=model_download_path, key_prefix=s3_model_prefix)
print(f"Model uploaded to --- > {model_artifact}")
print(f"We will set option.s3url={model_artifact}")

In [None]:
# optional
# !rm -rf {model_download_path}

We use the image uri created with previous step when building the SGLang container image.

In [None]:
image_uri = f'{sagemaker_session.account_id()}.dkr.ecr.{boto_region}.amazonaws.com/sglang-sagemaker:latest'

To find our more of the SageMaker `create_model` api call, you can see the details in [the boto3 doc](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html). Note that you can use the **CompressionType** to specify how the model data is prepared.  

If you choose `Gzip` and choose `S3Object` as the value of `S3DataType`, `S3Uri` identifies an object that is a gzip-compressed TAR archive. SageMaker will attempt to decompress and untar the object during model deployment.

If you choose `None` and `S3Prefix` as the value of `S3DataType`, then for each S3 object under the key name pefix referenced by `S3Uri`, SageMaker will trim its key by the prefix, and use the remainder as the path (relative to `/opt/ml/model`) of the file holding the content of the S3 object. SageMaker will split the remainder by slash (/), using intermediate parts as directory names and the last part as filename of the file holding the content of the S3 object.


In [None]:
qwen_sglang_model = {
    "Image": image_uri,
    'ModelDataSource': {
                'S3DataSource': {
                    'S3Uri': pretrained_model_location,
                    'S3DataType': 'S3Prefix',
                    'CompressionType': 'None',
                }
            },
    "Environment": {
        "SAGEMAKER_MODEL_SERVER_WORKERS": "1",
        'TENSOR_PARALLEL_DEGREE': '4',
    },
}
model_name_qwen = f"qwen-qwq-32b-sglang-{datetime.now().strftime('%y%m%d-%H%M%S')}"
# create SageMaker Model
sagemaker_client.create_model(
    ModelName=model_name_qwen,
    ExecutionRoleArn=role,
    Containers=[qwen_sglang_model],
)

We can now create the Inference Components which will deployed on the endpoint that you specify. Please note here that you can provide a SageMaker model or a container to specification. If you provide a container, you will need to provide an image and artifactURL as parameters. In this example we set it to the model name we prepared in the cells above. You can also set the `ComputeResourceRequirements` to supply SageMaker what should be reserved for each copy of the inference component. You can also set the copy count of the number of Inference Components you would like to deploy. These can be managed and scaled as the capabilities become available. 

Note that in this example we set the `NumberOfAcceleratorDevicesRequired` to a value of `4`. By doing so we reserve 4 accelerators for each copy of this inference component so that we can use tensor parallel. 

In [None]:
inference_component_name_qwen = f"{prefix}-IC-qwen-32b-{datetime.now().strftime('%y%m%d-%H%M%S')}"
variant_name = "AllTraffic"

sagemaker_client.create_inference_component(
    InferenceComponentName=inference_component_name_qwen,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name_qwen,
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 4,
            "NumberOfCpuCoresRequired": 1,
            "MinMemoryRequiredInMb": 1024,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)

In [None]:
while True:
    desc = sagemaker_client.describe_inference_component(
        InferenceComponentName=inference_component_name_qwen
    )
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

#### Invoke endpoint with boto3
Note that you can also invoke the endpoint with boto3. If you have an existing endpoint, you don't need to recreate the `predictor` and can follow below example to invoke the endpoint with an endpoint name.

In [None]:
import boto3
import json
sagemaker_runtime = boto3.client('sagemaker-runtime')
# endpoint_name = predictor.endpoint_name # you can manually set the endpoint name with an existing endpoint

prompt = {
    'model':'mymodel',
    'messages':[
    {"role": "system", "content": "You are a helpful assistant that thinks and reasons before answering."},
    {"role": "user", "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"}
],
    'temperature':0,
    'max_tokens':512,
}
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name_qwen,
    ContentType="application/json",
    Body=json.dumps(prompt)
)
response_dict = json.loads(response['Body'].read().decode("utf-8"))
response_content = response_dict['choices'][0]['message']['content']
print(response_content)

#### Streaming response from the endpoint
Additionally, SGLang allows you to invoke the endpoint and receive streaming response. Below is an example of how to interact with the endpoint with streaming response.

In [None]:
import io
import json

# Example class that processes an inference stream:
class SmrInferenceStream:
    
    def __init__(self, sagemaker_runtime, endpoint_name, inference_component_name=None):
        self.sagemaker_runtime = sagemaker_runtime
        self.endpoint_name = endpoint_name
        self.inference_component_name = inference_component_name
        # A buffered I/O stream to combine the payload parts:
        self.buff = io.BytesIO() 
        self.read_pos = 0
        
    def stream_inference(self, request_body):
        # Gets a streaming inference response 
        # from the specified model endpoint:
        response = self.sagemaker_runtime\
            .invoke_endpoint_with_response_stream(
                EndpointName=self.endpoint_name, 
                InferenceComponentName=self.inference_component_name,
                Body=json.dumps(request_body), 
                ContentType="application/json"
        )
        # Gets the EventStream object returned by the SDK:
        event_stream = response['Body']
        for event in event_stream:
            # Passes the contents of each payload part
            # to be concatenated:
            self._write(event['PayloadPart']['Bytes'])
            # Iterates over lines to parse whole JSON objects:
            for line in self._readlines():
                line = line.decode('utf-8')[len('data: '):]
                # print(line)
                try:
                    resp = json.loads(line)
                except:
                    continue
                if len(line)>0 and type(resp) == dict:
                    # if len(resp.get('choices')) == 0:
                    #     continue
                    part = resp.get('choices')[0]['delta']['content']
                    
                else:
                    part = resp
                # Returns parts incrementally:
                yield part
    
    # Writes to the buffer to concatenate the contents of the parts:
    def _write(self, content):
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)

    # The JSON objects in buffer end with '\n'.
    # This method reads lines to yield a series of JSON objects:
    def _readlines(self):
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            self.read_pos += len(line)
            yield line[:-1]

In [None]:
request_body = {
    'model':'mymodel',
    'messages':[
        {"role": "user", "content": "List 3 countries and their capitals. Give a short answer"},
    ],
    'temperature':0,
    'max_tokens':512,
    'stream': True,
    'stream_options': {'include_usage': True}
}

smr_inference_stream = SmrInferenceStream(
    sagemaker_runtime, endpoint_name, inference_component_name_qwen)
stream = smr_inference_stream.stream_inference(request_body)
for part in stream:
    print(part, end='')

## Automatically Scale To Zero
### Scaling policies
Once the endpoint is deployed and InService, you can then add the necessary scaling policies:

* A [target tracking](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-target-tracking.html) policy that can scale in the copy count for our inference component model copies to zero, and from 1 to n. 
* A [step scaling policy](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html) policy that will allow the endpoint to scale out from zero.

These policies work together to provide cost-effective scaling - the endpoint can scale to zero when idle and automatically scale out as needed to handle incoming requests.

### Scaling policy for inference components copies (target tracking)
We start with creating our target tracking policies for scaling the CopyCount of our inference component

#### Register a new autoscaling target
After you create your SageMaker endpoint and inference components, you register a new auto scaling target for Application Auto Scaling. In the following code block, you set **MinCapacity**  to **0**, which is required for your endpoint to scale down to zero

In [None]:
aas_client = sagemaker_session.boto_session.client("application-autoscaling")
cloudwatch_client = sagemaker_session.boto_session.client("cloudwatch")

# Autoscaling parameters
resource_id = f"inference-component/{inference_component_name_qwen}"
service_namespace = "sagemaker"
scalable_dimension = "sagemaker:inference-component:DesiredCopyCount"

min_copy_count = 0
max_copy_count = 3

aas_client.register_scalable_target(
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    MinCapacity=min_copy_count,
    MaxCapacity=max_copy_count,
)

#### Configure Target Tracking Scaling Policy
Once you have registered your new scalable target, the next step is to define your target tracking policy. In the code example that follows, we set the TargetValue to 5. This setting instructs the auto-scaling system to increase capacity when the number of concurrent requests per model reaches or exceeds 5. Here we are taking advantage of the more granular auto scaling metric `PredefinedMetricType`: `SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution` to more accurately monitor and react to changes in inference traffic. Take a look this [blog](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-inference-launches-faster-auto-scaling-for-generative-ai-models/) for more information. 

In [None]:
aas_client.describe_scalable_targets(
    ServiceNamespace=service_namespace,
    ResourceIds=[resource_id],
    ScalableDimension=scalable_dimension,
)

# The policy name for the target traking policy
target_tracking_policy_name = f"Target-tracking-policy-qwen-qwq-scale-to-zero-aas-{inference_component_name_qwen}"

aas_client.put_scaling_policy(
    PolicyName=target_tracking_policy_name,
    PolicyType="TargetTrackingScaling",
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    TargetTrackingScalingPolicyConfiguration={
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution",
        },
        # Low TPS + load TPS
        "TargetValue": 5,  # you need to adjust this value based on your use case
        "ScaleInCooldown": 300,  # default
        "ScaleOutCooldown": 300,  # default
    },
)

Application Auto Scaling creates two CloudWatch alarms per scaling target. The first triggers scale-out actions after 30 seconds (using 3 sub-minute data point), while the second triggers scale-in after 15 minutes (using 90 sub-minute data points). The time to trigger the scaling action is usually 1–2 minutes longer than those minutes because it takes time for the endpoint to publish metrics to CloudWatch, and it also takes time for AutoScaling to react.

### Scale out from zero policy (step scaling policy )
To enable your endpoint to scale out from zero instances, do the following:

#### Configure Step Scaling Policy
Create a step scaling policy that defines when and how to scale out from zero. This policy will add 1 model copy when triggered, enabling SageMaker to provision the instances required to handle incoming requests after being idle.  The following shows you how to define a step scaling policy. Here we have configured to scale out from 0 to 1 model copy ("ScalingAdjustment": 1), depending on your use case you can adjust ScalingAdjustment as required. 

In [None]:
# The policy name for the step scaling policy
step_scaling_policy_name = f"Step-scaling-policy-qwen-qwq-scale-to-zero-aas-{inference_component_name_qwen}"

aas_client.put_scaling_policy(
    PolicyName=step_scaling_policy_name,
    PolicyType="StepScaling",
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity",
        "MetricAggregationType": "Maximum",
        "Cooldown": 60,
        "StepAdjustments":
          [
             {
               "MetricIntervalLowerBound": 0,
               "ScalingAdjustment": 1
             }
          ]
    },
)

In [None]:
resp = aas_client.describe_scaling_policies(
    PolicyNames=[step_scaling_policy_name],
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
)
step_scaling_policy_arn = resp['ScalingPolicies'][0]['PolicyARN']
print(f"step_scaling_policy_arn: {step_scaling_policy_arn}")

#### Create the CloudWatch alarm that will trigger our policy

Finally, create a CloudWatch alarm with the metric **NoCapacityInvocationFailures**. When triggered, the alarm initiates the previously defined scaling policy. For more information about the NoCapacityInvocationFailures metric, see [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-inference-component).

We have also set the following:
- EvaluationPeriods to 1 
- DatapointsToAlarm to 1 
- ComparisonOperator to  GreaterThanOrEqualToThreshold

This results in 1 min waiting for the step scaling policy to trigger

In [None]:
# The alarm name for the step scaling alarm
step_scaling_alarm_name = f"step-scaling-alarm-qwen-qwq-scale-to-zero-aas-{inference_component_name_qwen}"

cloudwatch_client.put_metric_alarm(
    AlarmName=step_scaling_alarm_name,
    AlarmActions=[step_scaling_policy_arn],  # Replace with your actual ARN
    MetricName='NoCapacityInvocationFailures',
    Namespace='AWS/SageMaker',
    Statistic='Maximum',
    Dimensions=[
        {
            'Name': 'InferenceComponentName',
            'Value': inference_component_name_qwen  # Replace with actual InferenceComponentName
        }
    ],
    Period=30, # Set a lower period 
    EvaluationPeriods=1,
    DatapointsToAlarm=1,
    Threshold=1,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    TreatMissingData='missing'
)

From cloudwatch console, you can check the alarms created.

![cloudwatch alarms](./img/cloudwatch-alarms.png)


### Testing the behaviour
Notice the `MinInstanceCount: 0` setting in the Endpoint configuration, which allows the endpoint to scale down to zero instances. With the scaling policy, CloudWatch alarm, and minimum instances set to zero, your SageMaker Inference Endpoint will now be able to automatically scale down to zero instances when not in use, helping you optimize your costs and resource utilization.

### IC copy count scales in to zero
We'll pause for a few minutes without making any invocations to our model. Based on our target tracking policy, when our SageMaker endpoint doesn't receive requests for about 10 to 15 minutes, it will automatically scale down to zero the number of model copies. 

In [None]:
time.sleep(600)
start_time = time.time()
while True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name_qwen)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name_qwen)
print(desc)

### Endpoint's instances scale in to zero

After 10 additional minutes of inactivity, SageMaker automatically terminates all underlying instances of the endpoint, eliminating all associated costs.

In [None]:
# after 10mins instances will scale down to 0
time.sleep(600)
# verify whether CurrentInstanceCount is zero
sagemaker_session.wait_for_endpoint(endpoint_name)

### Invoke the endpoint with a sample prompt

If we try to invoke our endpoint while instances are scaled down to zero, we get a validation error: `An error occurred (ValidationError) when calling the InvokeEndpoint operation: Inference Component has no capacity to process this request. ApplicationAutoScaling may be in-progress (if configured) or try to increase the capacity by invoking UpdateInferenceComponentRuntimeConfig API.`

In [None]:
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name_qwen,
    ContentType="application/json",
    Body=json.dumps(prompt)
)
response_dict = json.loads(response['Body'].read().decode("utf-8"))
response_content = response_dict['choices'][0]['message']['content']
print(response_content)

### Scale out from zero kicks in
However, after 1 minutes our step scaling policy should kick in. SageMaker will then start provisioning a new instance and deploy our inference component model copy to handle requests. This demonstrates the endpoint's ability to automatically scale out from zero when needed.

In [None]:
time.sleep(60)
start_time = time.time()
while True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name_qwen)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name_qwen)
print(desc)

#### verify that our endpoint has succesfully scaled out from zero

In [None]:
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name_qwen,
    ContentType="application/json",
    Body=json.dumps(prompt)
)
response_dict = json.loads(response['Body'].read().decode("utf-8"))
response_content = response_dict['choices'][0]['message']['content']
print(response_content)

## Cleanup
Make sure to delete the endpoint and other artifacts that were created to avoid unnecessary cost. You can also go to SageMaker AI console to delete all the resources created in this example

- Deregister scalable target
- Delete cloudwatch alarms
- Delete scaling policies

In [None]:
try:
    # Deregister the scalable target for AAS
    aas_client.deregister_scalable_target(
        ServiceNamespace="sagemaker",
        ResourceId=resource_id,
        ScalableDimension=scalable_dimension,
    )
    print(f"Scalable target for [b]{resource_id}[/b] deregistered. ✅")
except aas_client.exceptions.ObjectNotFoundException:
    print(f"Scalable target for [b]{resource_id}[/b] not found!.")

print("---" * 10)

# Delete CloudWatch alarms created for Step scaling policy
try:
    cloudwatch_client.delete_alarms(AlarmNames=[step_scaling_alarm_name])
    print(f"Deleted CloudWatch step scaling scale-out alarm [b]{step_scaling_alarm_name} ✅")
except cloudwatch_client.exceptions.ResourceNotFoundException:
    print(f"CloudWatch scale-out alarm [b]{step_scaling_alarm_name}[/b] not found.")


# Delete step scaling policies
print("---" * 10)

try:
    aas_client.delete_scaling_policy(
        PolicyName=step_scaling_policy_name,
        ServiceNamespace="sagemaker",
        ResourceId=resource_id,
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    )
    print(f"Deleted scaling policy [i green]{step_scaling_policy_name} ✅")
except aas_client.exceptions.ObjectNotFoundException:
    print(f"Scaling policy [i]{step_scaling_policy_name}[/i] not found.")

In [None]:
sagemaker_client.delete_inference_component(InferenceComponentName=inference_component_name_qwen)
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)