# Faster autoscaling on Amazon SageMaker realtime endpoints with inference components (Application Autoscaling)

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)


---

In this notebook we show how the new faster autoscaling feature helps scale sagemaker inference endpoints by almost 6x faster than earlier.

We deploy Meta's `Llama3-8B-Instruct` model to an Amazon SageMaker realtime endpoint using Text Generation Inference (TGI) Deep Learning Container (DLC) and apply <span style='color:green'><b>Application Autoscaling</b></span> scaling policies to the endpoint.


<div class="alert alert-block alert-warning">
    Please select <b>m5.2xlarge</b> or larger instance types when running this on Amazon SageMaker Notebook Instance.<br/>
    Select <b>conda_pytorch_p310</b> kernel when running this notebook on Amazon SageMaker Notebook Instance. <br/><br/>
    Ensure python version for the kernel is <b>3.10.x</b> (3.11 is not supported). <br/>
</div>

---

## Prerequisites



<div style="border: 1px solid #f00; border-radius: 5px; padding: 10px; background-color: #fee;">
Before using this notebook please ensure you have access to an active access token from HuggingFace and have accepted the license agreement from Meta.

- **Step 1:** Create user access token in HuggingFace (HF). Refer [here](https://huggingface.co/docs/hub/security-tokens) on how to create HF tokens.
- **Step 2:** Login to [HuggingFace](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main) and navigate to *Meta-Llama-3-8B-Instruct** home page.
- **Step 3:** Accept META LLAMA 3 COMMUNITY LICENSE AGREEMENT by following the instructions [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main)
- **Step 4:** Wait for the approval email from META (Approval may take any where b/w 1-3 hrs)
</div>

Install packages using uv, an extremely fast python package installer\
Read more about uv here <https://astral.sh/blog/uv>

In [None]:
# ensure python version of the selected kernel is not greater than 3.10
!python --version

In [None]:
!pip install uv && uv pip install -U ipywidgets
!uv pip install -r requirements.txt

In [None]:
# restart kernel
from IPython.core.display import HTML

HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
# load rich extension
%load_ext rich

In [None]:
import sys
import os

import time
from getpass import getpass
import boto3
import sagemaker
from rich import print
from sagemaker.deserializers import JSONDeserializer
from sagemaker.huggingface import get_huggingface_llm_image_uri
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

## Initiate sagemaker session

In [None]:
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = sess.boto_region_name

boto_session = boto3.Session(region_name=region)

sagemaker_client = sess.sagemaker_client
sagemaker_runtime_client = sess.sagemaker_runtime_client
cloudwatch_client = boto3.client("cloudwatch", region_name=region)

hf_model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# retrieve the llm image uri
# tgi_dlc = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1-tgi2.0-gpu-py310-cu121-ubuntu22.04"
tgi_dlc = get_huggingface_llm_image_uri("huggingface", version="2.0.0")

print(f"TGI DLC: \n[b i green]{tgi_dlc}[/b i green]")
print(f"Region: [b blue]{region}[/b blue]")
print(f"Role: [b red]{role}[/b red]")

## Create Endpoint

1. Create `EndpointConfiguration`
2. Create Endpoint

In [None]:
# Set an unique endpoint config name
prefix = sagemaker.utils.unique_name_from_base("llama3")
print(f"prefix: {prefix}")

endpoint_config_name = f"{prefix}-endpoint-config"
print(f"Endpoint config name: {endpoint_config_name}")

# Set varient name and instance type for hosting
variant_name = "AllTraffic"
instance_type = "ml.g5.2xlarge"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600

initial_instance_count = 1
max_instance_count = 2
print(f"Initial instance count: {initial_instance_count}")
print(f"Max instance count: {max_instance_count}")

epc_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": initial_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)
print(epc_response)

In [None]:
# Set a unique endpoint name
endpoint_name = f"{prefix}-endpoint"

ep_response = sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)
# print(ep_response)
print(f"Creating endpoint: [b blue]{endpoint_name}...")
sess.wait_for_endpoint(endpoint_name)

## Deploy model

Create and deploy model using Amazon SageMaker HuggingFace TGI DLC

<https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy>

<div class="alert alert-block alert-warning">
<b>NOTE:</b> Remember to copy your Hugging Face Access Token from <a href="https://hf.co/">https://hf.co/</a> before running the below cell.<br/><br/>
Refer <a href="https://huggingface.co/docs/hub/security-tokens">here</a> to learn about creating HF tokens.
</div>

## Configure container and environment 

In [None]:
# print ecr image uri
print(f"llm image uri: [b green]{tgi_dlc}")

HF_TOKEN = os.getenv("HUGGING_FACE_HUB_TOKEN") or getpass("Enter HUGGINGFACE Access Token: ")

llama3model = {
    "Image": tgi_dlc,
    "Environment": {
        "HF_MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct",  # model_id from hf.co/models
        "SM_NUM_GPUS": "1",  # Number of GPU used per replica
        "MAX_INPUT_LENGTH": "2048",  # Max length of input text
        "MAX_TOTAL_TOKENS": "4096",  # Max length of the generation (including input text)
        "MAX_BATCH_TOTAL_TOKENS": "8192",  # Limits the number of tokens that can be processed in parallel during the generation
        "MESSAGES_API_ENABLED": "true",  # Enable the messages API
        "HUGGING_FACE_HUB_TOKEN": HF_TOKEN,
    },
}

# create Model
deployment_name = "sm"
model_name = f"{deployment_name}-model-llama3"

print(f"Creating model: [b green]{model_name}...")
model_response = sagemaker_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    Containers=[llama3model],
)

print(model_response)

In [None]:
# Deploy model to Amazon SageMaker Inference Component
inference_component_name_llama3b = f"{prefix}-IC-llama3b"
variant_name = "AllTraffic"

ic_response = sagemaker_client.create_inference_component(
    InferenceComponentName=inference_component_name_llama3b,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": f"{deployment_name}-model-llama3",
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 1,
            "NumberOfCpuCoresRequired": 1,
            "MinMemoryRequiredInMb": 1024,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)

# print(ic_response)

# Wait for IC to come InService
print(f"InferenceComponent [b magenta]{inference_component_name_llama3b}...")
while True:
    desc = sagemaker_client.describe_inference_component(
        InferenceComponentName=inference_component_name_llama3b
    )
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

## Inference

Invoke and test endpoint using messages API. Refer to HF [Messages API](https://huggingface.co/docs/text-generation-inference/messages_api) for more info.

In [None]:
# create predictor object
predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    component_name=inference_component_name_llama3b,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

In [None]:
# Prompt to generate
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is deep learning?"},
]

# Generation arguments
parameters = {
    "model": hf_model_id,  # model id is required
    "top_p": 0.6,
    "temperature": 0.9,
    "max_tokens": 512,
    "stop": ["<|eot_id|>"],
}

chat = predictor.predict({"messages": messages, **parameters})

# Unpack and print response
print(chat["choices"][0]["message"]["content"].strip())

## Apply Autoscaling policies to the endpoint

Apply Application Autoscaling Policy to endpoint

1. Register Scalable Target

In [None]:
as_min_capacity = 1
as_max_capacity = 2

resource_id = f"inference-component/{inference_component_name_llama3b}"

autoscaling_client = boto3.client("application-autoscaling", region_name=region)

# Register scalable target
scalable_target = autoscaling_client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:inference-component:DesiredCopyCount",
    MinCapacity=as_min_capacity,
    MaxCapacity=as_max_capacity,  # Replace with your desired maximum instances
)

scalable_target_arn = scalable_target["ScalableTargetARN"]
print(f"Resource ID: [b blue]{resource_id}")
print(f"Scalable_target_arn:\n[b green]{scalable_target_arn}")

## Use the latest high-resolution Metrics to trigger auto-scaling

- New feature introduces a new <span style='color:green'><b>PredefinedMetricType</b></span> for scaling policy configuration i.e. <span style='color:green'><b>SageMakerVariantConcurrentRequestsPerModelHighResolution</b></span> to trigger scaling actions.
- Creating a scaling policy with this metric type will create cloudwatch alarms that track a new metric called <span style='color:green'><b>ConcurrentRequestsPerModel</b></span>.
- These high-resolution metrics are published at sub-minute intervals (10s intervals to CW + any additional jitter + delays)
- We should observe significant improvement in scale out times with this new metric


### Steps to create Application autoscaling policy

- Create scaling policy
  - Set `PolicyType` to `TargetTrackingScaling`
  - Set `TargetValue` to `5.0`. i.e., Scaling triggers when endpoint receives 5 `ConcurrentRequestsPerModel`
  - Set `PredefinedMetricType` to `SageMakerVariantConcurrentRequestsPerModelHighResolution`
  - Set `ScaleInCoolDown` and `ScaleOutCoolDown` values to `300` seconds

In [None]:
# Create Target Tracking Scaling Policy
target_tracking_policy_response = autoscaling_client.put_scaling_policy(
    PolicyName="SageMakerICScalingPolicy",
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:inference-component:DesiredCopyCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 5.0,  # Scaling triggers when endpoint receives 5 ConcurrentRequestsPerModel
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution"
        },
        "ScaleInCooldown": 300,  # Cooldown period after scale-in activity
        "ScaleOutCooldown": 300,  # Cooldown period after scale-out activity
    },
)

# print(target_tracking_policy_response)
print(f"Policy ARN: [i blue]{target_tracking_policy_response['PolicyARN']}")

# print Cloudwatch Alarms
alarms = target_tracking_policy_response["Alarms"]

for alarm in alarms:
    print(f"[b]Alarm Name:[/b] [b magenta]{alarm['AlarmName']}")
    # print(f"[b]Alarm ARN:[/b] [i green]{alarm['AlarmARN']}[/i green]")
    print("===" * 15)

## Cleanup

- Deregister scalable target. This automatically deletes associated cloudwatch alarms.
- Delete model
- Delete endpoint

In [None]:
try:
    # Deregister the scalable target for AAS
    autoscaling_client.deregister_scalable_target(
        ServiceNamespace="sagemaker",
        ResourceId=resource_id,
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    )
    print(f"Scalable target for [b]{resource_id}[/b] deregistered. ✅")
except autoscaling_client.exceptions.ObjectNotFoundException:
    print(f"Scalable target for [b]{resource_id}[/b] not found!.")

print("---" * 10)

try:
    print(f"Deleting inference components: [b magenta]{inference_component_name_llama3b} ✅")
    # Delete inference component
    sagemaker_client.delete_inference_component(
        InferenceComponentName=inference_component_name_llama3b
    )
except Exception as e:
    print(f"{e}")


try:
    print(f"Deleting model: [b magenta]{deployment_name}-model-llama3 ✅")
    predictor.delete_model()
except Exception as e:
    print(f"{e}")


try:
    print(f"Deleting endpoint: [b magenta]{predictor.endpoint_name} ✅")
    predictor.delete_endpoint()
except Exception as e:
    print(f"{e}")

print("---" * 10)
print("Done")

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb)