# Faster autoscaling on Amazon SageMaker realtime endpoints (Step Scaling)

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)


---

In this notebook we show how the new faster autoscaling feature helps scale sagemaker inference endpoints by almost 6x faster than earlier.

We deploy Meta's `Llama3-8B-Instruct` model to an Amazon SageMaker realtime endpoint using Text Generation Inference (TGI) Deep Learning Container (DLC) and apply <span style='color:green'><b>Step Scaling</b></span> autoscaling policies to the endpoint.


<span class="alert alert-block alert-warning">Please use at least <strong>`m5.2xlarge`</strong> or larger instance types if running this on Amazon SageMaker Notebook Instance.</span>


## Prerequisites

<div style="border: 1px solid #f00; border-radius: 5px; padding: 10px; background-color: #fee;">
Before using this notebook please ensure you have access to an active access token from HuggingFace and have accepted the license agreement from Meta.

- **Step 1:** Create user access token in HuggingFace (HF). Refer [here](https://huggingface.co/docs/hub/security-tokens) on how to create HF tokens.
- **Step 2:** Login to [HuggingFace](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main) and navigate to *Meta-Llama-3-8B-Instruct** home page.
- **Step 3:** Accept META LLAMA 3 COMMUNITY LICENSE AGREEMENT by following the instructions [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main)
- **Step 4:** Wait for the approval email from META (Approval may take any where b/w 1-3 hrs)
</div>

Ensure python version of kernel is 3.10

In [None]:
!python --version

Install packages using uv, an extremely fast python package installer. Read more about uv here <https://astral.sh/blog/uv>

In [None]:
!pip install uv && uv pip install -U ipywidgets
!uv pip install -r requirements.txt

Restart kernel after installing packages

In [None]:
# restart kernel
from IPython.core.display import HTML

HTML("<script>Jupyter.notebook.kernel.restart()</script>")
print("Kernel restarted successfully!")

In [None]:
# load rich extension
%load_ext rich

In [None]:
import glob
import json
import os
import subprocess
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime
from getpass import getpass
from pathlib import Path
from statistics import mean
from uuid import uuid4

import boto3
import botocore
import sagemaker
from botocore.config import Config
from rich import box, print
from rich.console import Console
from rich.progress import Progress, SpinnerColumn, TimeElapsedColumn
from rich.table import Table
from sagemaker.deserializers import JSONDeserializer
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer

from utils.autoscaling import (
    monitor_scaling_events,
    print_scaling_times,
    test_concurrency_level,
)

from utils.llmperf import (
    print_llmperf_results,
    trigger_auto_scaling,
    monitor_process,
)

## Initiate sagemaker session

In [None]:
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = sess.boto_region_name
config = Config(retries=dict(max_attempts=10))

boto_session = boto3.Session(region_name=region)

sagemaker_client = sess.sagemaker_client
sagemaker_runtime_client = sess.sagemaker_runtime_client
cloudwatch_client = boto3.client("cloudwatch", region_name=region, config=config)

hf_model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

print(f"HF Model ID: [b green]{hf_model_id}")
print(f"Region: [b blue]{region}")
print(f"Role: [b red]{role}")

## Deploy model

Create and deploy model using Amazon SageMaker HuggingFace TGI DLC

<https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy>

<div class="alert alert-block alert-warning">
<b>NOTE:</b> Remember to copy your Hugging Face Access Token from <a href="https://hf.co/">https://hf.co/</a> before running the below cell.<br/><br/>
Refer <a href="https://huggingface.co/docs/hub/security-tokens">here</a> to learn about creating HF tokens.
</div>

In [None]:
instance_type = "ml.g5.2xlarge"
suffix = f"{str(uuid4())[:5]}-{datetime.now().strftime('%d%b%Y')}"
model_name = f"Llama3-8B-fas-{suffix}"
endpoint_name = model_name
health_check_timeout = 900

HF_TOKEN = os.getenv("HUGGING_FACE_HUB_TOKEN") or getpass("Enter HUGGINGFACE Access Token: ")

# retrieve the llm image uri
# tgi_dlc = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1-tgi2.0-gpu-py310-cu121-ubuntu22.04"
tgi_dlc = get_huggingface_llm_image_uri("huggingface", version="2.0.0")

# Define Model and Endpoint configuration parameter
config = {
    "HF_MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct",  # model_id from hf.co/models
    "SM_NUM_GPUS": "1",  # Number of GPU used per replica
    "MAX_INPUT_LENGTH": "2048",  # Max length of input text
    "MAX_TOTAL_TOKENS": "4096",  # Max length of the generation (including input text)
    "MAX_BATCH_TOTAL_TOKENS": "8192",  # Limits the number of tokens that can be processed in parallel during the generation
    "MESSAGES_API_ENABLED": "true",  # Enable the messages API
    "HUGGING_FACE_HUB_TOKEN": HF_TOKEN,
}

# create HuggingFaceModel with the image uri
print(f"Creating model: [b green]{model_name}...")
llm_model = HuggingFaceModel(name=model_name, role=role, image_uri=tgi_dlc, env=config)

# Deploy model to Amazon SageMaker endpoint
print(f"Deploying model to endpoint: [b magenta]{endpoint_name}...")
predictor = llm_model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,  # 15 minutes to be able to load the model
)

## Inference

Invoke and test endpoint using messages API. Refer to HF [Messages API](https://huggingface.co/docs/text-generation-inference/messages_api) for more info.

In [None]:
# Prompt to generate
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is deep learning?"},
]

# Generation arguments
parameters = {
    "model": hf_model_id,  # model id is required
    "top_p": 0.6,
    "temperature": 0.9,
    "max_tokens": 512,
    "stop": ["<|eot_id|>"],
}

chat = predictor.predict({"messages": messages, **parameters})

# Unpack and print response
print(chat["choices"][0]["message"]["content"].strip())

## Baseline average latency at various concurrency levels (Optional)

<div class="alert alert-block alert-info"><b>NOTE:</b> Running the following cell is optional<br/><br/>
By capturing average latency across various concurrency levels, we can get a fair idea on after how many concurrent request does endpoint performance would degrade significantly.<br/><br/>
Having this information can help define values for scaling policy accordingly.    
</div>

<div class="alert alert-block alert-info">
<b>INFO: ℹ️</b> Signal here is, at a given concurrency level you start to see average latency increase significantly.<br/>
At this concurrency level the endpoint gets overloaded and cannot serve requests in a timely fashion.<br/>
We use these values to set as threshold values for autoscaling.
</div>

In [None]:
# Define list of prompts
prompts = [
    "what is deep learning?",
    "what are various inference modes in Amazon SageMaker?",
    "Can I host Large language models on Amazon SageMaker?",
    "Does Amazon SageMaker support TensorRT-LLM?",
    "what is step scaling policy in the context of autoscaling ec2 instances on AWS?",
    "Why is the sky blue?",
    "List 5 benefits of incorporating limes into the diet.",
]

# Test different concurrency levels and measure average latency
concurrency_levels = [10, 50, 75, 100]  # Adjust these values as needed

for concurrency_level in concurrency_levels:
    try:
        avg_latency = test_concurrency_level(
            concurrency_level,
            prompts,
            messages,
            parameters,
            endpoint_name,
            sagemaker_runtime_client,
        )
        print(
            f"[b]Concurrency:[/b] {concurrency_level} requests,"
            f" [b]Average latency:[/b] {avg_latency:.2f} seconds"
        )
    except Exception as e:
        print(f"[b]At Concurrency[/b] {concurrency_level} requests," f"[b]Exception:[/b] \n{e}")
        continue

---

## Apply Step-Scaling autoscaling policies to endpoint

- **Step 1:** Register Scalable Target
- **Step 2:** Create Scale-Out Policy
- **Step 3:** Create Scale-In Policy
- **Step 4:** Create CloudWatch Alarms

Define and apply the step-scaling policy for scaling out.

In [None]:
variant_name = "AllTraffic"
as_min_capacity = 1
as_max_capacity = 2

resource_id = f"endpoint/{endpoint_name}/variant/{variant_name}"

autoscaling_client = boto3.client("application-autoscaling", region_name=region)

# Register scalable target
scalable_target = autoscaling_client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=as_min_capacity,
    MaxCapacity=as_max_capacity,  # Replace with your desired maximum instances
)

scalable_target_arn = scalable_target["ScalableTargetARN"]
print(f"Resource ID: [b blue]{resource_id}")
print(f"Scalable_target_arn:\n[b green]{scalable_target_arn}")

### Create StepScaling <span style='color:green'>Scale-out</span> Policy

In [None]:
# Configure step scaling scale-out policy
scale_out_policy_response = autoscaling_client.put_scaling_policy(
    PolicyName=f"{endpoint_name}-ScaleOutPolicy",
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="StepScaling",
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity",
        "Cooldown": 300,  # 5 minutes cooldown
        "MetricAggregationType": "Maximum",
        "StepAdjustments": [
            {
                "MetricIntervalLowerBound": 0,
                "MetricIntervalUpperBound": 20,
                "ScalingAdjustment": 1,  # Increase by one instance
            },
            {
                "MetricIntervalLowerBound": 20,
                "ScalingAdjustment": 2,  # Increase by 2 instances
            },
        ],
    },
)

# print(scale_out_policy_response)
scale_out_policy_arn = scale_out_policy_response["PolicyARN"]
print(f"Step scaling policy ARN: [i green]{scale_out_policy_arn}[/i green]")

### Create StepScaling <span style='color:green'>Scale-In</span> Policy

In [None]:
scale_in_policy_response = autoscaling_client.put_scaling_policy(
    PolicyName=f"{endpoint_name}-ScaleInPolicy",
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="StepScaling",
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity",
        "Cooldown": 300,  # Cooldown period after scale-in activity
        "MetricAggregationType": "Maximum",
        "StepAdjustments": [
            {
                "MetricIntervalUpperBound": 0,
                "MetricIntervalLowerBound": -20,
                "ScalingAdjustment": -1,  # Decrease by 1 instance
            },
            {"MetricIntervalUpperBound": -20, "ScalingAdjustment": -2},  # Decrease by 2 instances
        ],
    },
)

# print(scale_in_policy_response)
scale_in_policy_arn = scale_in_policy_response["PolicyARN"]
print(f"Step scaling policy ARN: [i green]{scale_in_policy_arn}[/i green]")

### Create CloudWatch alarms (Step-Scaling)

Create CloudWatch Alarms using new <span style='color:green'><b>ConcurrentRequestsPerModel</b></span> high-resolution Metric.

In [None]:
# Define the alarm parameters for scale-out
alarm_name_scale_out = f"Step-Scaling-AlarmHigh-SageMaker:{resource_id}"
metric_name = "ConcurrentRequestsPerModel"
namespace = "AWS/SageMaker"  # CloudWatch Namespace to write metric data
statistic = "Maximum"
period = 60  # 10 seconds
evaluation_periods = 3
threshold = 20.0  # Threshold for scale-out
comparison_operator = "GreaterThanOrEqualToThreshold"
dimensions = [
    {"Name": "EndpointName", "Value": endpoint_name},
    {"Name": "VariantName", "Value": "AllTraffic"},
]
alarm_actions = [scale_out_policy_response["PolicyARN"]]
treat_missing_data = "ignore"

# create CloudWatch alarm for scale-out
response = cloudwatch_client.put_metric_alarm(
    AlarmName=alarm_name_scale_out,
    MetricName=metric_name,
    Namespace=namespace,
    Statistic=statistic,
    Period=period,
    EvaluationPeriods=evaluation_periods,
    Threshold=threshold,
    ComparisonOperator=comparison_operator,
    Dimensions=dimensions,
    AlarmActions=alarm_actions,
    TreatMissingData=treat_missing_data,
)

print(f"CloudWatch alarm created for scale-out:\n[b blue]{alarm_name_scale_out}")

# Define the alarm parameters for scale-in
alarm_name_scale_in = f"Step-Scaling-AlarmLow-SageMaker:{resource_id}"
comparison_operator = "LessThanOrEqualToThreshold"
threshold = 10.0  # Adjust based on your requirements
alarm_actions = [scale_in_policy_response["PolicyARN"]]

# Create CloudWatch alarm for scale-in
response = cloudwatch_client.put_metric_alarm(
    AlarmName=alarm_name_scale_in,
    MetricName=metric_name,
    Namespace=namespace,
    Statistic=statistic,
    Period=period,
    EvaluationPeriods=evaluation_periods,
    Threshold=threshold,
    ComparisonOperator=comparison_operator,
    Dimensions=dimensions,
    AlarmActions=alarm_actions,
    TreatMissingData=treat_missing_data,
)

print(f"CloudWatch alarm created for scale-in:\n[b blue]{alarm_name_scale_in}")

## Trigger autoscaling action

### Use LLMPerf to generate traffic to the endpoint

Refer to <https://github.com/philschmid/llmperf> for more details on LLMPerf.

Run the LLMPerf traffic generation script in the background using `subprocess.Popen`

<div class="alert alert-block alert-info">
<b>INFO:ℹ️</b> Refer to <a href="utils/llmperf.py">utils.llmperf</a> for `trigger_autoscaling` function implementation
</div>


### Monitor Alarm Trigger times and Scaling event times
As llmperf generates traffic to the endpoint continuously this trigger auto-scaling.

The `monitor_scaling_events` function does the following:
- Calculates time taken for alarm to go into InAlarm state.
- checks if alarm is InAlarm state. If yes, then starts the scaling timer
- continuously monitors the `DesiredInstanceCount` property of the endpoint
  - waits till `CurrentInstanceCount == DesiredInstanceCount` and `EndpointStatus` is `InService`
- Calculates time taken to scale out instances prints the times in a table

The below cell triggers auto scaling action and calls the monitor_scaling_events immediately on the AlarmHigh

<div class="alert alert-block alert-info">
<b>INFO:ℹ️</b> Refer to <a href="utils/autoscaling.py">utils.autoscaling</a> for `monitor_scaling_events` function implementation
</div>

<div class="alert alert-block alert-info">
<b>NOTE: ⚠️</b>Per the <b>ScaleOut</b> Alarm, scale-out actions only start after the threshold of <b>ConcurrentRequestsPerModel >= 20</b> for 3 datapoints within <b>3 minutes</b> is breached.
</div>

In [None]:
# Trigger LLMPerf script to generate traffic to endpoint
num_concurrent_requests = 100
# LLMperf requires session credentials be passed in via environment variables.
# We'll use the current session to get these credentials.
creds = boto_session.get_credentials()
process = trigger_auto_scaling(creds, region, endpoint_name, num_concurrent_requests)
print(f"[b green]Process ID for LLMPerf: {process.pid}")

# Start monitoring scaling events
SLEEP_TIME = 5  # time to sleep
scaling_times = monitor_scaling_events(
    endpoint_name, alarm_name_scale_out, SLEEP_TIME, cloudwatch_client, sagemaker_client
)

# Print scaling times
console = Console()
table = print_scaling_times(scaling_times)
console.print(table)

### Monitor if the background process (llmperf) is completed.

In [None]:
# Monitor the background traffic generation process for completion
monitor_process(process)

## Print LLMPerf results

LLMPerf writes the results to **"results/"** directory. `summary.json` file has the endpoint benchmarking data.

In [None]:
print_llmperf_results(num_concurrent_requests)

### Monitor Scale-in action scaling times (Optional)


<div class="alert alert-block alert-info">
<b>NOTE: ⚠️</b>Per the <b>ScaleIn</b> Alarm, scale-in actions only start after the threshold of <b>ConcurrentRequestsPerModel <= 10</b> for 3 datapoints within <b>3 minutes</b> is breached.
</div>

In [None]:
# Start monitoring scaling events
SLEEP_TIME = 5  # time to sleep
scaling_times = monitor_scaling_events(
    endpoint_name,
    alarm_name_scale_in,  # scale_in cloudwatch metric alarm name
    SLEEP_TIME,
    cloudwatch_client,
    sagemaker_client,
)

# Print scaling times
console = Console()
table = print_scaling_times(scaling_times)
console.print(table)

## Cleanup

- Delete cloudwatch alarms
- Delete scaling policies
- Deregister scalable target
- Delete model
- Delete endpoint

In [None]:
# Delete CloudWatch alarms created for Step scaling policy
alarm_names = [alarm_name_scale_out, alarm_name_scale_in]

for alarm in alarm_names:
    try:
        cloudwatch_client.delete_alarms(AlarmNames=[alarm])
        print(f"Deleted CloudWatch scale-out alarm [b]{alarm} ✅")
    except cloudwatch_client.exceptions.ResourceNotFoundException:
        print(f"CloudWatch scale-out alarm [b]{alarm}[/b] not found.")


# Delete scaling policies
print("---" * 10)
step_policies = [f"{endpoint_name}-ScaleInPolicy", f"{endpoint_name}-ScaleOutPolicy"]
for policy_name in step_policies:
    try:
        autoscaling_client.delete_scaling_policy(
            PolicyName=policy_name,
            ServiceNamespace="sagemaker",
            ResourceId=resource_id,
            ScalableDimension="sagemaker:variant:DesiredInstanceCount",
        )
        print(f"Deleted scaling policy [i green]{policy_name} ✅")
    except autoscaling_client.exceptions.ObjectNotFoundException:
        print(f"Scaling policy [i]{policy_name}[/i] not found.")

# Deregister scalable target
try:
    autoscaling_client.deregister_scalable_target(
        ServiceNamespace="sagemaker",
        ResourceId=resource_id,
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    )
    print(f"Scalable target for [b]{resource_id}[/b] deregistered. ✅")
except autoscaling_client.exceptions.ObjectNotFoundException:
    print(f"Scalable target for [b]{resource_id}[/b] not found!.")

print("---" * 10)
# Delete model and endpoint
try:
    print(f"Deleting model: [b green]{model_name} ✅")
    predictor.delete_model()
except Exception as e:
    print(f"{e}")

try:
    print(f"Deleting endpoint: [b magenta]{predictor.endpoint_name} ✅")
    predictor.delete_endpoint()
except Exception as e:
    print(f"{e}")

print("---" * 10)
print("Done")

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb)
