# Faster autoscaling on Amazon SageMaker realtime endpoints (Application Autoscaling)

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)


---

In this notebook we show how the new faster autoscaling feature helps scale sagemaker inference endpoints by almost 6x faster than earlier.

We deploy Meta's `Llama3-8B-Instruct` model to an Amazon SageMaker realtime endpoint using Text Generation Inference (TGI) Deep Learning Container (DLC) and apply <span style='color:green'><b>Application Autoscaling</b></span> scaling policies to the endpoint.


<div class="alert alert-block alert-warning">
    Please select <b>m5.2xlarge</b> or larger instance types when running this on Amazon SageMaker Notebook Instance.<br/>
    Select <b>conda_pytorch_p310</b> kernel when running this notebook on Amazon SageMaker Notebook Instance. <br/><br/>
    Ensure python version for the kernel is <b>3.10.x</b> (3.11 is not supported). <br/>
</div>

---

## Prerequisites



<div style="border: 1px solid #f00; border-radius: 5px; padding: 10px; background-color: #fee;">
Before using this notebook please ensure you have access to an active access token from HuggingFace and have accepted the license agreement from Meta.

- **Step 1:** Create user access token in HuggingFace (HF). Refer [here](https://huggingface.co/docs/hub/security-tokens) on how to create HF tokens.
- **Step 2:** Login to [HuggingFace](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main) and navigate to *Meta-Llama-3-8B-Instruct** home page.
- **Step 3:** Accept META LLAMA 3 COMMUNITY LICENSE AGREEMENT by following the instructions [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main)
- **Step 4:** Wait for the approval email from META (Approval may take any where b/w 1-3 hrs)
</div>

Install packages using uv, an extremely fast python package installer\
Read more about uv here <https://astral.sh/blog/uv>

In [1]:
# ensure python version of the selected kernel is not greater than 3.10
!python --version

Python 3.10.14


In [2]:
!pip install uv && uv pip install -U ipywidgets
!uv pip install -r requirements.txt

Collecting uv
  Using cached uv-0.2.30-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (35 kB)
Using cached uv-0.2.30-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.9 MB)
Installing collected packages: uv
Successfully installed uv-0.2.30
[2K[2mResolved [1m22 packages[0m [2min 30ms[0m[0m                                         [0m
[2mUninstalled [1m13 packages[0m [2min 102ms[0m[0m
[2K[2mInstalled [1m13 packages[0m [2min 30ms[0m[0m.0.11                          [0m
 [31m-[39m [1mexceptiongroup[0m[2m==1.2.0 (from file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1704921103267/work)[0m
 [32m+[39m [1mexceptiongroup[0m[2m==1.2.2[0m
 [31m-[39m [1mipython[0m[2m==8.22.2 (from file:///home/conda/feedstock_root/build_artifacts/ipython_1709559745751/work)[0m
 [32m+[39m [1mipython[0m[2m==8.26.0[0m
 [31m-[39m [1mipywidgets[0m[2m==8.1.2 (from file:///home/conda/feedstock_root/build_artifacts/ipywidgets_170

In [3]:
# restart kernel
from IPython.core.display import HTML

HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [4]:
# load rich extension
%load_ext rich

In [5]:
import glob
import json
import os
import subprocess
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime
from getpass import getpass
from pathlib import Path
from statistics import mean
from uuid import uuid4

import boto3
import botocore
import sagemaker
from rich import box, print
from rich.console import Console
from rich.progress import Progress, SpinnerColumn, TimeElapsedColumn
from rich.table import Table
from sagemaker.deserializers import JSONDeserializer
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer

from utils.autoscaling import (
    monitor_scaling_events,
    print_scaling_times,
    test_concurrency_level,
)

from utils.llmperf import (
    print_llmperf_results,
    trigger_auto_scaling,
    monitor_process,
)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## Initiate sagemaker session

In [6]:
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = sess.boto_region_name

boto_session = boto3.Session(region_name=region)

sagemaker_client = sess.sagemaker_client
sagemaker_runtime_client = sess.sagemaker_runtime_client
cloudwatch_client = boto3.client("cloudwatch", region_name=region)

hf_model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# retrieve the llm image uri
# tgi_dlc = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1-tgi2.0-gpu-py310-cu121-ubuntu22.04"
tgi_dlc = get_huggingface_llm_image_uri("huggingface", version="2.0.0")

print(f"TGI DLC: \n[b i green]{tgi_dlc}")
print(f"Region: [b blue]{region}")
print(f"Role: [b red]{role}")

## Deploy model

Create and deploy model using Amazon SageMaker HuggingFace TGI DLC

<https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy>

<div class="alert alert-block alert-warning">
<b>NOTE:</b> Remember to copy your Hugging Face Access Token from <a href="https://hf.co/">https://hf.co/</a> before running the below cell.<br/><br/>
Refer <a href="https://huggingface.co/docs/hub/security-tokens">here</a> to learn about creating HF tokens.
</div>

In [7]:
# sagemaker config
instance_type = "ml.g5.2xlarge"
suffix = f"{str(uuid4())[:5]}-{datetime.now().strftime('%d%b%Y')}"
model_name = f"Llama3-8B-fas-{suffix}"
endpoint_name = model_name
health_check_timeout = 900

HF_TOKEN = os.getenv("HUGGING_FACE_HUB_TOKEN") or getpass("Enter HUGGINGFACE Access Token: ")
# Define Model and Endpoint configuration parameter
config = {
    "HF_MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct",  # model_id from hf.co/models
    "SM_NUM_GPUS": "1",  # Number of GPU used per replica
    "MAX_INPUT_LENGTH": "2048",  # Max length of input text
    "MAX_TOTAL_TOKENS": "4096",  # Max length of the generation (including input text)
    "MAX_BATCH_TOTAL_TOKENS": "8192",  # Limits the number of tokens that can be processed in parallel during the generation
    "MESSAGES_API_ENABLED": "true",  # Enable the messages API
    "HUGGING_FACE_HUB_TOKEN": HF_TOKEN,
}

# create HuggingFaceModel with the image uri
print(f"Creating model: [b green]{model_name}...")
llm_model = HuggingFaceModel(name=model_name, role=role, image_uri=tgi_dlc, env=config)

# Deploy model to Amazon SageMaker endpoint
print(f"Deploying model to endpoint: [b magenta]{endpoint_name}...")
predictor = llm_model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,  # 15 minutes to be able to load the model
)

Enter HUGGINGFACE Access Token:  ········


-------------!

## Inference

Invoke and test endpoint using messages API. Refer to HF [Messages API](https://huggingface.co/docs/text-generation-inference/messages_api) for more info.

In [8]:
# Prepare prompt in messages API format
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is deep learning?"},
]

# Generation arguments
parameters = {
    "model": hf_model_id,  # model id is required
    "top_p": 0.6,
    "temperature": 0.9,
    "max_tokens": 512,
    "stop": ["<|eot_id|>"],
}

chat = predictor.predict({"messages": messages, **parameters})

# Unpack and print response
print(chat["choices"][0]["message"]["content"].strip())

## Baseline average latency at various concurrency levels (Optional)

By capturing average latency across various concurrency levels, we can get a fair idea on after how many concurrent request does endpoint performance would degrade significantly.

Having this information can help define values for scaling policy accordingly.

<div class="alert alert-block alert-info">
<b>Running below cell is optional</b><br/><br/>
<b>INFO: ℹ️</b> Signal here is, at a given concurrency level you start to see average latency increase significantly.<br/>
At this concurrency level the endpoint gets overloaded and cannot serve requests in a timely fashion.<br/>
We use these values to set as threshold values for autoscaling.
<br/><br/>
<b>NOTE: ⚠️</b> As concurrent requests to the endpoint increase you might observe <b>ThrottlingException</b> errors as we haven't incorporated exponential backoff and retry mechanisms.
</div>

In [9]:
# Define list of prompts
prompts = [
    "what is deep learning?",
    "what are various inference modes in Amazon SageMaker?",
    "Can I host Large language models on Amazon SageMaker?",
    "Does Amazon SageMaker support TensorRT-LLM?",
    "what is step scaling policy in the context of autoscaling ec2 instances on AWS?",
    "Why is the sky blue?",
    "List 5 benefits of incorporating limes into the diet.",
]

# Test different concurrency levels and measure average latency
concurrency_levels = [10, 50, 75, 100]  # Adjust these values as needed

for concurrency_level in concurrency_levels:
    try:
        avg_latency = test_concurrency_level(
            concurrency_level,
            prompts,
            messages,
            parameters,
            endpoint_name,
            sagemaker_runtime_client,
        )
        print(
            f"[b]Concurrency:[/b] {concurrency_level} requests,"
            f" [b]Average latency:[/b] {avg_latency:.2f} seconds"
        )
    except Exception as e:
        print(f"[b]At Concurrency[/b] {concurrency_level} requests," f"[b]Exception:[/b] \n{e}")
        continue

---

## Apply Autoscaling policies to the endpoint

Apply Application Autoscaling Policy to endpoint

1. Register Scalable Target

In [10]:
variant_name = "AllTraffic"
as_min_capacity = 1
as_max_capacity = 2

resource_id = f"endpoint/{endpoint_name}/variant/{variant_name}"

autoscaling_client = boto3.client("application-autoscaling", region_name=region)

# Register scalable target
scalable_target = autoscaling_client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=as_min_capacity,
    MaxCapacity=as_max_capacity,  # Replace with your desired maximum instances
)

scalable_target_arn = scalable_target["ScalableTargetARN"]
print(f"Resource ID: [b blue]{resource_id}")
print(f"Scalable_target_arn:\n[b green]{scalable_target_arn}")

## Use the latest high-resolution Metrics to trigger auto-scaling

- New feature introduces a new <span style='color:green'><b>PredefinedMetricType</b></span> for scaling policy configuration i.e. <span style='color:green'><b>SageMakerVariantConcurrentRequestsPerModelHighResolution</b></span> to trigger scaling actions.
- Creating a scaling policy with this metric type will create cloudwatch alarms that track a new metric called <span style='color:green'><b>ConcurrentRequestsPerModel</b></span>.
- These high-resolution metrics are published at sub-minute intervals (10s intervals to CW + any additional jitter + delays)
- We should observe significant improvement in scale out times with this new metric


### Steps to create Application autoscaling policy

- Create scaling policy
  - Set `PolicyType` to `TargetTrackingScaling`
  - Set `TargetValue` to `5.0`. i.e., Scaling triggers when endpoint receives 5 `ConcurrentRequestsPerModel`
  - Set `PredefinedMetricType` to `SageMakerVariantConcurrentRequestsPerModelHighResolution`
  - Set `ScaleInCoolDown` and `ScaleOutCoolDown` values to `300` seconds

In [11]:
# Create Target Tracking Scaling Policy
target_tracking_policy_response = autoscaling_client.put_scaling_policy(
    PolicyName="SageMakerEndpointScalingPolicy",
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 5.0,  # Scaling triggers when endpoint receives 5 ConcurrentRequestsPerModel
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerVariantConcurrentRequestsPerModelHighResolution"
        },
        "ScaleInCooldown": 300,  # Cooldown period after scale-in activity
        "ScaleOutCooldown": 300,  # Cooldown period after scale-out activity
    },
)

# print(target_tracking_policy_response)
print(f"[b]Policy ARN:[/b] [i blue]{target_tracking_policy_response['PolicyARN']}")

# print Cloudwatch Alarms
alarms = target_tracking_policy_response["Alarms"]

for alarm in alarms:
    print(f"[b]Alarm Name:[/b] [b magenta]{alarm['AlarmName']}")
    # print(f"[b]Alarm ARN:[/b] [i green]{alarm['AlarmARN']}[/i green]")
    print("===" * 15)

## Trigger autoscaling action

### LLMPerf to generate traffic to the endpoint

Refer to <https://github.com/philschmid/llmperf> for more details on LLMPerf.

Run the LLMPerf traffic generation script in the background using `subprocess.Popen`

<div class="alert alert-block alert-info">
<b>INFO:ℹ️</b> Refer to <a href="utils/llmperf.py"><b>utils/llmperf.py</b></a> for <span style='color:red'>trigger_autoscaling</span> function implementation
</div>

### Monitor Scale-Out Alarm Trigger times and scaling event times

As llmperf generates traffic to the endpoint continuously this trigger auto-scaling.

The `monitor_scaling_events` function does the following:
- Calculates time taken for alarm to go into InAlarm state.
- checks if alarm is InAlarm state. If yes, then starts the scaling timer
- continuously monitors the `DesiredInstanceCount` property of the endpoint
  - waits till `CurrentInstanceCount == DesiredInstanceCount` and `EndpointStatus` is `InService`
- Calculates time taken to scale out instances prints the times in a table

The below cell triggers auto scaling action and calls the monitor_scaling_events immediately on the AlarmHigh

<div class="alert alert-block alert-info">
<b>INFO: ℹ️</b> Refer to <a href="utils/autoscaling.py"><b>utils/autoscaling.py</b></a> for <span style='color:red'>monitor_scaling_events</span> function implementation
</div>

<div class="alert alert-block alert-info">
<b>NOTE: ⚠️</b>The <b>AlarmHigh</b> Alarm triggers scale out actions only after the threshold of <b>ConcurrentRequestsPerModel >5 </b> for 3 datapoints within <b>30 seconds</b> is breached.
</div>

In [12]:
# Trigger LLMPerf script to generate traffic to endpoint
num_concurrent_requests = 100
# LLMperf requires session credentials be passed in via environment variables.
# We'll use the current session to get these credentials.
creds = boto_session.get_credentials()
process = trigger_auto_scaling(creds, region, endpoint_name, num_concurrent_requests)
print(f"[b green]Process ID for LLMPerf: {process.pid}")

# get AlarmHigh alarm name
scaleout_alarm_name = [alarm["AlarmName"] for alarm in alarms if "AlarmHigh" in alarm["AlarmName"]][
    0
]

# Start monitoring scaling events
SLEEP_TIME = 5  # time to sleep
scaling_times = monitor_scaling_events(
    endpoint_name, scaleout_alarm_name, SLEEP_TIME, cloudwatch_client, sagemaker_client
)

# Print scaling times
console = Console()
table = print_scaling_times(scaling_times)
console.print(table)

Output()

### Monitor if the background process (llmperf) is completed.

In [13]:
monitor_process(process)

## Print LLMPerf results

LLMPerf writes the results to **"results/"** directory.  `summary.json` file has the endpoint benchmarking data.

In [14]:
print_llmperf_results(num_concurrent_requests)

### Monitor Scale-in Alarm Trigger times and scaling event times

<div class="alert alert-block alert-warning">
<b>NOTE: ⚠️</b>The <b>AlarmLow</b> Alarm triggers scale-in actions only after the threshold of <b>ConcurrentRequestsPerModel < 4.5</b> for 90 datapoints within <b>15 minutes</b> is breached.
<br/>Running the below cell with take approximately 15 minutes to complete.<br/>
</div>

In [15]:
# get AlarmHigh alarm name
scalein_alarm_name = [alarm["AlarmName"] for alarm in alarms if "AlarmLow" in alarm["AlarmName"]][0]

# Start monitoring scaling events
SLEEP_TIME = 5  # time to sleep
scaling_times = monitor_scaling_events(
    endpoint_name, scalein_alarm_name, SLEEP_TIME, cloudwatch_client, sagemaker_client
)

# Print scaling times
console = Console()
table = print_scaling_times(scaling_times)
console.print(table)

Output()

## Cleanup

- Deregister scalable target. This automatically deletes associated cloudwatch alarms.
- Delete model
- Delete endpoint

In [16]:
# Deregister the scalable target for AAS
try:
    autoscaling_client.deregister_scalable_target(
        ServiceNamespace="sagemaker",
        ResourceId=resource_id,
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    )
    print(f"Scalable target for [b]{resource_id}[/b] deregistered. ✅")
except autoscaling_client.exceptions.ObjectNotFoundException:
    print(f"Scalable target for [b]{resource_id}[/b] not found!.")

print("---" * 10)
# Delete model and endpoint
try:
    print(f"Deleting model: [b green]{model_name} ✅")
    predictor.delete_model()
except Exception as e:
    print(f"{e}")

try:
    print(f"Deleting endpoint: [b magenta]{predictor.endpoint_name} ✅")
    predictor.delete_endpoint()
except Exception as e:
    print(f"{e}")

print("---" * 10)
print(f"Done")

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference|generativeai|huggingfacetgi|meta-llama|llama3-8b|faster-autoscaling|realtime-endpoints|FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb)