# Deploy Multiple SOTA LLMs on a single endpoint with Scale-to-Zero

In this notebook, you will learn how to leverage Fast Model Loader to dramatically improve model loading times of SOTA models, how to configure model level autoscaling, including how you can scale in your SageMaker endpoint to zero instances during idle periods, eliminating the previous requirement of maintaining at least one running instance.

## Set up

- Fetch and import dependencies
- Initialize SageMaker environment and required clients to access AWS services

In [None]:
%pip install sagemaker --upgrade --quiet --no-warn-conflicts

In [None]:
import sagemaker
import sys
import boto3
import logging
import time
from sagemaker.session import Session
from sagemaker.s3 import S3Uploader

print(sagemaker.__version__)

In [None]:
try:
    boto_region = boto3.Session().region_name
    sagemaker_session = sagemaker.session.Session(boto_session=boto3.Session(region_name=boto_region))
    role = sagemaker.get_execution_role()
    sagemaker_client = boto3.client("sagemaker", region_name=boto_region)
    model_bucket = sagemaker_session.default_bucket()
    
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

## Setup your SageMaker Real-time Endpoint 

### Create a SageMaker endpoint configuration

There are a few parameters we want to setup for our endpoint. We begin by creating the endpoint configuration and setting MinInstanceCount to 0. This allows the endpoint to scale in all the way down to zero instances when not in use, see the [blog](https://aws.amazon.com/blogs/machine-learning/unlock-cost-savings-with-the-new-scale-down-to-zero-feature-in-amazon-sagemaker-inference/). In addition we will use Managed Instance Scaling which allows SageMaker to scale the number of instances based on the requirements of the scaling of your inference components. Lastly, we set *RoutingStrategy* for the endpoint to optimally tune how to route requests to instances and inference components for the best performance.

In [None]:
# Set an unique name for our endpoint config
endpoint_config_name = sagemaker.utils.name_from_base("workshop-lab-3")
print(f"Endpoint config name: {endpoint_config_name}")

In [None]:
# Configure variant name and instance type for hosting
variant_name = "AllTraffic"
gpu_instance_type = "ml.g5.2xlarge"

model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600

min_instance_count = 0 # Minimum instance must be set to 0
max_instance_count = 2

sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": gpu_instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": min_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)

### Create the SageMaker endpoint
Next, we create our endpoint using the above endpoint config

In [None]:
# Set a unique endpoint name
endpoint_name = sagemaker.utils.name_from_base("workshop-lab-3")
print(f"Endpoint name: {endpoint_name}")

sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)

#### We wait for our endpoint to go InService. This step can take ~4 mins

In [None]:
sagemaker_session.wait_for_endpoint(endpoint_name)

## Create Model Builder

We'll make use of the ModelBuilder class to prepare and package the model inference components. In this example, we're using the Meta-Llama-3-8B-Instruct SageMaker JumpStart.

Key configurations:
- Model: Meta-Llama-3-8B-Instruct
- Schema Builder: Defines input/output format
- Set LMI image with fast model loader support

#### Select the container image to use

In [None]:
CONTAINER_VERSION = "0.32.0-lmi14.0.0-cu126"
inference_image = "763104351884.dkr.ecr.{}.amazonaws.com/djl-inference:{}".format(sagemaker_session.boto_session.region_name, CONTAINER_VERSION)

gpu_instance_type="ml.g5.2xlarge"


print(f"Inference Image --> {inference_image}")
print(f"GPU instance --> {gpu_instance_type}")

In [None]:
from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
import logging

prompt = "Falcons are"
response = "Falcons are small to medium-sized birds of prey related to hawks and eagles."

model_id = "meta-textgeneration-llama-3-1-8b-instruct"
model_name = sagemaker.utils.name_from_base("workshop-lab-3-llama")

llama_model_builder = ModelBuilder(
    model=model_id,
    name=model_name,
    role_arn=role,
    sagemaker_session=sagemaker_session,
    schema_builder=SchemaBuilder(sample_input=prompt, sample_output=response),
    log_level=logging.WARN
)

output_path = f"s3://{model_bucket}/ws-llama3-1-8b-instruct/sharding"

## Model Optimization with Fast Model Loader

Fast Model Loader streams model weights directly from S3 to GPU, bypassing traditional loading steps.
In internal testing, this approach has shown up to 15x faster model loading compared to standard deployment.

> **Note** : The optimization process may take a while to complete. The optimized model will be stored in the specified S3 output path.

In [None]:
llama_model_builder.optimize(
    instance_type=gpu_instance_type, 
    accept_eula=True, 
    output_path=output_path,
    env_vars={
        "OPTION_MAX_MODEL_LEN": "12384",
    },
    sharding_config={
            "Image": inference_image,
            "OverrideEnvironment": {
                "OPTION_TENSOR_PARALLEL_DEGREE": "1" # Number of GPU available on the instance
            }
    }
)

## Build the optimized model
After optimization, we'll build the final model artifacts and get them ready to deploy to a SageMaker endpoint. 

In [None]:
llama_model = llama_model_builder.build()

## Deploy model using Inference Components

In [None]:
from sagemaker.enums import EndpointType
from sagemaker.enums import RoutingStrategy
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements


resources = ResourceRequirements(
    requests = {
        "num_accelerators": 1, # Number of accelerators required
        "memory": 10409,  # Minimum memory required in Mb (required)
        "copies": 1,
    }
)

llama_model.deploy(
    endpoint_name=endpoint_name,
    accept_eula=True,
    initial_instance_count=1,
    instance_type=gpu_instance_type,
    endpoint_type=EndpointType.INFERENCE_COMPONENT_BASED,
    resources=resources,
)

#### Test the model with a sample prompt
Now we can invoke our endpoint with sample input to test its functionality and see the model's output.

In [None]:
import json
from sagemaker.predictor import retrieve_default 

endpoint_name = llama_model.endpoint_name 
llama_inference_component_name = llama_model.inference_component_name

llama_predictor = retrieve_default(endpoint_name, inference_component_name=llama_inference_component_name) 

payload = {
  "messages": [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What is deep learning?"}
      ]
    }
  ],
    "max_tokens": 100,
    "temperature": 0.6,
    "top_p": 0.9,
}


response = llama_predictor.predict(payload)
print(response['choices'][0]['message']['content'])

# Print usage statistics
print("=== Token Usage ===")
usage = response['usage']
print(f"Prompt Tokens: {usage['prompt_tokens']}")
print(f"Completion Tokens: {usage['completion_tokens']}")
print(f"Total Tokens: {usage['total_tokens']}")

## Configure per model Autoscaling and Enable Cost-Saving Autoscaling with Scale-to-Zero

### Scaling policies 

Once our models are deployed and InService, we can then add the necessary scaling policies:

* A [target tracking](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-target-tracking.html) policy that can scale in the copy count for our inference component model copies to zero, and from 1 to n. 
* A [step scaling policy](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html) policy that will allow the endpoint to scale out from zero.

These policies work together to provide cost-effective scaling - the endpoint can scale to zero when idle and automatically scale out as needed to handle incoming requests.

### Autoscaling Helper Function

> **Note** ofr the target tracking policy, Application Auto Scaling creates two CloudWatch alarms per scaling target. The first triggers scale-out actions after 30 seconds (using 3 sub-minute data point), while the second triggers scale-in after 15 minutes (using 90 sub-minute data points). The time to trigger the scaling action is usually 1–2 minutes longer than those minutes because it takes time for the endpoint to publish metrics to CloudWatch, and it also takes time for AutoScaling to react. 


In [None]:
aas_client = sagemaker_session.boto_session.client("application-autoscaling")
cloudwatch_client = sagemaker_session.boto_session.client("cloudwatch")

In [None]:
def setup_autoscaling(
    inference_component_name, 
    min_capacity=0, 
    max_capacity=2, 
    target_requests_per_copy=3
):
    """
    Configure complete autoscaling: scale-to-zero + scale-out
    
    Args:
        inference_component_name (str): Name of the inference component to configure
        min_capacity (int): Minimum number of model copies (set to 0 for scale-to-zero)
        max_capacity (int): Maximum number of model copies this component can scale to
        target_requests_per_copy (int): Concurrent requests threshold per copy before scaling out
    """
    
    resource_id = f"inference-component/{inference_component_name}"
    
    # 1. Register autoscaling target
    aas_client.register_scalable_target(
        ServiceNamespace="sagemaker",
        ResourceId=resource_id,
        ScalableDimension="sagemaker:inference-component:DesiredCopyCount",
        MinCapacity=min_capacity,  # Configurable minimum (0 enables scale-to-zero)
        MaxCapacity=max_capacity,  # Configurable maximum
    )
    
    # 2. Target tracking policy (scales min_capacity+1 → max_capacity)
    aas_client.put_scaling_policy(
        PolicyName=f"target-tracking-{inference_component_name}",
        PolicyType="TargetTrackingScaling",
        ServiceNamespace="sagemaker",
        ResourceId=resource_id,
        ScalableDimension="sagemaker:inference-component:DesiredCopyCount",
        TargetTrackingScalingPolicyConfiguration={
            "PredefinedMetricSpecification": {
                "PredefinedMetricType": "SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution",
            },
            "TargetValue": target_requests_per_copy,
        },
    )
    
    # 3. Step scaling policy (only needed if min_capacity = 0)
    step_policy_response = aas_client.put_scaling_policy(
        PolicyName=f"step-scaling-{inference_component_name}",
        PolicyType="StepScaling",
        ServiceNamespace="sagemaker",
        ResourceId=resource_id,
        ScalableDimension="sagemaker:inference-component:DesiredCopyCount",
        StepScalingPolicyConfiguration={
            "AdjustmentType": "ChangeInCapacity",
            "MetricAggregationType": "Maximum",
            "Cooldown": 60,
            "StepAdjustments": [{"MetricIntervalLowerBound": 0, "ScalingAdjustment": 1}]
        },
    )
    
    # CloudWatch alarm for scale-out from zero
    cloudwatch_client.put_metric_alarm(
        AlarmName=f"scale-from-zero-{inference_component_name}",
        AlarmActions=[step_policy_response['PolicyARN']],
        MetricName='NoCapacityInvocationFailures',
        Namespace='AWS/SageMaker',
        Statistic='Maximum',
        Dimensions=[{'Name': 'InferenceComponentName', 'Value': inference_component_name}],
        Period=30,
        EvaluationPeriods=1,
        DatapointsToAlarm=1,
        Threshold=1,
        ComparisonOperator='GreaterThanOrEqualToThreshold',
    )
    
    print(f"✅ Autoscaling configured for {inference_component_name}")

### Setup autoscaling for LLama model

In [None]:
# Retrieve the IC name
llama_inference_component_name = llama_model.inference_component_name
print(f"LLama inference component: {llama_inference_component_name}")

In [None]:
# Apply autoscaling configuration
setup_autoscaling(llama_inference_component_name)

## Testing scale to Zero behaviour

### IC copy count scales in to zero
We'll pause for a few minutes without making any invocations to our model. Based on our target tracking policy, when our SageMaker endpoint doesn't receive requests for about 10 to 15 minutes, it will automatically scale down to zero the number of model copies. 

In [None]:
time.sleep(900)
start_time = time.time()
while True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=llama_inference_component_name)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

desc = sagemaker_client.describe_inference_component(InferenceComponentName=llama_inference_component_name)
print(desc)

### Endpoint's instances scale in to zero

After 10 additional minutes of inactivity, SageMaker automatically terminates all underlying instances of the endpoint, eliminating all associated costs.

In [None]:
# after 10mins instances will scale down to 0
time.sleep(600)
# verify whether CurrentInstanceCount is zero
sagemaker_session.wait_for_endpoint(endpoint_name)

#### Invoke llama model with a sample prompt

If we try to invoke our endpoint while instances are scaled down to zero, we get a validation error: `An error occurred (ValidationError) when calling the InvokeEndpoint operation: Inference Component has no capacity to process this request. ApplicationAutoScaling may be in-progress (if configured) or try to increase the capacity by invoking UpdateInferenceComponentRuntimeConfig API.`

In [None]:
payload={
    "messages": [
        {"role": "user", "content": "What is deep learning?"}
    ],
    "max_tokens": 100,
}
try:
    response = llama_predictor.predict(payload)
    print(response['choices'][0]['message']['content'])

except Exception as e:
    print(f"   Reason: {str(e)}")  

### Scale out from zero kicks in
However, after 1 minutes our step scaling policy should kick in. SageMaker will then start provisioning a new instance and deploy our inference component model copy to handle requests. This demonstrates the endpoint's ability to automatically scale out from zero when needed.

In [None]:
start_time = time.time()
while True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=llama_inference_component_name)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

#### verify that our endpoint has succesfully scaled out from zero

In [None]:
payload={
    "messages": [
        {"role": "user", "content": "What is deep learning?"}
    ],
    "max_tokens": 100,
}
try:
    response = llama_predictor.predict(payload)
    print(response['choices'][0]['message']['content'])

except Exception as e:
    print(f"   Reason: {str(e)}")  

## Testing scaling behavior 1->n

### Helper functions

In [None]:
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

def make_prediction(predictor, payload, request_id):
    """Make a single prediction and return results with timing"""
    start_time = time.time()
    try:
        response = predictor.predict(payload)
        duration = time.time() - start_time
        return {
            'request_id': request_id,
            'success': True,
            'duration': duration,
            'response': response['choices'][0]['message']['content'][:50] + "..." if 'choices' in response else response[0]['generated_text'][:50] + "..."
        }
    except Exception as e:
        duration = time.time() - start_time
        return {
            'request_id': request_id,
            'success': False,
            'duration': duration,
            'error': str(e)
        }

def get_current_copy_count(inference_component_name):
    """Get the current number of inference component copies"""
    try:
        response = sagemaker_client.describe_inference_component(
            InferenceComponentName=inference_component_name
        )
        return response['RuntimeConfig']['CurrentCopyCount']
    except Exception as e:
        return "Error"

### Autoscaling demonstration

In the following code we will run 6 iterations with 6 concurrents request each.

What we will observe:
- Initial low component count
- Concurrent load being applied
- Component count increasing as autoscaling kicks in

In [None]:
num_iterations = 6
requests_per_iteration = 6
wait_between_iterations = 10

print("Starting Inference load")
print("="*50)

for iteration in range(1, num_iterations + 1):
    print(f"\nIteration {iteration}/{num_iterations}")
    
    # Show current inference component count
    current_copies = get_current_copy_count(llama_predictor.component_name)
    print(f"Current Inference Components: {current_copies}")
    
    # Make concurrent requests
    print(f"Sending {requests_per_iteration} concurrent requests...")
    start_time = time.time()
    
    with ThreadPoolExecutor(max_workers=requests_per_iteration) as executor:
        futures = [
            executor.submit(make_prediction, llama_predictor, payload, f"{iteration}-{i+1}") 
            for i in range(requests_per_iteration)
        ]
        
        results = [future.result() for future in as_completed(futures)]
    
    # Summary
    successful = sum(1 for r in results if r['success'])
    total_time = time.time() - start_time
    
    if successful > 0:
        avg_response_time = sum(r['duration'] for r in results if r['success']) / successful
        print(f"✅ Results: {successful}/{requests_per_iteration} successful")
        print(f"⏱️ Total: {total_time:.1f}s, Average response: {avg_response_time:.1f}s")
    else:
        print(f"❌ All requests failed")
    
    # Show any failures
    failures = [r for r in results if not r['success']]
    if failures:
        print(f"⚠️  {len(failures)} requests failed")
    
    # Wait between iterations (except last one)
    if iteration < num_iterations:
        print(f"⏳ Waiting {wait_between_iterations}s before next iteration...")
        time.sleep(wait_between_iterations)

print(f"\n Inference load complete!")
print("Watch how the inference component count changed during the load test.")

#### Check that our model has scaled to 2 copies

> **Note** you can rerun the cell above and increase the number of iterations

In [None]:
while True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=llama_inference_component_name)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

current_copies = get_current_copy_count(llama_predictor.component_name)
print(f"Current Inference Components: {current_copies}")

In [None]:
payload={
    "messages": [
        {"role": "user", "content": "What is deep learning?"}
    ],
    "max_tokens": 100,
}
try:
    response = llama_predictor.predict(payload)
    print(response['choices'][0]['message']['content'])

except Exception as e:
    print(f"   Reason: {str(e)}")  

### Note: 
**If you do not have a multi-GPU instance, you can skip to the cleanup section**

## (OPTIONAL MULTI-GPUs INSTANCE) Deploy Mistral 7B Instruct model on the same endpoint

Now we will deploy Mistral 7B Instruct model on the same endpoint previously created with the llama model. 
> **Please note that you require a multi-GPU machine for this to work**

In [None]:
from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
import logging

prompt = "Falcons are"
response = "Falcons are small to medium-sized birds of prey related to hawks and eagles."

model_id = "huggingface-llm-mistral-7b-instruct"

model_name = sagemaker.utils.name_from_base("workshop-lab-3-mistral")

mistral_model_builder = ModelBuilder(
    model=model_id,
    name=model_name,
    role_arn=role,
    sagemaker_session=sagemaker_session,
    schema_builder=SchemaBuilder(sample_input=prompt, sample_output=response),
    log_level=logging.WARN
)

output_path = f"s3://{model_bucket}/ws-mistral-7b-instruct/sharding"

In [None]:
import types

# Set the mistral model image to LMI
mistral_model_builder._get_default_vllm_image = types.MethodType(
    lambda self, image: inference_image, 
    mistral_model_builder
)

In [None]:
mistral_model_builder.optimize(
    instance_type=gpu_instance_type,
    accept_eula=True, 
    output_path=output_path,
    env_vars={
        "OPTION_MAX_MODEL_LEN": "12384",
        "OPTION_GPU_MEMORY_UTILIZATION": "0.87",
    },
    sharding_config={
            "Image": inference_image,
            "OverrideEnvironment": {
                "OPTION_TENSOR_PARALLEL_DEGREE": "1" # Number of GPU available on the instance
            }
    }
)

In [None]:
mistral_model = mistral_model_builder.build()

In [None]:
from sagemaker.enums import EndpointType
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements


resources = ResourceRequirements(
    requests = {
        "num_accelerators": 1, # Number of accelerators required
        "memory": 10409,  # Minimum memory required in Mb (required)
        "copies": 1,
    }
)

mistral_model.deploy(
    endpoint_name=llama_model.endpoint_name,
    accept_eula=True,
    initial_instance_count=1,
    instance_type=gpu_instance_type,
    endpoint_type=EndpointType.INFERENCE_COMPONENT_BASED,
    resources=resources,
)

#### Invoke the mistral model with a sample prompt

In [None]:
import boto3
import json

from sagemaker.predictor import retrieve_default 

endpoint_name = mistral_model.endpoint_name 
mistral_inference_component_name = mistral_model.inference_component_name

mistral_predictor = retrieve_default(endpoint_name, inference_component_name=mistral_inference_component_name) 

# Define the prompt and other parameters
prompt = """
<s>[INST] Below is the question based on the context. 
Question: Given a reference text about Lollapalooza, where does it take place, who started it and what is it?. 
Below is the given the context Lollapalooza /ˌlɒləpəˈluːzə/ (Lolla) is an annual American four-day music festival held in Grant Park in Chicago. 
It originally started as a touring event in 1991, but several years later, Chicago became its permanent location. Music genres include but are not limited to alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. Lollapalooza has also featured visual arts, nonprofit organizations, and political organizations. 
The festival, held in Grant Park, hosts an estimated 400,000 people each July and sells out annually. Lollapalooza is one of the largest and most iconic music festivals in the world and one of the longest-running in the United States. Lollapalooza was conceived and created in 1991 as a farewell tour by Perry Farrell, singer of the group Jane's Addiction.. 
Write a response that appropriately completes the request.[/INST]
"""
 
max_tokens_to_sample = 200

# parameters for llm
parameters = {
    "max_new_tokens": max_tokens_to_sample,
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.5,
    "strean": True,
}

contentType = 'application/json'

payload = {
    'inputs': prompt,
    "parameters": parameters
}

response = mistral_predictor.predict(payload)
print(response["generated_text"])

### Setup autoscaling for Mistral model

In [None]:
mistral_inference_component_name = mistral_model.inference_component_name
print(f"Mistral inference component: {mistral_inference_component_name}")

In [None]:
# Apply autoscaling configuration
setup_autoscaling(mistral_inference_component_name)

In [None]:
current_copies = get_current_copy_count(mistral_predictor.component_name)
print(f"Current Inference Components: {current_copies}")

### Understanding Model Copies and scale-to-zero

**Important** To scale to zero on an endpoint with multiple inference components, all components must be either set to 0 or deleted.

**Model Copy**: One loaded instance of your model in GPU memory
- Copies can share GPU instances (depending on model size and available accelerators)

**Instance**: The underlying compute resource (e.g., ml.g5.2xlarge)
- SageMaker automatically manages instances based on copy demands
- Multiple small model copies can share one instance
- Large model copies might need dedicated instances

# Clean up the environment

- Deregister scalable target
- Delete cloudwatch alarms
- Delete scaling policies

In [None]:
from utils.cleanup import cleanup_autoscaling, cleanup_workshop_resources

# Clean up autoscaling
workshop_components = []
if 'llama_inference_component_name' in locals():
    workshop_components.append(llama_inference_component_name)
if 'mistral_inference_component_name' in locals():
    workshop_components.append(mistral_inference_component_name)

if workshop_components:
    cleanup_autoscaling(workshop_components, aas_client, cloudwatch_client)


- Delete inference component
- Delete endpoint
- Delete endpoint-config
- Delete model

In [None]:
# Clean up other resources

cleanup_workshop_resources(llama_model.name, endpoint_config_name, endpoint_name, sagemaker_client)

print("✅ Workshop cleanup complete!")