# Enhancing deployment guardrails with inference component rolling update for Amazon SageMaker AI Inference

## Introduction

In today's machine learning landscape, deploying models efficiently, reliably, and cost-effectively is a critical challenge for organizations of all sizes. As organizations increasingly deploy foundation models (FMs) and other ML models to production, they face challenges related to resource utilization, cost efficiency, and maintaining high availability during updates. Amazon SageMaker AI introduced Inference Component (IC) functionality that can help organizations reduce model deployment costs by optimizing resource utilization through intelligent model packing and scaling. Inference components abstract ML models and enable assigning dedicated resources and specific scaling policies per model.

However, updating these models—especially in production environments with strict latency SLAs—has historically risked downtime or resource bottlenecks. Traditional blue/green (B/G) deployments often struggle with capacity constraints, making updates unpredictable for GPU-heavy models. To address this, we're excited to announce another powerful enhancement to Amazon SageMaker AI: rolling update for inference components, a feature designed to streamline updates for models of all sizes while minimizing operational overhead. 

In this blog post, we will first discuss the challenges faced by organizations when updating models in production. Then we will deep dive into the new rolling update feature for inference components and provide practical examples using DeepSeek distilled models to demonstrate this feature. Finally we will explore how to setup rolling update in different scenarios.  

## Challenges with Blue/Green Deployment

Traditionally, SageMaker AI inference has supported the blue/green deployment pattern for updating ICs in production. While effective for many scenarios, this approach comes with specific challenges:

* Resource Inefficiency: Blue/Green deployment requires provisioning resources for both the current (blue) and new (green) environments simultaneously. For inference components running on expensive GPU instances like P4d or G5, this means potentially doubling the resource requirements during deployments. Consider an example where a customer has 10 copies of an inference component spread across 5 x ml.p4d.24xlarge instances, all operating at full capacity. With Blue/Green deployment, SageMaker would need to provision 5 additional ml.p4d.24xlarge instances to host the new version of the inference component before switching traffic and decommissioning the old instances.
* Limited Computing Resources: For customers using powerful GPU instances like P and/or G series, the required capacity might simply not be available in a given Availability Zone or Region. This often results in Instance Capacity Exceptions (ICE) during deployments, causing update failures and rollbacks.
* All-or-Nothing Transitions: Traditional blue/green deployments shift all traffic at once or based on a configured schedule. This leaves limited room for gradual validation and increases the blast radius if issues arise with the new deployment.

While blue/green deployment has been a reliable strategy for zero-downtime updates, its limitations become glaring when deploying large-scale LLMs or high-throughput models on premium GPU instances. These challenges demand a more nuanced approach—one that incrementally validates updates while optimizing resource usage. Enter rolling updates for inference components, a paradigm designed to eliminate the rigidity of blue/green. By updating models in controlled batches, dynamically scaling infrastructure, and integrating real-time safety checks, this strategy ensures deployments remain cost-effective, reliable, and adaptable—even for GPU-heavy workloads. Let’s explore how it works.


## Introducing Rolling Deployment for Inference Component Update


Before diving deeper, let's briefly recap what Inference Components are. Introduced as a SageMaker feature to optimize costs, Inference Components (ICs) allow you to define and deploy the specific resources needed for your model inference workload. By right-sizing compute resources to match your model's requirements, you can achieve significant cost savings compared to traditional deployment approaches.


In this notebook, we demonstrate the process of a rolling upgrade for an Inference Component by updating it from running Meta's Llama 3.1 8B Instruct to DeepSeek's DeepSeek R1 Distill Llama 8B.

In [None]:
!pip install --upgrade boto3 botocore

In [None]:
import sagemaker
import boto3

sagemaker_client = boto3.client('sagemaker')
sagemaker_runtime_client = boto3.client('sagemaker-runtime')

session=sagemaker.session.Session()
role = sagemaker.get_execution_role()
prefix = sagemaker.utils.unique_name_from_base("rolling-upgrade")

Inference Components allow us to create a SageMaker AI Endpoint that does not initally have a model running when deployed.

Below we create an endpoint using the `ml.g5.2xlarge` instance with 1 GPU per instance. We configure Managed Instance Scaling from a minimum of 1 to a maximum of 4. SageMaker AI will automatically scale the number of instances behind the endpoint to ensure we can fit the requested inference component count on it.

In [None]:
endpoint_config_name = f"{prefix}-endpoint-config"

variant_name="AllTraffic"
instance_type="ml.g5.2xlarge"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600

initial_instance_count = 1
max_instance_count = 4

sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 2,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": initial_instance_count,
                "MaxInstanceCount": max_instance_count
            },
        },
    ],
)

In [None]:
endpoint_name = f"{prefix}-endpoint"

sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name
)

In [None]:
session.wait_for_endpoint(endpoint_name)

Once the endpoint is created, we create our inference componnet and request that it runs using 1 accelerator device with 2 copies. We use Hugging Face's Text Generation Inference (TGI) container which is able to download our model from the Hugging Face Hub and load it onto the GPU. We also use the `'MESSAGES_API_ENABLED'` to allow the inference interface to accept the Messages API format.

We set the `HF_MODEL_ID` environment variable to `'meta-llama/Llama-3.1-8B-Instruct'` and configure the `HF_TOKEN` environment variable to a valid token. This will allow the endpoint to download the gated model.

In [None]:
from sagemaker import image_uris
region = session._region_name
image_uri = image_uris.retrieve(framework='huggingface-llm', region=region, version='2.2.0')

In [None]:
inference_component_name=f"llama-3-1-8b-{prefix}"
# variant_name="AllTraffic"
number_of_accelerator_devices_required=1
min_memory_required_in_mb=1024

hf_token = "<INSERT_TOKEN_HERE>"

assert hf_token != "<INSERT_TOKEN_HERE>"

sagemaker_client.create_inference_component(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        'Container': {
            'Image': image_uri,
            'Environment': {
                'SM_NUM_GPUS': str(number_of_accelerator_devices_required),
                'HF_MODEL_ID': 'meta-llama/Llama-3.1-8B-Instruct',
                'HF_TOKEN': hf_token,
                "MESSAGES_API_ENABLED": "true",
            }
        },
        'ComputeResourceRequirements': {
            'NumberOfAcceleratorDevicesRequired': number_of_accelerator_devices_required,
            'MinMemoryRequiredInMb': min_memory_required_in_mb
        }
    },
    RuntimeConfig={
        'CopyCount': 2
    }
)

In [None]:
import time
while True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
    status = desc["InferenceComponentStatus"]
    print(status)
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)


In [None]:
desc

Once our Inference Component is `InService` we can invoke with the InvokeEndpoint API as below.

In [None]:
import json


messages = [
        {"role": "system", "content": "You are a helpful assistant that thinks and reasons before answering."},
        {"role": "user", "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"}
    ]

payload = {
        "messages": messages,
        "max_tokens": 512,
        "temperature": 0.6
}

response_model = sagemaker_runtime_client.invoke_endpoint( 
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name, 
    Body=json.dumps(payload), 
    ContentType="application/json", Accept="application/json")

print(json.loads(response_model['Body'].read().decode('utf-8'))['choices'][0]['message']['content'])

We can configure our Inference Component update to track a CloudWatch Alarm during the update period. If the alarm is triggered then the deployment rolls back to the previous state.

We can configure an Amazon CloudWatch Alarm to alarm when we see more than 5 4xx errors from our Inference Component. When using TGI, you can see 4xx errors when there's an input mismatch (ie. if the container isn't configured to support the Messages API).

You can configure the alarm to track any CloudWatch metric and should ensure the metrics are accurate to track and alarm on potential deployment errors to.

In [None]:
cloudwatch = boto3.client('cloudwatch')

# Create alarm
cloudwatch.put_metric_alarm(
    AlarmName=f'SageMaker-{endpoint_name}-4xx-errors',
    ComparisonOperator='GreaterThanThreshold',
    EvaluationPeriods=1,
    MetricName='Invocation4XXErrors',
    Namespace='AWS/SageMaker',
    Period=300,
    Statistic='Sum',
    Threshold=5.0,
    ActionsEnabled=True,
    AlarmDescription='Alarm when greather than 5 4xx errors',
    Dimensions=[
        {
          'Name': 'InferenceComponentName',
          'Value': inference_component_name
        },
    ],
)

Below we update our inference component to use `'deepseek-ai/DeepSeek-R1-Distill-Llama-8B'` from the Hugging Face Hub. 

We do not set the environment variable `MESSAGES_API_ENABLED` which means the previous invocations will fail with 4xx errors as the invocation payload is not compatible.

When updating an inference component, the new copies must have backwards compatible APIs as the service can route to any copy that is both old or new.

In [None]:
sagemaker_client.update_inference_component(
    InferenceComponentName=inference_component_name,
    Specification={
        'Container': {
            'Image': image_uri,
            'Environment': {
                'SM_NUM_GPUS': str(number_of_accelerator_devices_required),
                'HF_MODEL_ID': 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B',
                # "MESSAGES_API_ENABLED": "true",
            }
        },
        'ComputeResourceRequirements': {
            'NumberOfAcceleratorDevicesRequired': number_of_accelerator_devices_required,
            'MinMemoryRequiredInMb': min_memory_required_in_mb
        }
    },
    DeploymentConfig={
        "RollingUpdatePolicy": {
            "MaximumBatchSize": {
                "Type": "COPY_COUNT",
                "Value": 1
            },
            "WaitIntervalInSeconds": 120,
            "RollbackMaximumBatchSize": {
                "Type": "COPY_COUNT",
                "Value": 1
            }
        },
        'AutoRollbackConfiguration': {
            "Alarms": [
                {"AlarmName": f'SageMaker-{endpoint_name}-4xx-errors'}
            ]
        }
    }
)

Below will invoke the endpoint in a loop and report any failures. Due to the incompatible API format, invocations to the new Inference Component will fail and print `"invocation failed"`. Successful invocations to the original copy will print the result.

We can see the progress in the deployment with the `"RuntimeConfig"` output from the `DescribeInferenceComponent` API. The new versions are deployed as additional copies that the service will route a percentage of traffic to. 

When the CloudWatch Alarm fires, we will see that the Inference Component will revert to `InService` status without any changes, allowing our invocations to continue successfully

In [None]:
desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)

import json
import time

while desc['InferenceComponentStatus'] == 'Updating':
    messages = [
        {"role": "user", "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"}
    ]

    payload = {
            "messages": messages,
            "max_tokens": 512,
            "temperature": 0.6
    }
    try:
        response_model = sagemaker_runtime_client.invoke_endpoint( 
            InferenceComponentName=inference_component_name,
            EndpointName=endpoint_name, 
            Body=json.dumps(payload), 
            ContentType="application/json", Accept="application/json")
        print(json.loads(response_model['Body'].read().decode('utf-8'))['choices'][0]['message']['content'])
    except:
        print('invocation failed')
    alarm_info = cloudwatch.describe_alarms(
        AlarmNames=[
            f'SageMaker-{endpoint_name}-4xx-errors'
        ],
    )
    print(f"Alarm state:{alarm_info['MetricAlarms'][0]['StateValue']}")
    
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
    print(desc['InferenceComponentStatus'])
    print(desc['RuntimeConfig'])
    time.sleep(10)

response_model = sagemaker_runtime_client.invoke_endpoint( 
        InferenceComponentName=inference_component_name,
        EndpointName=endpoint_name, 
        Body=json.dumps(payload), 
        ContentType="application/json", Accept="application/json")
    
print(json.loads(response_model['Body'].read().decode('utf-8'))['choices'][0]['message']['content'])
print(desc['InferenceComponentStatus'])

Below we perform the same update again while also correctly setting the `"MESSAGES_API_ENABLED"` environment variable. Once again, we loop through the invocations and print the result. 

DeepSeek R1 is a reasoning model that will think before it responds where the thinking is visible in the model output. As our Inference Component is updated, we can see which invocations are routed to the DeepSeek variant by the thinking shown in the response. Once again, we can also track the deployment process with the `CurrentCopyCount` field for the Inference Component

In [None]:
sagemaker_client.update_inference_component(
    InferenceComponentName=inference_component_name,
    Specification={
        'Container': {
            'Image': image_uri,
            'Environment': {
                'SM_NUM_GPUS': str(number_of_accelerator_devices_required),
                'HF_MODEL_ID': 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B',
                "MESSAGES_API_ENABLED": "true",
            }
        },
        'ComputeResourceRequirements': {
            'NumberOfAcceleratorDevicesRequired': number_of_accelerator_devices_required,
            'MinMemoryRequiredInMb': min_memory_required_in_mb
        }
    },
    DeploymentConfig={
        "RollingUpdatePolicy": {
            "MaximumBatchSize": {
                "Type": "COPY_COUNT",
                "Value": 1
            },
            "WaitIntervalInSeconds": 120,
            "RollbackMaximumBatchSize": {
                "Type": "COPY_COUNT",
                "Value": 1
            }
        },
        'AutoRollbackConfiguration': {
            "Alarms": [
                {"AlarmName": f'SageMaker-{endpoint_name}-4xx-errors'}
            ]
        }
    }
)

In [None]:
desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)

import json
import time

while desc['InferenceComponentStatus'] == 'Updating':
    messages = [
        {"role": "user", "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"}
    ]

    payload = {
            "messages": messages,
            "max_tokens": 512,
            "temperature": 0.6
    }
    try:
        response_model = sagemaker_runtime_client.invoke_endpoint( 
            InferenceComponentName=inference_component_name,
            EndpointName=endpoint_name, 
            Body=json.dumps(payload), 
            ContentType="application/json", Accept="application/json")
        print(json.loads(response_model['Body'].read().decode('utf-8'))['choices'][0]['message']['content'])
    except:
        print('invocation failed')
    alarm_info = cloudwatch.describe_alarms(
        AlarmNames=[
            f'SageMaker-{endpoint_name}-4xx-errors'
        ],
    )
    print(f"Alarm state:{alarm_info['MetricAlarms'][0]['StateValue']}")
    
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
    print(desc['InferenceComponentStatus'])
    print(desc['RuntimeConfig'])
    time.sleep(10)

response_model = sagemaker_runtime_client.invoke_endpoint( 
        InferenceComponentName=inference_component_name,
        EndpointName=endpoint_name, 
        Body=json.dumps(payload), 
        ContentType="application/json", Accept="application/json")
    
print(json.loads(response_model['Body'].read().decode('utf-8'))['choices'][0]['message']['content'])
print(desc['InferenceComponentStatus'])

Once the Inference Component has successfully updated, we can call `DescribeInferenceComponent` and see that the parameters now reflect our most recent update.

In [None]:
desc

## Cleanup
Finally we delete our resources

In [None]:
sagemaker_client.delete_inference_component(InferenceComponentName=inference_component_name)
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
cloudwatch.delete_alarms(AlarmNames=[f'SageMaker-{endpoint_name}-4xx-errors'])