
# Deploying Swiss LLM Apertus on SageMaker AI with LMI v16 powered by vLLM

This notebook demonstrates deploying and running inference with the Apertus model. We will cover 

1. Installing SageMaker python SDK, Setting up SageMaker resources and permissions
2. Deploying the model using SageMaker LMI (Large Model Inference Container powered by Vllm)
3. Invoking the model using streaming responses

Apertus is also deployable through Amazon SageMaker JumpStart. You can learn more about Apertus on Amazon SageMaker AI in our blog post: [Switzerland‚Äôs Open-Source Apertus LLMs now available on Amazon SageMaker AI](https://aws.amazon.com/blogs/alps/switzerlands-open-source-apertus-llms-now-available-on-amazon-sagemaker-ai/)

## Prerequisites

1. Accept the license for the model on HuggingFace Hub. The model comes in two different sizes (8B and 70B) and as Instruct and Non-Instruct variants. Go to the HuggingFace Hub page for the model that you want to deploy, accept the license agreement, and copy the respective model id.
    * [HuggingFace Hub Apertus 8B Instruct 2509](https://huggingface.co/swiss-ai/Apertus-8B-Instruct-2509)
    * [HuggingFace Hub Apertus 70B Instruct 2509](https://huggingface.co/swiss-ai/Apertus-70B-Instruct-2509)
    * [HuggingFace Hub Apertus 8B 2509](https://huggingface.co/swiss-ai/Apertus-8B-2509)
    * [HuggingFace Hub Apertus 70B 2509](https://huggingface.co/swiss-ai/Apertus-70B-2509)
2. Make sure that you have sufficient Service Quota for SageMaker on-demand endpoint usage for G or P instances. Below is some general guidance which instance types to use.


**SageMaker Instance Types for Model Deployment**

| Model Size | Environment | Recommended Instance Types |
|------------|-------------|---------------------------|
| **70B** | Production | ml.p4d.24xlarge, ml.p5.48xlarge |
| **70B** | Testing | ml.g5.48xlarge, ml.g6.48xlarge, ml.g6e.48xlarge |
| **8B** | Production | ml.g5.4xlarge, ml.g6.4xlarge, ml.g5.48xlarge, ml.g6.48xlarge |
| **8B** | Testing | ml.g5.xlarge, ml.g6.xlarge |


## Environment Setup

First, we'll install the SageMaker SDK to ensure compatibility with the latest features, particularly those needed for large language model deployment and streaming inference.



In [None]:
%pip install sagemaker --upgrade --quiet --no-warn-conflicts

In [None]:
local_mode = False  # if you have a local GPU you can also run the model locally using SageMaker SDK, e.g. for debugging

Replace the model id below with the one you want to deploy. 

In [None]:
MODEL_ID = "swiss-ai/Apertus-8B-Instruct-2509" # PICK which model to deploy
# MODEL_ID = "swiss-ai/Apertus-70B-2509"
# MODEL_ID = "swiss-ai/Apertus-70B-Instruct-2509"
# MODEL_ID = "swiss-ai/Apertus-8B-2509"

In [None]:
if local_mode:
    %pip install sagemaker[local] --upgrade --quiet --no-warn-conflicts

In [None]:
from sagemaker import Model, Session, get_execution_role 
from sagemaker.utils import name_from_base
from botocore.exceptions import ClientError

role = get_execution_role()  # execution role for the endpoint

if local_mode:
    from sagemaker.local import entities, LocalSession

    # Extend LocalMode‚Äôs health-check timeout to 15 minutes
    entities.HEALTH_CHECK_TIMEOUT_LIMIT = 15 * 60  # seconds

    sess = LocalSession()
    sess.config = {"local": {"local_code": True}}
else:
    sess = Session() # sagemaker session for interacting with different AWS APIs

## Configure Model Container and Instance

For deploying Apertus, we'll use:
- **LMI (Deep Java Library) Inference Container with vLLM** : A container optimized for large language model inference
- **G or P Instance**: AWS's GPU instance type optimized for large model inference

Key configurations:
- The container URI points to the DJL inference container in ECR (Elastic Container Registry)
- We use `ml.g6.48xlarge` instances which offer:
  - 8 NVIDIA L4 GPUs with 192 GB GPU memory
  - 768 GB of memory
  - High network bandwidth for optimal inference performance

> **Note**: REPLACE `eu-central-2` with your region if different.

In [None]:
# Define region where you have capacity
REGION = "eu-central-2"
INSTANCE_TYPE = (
    "local_gpu" if local_mode else "ml.g6.48xlarge"
)  # Review the instance type. Find the one most suitable for your need with the guidance provided in the prerequisites section.

# Select the latest container. Check the link for the latest available version https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers
CONTAINER_VERSION = "0.34.0-lmi16.0.0-cu128"

# Construct container URI
if REGION == "eu-central-2":
    container_account = 380420809688
else:
    container_account = 763104351884

container_uri = f"{container_account}.dkr.ecr.{REGION}.amazonaws.com/djl-inference:{CONTAINER_VERSION}"


# Validate region and print configuration
if REGION != sess.boto_region_name:
    print(
        f"‚ö†Ô∏è Warning: Container region ({REGION}) differs from session region ({sess.boto_region_name})"
    )
else:
    print(f"‚úÖ Region validation passed: {REGION}")

print(f"üì¶ Container URI: {container_uri}")
print(f"üñ•Ô∏è Instance Type: {INSTANCE_TYPE}")

## Create SageMaker Model

Now we'll create a SageMaker Model object that combines our:
- vllm env variables
- Container image (LMI)
- Model artifacts (configuration files)
- IAM role (for permissions)

This step defines the model configuration but doesn't deploy it yet. The Model object represents the combination of:

1. **Container Image** (`image_uri`): DJL Inference optimized for LLMs
2. **Env Variables** (`env`): Our variables for the model server
3. **IAM Role** (`role`): Permissions for model execution


> **Note**: DJL v16 comes with transformers version 4.55.2. The transformers implementation of Apertus is only available starting with transformers version 4.56.0 so we need to update the transformers version in the inference container.

In [None]:
requirements = "transformers==4.57.1"

In [None]:
%store requirements >requirements.txt

In [None]:
serving_properties = f"""engine=Python
option.async_mode=true
option.rolling_batch=disable

# Load model from HuggingFace Hub
option.model_id={MODEL_ID}

# vLLM configuration
# Update based on your needs.
# Also view: https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/vllm_user_guide.html#quick-start-configurations
option.max_model_len=4096
option.model_loading_timeout=1500
"""

In [None]:
%store serving_properties >serving.properties

We combine the requirements file with the model weights into a single archive to upload to an Amazon S3 bucket. The Amazon SageMaker inference enpoint will download the archive from the Amazon S3 bucket and extract it into the inference container.

In [None]:
%%sh
mkdir -p apertus-code
mv requirements.txt apertus-code/
mv serving.properties apertus-code/
tar czvf apertus.tar.gz -C apertus-code .

Replace `<your-bucket-name>` with your own Amazon S3 bucket name in the same region in which you plan to deploy the endpoint in.

In [None]:
# Upload model artifacts to S3
bucket = "<your-bucket-name>" # REPLACE with the name of you Amazon S3 bucket

s3_code_prefix = "apertus-lmi"
s3_object_name = "apertus.tar.gz"

In [None]:
if not bucket or bucket == "<your-bucket-name>": # DO NOT replace this string
    raise ValueError("‚ùå Please set a valid S3 bucket name. Replace bucket='<your-bucket-name>'.")

code_artifact = sess.upload_data(, bucket, s3_code_prefix)

In [None]:
# Updated vLLM configuration to use local model
vllm_config = {
    "SERVING_FAIL_FAST": "true",
    "HF_HUB_ENABLE_HF_TRANSFER": "1", # Faster downloads
    "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service"
}

The Model object combines all the information on how to deploy the model to an endpoint.

In [None]:
model = Model(
    image_uri=container_uri,
    role=role,
    model_data=code_artifact,
    sagemaker_session=sess,
    env=vllm_config,
)

## Deploy Model to SageMaker Endpoint

Now we'll deploy our model to a SageMaker endpoint for real-time inference. This is a significant step that:
1. Provisions the specified compute resources (G6 instance)
2. Deploys the model container
3. Sets up the endpoint for API access

### Deployment Configuration
- **Instance Count**: 1 instance for single-node deployment
- **Instance Type**: `ml.g6.48xlarge` for high-performance inference
- **Health Check Timeout**: 1800 seconds 
  - Extended timeout needed for large model loading
  - Includes time for container setup and model initialization

> ‚ö†Ô∏è **Important**: 
> - Deployment can take upto 15 minutes
> - Monitor the endpoint status in SageMaker Console and CloudWatch logs for progress

In [None]:
if local_mode:
    # To see progress
    !docker pull $container_uri

In [None]:
endpoint_name = name_from_base(MODEL_ID.replace("/", "-"))

print(endpoint_name)

try:
    model.deploy(
        initial_instance_count=1,
        instance_type=INSTANCE_TYPE,
        endpoint_name=endpoint_name,
        container_startup_health_check_timeout=1800,
    )
    print(f"\n‚úÖ Endpoint '{endpoint_name}' deployed successfully")
except ClientError as e:
    error_code = e.response['Error']['Code']
    if error_code == 'ResourceLimitExceeded':
        print(
            "‚ùå Resource limit exceeded."
            + f"Did you request the necessary Service Quotas for {INSTANCE_TYPE} in {REGION}?"
            + "See also https://repost.aws/knowledge-center/sagemaker-resource-limit-exceeded-error"
        )
    elif error_code == 'InsufficientInstanceCapacity':
        print(
            "‚ùå Insufficient instance capacity. Try a different AZ or instance type"
            + "See also https://repost.aws/knowledge-center/sagemaker-insufficient-capacity-error"
        )
    else:
        print(f"‚ùå Deployment failed: {e}")
    raise e
except Exception as e:
    print(f"‚ùå Unexpected deployment error: {e}")
    print("üí° Check CloudWatch logs for detailed error information")
    raise e

## Running Inference requests to the model

Once you have deployed the model to the Amazon SageMaker inference endpoint you can invoke it. Replace `<your_endpoint_name>` below with the name of your SageMaker inference endpoint.

### Option 1: Invoke model with response streaming

In [None]:
from json import dumps as json_dumps, loads as json_loads, JSONDecodeError
from boto3 import client
from time import time

# Create SageMaker Runtime client
smr_client = client("sagemaker-runtime")

endpoint_name = "<your_endpoint_name>" # REPLACE with your endpoint

print(f"Endpoint name: {endpoint_name}")
if endpoint_name == "<your_endpoint_name>": # DO NOT replace this string
    raise ValueError("‚ùå Please set a valid endpoint name")

# Invoke with messages format
body = {
    "messages": [
        {"role": "user", "content": "Explain the the Swiss national sport Schwingen."}
    ],
    "temperature": 0.9,
    "max_tokens": 256,
    "stream": True,
}

start_time = time()
first_token_received = False
ttft = None
token_count = 0
full_response = ""

print(f"Prompt: {body['messages'][0]['content']}\n")
print("Response:", end=" ", flush=True)

# Invoke endpoint with streaming

try:
    resp = smr_client.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json_dumps(body),
        ContentType="application/json",
    )
except ClientError as e:
    error_code = e.response['Error']['Code']
    if error_code == 'ValidationException':
        print("‚ùå Validation Exception. Invalid request format or parameters")
    elif error_code == 'ModelError':
        print("‚ùå Model error. Check model logs")
    else:
        print(f"‚ùå Inference failed: {e}")
    raise e
except Exception as e:
    print(f"‚ùå Unexpected inference error: {e}")
    raise e

# Process streaming response
for event in resp["Body"]:
    if "PayloadPart" in event:
        payload = event["PayloadPart"]["Bytes"].decode()

        try:

            if payload.startswith("data: "):
                data = json_loads(payload[6:])  # Skip "data: " prefix
            else:
                data = json_loads(payload)

            token_count += 1
            if not first_token_received:
                ttft = time() - start_time
                first_token_received = True

            # Handle different streaming response formats
            if "choices" in data and len(data["choices"]) > 0:
                # Messages-compatible format
                if (
                    "delta" in data["choices"][0]
                    and "content" in data["choices"][0]["delta"]
                ):
                    token_text = data["choices"][0]["delta"]["content"]
                    full_response += token_text
                    print(token_text, end="", flush=True)
            elif "token" in data and "text" in data["token"]:
                # TGI format
                token_text = data["token"]["text"]
                full_response += token_text
                print(token_text, end="", flush=True)

        except JSONDecodeError:
            # Skip invalid JSON
            continue

end_time = time()
total_latency = end_time - start_time

print("\n\nMetrics:")
if ttft:
    print(
        f"Time to First Token (TTFT): {ttft:.2f} seconds"
    )
else:
    print('No tokens received')
print(f"Total Tokens Generated: {token_count}")
print(f"Total Latency: {total_latency:.2f} seconds")
# print(f"\nFull Response:\n{full_response}")

### Option 2: Invoke without Streaming

In [None]:
from json import dumps as json_dumps, loads as json_loads, JSONDecodeError
from boto3 import client


# Create SageMaker Runtime client for invocation
smr_client = client('sagemaker-runtime')

endpoint_name = "<your_endpoint_name>" # REPLACE with your endpoint

print(f"Endpoint name: {endpoint_name}")
if endpoint_name == "<your_endpoint_name>": # DO NOT replace this string
    raise ValueError("‚ùå Please set a valid endpoint name")



# Invoke with messages format
body = {
    "messages": [
        {"role": "user", "content": "Explain the the Swiss national sport Schwingen."}
    ],
    "temperature": 0.9,
    "max_tokens": 256,
    "stream": False,
}

print(f"Prompt: {body['messages'][0]['content']}\n")

try:
    # Non-streaming invocation
    response = smr_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json_dumps(body)
    )
except ClientError as e:
    error_code = e.response['Error']['Code']
    if error_code == 'ValidationException':
        print("‚ùå Validation Exception. Invalid request format or parameters")
    elif error_code == 'ModelError':
        print("‚ùå Model error. Check model logs")
    else:
        print(f"‚ùå Inference failed: {e}")
    raise e
except Exception as e:
    print(f"‚ùå Unexpected inference error: {e}")
    raise e


result = json_loads(response['Body'].read().decode())
print(result["choices"][0]["message"]["content"])
print(f"\nFull Response:\n{result}")

## Cleanup: Delete Endpoint

In [None]:
endpoint_name = "<your_endpoint_name>" # REPLACE with your endpoint

In [None]:
model_name = "<your_model_name>" # REPLACE with your model name
if model and model.name:
    model_name = model.name

In [None]:
print(f"The next cell deletes the following SageMaker resources: {endpoint_name} (endpoint & endpoint config) & {model_name} (model).")

In [None]:
# from sagemaker import Session

# # Initialize session
# sess = Session()

sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
sess.delete_model(model_name)

Remove the local artifacts which contain the model weights:

In [None]:
!rm -rf apertus-code
!rm apertus.tar.gz

In [None]:
from IPython.display import Markdown
from os import path

Markdown(f"The next cell deletes the code artidact at {path.join('s3://',bucket,s3_code_prefix,s3_object_name)}.")

In [None]:
s3_client = sess.boto_session.client("s3")
s3_client.delete_object(
    Bucket=bucket,
    Key=path.join(s3_code_prefix, s3_object_name)
)