# Voxtral vLLM BYOC Deployment on SageMaker

This notebook demonstrates how to deploy Mistral AI's Voxtral models using a custom vLLM container (BYOC - Bring Your Own Container) on Amazon SageMaker.

## Overview
- **Models**: Voxtral-Mini-3B-2507 and Voxtral-Small-24B-2507
- **Engine**: vLLM v0.10.0+ (required for Voxtral support)
- **Deployment**: Custom Docker container with BYOC approach
- **Features**: Multimodal audio+text processing, function calling (Small model), transcription
- **Context**: 32k token context length, up to 30min audio transcription, 40min audio understanding

## Key Advantages of BYOC Approach
1. **Latest vLLM version** - Control over vLLM version (v0.10.0+) with Voxtral support
2. **Official configurations** - Uses official Voxtral server parameters
3. **Full control** - Complete control over container environment and dependencies
4. **Future-proof** - Easy updates to new vLLM versions
5. **Flexible architecture** - Separate container image from model code for faster iterations

## Supported Models
- **Voxtral-Mini-3B-2507**: Text + audio processing, ml.g6.4xlarge instance
- **Voxtral-Small-24B-2507**: Text + audio + function calling, ml.g6.12xlarge instance

In [1]:
# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

## 1. Setup and Dependencies

Install required packages and initialize SageMaker session.

**Prerequisites**: Docker image should be built and pushed before running this notebook. See README for build instructions.

In [2]:
# Import required libraries for SageMaker BYOC deployment
import boto3
import sagemaker
from sagemaker.model import Model
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
import json
import base64
import requests
import time
import subprocess
import os
from typing import Dict, List, Any, Optional

# Initialize SageMaker session and get execution role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()  # IAM role for SageMaker operations
# role = "arn:aws:iam::459006231907:role/service-role/AmazonSageMaker-ExecutionRole-20250606T154579" 
bucket = "3p-projects"  # S3 bucket for storing model artifacts

# Get AWS account and region information
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()['Account']
region = boto3.Session().region_name

print(f"AWS Account ID: {account_id}")
print(f"AWS Region: {region}")
print(f"SageMaker role: {role}")
print(f"S3 bucket: {bucket}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
AWS Account ID: 459006231907
AWS Region: us-west-2
SageMaker role: arn:aws:iam::459006231907:role/mistral-workshop-SageMakerExecutionRole
S3 bucket: 3p-projects


## 2. Container Configuration

Configure the custom container image URI. The Docker image should be pre-built with vLLM v0.10.0+ and base dependencies.

In [3]:
# Configuration for custom container
repository_name = "voxtral-vllm-byoc"
image_tag = "latest"
image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repository_name}:{image_tag}"

print(f"Custom container will be built and pushed to:")
print(f"Repository: {repository_name}")
print(f"Image URI: {image_uri}")

Custom container will be built and pushed to:
Repository: voxtral-vllm-byoc
Image URI: 459006231907.dkr.ecr.us-west-2.amazonaws.com/voxtral-vllm-byoc:latest


In [4]:
# Prepare BYOC model artifacts from code directory
# Files are already organized in the code/ directory structure
byoc_code_dir = "./code"

# Verify required files exist in code directory
required_files = ["model.py", "serving.properties", "requirements.txt"]
missing_files = []

for file in required_files:
    file_path = os.path.join(byoc_code_dir, file)
    if not os.path.exists(file_path):
        missing_files.append(file_path)

if missing_files:
    print(f"❌ Missing required files: {missing_files}")
    print("Please ensure all files are in the code/ directory.")
else:
    print("📁 BYOC model artifacts ready:")
    for file in required_files:
        file_path = os.path.join(byoc_code_dir, file)
        if os.path.exists(file_path):
            file_size = os.path.getsize(file_path)
            print(f"  ✅ {file} ({file_size} bytes)")

📁 BYOC model artifacts ready:
  ✅ model.py (23287 bytes)
  ✅ serving.properties (2581 bytes)
  ✅ requirements.txt (693 bytes)


In [5]:
# Upload BYOC configuration to S3
prefix = "voxtral-vllm-byoc"  # Se bucket folder prefix

byoc_config_uri = sagemaker_session.upload_data(
    path=byoc_code_dir, 
    bucket=bucket, 
    key_prefix=f"{prefix}/code"
)

print(f"📤 BYOC configuration uploaded to: {byoc_config_uri}")

📤 BYOC configuration uploaded to: s3://3p-projects/voxtral-vllm-byoc/code


## 3. Deploy Custom vLLM Model on SageMaker

Create a real-time inference endpoint using the custom vLLM container.

**Note**: Docker build and push should be completed before running this notebook. See README for build instructions.

In [6]:
# Configuration for SageMaker BYOC deployment
timestamp = int(time.time())
model_name = f'voxtral-vllm-byoc-model-{timestamp}'
endpoint_name = f'voxtral-vllm-byoc-endpoint-{timestamp}'

print(f"Model name: {model_name}")
print(f"Endpoint name: {endpoint_name}")
print(f"Custom container image: {image_uri}")
print(f"\n📋 Instance recommendations:")
print(f"  - Voxtral-Mini-3B-2507: ml.g6.4xlarge (tensor_parallel_degree=1)")
print(f"  - Voxtral-Small-24B-2507: ml.g6.12xlarge (tensor_parallel_degree=4)")

Model name: voxtral-vllm-byoc-model-1755863967
Endpoint name: voxtral-vllm-byoc-endpoint-1755863967
Custom container image: 459006231907.dkr.ecr.us-west-2.amazonaws.com/voxtral-vllm-byoc:latest

📋 Instance recommendations:
  - Voxtral-Mini-3B-2507: ml.g6.4xlarge (tensor_parallel_degree=1)
  - Voxtral-Small-24B-2507: ml.g6.12xlarge (tensor_parallel_degree=4)


In [7]:
# Prepare model data configuration for BYOC
model_data = {
    "S3DataSource": {
        "S3Uri": f"{byoc_config_uri}/",
        "S3DataType": "S3Prefix",
        "CompressionType": "None"
    }
}

print(f"Model data configuration: {json.dumps(model_data, indent=2)}")

Model data configuration: {
  "S3DataSource": {
    "S3Uri": "s3://3p-projects/voxtral-vllm-byoc/code/",
    "S3DataType": "S3Prefix",
    "CompressionType": "None"
  }
}


In [8]:
# Create SageMaker model configuration for custom vLLM container
voxtral_byoc_model = Model(
    image_uri=image_uri,  # Use our custom container
    model_data=model_data,  # Contains model.py, serving.properties, requirements.txt
    role=role,
    name=model_name,
    env={
        # Environment variables for our custom container
        'MODEL_CACHE_DIR': '/opt/ml/model',
        'TRANSFORMERS_CACHE': '/tmp/transformers_cache',
        'HF_HOME': '/tmp/hf_home',
        'VLLM_WORKER_MULTIPROC_METHOD': 'spawn',
        'SAGEMAKER_BIND_TO_PORT': '8080',
        'SAGEMAKER_BIND_TO_HOST': '0.0.0.0'
    }
)

In [9]:
%%time
# Deploy the custom vLLM model to a real-time inference endpoint
print(f"🚀 Deploying BYOC vLLM endpoint: {endpoint_name}")
print("This will take approximately 8-10 minutes...")

try:
    predictor = voxtral_byoc_model.deploy(
        initial_instance_count=1,
        instance_type="ml.g6.12xlarge",    # For Voxtral-Mini: use ml.g6.4xlarge, for Voxtral-Small: use ml.g6.12xlarge     
        endpoint_name=endpoint_name,
        serializer=JSONSerializer(),
        deserializer=JSONDeserializer(),
        container_startup_health_check_timeout=1200,  # Extended timeout for model loading
        model_data_download_timeout=1800,
        wait=True
    )
    
    print(f"✅ BYOC vLLM Endpoint deployed successfully: {endpoint_name}")
    
except Exception as e:
    print(f"❌ Deployment failed: {str(e)}")

🚀 Deploying BYOC vLLM endpoint: voxtral-vllm-byoc-endpoint-1755863967
This will take approximately 8-10 minutes...
----------------!✅ BYOC vLLM Endpoint deployed successfully: voxtral-vllm-byoc-endpoint-1755863967
CPU times: user 181 ms, sys: 24.5 ms, total: 205 ms
Wall time: 8min 32s


## 4. Test Custom vLLM Deployment

Test the deployed model with various input types. 

### 4.1 Test Health Check

In [10]:
# Test endpoint health

from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor = Predictor(
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)
try:
    # Simple health check payload
    health_payload = {
        "messages": [
            {
                "role": "user",
                "content": "Hello, are you working?"
            }
        ],
        "max_tokens": 50,
        "temperature": 0.1
    }
    
    print("🔍 Testing endpoint health...")
    response = predictor.predict(health_payload)
    print(f"Response: {response['choices'][0]['message']['content']}")
    
except Exception as e:
    print(f"❌ Health check failed: {str(e)}")

🔍 Testing endpoint health...
Response: Hello! Yes, I'm here and ready to assist you. How can I help you today?


### 4.2 Text-Only Conversation

In [11]:
# Test text-only conversation using OpenAI format
payload = {
    "messages": [
        {
            "role": "user",
            "content": "Hello! Can you tell me about the advantages of using vLLM for model inference?"
        }
    ],
    "max_tokens": 200,
    "temperature": 0.2,
    "top_p": 0.95
}

print("🔤 Testing text-only conversation with custom vLLM...")
try:
    response = predictor.predict(payload)
    print("Response:", response["choices"][0]["message"]["content"])
    
    # Print usage statistics if available
    if "usage" in response:
        usage = response["usage"]
        print(f"\n📊 Token Usage:")
        print(f"  Prompt tokens: {usage.get('prompt_tokens', 'N/A')}")
        print(f"  Completion tokens: {usage.get('completion_tokens', 'N/A')}")
        print(f"  Total tokens: {usage.get('total_tokens', 'N/A')}")
        
except Exception as e:
    print(f"❌ Text conversation test failed: {str(e)}")

🔤 Testing text-only conversation with custom vLLM...
Response: Hello! I'd be happy to explain the advantages of using vLLM (Virtual Large Language Model) for model inference.

1. **Scalability**: vLLM is designed to handle large language models efficiently. It can scale to models with billions of parameters, making it suitable for tasks that require high computational resources.

2. **Efficiency**: vLLM uses a technique called "virtual memory" to manage the model's parameters. This allows it to use less physical memory than traditional methods, making it more efficient and reducing the cost of inference.

3. **Speed**: vLLM can perform inference faster than many other methods. It achieves this by using a combination of techniques, including model parallelism and pipelining.

4. **Flexibility**: vLLM supports a wide range of models and can be used for various tasks, such as text generation, translation, and summarization.

5. **Ease of Use**: vLLM is designed to be easy to use.

📊 Token

### 4.3 Audio Understanding Test

In [12]:
# Test audio file url for transcription
payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Transcribe this audio file"
                },
                {
                    "type": "audio",
                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3"
                }
            ]
        }
    ],
    "max_tokens": 300,
    "temperature": 0.0,  # Use 0.0 for transcription tasks
    "top_p": 0.95
}

print("🎵 Testing audio file url for transcription...")
try:
    response = predictor.predict(payload)
    print("Response:", response["choices"][0]["message"]["content"])
    
    # Print usage statistics if available
    if "usage" in response:
        usage = response["usage"]
        print(f"\n📊 Token Usage:")
        print(f"  Prompt tokens: {usage.get('prompt_tokens', 'N/A')}")
        print(f"  Completion tokens: {usage.get('completion_tokens', 'N/A')}")
        print(f"  Total tokens: {usage.get('total_tokens', 'N/A')}")
        
except Exception as e:
    print(f"❌ Audio file url test failed: {str(e)}")


🎵 Testing audio file url for transcription...
Response: And the 0-1 pitch on the way to Edgar Martinez, swung on and lined down the left field line for a base hit. Here comes Joey. Here is Junior to third base. They're going to wave him in. The throw to the plate will be late. The Mariners are going to play for the American League Championship. I don't believe it. It just continues. My, oh my.

📊 Token Usage:
  Prompt tokens: 384
  Completion tokens: 85
  Total tokens: 469


In [13]:
# Test audio file base64 for audio understanding
import base64
import requests
import json

# Load audio file and encode as base64
with open("winning_call.mp3", "rb") as audio_file:
    audio_data = base64.b64encode(audio_file.read()).decode('utf-8')

payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What do you hear in this audio?"
                },
                {
                    "type": "audio",
                    "data": f"data:audio/mp3;base64,{audio_data}"  # Base64 encoded audio
                }
            ]
        }
    ],
    "max_tokens": 300,
    "temperature": 0.2,
    "top_p": 0.95
}

print("🎵 Testing audio file base64 for audio understanding...")

try:
    response = predictor.predict(payload)
    print("Response:", response["choices"][0]["message"]["content"])
    
    # Print usage statistics if available
    if "usage" in response:
        usage = response["usage"]
        print(f"\n📊 Token Usage:")
        print(f"  Prompt tokens: {usage.get('prompt_tokens', 'N/A')}")
        print(f"  Completion tokens: {usage.get('completion_tokens', 'N/A')}")
        print(f"  Total tokens: {usage.get('total_tokens', 'N/A')}")
        
except Exception as e:
    print(f"❌ Audio file base64 test failed: {str(e)}")


🎵 Testing audio file base64 for audio understanding...
Response: The audio describes a dramatic moment in a baseball game. Here's a breakdown:

- The pitcher throws a pitch to Edgar Martinez.
- Martinez hits the ball, which goes down the left field line for a base hit.
- Jay Buhner (referred to as "Joy" in the audio) and Ken Griffey Jr. (referred to as "Junior") advance to third base and home plate, respectively.
- The throw to the plate is late, allowing Griffey Jr. to score.
- This play allows the Seattle Mariners to advance to the American League Championship.

The commentator expresses disbelief and excitement about the Mariners' advancement.

📊 Token Usage:
  Prompt tokens: 387
  Completion tokens: 133
  Total tokens: 520


### 4.4 Multiple Audio Files Test

In [14]:
# Test with multiple audio files 
multi_audio_payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio",
                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3"
                },
                {
                    "type": "audio", 
                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3"
                },
                {
                    "type": "text",
                    "text": "Compare these two audio files. What similarities and differences do you notice?"
                }
            ]
        }
    ],
    "max_tokens": 400,
    "temperature": 0.2,
    "top_p": 0.95
}

print("🎵 Testing multiple audio files support...")
try:
    response = predictor.predict(multi_audio_payload)
    print("Response:", response["choices"][0]["message"]["content"])
    
except Exception as e:
    print(f"❌ Multiple audio test failed: {str(e)}")
    print("💡 This is expected if the model doesn't support multimodal processing yet.")

🎵 Testing multiple audio files support...
Response: The first audio is a historical speech by Thomas Edison, where he recites a nursery rhyme, "Mary Had a Little Lamb," which was the first words he spoke into the phonograph. The second audio is a sports commentary, specifically a baseball game, where the commentator describes a play that leads to a significant win for the Mariners. The similarities are that both audios involve a historical moment being described, but the differences lie in the content and context, with one being a technological milestone and the other a sports achievement.


### 4.5 Transcribe-only mode

In [15]:
payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio",
                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3"
                }
            ]
        }
    ],
    "max_tokens": 300,
    "temperature": 0.0, # set temperature as 0 for transcribe-only mode
    "top_p": 0.95,
}

print("🎵 Testing transcribe-only mode...")
try:
    response = predictor.predict(payload)
    print("Response:", response["choices"][0]["message"]["content"])
    
except Exception as e:
    print(f"❌ Transcribe-only test failed: {str(e)}")

🎵 Testing transcribe-only mode...
Response: This week, I traveled to Chicago to deliver my final farewell address to the nation, following in the tradition of presidents before me. It was an opportunity to say thank you. Whether we've seen eye to eye or rarely agreed at all, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors, at diners and on distant military outposts, all these conversations are what have kept me honest, kept me inspired, and kept me going. Every day, I learned from you. You made me a better president, and you made me a better man. Over the course of these eight years, I've seen the goodness, the resilience, and the hope of the American people. I've seen neighbors looking out for each other as we rescued our economy from the worst crisis of our lifetimes. I've hugged cancer survivors who finally know the security of affordable health care. I've seen communities like Joplin rebuild from disaster and cities like B

### 4.6 Function calling - only supported by Voxtral Small model

In [16]:
import json

# Define weather tool configuration
WEATHER_TOOL = {
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather for a specific location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                },
                "format": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "The temperature unit to use."
                }
            },
            "required": ["location", "format"]
        }
    }
}

# Mock weather function
def mock_weather(location, format="celsius"):
    """Always returns sunny weather at 25°C/77°F"""
    temp = 77 if format.lower() == "fahrenheit" else 25
    unit = "°F" if format.lower() == "fahrenheit" else "°C"
    return f"It's sunny in {location} with {temp}{unit}"

# Test payload with audio
payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                "type": "audio",
                "path": "https://huggingface.co/datasets/patrickvonplaten/audio_samples/resolve/main/fn_calling.wav"
                }
            ]
        }
    ],
    "temperature": 0.2,
    "top_p": 0.95,
    "tools": [WEATHER_TOOL]
}


print("🧪 Testing Function Calling...")
try:
    # Try audio first, then text if that fails
    response = predictor.predict(payload)

    message = response["choices"][0]["message"]
    print(f"Response: {message['content']}")

    # Check for tool calls
    if "tool_calls" in message:
        print("🔧 Tool calls found:")
        for tool_call in message["tool_calls"]:
            func_name = tool_call["function"]["name"]
            func_args = json.loads(tool_call["function"]["arguments"])
            print(f"  Function: {func_name}")
            print(f"  Args: {func_args}")

            # Execute mock function
            if func_name == "get_current_weather":
                result = mock_weather(**func_args)
                print(f"  Result: {result}")
    else:
        print("❌ No tool calls found")

except Exception as e:
    print(f"❌ Test failed: {e}")

🧪 Testing Function Calling...
Response: I'll help you with that.
🔧 Tool calls found:
  Function: get_current_weather
  Args: {'location': 'Madrid', 'format': 'celsius'}
  Result: It's sunny in Madrid with 25°C


## 5. Cleanup Resources

**Important**: Remember to delete resources when done to avoid charges.

In [17]:
# Delete SageMaker endpoint
print(f"🗑️ Deleting endpoint: {endpoint_name}")
predictor.delete_endpoint(delete_endpoint_config=True)
print("✅ Endpoint deleted successfully")


# # Delete ECR repository (optional)
# ecr_client = boto3.client('ecr')
# ecr_client.delete_repository(
#     repositoryName='voxtral-vllm-byoc',
#     force=True
# )
# print("✅ ECR repository deleted successfully")

🗑️ Deleting endpoint: voxtral-vllm-byoc-endpoint-1755863967
✅ Endpoint deleted successfully


## 6. Summary

### What We've Accomplished

1. **✅ Flexible BYOC Architecture** - Separated container image from model code
2. **✅ Dynamic Code Deployment** - Model artifacts provided via S3 model_data
3. **✅ Multi-Model Support** - Both Voxtral Mini and Small models supported
4. **✅ SageMaker Integration** - Custom model and endpoint creation
5. **✅ Multimodal Processing** - Audio + text processing with proper formatting
6. **✅ Function Calling** - Tool calling support for Voxtral-Small model