# Voxtral Mini Audio-Language Model Deployment on SageMaker

This notebook demonstrates how to deploy and use Mistral AI's Voxtral Mini model on Amazon SageMaker for real-time audio and text processing.

## Overview
- **Model**: Voxtral-Mini-3B-2507 - A multimodal model that can process both audio and text
- **Capabilities**: Audio transcription, audio understanding, text generation, and multimodal conversations
- **Deployment**: Real-time inference endpoint on SageMaker with GPU acceleration

## Key Features Demonstrated
1. **Text-only conversations** - Standard language model interactions
2. **Audio-only processing** - Direct audio analysis and description
3. **Audio + text combinations** - Guided audio analysis with specific questions
4. **Multi-audio processing** - Analyze multiple audio files together
5. **Multi-turn conversations** - Complex dialogues with mixed media
6. **Transcription-only mode** - Pure speech-to-text conversion


## Below solution is tested in the following environment setup
1. This solution is tested in **us-west-2** region.  Please modify the region based on your requirement. 
2. SageMaker Domain JupyterLab with instance **ml.g6.8xlarge** and **100GB storage**
3. Ensure you have enough quota for **ml.g6.8xlarge** to host this model on SageMaker endpoint

In [1]:
# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

In [2]:
!pip install -r code/requirements.txt



## 1. Setup and Dependencies

Install required packages and initialize SageMaker session.

In [3]:
# Import required libraries for SageMaker deployment and audio processing
import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.pytorch import PyTorchModel
import json
import base64
import numpy as np
import librosa
from typing import Dict, List, Any, Optional
import time
import io

# Initialize SageMaker session and get execution role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()  # IAM role for SageMaker operations
bucket = "<UPDATE_WITH_YOUR_BUCKET>"  # S3 bucket for storing model artifacts

print(f"SageMaker role: {role}")
print(f"S3 bucket: {bucket}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
SageMaker role: arn:aws:iam::459006231907:role/service-role/AmazonSageMaker-ExecutionRole-20250606T154579
S3 bucket: 3p-projects


## 2. Download and Prepare Model

Download the Voxtral Mini model from Hugging Face and prepare it for SageMaker deployment.

In [6]:
# Load Voxtral Mini model from Hugging Face
import torch

# Configuration
device = "cuda"  # Use GPU for faster model loading
repo_id = "mistralai/Voxtral-Mini-3B-2507"  # Official Mistral AI model


In [7]:
# Save model and processor locally for packaging
local_path = "./model"

# Save both model and processor to local directory

from huggingface_hub import snapshot_download
import os

snapshot_download(
    repo_id=repo_id, 
    local_dir=local_path,
    local_dir_use_symlinks=False,
    ignore_patterns=["consolidated.safetensors"]  # Skip this file
)

print(f"Model saved to: {local_path}")

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Model saved to: ./model


In [8]:
#Please based on your environment, select the correct method to install the pigz package
!sudo apt install pigz

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  pigz
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 63.6 kB of archives.
After this operation, 162 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 pigz amd64 2.6-1 [63.6 kB]
Fetched 63.6 kB in 0s (169 kB/s)m[33m
debconf: delaying package configuration, since apt-utils is not installed

7[0;23r8[1ASelecting previously unselected package pigz.
(Reading database ... 14572 files and directories currently installed.)
Preparing to unpack .../archives/pigz_2.6-1_amd64.deb ...
7[24;0f[42m[30mProgress: [  0%][49m[39m [..........................................................] 87[24;0f[42m[30mProgress: [ 20%][49m[39m [###########...............................................] 8Unpacking pigz (2.6-1) ...
7[24;0f[42m[30mProgress: [ 40%][49m[39m [##

In [None]:
%%time
# Package model as tar.gz and upload to S3
# Using pigz for faster compression (parallel gzip)
!tar --use-compress-program=pigz -cf model.tar.gz -C model/ .

# Upload compressed model to S3
sess = sagemaker.session.Session()
prefix = "voxtral"  # S3 prefix for organization
model_uri = sess.upload_data(
    'model.tar.gz', 
    bucket=bucket, 
    key_prefix=f"{prefix}/huggingface/mini"
)

# Clean up local files to save disk space
!rm model.tar.gz
!rm -rf model

print(f"Model uploaded to: {model_uri}")
model_uri

## 3. Deploy Model on SageMaker

Create a real-time inference endpoint for the Voxtral Mini model.

In [10]:
# Configuration for SageMaker deployment
id = int(time.time())  # Unique identifier for resources
model_name = f'voxtral-mini-model-{id}'

# SageMaker container image with PyTorch and Transformers
# NOTE: Update this URI for your specific AWS region
image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:2.6.0-transformers4.49.0-gpu-py312-cu124-ubuntu22.04"

print(f"Model name: {model_name}")
print(f"Container image: {image}")

Model name: voxtral-mini-model-1755007773
Container image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:2.6.0-transformers4.49.0-gpu-py312-cu124-ubuntu22.04


In [12]:
# Create SageMaker model configuration
from sagemaker.huggingface.model import HuggingFaceModel

voxtral_mini_model = HuggingFaceModel(
    model_data=model_uri,              # S3 location of model artifacts
    role=role,                         # IAM execution role
    image_uri=image,                   # Container image
    entry_point="inference.py",        # Custom inference script
    source_dir='code',                 # Directory containing inference code
    name=model_name,
    env={
        # Increase timeouts and payload limits for audio processing
        'MMS_MAX_REQUEST_SIZE': '2000000000',    
        'MMS_MAX_RESPONSE_SIZE': '2000000000',    
        'MMS_DEFAULT_RESPONSE_TIMEOUT': '900'    
    }
)

print("SageMaker model configuration created")

SageMaker model configuration created


In [13]:
%%time
# Deploy the model to a real-time inference endpoint
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

endpoint_name = f'voxtral-mini-real-time-endpoint-{id}'

print(f"Deploying endpoint: {endpoint_name}")
print("This will take approximately 20 minutes...")

predictor = voxtral_mini_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g6.8xlarge",              # GPU instance for fast inference
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),                # Handle JSON requests
    deserializer=JSONDeserializer(),            # Handle JSON responses
    container_startup_health_check_timeout=900, # Extended startup timeout
    model_data_download_timeout=900,            # Extended download timeout
)

print(f"✅ Endpoint deployed successfully: {endpoint_name}")

Deploying endpoint: voxtral-mini-real-time-endpoint-1755007773
This will take approximately 20 minutes...
-----------!✅ Endpoint deployed successfully: voxtral-mini-real-time-endpoint-1755007773
CPU times: user 13min 56s, sys: 30.8 s, total: 14min 27s
Wall time: 20min 23s


## 4. Test Different Use Cases

Now let's test the deployed model with various input types and scenarios.

### 4.1 Text-Only Conversation

Test the model's text generation capabilities without any audio input.

In [14]:
# Simple text-only conversation
payload = {
    "text": "Hello, how are you today?",
    "max_new_tokens": 100,      # Limit response length
    "temperature": 0.3          # Lower temperature for more focused responses
}

print("🔤 Testing text-only conversation...")
response = predictor.predict(payload)
print("Response:", response["response"])

🔤 Testing text-only conversation...
Response: Hello! I'm functioning as intended, thank you. How about you? How's your day going?


### 4.2 Audio-Only Analysis

Test the model's ability to understand and describe audio content without additional text prompts.

In [15]:
# Audio-only analysis - model describes what it hears
payload = {
    "audio": [
        "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3"
    ],
    "max_new_tokens": 800,     # Allow longer response for detailed analysis
    "temperature": 0.3
}

print("🎵 Testing audio-only analysis...")
response = predictor.predict(payload)
print("Response:", response["response"])

🎵 Testing audio-only analysis...
Response: The audio describes a dramatic moment in a baseball game, likely a playoff or championship game, where a home run is hit, and the team is celebrating their potential to win the American League Championship. Here's a breakdown:

1. **Pitcher and Batter**: The pitcher is 0-1, meaning he has not yet struck out the batter. The batter is Edgar Martinez, a known power hitter.

2. **Swing and Hit**: The pitcher swings and misses, and the batter hits a home run. The ball goes over the left field line, resulting in a home run.

3. **Base Running**: The runner, likely named Joy, is at first base. Junior is at third base, and the team is trying to wave him in to score.

4. **Potential Run**: If Junior scores, it would be a potential run for the team, which is already celebrating the home run.

5. **American League Championship**: The team is aiming to play for the American League Championship, which is a significant achievement in baseball.

6. **Celebra

### 4.3 Local Audio File (Base64 Encoding)

Test processing local audio files by encoding them as base64 strings.

In [16]:
# Process local audio file via base64 encoding
audio_file_path = "winning_call.mp3" 

# Read the audio file as binary data (preserves original format)
with open(audio_file_path, 'rb') as audio_file:
    audio_bytes = audio_file.read()

# Encode the audio file to base64 for transmission
audio_b64 = base64.b64encode(audio_bytes).decode('utf-8')

payload = {
    "audio": [audio_b64],           # Base64 encoded audio
    "text": "What is being said in this audio?",  # Specific question
    "max_new_tokens": 300,
    "temperature": 0.3
}

print("📁 Testing base64 encoded local audio file...")
response = predictor.predict(payload)
print("Response:", response["response"])

📁 Testing base64 encoded local audio file...
Response: In this audio, the commentator is describing a dramatic moment in a baseball game where a home run is hit, and the team is celebrating their potential to play for the American League Championship. The commentator expresses disbelief and excitement, repeatedly saying "I don't believe it" and "my, oh my."


### 4.4 Multi-Audio Analysis

Test the model's ability to analyze multiple audio files together and answer specific questions.

In [17]:
# Analyze multiple audio files with a specific question
payload = {
    "audio": [
        "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3",
        "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3"
    ],
    "text": "What sport and what nursery rhyme are referenced in these audio clips?",
    "max_new_tokens": 500,
    "do_sample": False,         # Use deterministic generation
    "temperature": 1.0
}

print("🎵🎵 Testing multiple audio analysis...")
response = predictor.predict(payload)
print("Response:", response["response"])

🎵🎵 Testing multiple audio analysis...
Response: The first audio clip references a nursery rhyme, specifically "Mary Had a Little Lamb." The second audio clip references baseball, specifically the 1969 American League Championship Series between the Baltimore Orioles and the New York Mets.


### 4.5 Multi-Turn Conversation Format

Test complex multi-turn conversations with mixed audio and text content using the structured conversation format.

In [22]:
# Multi-turn conversation with audio and text across multiple exchanges
payload = {
    "conversation": [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio",
                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
                },
                {
                    "type": "audio", 
                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
                },
                {
                    "type": "text", 
                    "text": "Describe briefly what you can hear."},
            ],
        },
        {
            # Assistant's previous response (simulating conversation history)
            "role": "assistant",
            "content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "audio",
                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
                },
                {
                    "type": "text", 
                    "text": "Ok, now compare this new audio with the previous one."},
            ],
        },
    ],
    "max_new_tokens": 400,
    "temperature": 0.3
}

print("💬 Testing multi-turn conversation format...")
response = predictor.predict(payload)
print("Response:", response["response"])

💬 Testing multi-turn conversation format...
Response: The previous audio was a political speech, while this audio is a sports commentary. The speaker is describing a baseball game, specifically a home run hit by Edgar Martinez, and the excitement of the moment. The speaker expresses disbelief and joy, contrasting with the previous audio's serious tone and focus on political themes.


### 4.6 Transcription-Only Mode

Test the model's pure speech-to-text transcription capabilities without additional analysis or commentary.

In [23]:
# Pure transcription mode - convert speech to text only
payload = {
    "audio": ["https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3"],
    "transcribe_only": True,    # Enable transcription-only mode
    "language": "en",           # Specify language for better accuracy
    "max_new_tokens": 500       # Allow sufficient tokens for full transcription
}

print("📝 Testing transcription-only mode...")
response = predictor.predict(payload)
print("Response:", response["response"])

📝 Testing transcription-only mode...
Response: This week, I traveled to Chicago to deliver my final farewell address to the nation, following in the tradition of presidents before me. It was an opportunity to say thank you. Whether we've seen eye-to-eye or rarely agreed at all, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors, at diners and on distant military outposts, All these conversations are what have kept me honest, kept me inspired, and kept me going. Every day, I learned from you. You made me a better president, and you made me a better man. Over the course of these eight years, I've seen the goodness, the resilience, and the hope of the American people. I've seen neighbors looking out for each other as we rescued our economy from the worst crisis of our lifetimes. I've hugged cancer survivors who finally know the security of affordable health care. I've seen communities like Joplin rebuild from disaster, and cities li

## 5. Summary and Next Steps

### What We've Demonstrated

This notebook showed how to successfully deploy and use Mistral AI's Voxtral Mini model on SageMaker for various audio and text processing tasks:

1. **✅ Text-only conversations** - Standard language model interactions
2. **✅ Audio-only analysis** - Automatic audio content description and understanding  
3. **✅ Guided audio analysis** - Audio processing with specific text instructions
4. **✅ Multi-audio processing** - Simultaneous analysis of multiple audio files
5. **✅ Multi-turn conversations** - Complex dialogues with mixed media content
6. **✅ Pure transcription** - Speech-to-text conversion without additional analysis

### Key Features of Our Implementation

- **Flexible input formats**: Supports URLs, base64 encoded files, and conversation structures
- **Memory optimized**: Handles large audio files with automatic cleanup
- **Production ready**: Comprehensive error handling and logging
- **Scalable**: Real-time inference endpoint with GPU acceleration

### Supported Audio Formats
- MP3, WAV, M4A, and other common audio formats
- Both remote URLs and local files (via base64 encoding)
- Multiple audio files in a single request

### Next Steps
1. **Scale the deployment** - Add auto-scaling policies for production workloads
2. **Add monitoring** - Set up CloudWatch metrics and alarms
3. **Implement caching** - Cache frequent requests to reduce costs
4. **Add security** - Implement authentication and authorization
5. **Optimize costs** - Consider using spot instances or scheduled scaling

## 6. Cleanup

**⚠️ Important**: Remember to delete your SageMaker endpoint when you're done testing to avoid ongoing charges.

Uncomment and run the cell below to delete the endpoint:

In [None]:
# Cleanup: Delete the SageMaker endpoint to avoid charges

print(f"Deleting endpoint: {endpoint_name}")
predictor.delete_endpoint(delete_endpoint_config=True)
print("✅ Endpoint deleted successfully")

# You can also delete the model if no longer needed:  
predictor.delete_model()
print("✅ Model deleted successfully")