# Voxtral Audio Models on Amazon Bedrock

---

This notebook demonstrates how to use **Voxtral Mini** (3B) and **Voxtral Small** (24B) audio-language models on Amazon Bedrock. These models can understand and process audio input alongside text, enabling powerful audio understanding and transcription capabilities.

## Voxtral Models Overview

| Model | Parameters | Context | Best For |
|-------|------------|---------|----------|
| **Voxtral Mini** | 3B | 32k tokens | Fast transcription, edge/local deployment |
| **Voxtral Small** | 24B | 32k tokens | Complex audio analysis, detailed understanding |

## Key Capabilities

- **Audio Transcription**: Convert speech to text in multiple languages
- **Audio Understanding**: Answer questions about audio content
- **Multi-Turn Conversations**: Discuss audio content with context
- **Multiple Audio Files**: Process multiple audio inputs in one request
- **Streaming**: Real-time response generation

## Supported Languages

English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian

## Audio Specifications

- **Max Duration**: ~45 seconds per request (due to ~2MB payload limit)
- **Format**: WAV (16-bit PCM) - convert from OGG/MP3/FLAC
- **Sample Rate**: 16kHz recommended
- **Longer Audio**: Chunk into 30-second segments

---

## 1. Setup

---

In [None]:
# Ensure required packages are installed
%pip install --upgrade --quiet boto3 soundfile

In [None]:
import boto3
import json
import base64
import wave
import struct
import io
import os
import time
import numpy as np
import soundfile as sf
from typing import Optional, Tuple

# Voxtral Model IDs on Amazon Bedrock
MODELS = {
    "mini": "mistral.voxtral-mini-3b-2507",
    "small": "mistral.voxtral-small-24b-2507"
}

REGION = "us-west-2"

# Audio files directory (same directory as this notebook)
AUDIO_DIR = os.path.dirname(os.path.abspath("__file__"))

# Initialize Bedrock client
bedrock = boto3.client('bedrock-runtime', region_name=REGION)

print("Voxtral Models Available:")
for name, model_id in MODELS.items():
    print(f"  {name}: {model_id}")
print(f"\nConnected to Bedrock in {REGION}")

## 2. Helper Functions

These utilities help load and convert audio files for use with Voxtral models.

---

In [None]:
def load_and_convert_audio(
    file_path: str,
    target_sample_rate: int = 16000,
    max_duration: Optional[float] = None
) -> Tuple[bytes, dict]:
    """
    Load an audio file and convert it to WAV format suitable for Voxtral.
    
    Handles various input formats (WAV, MP3, etc.) and converts to:
    - 16-bit PCM WAV
    - Mono channel
    - Target sample rate (default 16kHz)
    
    Args:
        file_path: Path to the audio file
        target_sample_rate: Output sample rate (16kHz recommended)
        max_duration: Maximum duration in seconds (None for full file)
    
    Returns:
        Tuple of (wav_bytes, info_dict)
    """
    # Load audio with soundfile
    data, sample_rate = sf.read(file_path)
    
    original_duration = len(data) / sample_rate
    
    # Trim to max duration if specified
    if max_duration and original_duration > max_duration:
        samples = int(max_duration * sample_rate)
        data = data[:samples]
    
    # Convert stereo to mono
    if len(data.shape) > 1:
        data = data.mean(axis=1)
    
    # Resample if needed
    if sample_rate != target_sample_rate:
        ratio = target_sample_rate / sample_rate
        new_length = int(len(data) * ratio)
        indices = np.linspace(0, len(data) - 1, new_length).astype(int)
        data = data[indices]
    
    # Convert to 16-bit PCM
    if data.dtype in ['float32', 'float64']:
        data = (data * 32767).astype(np.int16)
    elif data.dtype != np.int16:
        data = data.astype(np.int16)
    
    # Create WAV in memory
    wav_buffer = io.BytesIO()
    with wave.open(wav_buffer, 'wb') as wav_file:
        wav_file.setnchannels(1)
        wav_file.setsampwidth(2)
        wav_file.setframerate(target_sample_rate)
        wav_file.writeframes(data.tobytes())
    
    wav_bytes = wav_buffer.getvalue()
    
    info = {
        "original_sample_rate": sample_rate,
        "original_duration": original_duration,
        "converted_sample_rate": target_sample_rate,
        "converted_duration": len(data) / target_sample_rate,
        "size_bytes": len(wav_bytes)
    }
    
    return wav_bytes, info


def audio_to_base64(audio_bytes: bytes) -> str:
    """Encode audio bytes to base64 string."""
    return base64.b64encode(audio_bytes).decode('utf-8')


def call_voxtral(
    audio_base64: str,
    text_prompt: str,
    model: str = "small",
    max_tokens: int = 500,
    temperature: float = 0.7
) -> dict:
    """
    Call Voxtral model with audio and text input.
    
    Args:
        audio_base64: Base64 encoded audio (WAV format)
        text_prompt: Text question or instruction
        model: 'mini' or 'small'
        max_tokens: Maximum tokens in response
        temperature: Sampling temperature (0.0-1.0)
    
    Returns:
        Response dictionary from Bedrock
    """
    payload = {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": audio_base64,
                            "format": "wav"
                        }
                    },
                    {
                        "type": "text",
                        "text": text_prompt
                    }
                ]
            }
        ],
        "max_tokens": max_tokens,
        "temperature": temperature
    }
    
    response = bedrock.invoke_model(
        modelId=MODELS[model],
        body=json.dumps(payload),
        contentType="application/json"
    )
    
    return json.loads(response['body'].read())


def call_voxtral_text_only(
    text_prompt: str,
    model: str = "small",
    max_tokens: int = 500
) -> dict:
    """
    Call Voxtral model with text-only input (no audio).
    """
    payload = {
        "messages": [{"role": "user", "content": text_prompt}],
        "max_tokens": max_tokens
    }
    
    response = bedrock.invoke_model(
        modelId=MODELS[model],
        body=json.dumps(payload),
        contentType="application/json"
    )
    
    return json.loads(response['body'].read())


print("Helper functions loaded!")

## 3. Load Sample Audio Files

We'll use the audio files included in this directory for our examples:

| File | Description | Duration |
|------|-------------|----------|
| `jfk.wav` | JFK inaugural speech excerpt | ~11s |
| `librispeech_1.wav` | LibriSpeech audiobook sample | ~3.5s |
| `VOA_News_Headlines_(May_5,_2009).ogg` | Voice of America news broadcast | ~5 min |

---

In [None]:
# Define audio file paths
AUDIO_FILES = {
    "jfk": "jfk.wav",                    # JFK speech excerpt (~11s)
    "librispeech": "librispeech_1.wav",  # LibriSpeech sample (~3.5s)
    "voa_news": "VOA_News_Headlines_(May_5,_2009).ogg",  # VOA News broadcast (~5 min)
}

# Load and convert audio files
print("Loading audio files...")
print("=" * 60)

audio_data = {}

for name, filename in AUDIO_FILES.items():
    file_path = filename  # Files are in same directory as notebook
    
    if os.path.exists(file_path):
        # Set max duration based on file type
        # JFK: limit to 5s (large file at 44.1kHz stereo)
        # VOA News: limit to 30s (API has ~2MB payload limit)
        if name == "jfk":
            max_dur = 5.0
        elif name == "voa_news":
            max_dur = 30.0  # Use first 30 seconds (fits within 2MB limit)
        else:
            max_dur = None
            
        wav_bytes, info = load_and_convert_audio(file_path, max_duration=max_dur)
        
        audio_data[name] = {
            "bytes": wav_bytes,
            "base64": audio_to_base64(wav_bytes),
            "info": info
        }
        
        print(f"\n{name.upper()} ({filename})")
        print(f"  Original: {info['original_duration']:.2f}s @ {info['original_sample_rate']}Hz")
        print(f"  Converted: {info['converted_duration']:.2f}s @ {info['converted_sample_rate']}Hz")
        print(f"  Size: {info['size_bytes']:,} bytes ({info['size_bytes']/1024:.1f} KB)")
    else:
        print(f"\n{name.upper()}: File not found - {file_path}")

print("\n" + "=" * 60)
print(f"Loaded {len(audio_data)} audio file(s)")

---

## 4. Audio Transcription

Let's transcribe speech from our audio files.

---

In [None]:
# Transcribe JFK speech
print("Transcribing JFK Speech...")
print("=" * 60)

if "jfk" in audio_data:
    result = call_voxtral(
        audio_base64=audio_data["jfk"]["base64"],
        text_prompt="Transcribe this audio exactly as spoken.",
        model="small",
        max_tokens=500
    )
    
    print(f"\nTranscription:")
    print(result['choices'][0]['message']['content'])
    print(f"\nTokens used: {result['usage']}")
else:
    print("JFK audio file not loaded")

In [None]:
# Transcribe LibriSpeech sample
print("Transcribing LibriSpeech Sample...")
print("=" * 60)

if "librispeech" in audio_data:
    result = call_voxtral(
        audio_base64=audio_data["librispeech"]["base64"],
        text_prompt="Transcribe this audio exactly as spoken.",
        model="small",
        max_tokens=500
    )
    
    print(f"\nTranscription:")
    print(result['choices'][0]['message']['content'])
    print(f"\nTokens used: {result['usage']}")
else:
    print("LibriSpeech audio file not loaded")

### Longer Audio: VOA News Broadcast

The VOA News file demonstrates transcription of longer, more complex audio with multiple topics.

In [None]:
# Transcribe VOA News broadcast (first 30 seconds)
print("Transcribing VOA News Broadcast...")
print("=" * 60)

if "voa_news" in audio_data:
    info = audio_data["voa_news"]["info"]
    print(f"Audio: {info['converted_duration']:.1f} seconds of news broadcast")
    print(f"Original file duration: {info['original_duration']:.1f} seconds (~5 minutes)")
    print("-" * 60)
    
    result = call_voxtral(
        audio_base64=audio_data["voa_news"]["base64"],
        text_prompt="Transcribe this news broadcast completely and accurately.",
        model="small",
        max_tokens=1000
    )
    
    print(f"\nTranscription:")
    print(result['choices'][0]['message']['content'])
    print(f"\nTokens used: {result['usage']}")
else:
    print("VOA News audio file not loaded")

In [None]:
# Q&A about VOA News content
print("Question & Answer: VOA News Analysis")
print("=" * 60)

if "voa_news" in audio_data:
    questions = [
        "What news stories are covered in this broadcast? List them briefly.",
        "What is the main headline or lead story?",
        "What geographic regions or countries are mentioned?",
    ]
    
    for q in questions:
        print(f"\nQ: {q}")
        print("-" * 50)
        
        result = call_voxtral(
            audio_base64=audio_data["voa_news"]["base64"],
            text_prompt=q,
            model="small",
            max_tokens=300
        )
        
        print(f"A: {result['choices'][0]['message']['content']}")
else:
    print("VOA News audio file not loaded")

### Transcribing Longer Audio: Chunking Strategy

For audio longer than ~45 seconds, you must chunk the audio and make sequential API calls. Here's a practical approach:

In [None]:
def transcribe_long_audio(
    file_path: str,
    chunk_duration: float = 30.0,
    model: str = "small",
    overlap: float = 2.0
) -> str:
    """
    Transcribe audio files longer than the API limit by chunking.
    
    Args:
        file_path: Path to audio file (any format soundfile supports)
        chunk_duration: Duration of each chunk in seconds (default 30s)
        model: 'mini' or 'small'
        overlap: Overlap between chunks to avoid cutting words (default 2s)
    
    Returns:
        Full transcription as a single string
    """
    # Load the full audio file
    data, sample_rate = sf.read(file_path)
    total_duration = len(data) / sample_rate
    
    print(f"Total audio duration: {total_duration:.1f}s")
    print(f"Chunk size: {chunk_duration}s with {overlap}s overlap")
    
    # Calculate number of chunks
    effective_chunk = chunk_duration - overlap
    num_chunks = int(np.ceil(total_duration / effective_chunk))
    print(f"Processing {num_chunks} chunks...")
    print("-" * 50)
    
    transcriptions = []
    
    for i in range(num_chunks):
        # Calculate start and end for this chunk
        start_time = i * effective_chunk
        end_time = min(start_time + chunk_duration, total_duration)
        
        # Extract chunk
        start_sample = int(start_time * sample_rate)
        end_sample = int(end_time * sample_rate)
        chunk_data = data[start_sample:end_sample]
        
        # Convert chunk to WAV format
        # Resample to 16kHz if needed
        target_sr = 16000
        if sample_rate != target_sr:
            ratio = target_sr / sample_rate
            new_length = int(len(chunk_data) * ratio)
            indices = np.linspace(0, len(chunk_data) - 1, new_length).astype(int)
            chunk_data = chunk_data[indices]
        
        # Convert to 16-bit PCM
        if chunk_data.dtype in ['float32', 'float64']:
            chunk_data = (chunk_data * 32767).astype(np.int16)
        
        # Create WAV bytes
        wav_buffer = io.BytesIO()
        with wave.open(wav_buffer, 'wb') as wav_file:
            wav_file.setnchannels(1)
            wav_file.setsampwidth(2)
            wav_file.setframerate(target_sr)
            wav_file.writeframes(chunk_data.tobytes())
        
        chunk_b64 = base64.b64encode(wav_buffer.getvalue()).decode('utf-8')
        
        # Call Voxtral for this chunk
        print(f"Chunk {i+1}/{num_chunks}: {start_time:.1f}s - {end_time:.1f}s", end=" ")
        
        result = call_voxtral(
            audio_base64=chunk_b64,
            text_prompt="Transcribe this audio exactly. Do not add any commentary.",
            model=model,
            max_tokens=500
        )
        
        chunk_text = result['choices'][0]['message']['content'].strip()
        transcriptions.append(chunk_text)
        print(f"({len(chunk_text)} chars)")
    
    # Combine transcriptions
    # Note: With overlap, there may be some repeated words at chunk boundaries
    full_transcription = " ".join(transcriptions)
    
    print("-" * 50)
    print(f"Total transcription: {len(full_transcription)} characters")
    
    return full_transcription


# Transcribe first 2 minutes of VOA News (4 chunks of 30s each)
print("Transcribing 2 minutes of VOA News using chunking...")
print("=" * 60)

# We'll transcribe 120 seconds (2 minutes) to demonstrate
# Load file info first
voa_data, voa_sr = sf.read("VOA_News_Headlines_(May_5,_2009).ogg")
voa_duration = len(voa_data) / voa_sr

# Create a temporary trimmed version for the demo (2 minutes)
demo_duration = min(120.0, voa_duration)  # 2 minutes or less

# Save trimmed version temporarily
trimmed_samples = int(demo_duration * voa_sr)
trimmed_data = voa_data[:trimmed_samples]
sf.write("_temp_voa_2min.wav", trimmed_data, voa_sr)

# Transcribe using chunking
full_transcript = transcribe_long_audio(
    "_temp_voa_2min.wav",
    chunk_duration=30.0,
    model="small",
    overlap=2.0
)

print(f"\n{'='*60}")
print("FULL TRANSCRIPTION:")
print("=" * 60)
print(full_transcript)

# Clean up temp file
os.remove("_temp_voa_2min.wav")

---

## 5. Model Comparison

Compare transcription quality and latency between Voxtral Mini and Small.

---

In [None]:
# Compare Mini vs Small on the same audio
print("Model Comparison: Voxtral Mini vs Small")
print("=" * 70)

if "librispeech" in audio_data:
    test_audio = audio_data["librispeech"]["base64"]
    test_prompt = "describe the tone of the speaker then transcribe the audio in addition to the first task."
    
    print(f"Audio: LibriSpeech ({audio_data['librispeech']['info']['converted_duration']:.2f}s)")
    print(f"Prompt: {test_prompt}")
    print("=" * 70)
    
    for model_name in ["mini", "small"]:
        start = time.time()
        result = call_voxtral(
            audio_base64=test_audio,
            text_prompt=test_prompt,
            model=model_name,
            max_tokens=300
        )
        latency = time.time() - start
        
        print(f"\n--- Voxtral {model_name.title()} ({latency:.2f}s) ---")
        print(result['choices'][0]['message']['content'])
        print(f"\nTokens: {result['usage']}")
else:
    print("No audio loaded for comparison")

---

## 6. Streaming Responses

Stream responses in real-time for better user experience.

---

In [None]:
def stream_voxtral(
    audio_base64: str,
    text_prompt: str,
    model: str = "small",
    max_tokens: int = 500
):
    """
    Stream response from Voxtral model.
    
    Yields:
        Text chunks as they're generated
    """
    payload = {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": audio_base64,
                            "format": "wav"
                        }
                    },
                    {"type": "text", "text": text_prompt}
                ]
            }
        ],
        "max_tokens": max_tokens,
        "stream": True
    }
    
    response = bedrock.invoke_model_with_response_stream(
        modelId=MODELS[model],
        body=json.dumps(payload),
        contentType="application/json"
    )
    
    for event in response['body']:
        chunk = event.get('chunk')
        if chunk:
            data = json.loads(chunk['bytes'].decode())
            if 'choices' in data and data['choices']:
                delta = data['choices'][0].get('delta', {})
                content = delta.get('content', '')
                if content:
                    yield content


# Test streaming with JFK audio
print("Streaming Transcription:")
print("-" * 50)

if "jfk" in audio_data:
    full_response = ""
    for chunk in stream_voxtral(
        audio_base64=audio_data["jfk"]["base64"],
        text_prompt="Transcribe",
        model="small"
    ):
        print(chunk, end="", flush=True)
        full_response += chunk
    
    print(f"\n\nTotal characters streamed: {len(full_response)}")
else:
    print("JFK audio not loaded")

---

## 7. Audio Understanding

Ask questions about the audio content beyond simple transcription.

---

In [None]:
# Ask analytical questions about the JFK speech
print("Audio Understanding - JFK Speech Analysis")
print("=" * 60)

if "jfk" in audio_data:
    questions = [
        "What is the main message of this speech?",
        "Describe the speaker's tone and delivery style.",
        "What historical context might this speech be from?"
    ]
    
    for q in questions:
        print(f"\nQ: {q}")
        print("-" * 50)
        
        result = call_voxtral(
            audio_base64=audio_data["jfk"]["base64"],
            text_prompt=q,
            model="small",
            max_tokens=200
        )
        
        print(f"A: {result['choices'][0]['message']['content']}")
else:
    print("JFK audio not loaded")

---

## 8. Multi-Turn Conversations

Have follow-up conversations about audio content.

---

In [None]:
class VoxtralConversation:
    """Manage multi-turn conversations with Voxtral."""
    
    def __init__(self, model: str = "small"):
        self.model = model
        self.messages = []
    
    def send_with_audio(self, audio_base64: str, text: str, max_tokens: int = 500) -> str:
        """Send a message with audio."""
        user_message = {
            "role": "user",
            "content": [
                {"type": "input_audio", "input_audio": {"data": audio_base64, "format": "wav"}},
                {"type": "text", "text": text}
            ]
        }
        self.messages.append(user_message)
        return self._get_response(max_tokens)
    
    def send_text(self, text: str, max_tokens: int = 500) -> str:
        """Send a text-only follow-up message."""
        self.messages.append({"role": "user", "content": text})
        return self._get_response(max_tokens)
    
    def _get_response(self, max_tokens: int) -> str:
        """Get response from model."""
        payload = {"messages": self.messages, "max_tokens": max_tokens}
        
        response = bedrock.invoke_model(
            modelId=MODELS[self.model],
            body=json.dumps(payload),
            contentType="application/json"
        )
        
        result = json.loads(response['body'].read())
        assistant_content = result['choices'][0]['message']['content']
        
        self.messages.append({"role": "assistant", "content": assistant_content})
        return assistant_content
    
    def clear(self):
        """Clear conversation history."""
        self.messages = []


# Multi-turn conversation about JFK speech
print("Multi-Turn Conversation")
print("=" * 60)

if "jfk" in audio_data:
    conv = VoxtralConversation(model="small")
    
    # Turn 1: Initial question with audio
    print("\nTurn 1 - With Audio:")
    print("-" * 40)
    response1 = conv.send_with_audio(
        audio_base64=audio_data["jfk"]["base64"],
        text="Who is speaking and what are they talking about?"
    )
    print(f"User: Who is speaking and what are they talking about?")
    print(f"Assistant: {response1}")
    
    # Turn 2: Follow-up (text only)
    print("\nTurn 2 - Text Only:")
    print("-" * 40)
    response2 = conv.send_text("When was this speech given and why was it significant?")
    print(f"User: When was this speech given and why was it significant?")
    print(f"Assistant: {response2}")
    
    # Turn 3: Another follow-up
    print("\nTurn 3 - Text Only:")
    print("-" * 40)
    response3 = conv.send_text("What was the full quote that includes 'ask not'?")
    print(f"User: What was the full quote that includes 'ask not'?")
    print(f"Assistant: {response3}")
    
    print(f"\nConversation history: {len(conv.messages)} messages")
else:
    print("JFK audio not loaded")

---

## 9. Multiple Audio Files

Process multiple audio inputs in a single request for comparison.

---

In [None]:
def call_voxtral_multi_audio(
    audio_list: list,
    text_prompt: str,
    model: str = "small",
    max_tokens: int = 500
) -> dict:
    """
    Call Voxtral with multiple audio files.
    
    Args:
        audio_list: List of base64-encoded audio strings
        text_prompt: Text question about the audio files
        model: 'mini' or 'small'
        max_tokens: Maximum tokens in response
    """
    content = []
    
    for audio_b64 in audio_list:
        content.append({
            "type": "input_audio",
            "input_audio": {"data": audio_b64, "format": "wav"}
        })
    
    content.append({"type": "text", "text": text_prompt})
    
    payload = {
        "messages": [{"role": "user", "content": content}],
        "max_tokens": max_tokens
    }
    
    response = bedrock.invoke_model(
        modelId=MODELS[model],
        body=json.dumps(payload),
        contentType="application/json"
    )
    
    return json.loads(response['body'].read())


# Compare two audio files
print("Multiple Audio Comparison")
print("=" * 60)

if "jfk" in audio_data and "librispeech" in audio_data:
    print("Audio 1: JFK Speech")
    print("Audio 2: LibriSpeech Sample")
    print("=" * 60)
    
    result = call_voxtral_multi_audio(
        audio_list=[
            audio_data["jfk"]["base64"],
            audio_data["librispeech"]["base64"]
        ],
        text_prompt="Compare these two audio samples. Describe the differences in content, speaker style, and audio quality.",
        model="small",
        max_tokens=400
    )
    
    print(f"\nComparison:")
    print(result['choices'][0]['message']['content'])
    print(f"\nTokens: {result['usage']}")
else:
    print("Need both audio files loaded for comparison")

---

## 10. Best Practices & Limitations

---

### Model Selection

| Use Case | Recommended Model | Reasoning |
|----------|-------------------|------------|
| Quick classification | **Voxtral Mini** | Fast, low latency |
| Simple transcription | **Voxtral Mini** | Cost-effective for basic tasks |
| Detailed analysis | **Voxtral Small** | Better comprehension |
| Complex conversations | **Voxtral Small** | Stronger reasoning |

### Audio Format Requirements

| Specification | Recommendation |
|---------------|----------------|
| **Format** | WAV (16-bit PCM) - convert from MP3/OGG/FLAC |
| **Sample Rate** | 16kHz for speech |
| **Channels** | Mono recommended |
| **Max Duration** | ~45 seconds per request (see payload limits below) |

### InvokeModel API Limitations

| Limitation | Value | Notes |
|------------|-------|-------|
| **Request payload** | ~2 MB | Voxtral-specific limit on Bedrock |
| **Max audio per request** | ~45 seconds | At 16kHz mono 16-bit WAV |
| **Recommended chunk size** | 30 seconds | With 2s overlap for word boundaries |

**Why ~2MB?** While Bedrock's InvokeModel API allows up to 25MB payloads, Voxtral models on Bedrock have a stricter ~2MB limit. This restricts audio to ~45 seconds per request.

### Transcribing Long Audio (30+ minutes)

For audio longer than ~45 seconds, you **must chunk the audio** and make sequential API calls:

1. **Cannot use multiple audio files** - Multiple `input_audio` blocks still count against the ~2MB limit
2. **Must chunk sequentially** - Process 30-second segments one at a time
3. **Use overlap** - 2-second overlap prevents cutting words at boundaries
4. **Combine results** - Concatenate transcriptions (may have minor repetition at boundaries)

See the `transcribe_long_audio()` function above for a working implementation.

### Supported Input Formats

| Format | Support | Notes |
|--------|---------|-------|
| **WAV** | Direct | 16-bit PCM recommended |
| **OGG** | Convert | Use soundfile to convert to WAV |
| **MP3** | Convert | May cause errors; convert to WAV first |
| **FLAC** | Convert | Use soundfile to convert to WAV |

### Current Limitations

1. **Payload Size**: ~2MB limit restricts audio to ~45 seconds per request
2. **MP3 Format**: Not directly supported - must convert to WAV first
3. **Audio URLs**: Only base64-encoded audio is supported (no URL references)
4. **Converse API**: Does not support audio content blocks
5. **Tool Calling**: Not available for Voxtral models on Bedrock

### Tips for Best Results

1. **Convert audio**: Always convert to 16-bit PCM WAV at 16kHz mono
2. **Chunk long audio**: Use 30-second chunks with 2-second overlap
3. **Calculate size**: At 16kHz mono: 1 second ≈ 32KB WAV ≈ 43KB base64
4. **Use clear prompts**: Be specific about transcription vs. analysis
5. **Use streaming**: For interactive applications, stream for better UX
6. **Multi-turn for context**: Use conversations for follow-up questions

---

## Clean Up

This notebook uses Amazon Bedrock's serverless inference, so there are no persistent resources to clean up. You are charged based on token usage.

In [None]:
# Clear any conversation history
if 'conv' in dir():
    conv.clear()

print("Notebook complete!")
print("\nVoxtral models tested:")
for name, model_id in MODELS.items():
    print(f"  - {name}: {model_id}")
print("\nAudio files used:")
for name in audio_data.keys():
    print(f"  - {name}")