# Speech Understanding Examples

This notebook demonstrates how to use Amazon Nova 2 Omni for speech understanding tasks. Nova 2 Omni can transcribe, summarize, analyze, answer questions about, and translate speech content in audio files.

**Supported audio formats:** mp3, opus, wav, aac, flac, mp4, ogg, mkv

---

## Setup

### Helper Functions

Run the cell below to establish helper functions used by the examples in this notebook.

In [None]:
import boto3
from botocore.config import Config

import nova_utils

MODEL_ID = "us.amazon.nova-2-omni-v1:0"
REGION_ID = "us-west-2"

def get_bedrock_runtime():
    """Returns a properly configured Bedrock Runtime client."""
    config = Config(
        read_timeout=2 * 60,
    )
    bedrock = boto3.client(
        service_name="bedrock-runtime",
        region_name=REGION_ID,
        config=config,
    )
    return bedrock


def speech_to_text(
    audio_path,
    audio_type,
    text_prompt,
    temperature=None,
    max_tokens=10000,
    reasoning_effort=None,
):
    """
    Generates a text output from a text prompt and a single input audio.

    Args:
        audio_path: The path to the input audio.
        audio_type: Type of the audio file (mp3, opus, wav, aac, flac, mp4, ogg, mkv)
        text_prompt: The text prompt to use for speech understanding.
        temperature: Optional temperature parameter (0.0-1.0). If None, uses model default.
        max_tokens: Maximum number of tokens to generate (default: 10000).
        reasoning_effort: Optional reasoning effort level ("low", "medium", "high"). If None, reasoning is disabled.

    Returns:
        (Dict) A dictionary with "text" and "request_id" keys
    """
    audio_bytes = nova_utils.load_audio_as_bytes(audio_path)

    # Build inference config
    inference_config = {"maxTokens": max_tokens}
    if temperature is not None:
        inference_config["temperature"] = temperature

    # Build request
    request = {
        "modelId": MODEL_ID,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"audio": {"format": audio_type, "source": {"bytes": audio_bytes}}},
                    {"text": text_prompt},
                ],
            }
        ],
        "inferenceConfig": inference_config,
    }

    # Add reasoning config if specified
    if reasoning_effort is not None:
        if reasoning_effort.lower() not in ["low", "medium", "high"]:
            raise ValueError("reasoning_effort must be 'low', 'medium', or 'high'")

        request["additionalModelRequestFields"] = {
            "reasoningConfig": {
                "type": "enabled",
                "maxReasoningEffort": reasoning_effort.lower(),
            }
        }

    bedrock_runtime = get_bedrock_runtime()

    response = bedrock_runtime.converse(**request)
    import json

    return {
        "text": nova_utils.extract_response_text(response),
        "request_id": response.get("ResponseMetadata", {}).get("RequestId", "N/A"),
    }

---

## Use Case 1: Transcribing Speech from Audio Files

Nova 2 Omni can transcribe speech content in audio files and can provide annotations indicating who is speaking, known as diarization.

**Recommended inference parameters:**
* `temperature`: 0 (greedy decoding)
* Reasoning should not be used

---

### Example 1a: Speech Transcription (Without Diarization)

**Recommended prompt template:**
```
Transcribe the audio.
```

In [None]:
audio_path = "media/call_1763087723216.wav"
audio_type = "wav"
temperature = 0.0

user_prompt = "Transcribe the audio."

try:
    result = speech_to_text(audio_path, audio_type, user_prompt, temperature)

    if result["text"]:
        print(f"Request ID: {result['request_id']}\n")
        print("== Transcription Output ==\n")
        print(result["text"])

        # Store for later use in Q&A examples
        transcription = result["text"]

except Exception as e:
    print(f"Error occurred: {e}")

---

### Example 1b: Speech Transcription (With Diarization)

**Recommended prompt template:**
```
For each speaker turn segment, transcribe, assign a speaker label, start and end timestamps. 
You must follow the exact XML format shown in the example below: 
'<segment><transcription speaker="speaker_id" start="start_time" end="end_time">transcription_text</transcription></segment>
```

In [None]:
audio_path = "media/call_1763087723216.wav"
audio_type = "wav"
temperature = 0.0

user_prompt = """For each speaker turn segment, transcribe, assign a speaker label, start and end timestamps. You must follow the exact XML format shown in the example below: '<segment><transcription speaker="speaker_id" start="start_time" end="end_time">transcription_text</transcription></segment>'"""

try:
    result = speech_to_text(audio_path, audio_type, user_prompt, temperature)

    if result["text"]:
        print(f"Request ID: {result['request_id']}\n")
        print("== Transcription with Diarization Output ==\n")
        print(result["text"])

except Exception as e:
    print(f"Error occurred: {e}")

---

## Use Case 2: Summarizing Speech in Audio Files

The Nova 2 Omni model is capable of understanding speech in audio files and generating concise summaries.

**Recommended inference parameters:**
* `temperature`: text default parameters
* `topP`: text default parameters
* Some use cases may benefit from enabling model reasoning; however, we recommend starting without reasoning first

---

### Example 2: Summarize Audio Content

**Recommended prompt template:**
```
Extract and summarize the essential details from the audio.
```

In [None]:
audio_path = "media/call_1763087723216.wav"
audio_type = "wav"

user_prompt = "Extract and summarize the essential details from the audio."

try:
    result = speech_to_text(audio_path, audio_type, user_prompt)

    if result["text"]:
        print(f"Request ID: {result['request_id']}\n")
        print("== Summary Output ==\n")
        print(result["text"])

except Exception as e:
    print(f"Error occurred: {e}")

---

## Use Case 3: Analyzing Audio Calls

The Nova 2 Omni model is capable of understanding speech in audio files and generating structured call analytics based on your business needs.

**Recommended inference parameters:**
* `temperature`: text default parameters
* `topP`: text default parameters
* Some use cases may benefit from enabling model reasoning; however, we recommend starting without reasoning first

---

### Example 3: Call Analytics with Structured JSON Output

**Example prompt:**
```
Analyze the call and return JSON:
{
  "call_summary": "Summarize the call",
  "customer_intent": "What the customer wanted",
  "resolution_status": "resolved/pending/escalated",
  "key_topics": ["topic1", "topic2"],
  "action_items": [
    {"task": "description", "owner": "agent/customer", "priority": "high/medium/low"}
  ],
  "sentiment_analysis": {
    "overall": "positive/neutral/negative"
  },
  "follow_up_required": true/false
}
```

**Note:** You can customize the JSON structure based on your specific business needs.

In [None]:
audio_path = "media/call_1763087723216.wav"
audio_type = "wav"

user_prompt = """Analyze the call and return JSON:
{
  "call_summary": "Summarize the call",
  "customer_intent": "What the customer wanted",
  "resolution_status": "resolved/pending/escalated",
  "key_topics": ["topic1", "topic2"],
  "action_items": [
    {"task": "description", "owner": "agent/customer", "priority": "high/medium/low"}
  ],
  "sentiment_analysis": {
    "overall": "positive/neutral/negative"
  },
  "follow_up_required": true/false
}"""

try:
    result = speech_to_text(audio_path, audio_type, user_prompt)

    if result["text"]:
        print(f"Request ID: {result['request_id']}\n")
        print("== Call Analytics Output ==\n")
        print(result["text"])

except Exception as e:
    print(f"Error occurred: {e}")

---

## Use Case 4: Asking Questions About Audio File Content

You can leverage Nova 2 Omni's speech understanding capabilities for question and answer use cases.

**Recommended inference parameters:**
* `temperature`: text default parameters
* `topP`: text default parameters
* Some use cases may benefit from enabling model reasoning; however, we recommend starting without reasoning first

**Note:** No specific prompting template is required. Simply ask your question naturally.

---

### Example 4a: Count Speakers in Audio

In [None]:
audio_path = "media/call_1763087723216.wav"
audio_type = "wav"

user_prompt = "How many speakers are in the audio?"

try:
    result = speech_to_text(audio_path, audio_type, user_prompt)

    if result["text"]:
        print(f"Request ID: {result['request_id']}\n")
        print("== Q&A Output ==\n")
        print(f"Question: {user_prompt}")
        print(f"Answer: {result['text']}")

except Exception as e:
    print(f"Error occurred: {e}")

### Example 4b: Analyze Emotional Tone

In [None]:
audio_path = "media/call_1763087723216.wav"
audio_type = "wav"

user_prompt = "What was the overall emotional tone of the speaker (e.g., frustrated, calm, excited)?"

try:
    result = speech_to_text(audio_path, audio_type, user_prompt)

    if result["text"]:
        print(f"Request ID: {result['request_id']}\n")
        print("== Q&A Output ==\n")
        print(f"Question: {user_prompt}")
        print(f"Answer: {result['text']}")

except Exception as e:
    print(f"Error occurred: {e}")

### Example 4c: List People Mentioned

In [None]:
audio_path = "media/call_1763087723216.wav"
audio_type = "wav"

user_prompt = "List the people mentioned in the audio."

try:
    result = speech_to_text(audio_path, audio_type, user_prompt)

    if result["text"]:
        print(f"Request ID: {result['request_id']}\n")
        print("== Q&A Output ==\n")
        print(f"Question: {user_prompt}")
        print(f"Answer: {result['text']}")

except Exception as e:
    print(f"Error occurred: {e}")

### Example 4d: Detect Background Noise

In [None]:
audio_path = "media/call_1763087723216.wav"
audio_type = "wav"

user_prompt = "Is there background noise in the audio?"

try:
    result = speech_to_text(audio_path, audio_type, user_prompt)

    if result["text"]:
        print(f"Request ID: {result['request_id']}\n")
        print("== Q&A Output ==\n")
        print(f"Question: {user_prompt}")
        print(f"Answer: {result['text']}")

except Exception as e:
    print(f"Error occurred: {e}")

### Example 4e: Describe Speakers

In [None]:
audio_path = "media/call_1763087723216.wav"
audio_type = "wav"

user_prompt = "Describe the speakers in the audio."

try:
    result = speech_to_text(audio_path, audio_type, user_prompt)

    if result["text"]:
        print(f"Request ID: {result['request_id']}\n")
        print("== Q&A Output ==\n")
        print(f"Question: {user_prompt}")
        print(f"Answer: {result['text']}")

except Exception as e:
    print(f"Error occurred: {e}")