# Video Understanding Examples

This notebook demonstrates how to use Amazon Nova 2 Omni for understanding video content. Nova 2 Omni can analyze videos, understand actions, extract insights, and classify video types.

**Supported video formats:** mp4, mov, avi, mkv, webm

**Note:** For audio understanding examples, see **01_speech_understanding_examples.ipynb**. For image generation examples, see **02_image_generation_examples.ipynb**.

---

## Setup

### Helper Functions

Run the cell below to establish helper functions used by the examples in this notebook.

In [None]:
import json
import boto3
from botocore.config import Config
from botocore.exceptions import ClientError
from IPython.display import Image, display

import nova_utils

MODEL_ID = "us.amazon.nova-2-omni-v1:0"
REGION_ID = "us-west-2"

def get_bedrock_runtime():
    """Returns a properly configured Bedrock Runtime client."""
    config = Config(read_timeout=2 * 60)
    bedrock = boto3.client(
        service_name="bedrock-runtime",
        region_name=REGION_ID,
        config=config,
    )
    return bedrock

---

## Video Understanding

Nova 2 Omni can analyze video content, understand actions, classify video types, and extract insights from moving images.

---

### Example 1a: Video Summarization

Create executive summaries of video content.

**Recommended inference parameters:**
* `temperature`: 0
* `topP`: 1
* Some use cases may benefit from enabling model reasoning

In [None]:
INPUT_VIDEO_PATH = "media/Cheesecake.mp4"
user_prompt = "Can you create an executive summary of this video's content?"

with open(INPUT_VIDEO_PATH, "rb") as video_file:
    video_bytes = video_file.read()

request = {
    "modelId": MODEL_ID,
    "messages": [
        {
            "role": "user",
            "content": [
                {"video": {"format": "mp4", "source": {"bytes": video_bytes}}},
                {"text": user_prompt},
            ],
        }
    ],
    "inferenceConfig": {"temperature": 0, "topP": 1, "maxTokens": 10000},
}

bedrock_runtime = get_bedrock_runtime()

try:
    response = bedrock_runtime.converse(**request)
    text_content = next((item for item in response["output"]["message"]["content"] if "text" in item), None)
    
    if text_content:
        print("== Video Summary ==")
        print(text_content["text"])

except ClientError as err:
    print(f"Error occurred: {err}")

---

### Example 1b: Step-by-Step Recipe Extraction

Extract structured recipe information from cooking videos.

**Recommended inference parameters:**
* `temperature`: 0
* `topP`: 1
* Some use cases may benefit from enabling model reasoning

In [None]:
INPUT_VIDEO_PATH = "media/Cheesecake.mp4"
user_prompt = """Extract the recipe from this video. Provide:
1. Recipe name
2. Ingredients list with measurements
3. Step-by-step instructions

Format as a clear, structured recipe."""

with open(INPUT_VIDEO_PATH, "rb") as video_file:
    video_bytes = video_file.read()

request = {
    "modelId": MODEL_ID,
    "messages": [
        {
            "role": "user",
            "content": [
                {"video": {"format": "mp4", "source": {"bytes": video_bytes}}},
                {"text": user_prompt},
            ],
        }
    ],
    "inferenceConfig": {"temperature": 0, "topP": 1, "maxTokens": 10000},
}

bedrock_runtime = get_bedrock_runtime()

try:
    response = bedrock_runtime.converse(**request)
    text_content = next((item for item in response["output"]["message"]["content"] if "text" in item), None)
    
    if text_content:
        print("== Extracted Recipe ==")
        print(text_content["text"])

except ClientError as err:
    print(f"Error occurred: {err}")


---

### Example 1c: Rich Visual Description

Generate detailed descriptions focusing on visual elements, colors, composition, and cinematography.

**Recommended inference parameters:**
* `temperature`: 0
* `topP`: 1
* Some use cases may benefit from enabling model reasoning

In [None]:
INPUT_VIDEO_PATH = "media/Cheesecake.mp4"
user_prompt = """Provide a rich visual description of this video. Focus on:
- Camera angles and framing (top-down, close-up, etc.)
- Color palette and lighting
- Visual composition and layout
- Movement and transitions
- Text overlays and their styling
- Overall aesthetic and production quality

Describe what makes this video visually engaging."""

with open(INPUT_VIDEO_PATH, "rb") as video_file:
    video_bytes = video_file.read()

request = {
    "modelId": MODEL_ID,
    "messages": [
        {
            "role": "user",
            "content": [
                {"video": {"format": "mp4", "source": {"bytes": video_bytes}}},
                {"text": user_prompt},
            ],
        }
    ],
    "inferenceConfig": {"temperature": 0, "topP": 1, "maxTokens": 10000},
}

bedrock_runtime = get_bedrock_runtime()

try:
    response = bedrock_runtime.converse(**request)
    text_content = next((item for item in response["output"]["message"]["content"] if "text" in item), None)
    
    if text_content:
        print("== Rich Visual Description ==")
        print(text_content["text"])

except ClientError as err:
    print(f"Error occurred: {err}")

---

### Example 1d: Event Timestamp Extraction

Extract timestamps for specific events in videos.

**Recommended inference parameters:**
* `temperature`: 0
* `topP`: 1
* Reasoning should not be used

In [None]:
INPUT_VIDEO_PATH = "media/Cheesecake.mp4"
event_query = "mixing ingredients"
user_prompt = f"Please localize the moment that the event '{event_query}' happens in the video. Answer with the starting and ending time of the event in seconds. e.g. [[72, 82]]. If the event happen multiple times, list all of them. e.g. [[40, 50], [72, 82]]"

with open(INPUT_VIDEO_PATH, "rb") as video_file:
    video_bytes = video_file.read()

request = {
    "modelId": MODEL_ID,
    "messages": [
        {
            "role": "user",
            "content": [
                {"video": {"format": "mp4", "source": {"bytes": video_bytes}}},
                {"text": user_prompt},
            ],
        }
    ],
    "inferenceConfig": {"temperature": 0, "topP": 1, "maxTokens": 10000},
}

bedrock_runtime = get_bedrock_runtime()

try:
    response = bedrock_runtime.converse(**request)
    text_content = next((item for item in response["output"]["message"]["content"] if "text" in item), None)
    
    if text_content:
        print(f"== Event Timestamps for '{event_query}' ==")
        print(text_content["text"])

except ClientError as err:
    print(f"Error occurred: {err}")

---

### Example 1e: Video Segmentation with Timestamps

Generate a log of video segments with timestamps and captions.

**Recommended inference parameters:**
* `temperature`: 0
* `topP`: 1
* Reasoning should not be used

In [None]:
INPUT_VIDEO_PATH = "media/Cheesecake.mp4"
user_prompt = "Segment a video into different scenes and generate caption per scene. The output should be in the format: [STARTING TIME-ENDING TIMESTAMP] CAPTION. Timestamp in MM:SS format"

with open(INPUT_VIDEO_PATH, "rb") as video_file:
    video_bytes = video_file.read()

request = {
    "modelId": MODEL_ID,
    "messages": [
        {
            "role": "user",
            "content": [
                {"video": {"format": "mp4", "source": {"bytes": video_bytes}}},
                {"text": user_prompt},
            ],
        }
    ],
    "inferenceConfig": {"temperature": 0, "topP": 1, "maxTokens": 10000},
}

bedrock_runtime = get_bedrock_runtime()

try:
    response = bedrock_runtime.converse(**request)
    text_content = next((item for item in response["output"]["message"]["content"] if "text" in item), None)
    
    if text_content:
        print("== Video Segmentation ==")
        print(text_content["text"])

except ClientError as err:
    print(f"Error occurred: {err}")

---

### Example 1f: Video Classification

Classify videos based on predefined categories.

**Recommended inference parameters:**
* `temperature`: 0
* `topP`: 1
* Reasoning should not be used

In [None]:
INPUT_VIDEO_PATH = "media/Cheesecake.mp4"
user_prompt = """What is the most appropriate category for this video? Select your answer from the options provided:
Cooking Tutorial
Home Repair
Makeup Tutorial
Other"""

with open(INPUT_VIDEO_PATH, "rb") as video_file:
    video_bytes = video_file.read()

request = {
    "modelId": MODEL_ID,
    "messages": [
        {
            "role": "user",
            "content": [
                {"video": {"format": "mp4", "source": {"bytes": video_bytes}}},
                {"text": user_prompt},
            ],
        }
    ],
    "inferenceConfig": {"temperature": 0, "topP": 1, "maxTokens": 100},
}

bedrock_runtime = get_bedrock_runtime()

try:
    response = bedrock_runtime.converse(**request)
    text_content = next((item for item in response["output"]["message"]["content"] if "text" in item), None)
    
    if text_content:
        print("== Video Classification ==")
        print(text_content["text"])

except ClientError as err:
    print(f"Error occurred: {err}")

---

## Key Takeaways

- **Video Summarization**: Create executive summaries of video content
- **Recipe Extraction**: Extract structured recipe information from cooking videos
- **Visual Description**: Analyze cinematography, colors, and composition
- **Timestamp Extraction**: Locate specific events within videos
- **Video Segmentation**: Break videos into timestamped segments with captions
- **Video Classification**: Categorize videos based on content
- **Temperature Settings**: Use temperature 0 for factual, consistent responses

## Next Steps

- Explore **01_speech_understanding_examples.ipynb** for comprehensive audio processing examples
- Check out **02_image_generation_examples.ipynb** to learn about image generation capabilities
- Experiment with different prompts and inference parameters to optimize for your use case