# Video Analysis with Amazon Nova 2 Lite

In this notebook, we demonstrate how to use Amazon Nova 2 Lite for video understanding using the Converse API.

## Use Case Description

We will use Nova 2 Lite to analyze a video, exercising the video understanding capabilities documented in the [Nova 2 multimodal prompting guide](https://docs.aws.amazon.com/nova/latest/nova2-userguide/prompting-multimodal.html):

- **Video Summarization:** Extracting a concise summary of the video content
- **Dense Captioning:** Generating detailed, scene-by-scene descriptions
- **Security Footage Analysis:** Detecting events in camera footage
- **Timestamp Extraction:** Localizing events with start/end times as structured JSON
- **Video Classification:** Categorizing the video from a predefined class list

## Setup

In [None]:
import boto3
import os
import json
import re
import uuid
import base64
import urllib.request
from IPython.display import Video, Markdown, display, HTML

In [None]:
%store -r MODEL_ID
%store -r region_name

In [None]:
bedrock = boto3.client("bedrock-runtime", region_name=region_name)

In [None]:
# Download the video and load bytes — reused across all exercises
video_url = "https://ws-assets-prod-iad-r-iad-ed304a55c2ca1aee.s3.us-east-1.amazonaws.com/8082573f-f39e-4e39-a48f-f3562cc6e597/aws-ads-rainy-day.mp4"
video_path = "video/aws-ads-rainy-day.mp4"

os.makedirs("video", exist_ok=True)
if not os.path.exists(video_path):
    print("Downloading video...")
    urllib.request.urlretrieve(video_url, video_path)
    print("Download complete.")

with open(video_path, "rb") as f:
    video_bytes = f.read()

print(f"Video loaded: {len(video_bytes) / (1024*1024):.1f} MB")

In [None]:
display(Video(video_path, embed=True, width=800, height=450,
              html_attributes="controls autoplay loop"))

## 1. Video Summarization

Nova 2 can generate summaries of video content. Per the best practices guide:
- Set `temperature: 0`
- Place the user text prompt after the video content
- Clearly specify the aspects of the video you care about in the prompt

In [None]:
response = bedrock.converse(
    modelId=MODEL_ID,
    messages=[
        {
            "role": "user",
            "content": [
                {"video": {"format": "mp4", "source": {"bytes": video_bytes}}},
                {"text": "Could you provide a summary of the video, focusing on its key points?"}
            ]
        }
    ],
    inferenceConfig={"maxTokens": 512, "temperature": 0}
)

display(Markdown(response["output"]["message"]["content"][0]["text"]))

## 2. Dense Captioning

Dense captioning generates detailed, scene-by-scene descriptions of the video. This is useful for creating searchable metadata or accessibility descriptions.

In [None]:
response = bedrock.converse(
    modelId=MODEL_ID,
    messages=[
        {
            "role": "user",
            "content": [
                {"video": {"format": "mp4", "source": {"bytes": video_bytes}}},
                {"text": "Describe the video scene-by-scene, including details about characters, actions and settings."}
            ]
        }
    ],
    inferenceConfig={"maxTokens": 1024, "temperature": 0}
)

display(Markdown(response["output"]["message"]["content"][0]["text"]))

## 3. Timestamp Extraction

Nova 2 can identify timestamps related to events in a video. Following the best practices for structured extraction, we pass a JSON schema in the user prompt and set `temperature: 0`. The first cell extracts key moments as structured JSON, and the second localizes a specific event using the recommended prompt template.

In [None]:
json_schema = {
    "type": "object",
    "properties": {
        "description": {
            "type": "string",
            "description": "A high-level summary of the entire video content."
        },
        "key_moments": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {
                        "type": "string",
                        "description": "A detailed description of what happens in this moment."
                    },
                    "start_time": {
                        "type": "number",
                        "description": "Start time of the moment in seconds."
                    },
                    "end_time": {
                        "type": "number",
                        "description": "End time of the moment in seconds."
                    }
                },
                "required": ["description", "start_time", "end_time"]
            }
        }
    },
    "required": ["description", "key_moments"]
}

prompt = f"""Segment this video into key moments with timestamps.

Extract information in JSON format according to the given schema.

Follow these guidelines:
- Ensure that every field is populated.
- Provide start_time and end_time in seconds.
- The description field should be a detailed account of what happens in each moment.
- Do not make up events not present in the video.

JSON Schema:
{json.dumps(json_schema, indent=2)}"""

response = bedrock.converse(
    modelId=MODEL_ID,
    messages=[
        {
            "role": "user",
            "content": [
                {"video": {"format": "mp4", "source": {"bytes": video_bytes}}},
                {"text": prompt}
            ]
        }
    ],
    inferenceConfig={"maxTokens": 2048, "temperature": 0}
)

result_text = response["output"]["message"]["content"][0]["text"]

# Strip markdown code fences (```json ... ```) if present
cleaned = result_text.strip()
if cleaned.startswith("```"):
    cleaned = cleaned.split("\n", 1)[1]
if cleaned.endswith("```"):
    cleaned = cleaned.rsplit("```", 1)[0]

result_json = json.loads(cleaned.strip())
print(json.dumps(result_json, indent=2))

### 3.a. Event Locolization with Timestamp extraction

In [None]:
# Localize a specific event and display interactive video clips
event_description = "Watching tablet"

response = bedrock.converse(
    modelId=MODEL_ID,
    messages=[
        {
            "role": "user",
            "content": [
                {"video": {"format": "mp4", "source": {"bytes": video_bytes}}},
                {"text": f'Please localize the moment that the event "{event_description}" happens in the video. Answer with the starting and ending time of the event in seconds, such as [[72, 82]]. If the event happens multiple times, list all of them, such as [[40, 50], [72, 82]].'}
            ]
        }
    ],
    inferenceConfig={"maxTokens": 512, "temperature": 0}
)

result_text = response["output"]["message"]["content"][0]["text"]
print(f"Raw response: {result_text}")

# Parse [[start, end], ...] timestamps from the response
timestamps = [
    (float(parts[0].strip()), float(parts[1].strip()))
    for match in re.findall(r'\[([^\[\]]*?)\]', result_text)
    if len(parts := match.split(',')) == 2
]

print(f"Detected {len(timestamps)} clip(s) for: {event_description}")

# Build an inline video player with jump-to buttons using base64 embedded video
video_b64 = base64.b64encode(video_bytes).decode('utf-8')
video_id = f"videoPlayer{uuid.uuid4().hex[:6]}"

buttons_html = ''.join([
    f'<button onclick="jumpTo{video_id}({start})" style="margin:4px;padding:6px 12px;cursor:pointer;">'
    f'{start}s - {end}s</button>'
    for start, end in timestamps
])

html = f"""
<video id="{video_id}" width="640" controls muted>
  <source src="data:video/mp4;base64,{video_b64}" type="video/mp4">
</video>
<div style="margin-top:10px;">
  <strong>Jump to \"{event_description}\":</strong><br>
  {buttons_html}
</div>
<script>
  function jumpTo{video_id}(time) {{
    var video = document.getElementById('{video_id}');
    video.currentTime = time;
    video.play();
  }}
</script>
"""

display(HTML(html))

# Conclusion

In this notebook, we demonstrated Amazon Nova 2 Lite's video understanding capabilities using the Converse API:

1. **Summarization** — concise overview of video content
2. **Dense Captioning** — detailed scene-by-scene descriptions
3. **Timestamp Extraction** — structured JSON output with key moments and event localization

All exercises follow the [Nova 2 multimodal prompting best practices](https://docs.aws.amazon.com/nova/latest/nova2-userguide/prompting-multimodal.html): `temperature: 0`, text prompt after video content, and task instructions in the user prompt.