# TwelveLabs Marengo on Amazon Bedrock Workshop

TwelveLabs is a leading provider of multimodal AI models specializing in video understanding and analysis. TwelveLabs' advanced models enable sophisticated video search, analysis, and content generation capabilities through state-of-the-art computer vision and natural language processing technologies. [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html) now offers two TwelveLabs models: [TwelveLabs Pegasus 1.2](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-pegasus.html), which provides comprehensive video understanding and analysis, and [TwelveLabs Marengo Embed 2.7](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-marengo.html), which generates high-quality embeddings for video, text, audio, and image content. These models empower developers to build applications that can intelligently process, analyze, and derive insights from video data at scale.

In this notebook, we'll be using TwelveLabs Marengo model for generating embeddings for content in texts, images and videos to enable multimodal search and analysis capabilities across different media types. 

### TwelveLabs Video Understanding Models
TwelveLabs’ video understanding models consist of a family of deep neural networks built on our multimodal foundation model for video understanding that you can use for the following downstream tasks:
- Search using natural language queries
- Analyze videos to generate text

Videos contain multiple types of information, including visuals, sounds, spoken words, and texts. The human brain combines all types of information and their relations with each other to comprehend the overall meaning of a scene. For example, you’re watching a video of a person jumping and clapping, both visual cues, but the sound is muted. You might realize they’re happy, but you can’t understand why they’re happy without the sound. However, if the sound is unmuted, you could realize they’re cheering for a soccer team that scored a goal.

Thus, an application that analyzes a single type of information can’t provide a comprehensive understanding of a video. TwelveLabs’ video understanding models, however, analyze and combine information from all the modalities to accurately interpret the meaning of a video holistically, similar to how humans watch, listen, and read simultaneously to understand videos.

Our video understanding models have the ability to identify, analyze, and interpret a variety of elements, including but not limited to the following:
| Element | Modality | Example |
|---------|----------|---------|
| People, including famous individuals | Visual | Michael Jordan, Steve Jobs |
| Actions | Visual | Running, dancing, kickboxing |
| Objects | Visual | Cars, computers, stadiums |
| Animals or pets | Visual | Monkeys, cats, horses |
| Nature | Visual | Mountains, lakes, forests |
| Text displayed on the screen (OCR) | Visual | License plates, handwritten words, number on a player's jersey |
| Brand logos | Visual | Nike, Starbucks, Mercedes |
| Shot techniques and effects | Visual | Aerial shots, slow motion, time-lapse |
| Counting objects | Visual | Number of people in a crowd, items on a shelf, vehicles in traffic |
| Sounds | Audio | Chirping (birds), applause, fireworks popping or exploding |
| Human speech | Audio | "Good morning. How may I help you?" |
| Music | Audio | Ominous music, whistling, lyrics |

### Modalities
Modalities represent the types of information that the models process and analyze in a video. These modalities are central to both indexing and searching video content.

The models support the following modalities: 

- **Visual**: Analyzes visual content in a video, including actions, objects, events, text (through Optical Character Recognition, or OCR), and brand logos.
- **Audio**: Analyzes audio content in a video, including ambient sounds, music, and human speech.

## Part 0: Setup

In [None]:
# Make sure you download the latest botocore and boto3 libraries.
import shutil
import subprocess
import sys

def ensure_uv_installed():
    if shutil.which("uv") is None:
        print("🔧 'uv' not found. Installing with pip...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "uv"])
    else:
        print("✅ 'uv' is already installed.")

def uv_install(*packages):
    ensure_uv_installed()
    uv_path = shutil.which("uv")
    print(f"📦 Installing {', '.join(packages)} using uv...")
    subprocess.check_call([uv_path, "pip", "install", *packages])

### Dependencies

In [None]:
%uv pip install -r requirements.txt

In [None]:
import pandas as pd
import numpy as np
from IPython.display import display, Image
from sklearn.metrics.pairwise import cosine_similarity

## Part 1: Multimodal Embeddings with Marengo

### Part 1a: What is an embedding?

Use TwelveLabs Marengo to create multimodal embeddings for videos, texts, images, and audio files. These embeddings are contextual vector representations (a series of numbers) that capture interactions between modalities, such as visual expressions, body language, spoken words, and video context. You can apply these embeddings to downstream tasks like training custom multimodal models for anomaly detection, diversity sorting, sentiment analysis, recommendations, or building Retrieval-Augmented Generation (RAG) systems.

Key features:
- **Native multimodal support**: Process all modalities natively without separate models or frame conversion.
- **State-of-the-art performance**: Captures motion and temporal information for accurate video interpretation.
- **Unified vector space**: Combines embeddings from different modalities for holistic understanding.
- **Fast and reliable**: Reduces processing time for large video sets.
- **Flexible segmentation**: Generate embeddings for video segments or the entire video.

Use cases:
- **Anomaly detection**: Identify unusual patterns, such as corrupt videos with black backgrounds, to improve data set quality.
- **Diversity sorting**: Organize data for broad representation, reducing bias and improving AI model training.
- **Sentiment analysis**: Combine vocal tone, facial expressions, and spoken language for accurate insights, which particularly useful for customer service.
- **Recommendations**: Use embeddings in similarity-based retrieval and ranking systems for recommendations.

To learn more about embeddings, check out [The Multimodal Evolution of Vector Embeddings](https://www.twelvelabs.io/blog/multimodal-embeddings) on the TwelveLabs Blog!

In [None]:
# Sample embeddings
sample_embedding_1 = np.random.rand(1, 1024)
sample_embedding_2 = np.random.rand(1, 1024)

df_embedding_1 = pd.DataFrame(sample_embedding_1)
df_embedding_2 = pd.DataFrame(sample_embedding_2)

df_embedding_1


In [None]:
# Sample video embedding
sample_video_embedding = np.random.rand(5, 1024)
df_video_embedding = pd.DataFrame(sample_video_embedding)
df_video_embedding

### Part 1b: Calculating cosine similarity

Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between them in high-dimensional space. Unlike distance metrics that consider magnitude, cosine similarity focuses purely on the orientation or direction of vectors, making it particularly useful for comparing text embeddings, documents, and other high-dimensional data.

The multimodal vector embeddings from TwelveLabs Marengo can be used to calculate the similarity across text, image, audio, and video.

***Formula***

The cosine similarity between two vectors **A** and **B** is calculated as:

```
cos(θ) = (A · B) / (||A|| × ||B||)
```

Where:
- **A · B** is the dot product of vectors A and B
- **||A||** and **||B||** are the magnitudes (norms) of vectors A and B respectively
- **θ** is the angle between the two vectors

***Key Characteristics***
- **Range**: Values range from -1 to 1
  - **1**: Identical direction (perfect similarity)
  - **0**: Orthogonal vectors (no similarity)
  - **-1**: Opposite directions (perfect dissimilarity)
- **Magnitude Independence**: Only considers vector direction, not size
- **Symmetric**: cos(A,B) = cos(B,A)

***Benefits***
- **Scale Invariant**: Ideal for comparing vectors of different magnitudes
- **Computationally Efficient**: Fast calculation, especially with sparse vectors
- **Robust for Text Analysis**: Perfect for document similarity and text embeddings
- **Handles High Dimensions**: Works well in high-dimensional spaces without curse of dimensionality issues
- **Intuitive Results**: Easy to interpret similarity scores between 0 and 1 for most applications

***Drawbacks***
- **Ignores Magnitude**: Completely disregards vector size, which may contain important information
- **Limited with Negative Values**: Can be less meaningful when dealing with vectors containing negative components
- **Not Always Intuitive**: May not align with human perception of similarity in certain domains
- **Loses Information**: Discarding magnitude means losing potentially valuable signal strength data
- **Poor for Sparse Positive Data**: May not distinguish well between vectors with very few non-zero elements

In [None]:
# Cosine similarity between two single segment embeddings
similarity = cosine_similarity(df_embedding_1, df_embedding_2)
pd.DataFrame(similarity)

In [None]:
# Cosine similarity with a multi-segment embedding
similarities = cosine_similarity(df_video_embedding, df_embedding_1)
pd.DataFrame(similarities)

In [None]:
# Getting the max similarity and the index of the max similarity
max_similarity = np.max(similarities)
max_similarity_index = np.argmax(similarities)

print(f"Max similarity: {max_similarity}")
print(f"Index of max similarity: {max_similarity_index}")

---
## Part 2: Building Multimodal Video Search


In [None]:
from utils import BedrockTwelvelabsHelper, play_video, delete_s3_bucket_objects
import boto3

In [None]:
bedrock_client = boto3.client("bedrock-runtime")
s3_client = boto3.client("s3")
aws_account_id = boto3.client('sts').get_caller_identity()["Account"]
model_id = "twelvelabs.marengo-embed-2-7-v1:0"
cris_model_id = "us.twelvelabs.marengo-embed-2-7-v1:0"
s3_bucket_name = '<an S3 bucket for storing the outputs>'

bedrock_twelvelabs_helper = BedrockTwelvelabsHelper(bedrock_client=bedrock_client, 
                s3_client=s3_client, 
                aws_account_id=aws_account_id, 
                model_id=model_id, 
                cris_model_id=cris_model_id, 
                s3_bucket_name=s3_bucket_name)

### Part 2a: Storing videos in S3

#### Netflix Open Content

The [Netflix Open Content](https://opencontent.netflix.com/) is an open source content available under the [Creative Commons Attribution 4.0 International Public License](https://www.google.com/url?q=https%3A%2F%2Fcreativecommons.org%2Flicenses%2Fby%2F4.0%2Flegalcode&sa=D&sntz=1&usg=AOvVaw3DDX6ldzWtAO5wOs5KkByf).

The assets are available for download at: http://download.opencontent.netflix.com/

We will be utilizing a subset of the videos for demonstrating how to utilize the TwelveLabs models on Amazon Bedrock.

## Download a Sample Video and Upload to S3 as Input
We'll use the TwelveLabs Marengo model to generate embeddings from this video and perform content-based search.

![Meridian](./assets/images/sample-video-meridian.png)
We will use an open-source sample video, [Meridian](https://en.wikipedia.org/wiki/Meridian_(film)), as input to generate embeddings.

In [None]:
# Download a sample video to local disk
sample_name = 'NetflixMeridian.mp4'
source_url = f'https://ws-assets-prod-iad-r-pdx-f3b3f9f1a7d6a3d0.s3.us-west-2.amazonaws.com/335119c4-e170-43ad-b55c-76fa6bc33719/NetflixMeridian.mp4'
!curl {source_url} --output {sample_name}

In [None]:
# Upload to S3
s3_video_output_path = bedrock_twelvelabs_helper.upload_video(sample_name)

### Part 2b: Creating vector embeddings with Marengo on Bedrock

#### TwelveLabs Marengo

Marengo is an embedding model for comprehensive video understanding. Marengo analyzes multiple modalities in video content, including visuals, audio, and text, to provide a holistic understanding similar to human comprehension.

***Key features***
- **Multimodal processing:** Combines visual, audio, and text elements for comprehensive understanding
- **Fine-grained search:** Detects brand logos, text, and small objects (as small as 10% of the video frame)
- **Motion search:** Identifies and analyzes movement within videos
- **Counting capabilities:** Accurately counts objects in video frames
- **Audio comprehension:** Analyzes music, lyrics, sound, and silence

***Use cases***
- **Search:** Use natural language queries to find specific content within videos
- **Embeddings:** Create video embeddings for various downstream applications

#### Marengo Embed 2.7 on Bedrock

A multimodal embedding model that generates high-quality vector representations of video, text, audio, and image content for similarity search, clustering, and other machine learning tasks. The model supports multiple input modalities and provides specialized embeddings optimized for different use cases.

The model supports asynchronous inference through the [StartAsyncInvoke API](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_StartAsyncInvoke.html).
- Provider — TwelveLabs
- Categories — Embeddings, multimodal
- Model ID — `twelvelabs.marengo-embed-2-7-v1:0`
- Input modality — Video, Text, Audio, Image
- Output modality — Embeddings
- Max video size — 2 hours long video (< 2GB file size)

**Resources:**
- [AWS Docs: TwelveLabs Marengo Embed 2.7](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-marengo.html)
- [TwelveLabs Docs: Marengo](https://docs.twelvelabs.io/v1.3/docs/concepts/models/marengo)


In [None]:
# Example: Create text embedding
text_query = "two people having a conversation in a car"
text_embedding_data = bedrock_twelvelabs_helper.create_text_embedding(text_query)
print(f"✅ Text embedding created successfully with {len(text_embedding_data)} segment and {len(text_embedding_data[0]['embedding'])} dimensions.")

Creating an image embedding with Marengo

In [None]:
# Choose image
image_path = "images/meridian-image-search.jpg"
image_embedding_data = bedrock_twelvelabs_helper.create_image_embedding(image_path)
print(f"✅ Image embedding created successfully with {len(image_embedding_data)} segment(s)")

Creating video embeddings with Marengo

In [None]:
video_embedding_data, video_id = bedrock_twelvelabs_helper.create_video_embedding(s3_video_output_path)
print(f"✅ Video embedding created successfully with {len(video_embedding_data)} segment(s)")

In [None]:
[x for x in video_embedding_data if x["embeddingOption"] == "visual-image"][0]

# Part 2c: Creating a vector index in OpenSearch Serverless
[OpenSearch Serverless](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless.html) makes it easy to create vector indexes for storing and searching embeddings from text, images, audio and videos. 

As a vector database, OpenSearch serverless allows you to quickly find similar content using semantic search without worrying about managing servers or infrastructure. In the following section, we'll use the Python API to set up an OpenSearch Serverless collection and index, load embeddings from the TwelveLabs Marengo model into our vector store, and execute semantic queries to discover the most relevant content within our dataset. 

Create an OpenSearch Serverless collection

In [None]:
collection_name_prefix = "twelvelabs-marengo-blog"
bedrock_twelvelabs_helper.create_opensearch_serverless_collection(collection_name_prefix)

Setup an index for the OpenSearch Serverless collection

In [None]:
index_name_prefix="twelvelabs-marengo-blog-index"
index_name = bedrock_twelvelabs_helper.create_opensearch_index(index_name_prefix)

Ingesting video embeddings created from Marengo model into the index

In [None]:
bedrock_twelvelabs_helper.index_video_embeddings(video_embedding_data, video_id)

Perform text semantic search on the video embeddings

In [None]:
text_query = "a person smoking in a room"
text_search_results = bedrock_twelvelabs_helper.search_videos_by_text(text_query, top_k=3)


Visualize the top results

In [None]:
# View top result
top_text_result = text_search_results[0]
video_url = bedrock_twelvelabs_helper.find_video_from_embedding(top_text_result)


In [None]:
top_text_result

In [None]:
start_time = top_text_result["start_time"]
play_video(video_url, start_time)

Perform text semantic search on the video embeddings

In [None]:
image_query = "images/meridian-image-search.jpg"
display(Image(filename=image_query, width=700))

In [None]:
image_search_results = bedrock_twelvelabs_helper.search_videos_by_image(image_path=image_query, top_k=3)

Visualize the top results

In [None]:
top_image_result = image_search_results[0]
video_url = bedrock_twelvelabs_helper.find_video_from_embedding(top_image_result)

In [None]:
top_image_result

In [None]:
# Play the video
play_video(video_url, top_image_result["start_time"])

---
## Cleanup


#### Delete OpenSearchServerless Index

In [None]:
# Delete OpenSearch index
try:
    index_name = bedrock_twelvelabs_helper.index_name
    response = bedrock_twelvelabs_helper.opensearch_client.indices.delete(
        index=index_name
    )
    print(f"Index '{index_name}' deletion response: {response}")
except Exception as e:
    print(f"Error deleting index '{index_name}': {e}")

In [None]:
aoss_client = boto3.client('opensearchserverless')
collection_id = bedrock_twelvelabs_helper.opensearch_serverless_collection_name["createCollectionDetail"]["id"]
response = aoss_client.delete_collection(id=collection_id)


#### Empty S3 bucket content

In [None]:
bucket_name = bedrock_twelvelabs_helper.s3_bucket_name
delete_s3_bucket_objects(s3_client, s3_bucket_name, "images")
delete_s3_bucket_objects(s3_client, s3_bucket_name, "videos")
delete_s3_bucket_objects(s3_client, s3_bucket_name, "embeddings")