## Video and Audio Content Analysis with Amazon Bedrock and Amazon Aurora PostgreSQL pgvector

This notebook demonstrates how to process video and audio content using Amazon Bedrock with the [Amazon Titan Multimodal Embeddings G1 model](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-multiemb-models.html) to generate embeddings and store them in an [Amazon Aurora PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/data-api.html) database with pgvector for similarity search capabilities.

> Create Amazon Aurora PostgreSQL with this [Amazon CDK Stack](../create-audio-video-embeddings/02-aurora-pg-vector/README.md)


![Diagram](data/video-embedding.png)

## Use Cases

This notebook is useful for:
- Video and audio content analysis
- Semantic search in multimedia content
- Implementing multimodal RAG systems
- Processing and analyzing transcriptions



## System Architecture

The system uses:
- **Data Storage**: Amazon S3 for video/audio files and extracted frames
- **Database**: Amazon Aurora PostgreSQL with pgvector extension for storing embeddings
- **AI Services**: Amazon Bedrock with [Amazon Titan Multimodal Embeddings G1 model](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-multiemb-models.html) for generating embeddings
- **Configuration**: AWS Systems Manager Parameter Store for secure configuration

![Diagram](data/diagram_video.png)

## 1. Initial Setup

### Library Imports
The notebook begins by importing the necessary libraries:
- `boto3`: [AWS SDK for Python ](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingTheBotoAPI.html)
- Standard libraries: `json`, `os`, `base64`, `datetime`, `uuid`, etc.
- Custom utilities for transcript processing

### AWS Client Configuration
AWS service clients are configured:
- [Configure AWS credentials](https://docs.aws.amazon.com/braket/latest/developerguide/braket-using-boto3.html)
- `s3`: Amazon S3 client for storage operations
- `ssm`: [AWS Systems Manager Parameter Store](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html)  for configuration
- `bedrock_runtime`: Amazon Bedrock Runtime client for embedding generation
- `transcribe_client`: Amazon Transcribe client for audio transcription


## Processing Flow

The pipeline processes videos through these steps:

1. A video file is processed using Python code in a Jupyter notebook, utilizing the boto3 SDK to interact with AWS services.

2. The audio stream is extracted and sent to Amazon Transcribe for speech-to-text conversion.

3. Simultaneously, the video is processed to extract key frames, which are stored in an Amazon S3 bucket.

4. The extracted frames are processed through Amazon Bedrock's Titan embedding model to generate multimodal vectors that represent the visual content.

5. Finally, all the processed data (transcriptions, frame data, and vectors) is stored in Amazon Aurora Serverless PostgreSQL with pgvector extension, enabling vector-based searches through standard RDS API calls.

In [None]:
#!pip install boto3
#!pip install json
#!pip install base64
#!pip install uuid
# or install requirements.txt

In [1]:
import boto3
import json
import os
from PIL import Image as PILImage
import random

_region_name = "us-west-2"
ssm = boto3.client(service_name="ssm", region_name="us-east-1")

# Default model settings
default_model_id = os.environ.get("DEFAULT_MODEL_ID", "amazon.titan-embed-image-v1")
default_embedding_dimmesion = os.environ.get("DEFAULT_EMBEDDING_DIMENSION", "1024")

## 2. Database Interface (AuroraPostgres Class)

An `AuroraPostgres` class is defined that includes:
- `execute_statement()`: Handles database connections and queries [using RDS Data API](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/data-api.html)
- `insert()`: Implements functions to insert embedding data into the database
- `similarity_search()`: Provides similarity search using cosine similarity and L2 distance methods

Code: [aurora_service.py](create_audio_video_helper/aurora_service.py)

In [2]:
from create_audio_video_helper import AuroraPostgres

## 3. Video Content Processing

A `VideoProcessor` class that uses the [ffmpeg libavcodec library](https://ffmpeg.org/) are implemented for:
- `extract_sec_number()`: Extracts second numbers from file paths for frame ordering
- `ffmpeg_check()`: Checks if ffmpeg is installed on the system
- `run_ffmpeg_command()`: Runs ffmpeg commands with proper error handling
- `extract_frames()`: Extracts video frames at regular intervals using ffmpeg

Code: [video_processor.py](create_audio_video_helper/video_processor.py)

In [3]:
from create_audio_video_helper import VideoProcessor

## 4. Video Download and Processing

A `VideoManager` class is defined that includes:
- `parse_location()`: Parses S3 URIs into bucket, prefix, filename components
- `read_json_from_s3()`: Reads JSON files from S3 buckets
- `download_file()`: Downloads files from S3 to local storage
- `upload_file()`: Uploads local files to S3
- `read_image_from_s3()`: Reads image files directly from S3
- `read_image_from_local()`: Reads image files from local storage

Code: [video_manager.py](create_audio_video_helper/video_manager.py)

In [4]:
from create_audio_video_helper import VideoManager


## 5. Audio Processing with Amazon Transcribe

Implements functions for:
- `transcribe()`: Starts transcription jobs with Amazon Transcribe
- `wait_transcription_complete()`: Waits for job completion with status polling
- `get_transcribe_result_data()`: Gets transcription job status and results
- `process_part()`: Processes individual parts of the transcript
- `process_segments()`: Processes transcript segments with timing information
- `combine_by_seconds()`: Combines transcript segments by time
- `combine_transcrip_segments_by_speaker()`: Combines segments by speaker for better readability
- `process_transcript()`: Main function to process the complete transcript

Code: [audio_processor.py](create_audio_video_helper/audio_processing.py)

In [5]:
from create_audio_video_helper import AudioProcessing

## 6. Embedding Generation

Includes functions for:
- `get_image_embeddings()`: Generates embeddings for images using Amazon Bedrock
- `get_images_embeddings()`: Processes multiple images with progress bars
- `get_embeddings()`: Generic function that handles both text and image embedding generation
- `get_text_embeddings()`: Generates embeddings for text using Amazon Bedrock
- `create_text_embeddings()`: Creates structured embedding records for transcribed text
- `create_frames_embeddings()`: Creates structured embedding records for video frames


In [6]:
from create_audio_video_helper import EmbeddingGeneration

## 7. Select Key Frames

We select frames where the similarity drops below a threshold (indicating a visual change)

Code: [compare_frames.py](create_audio_video_helper/compare_frames.py)


In [7]:
from create_audio_video_helper import CompareFrames

### Main Processing Flow

The complete workflow includes:
1. Downloading the video file from S3 using `download_file()`
2. Verifying ffmpeg installation with `ffmpeg_check()`
3. Starting a transcription job with `transcribe()`
4. Extracting frames from the video using `extract_frames()`
5. Generating embeddings for the extracted frames with `get_images_embeddings()`
6. Filtering relevant frames based on similarity using `filter_relevant_frames()`
7. Processing the transcription results with `process_transcript()`
8. Creating embeddings for the transcribed text and selected frames using `create_text_embeddings()` and `create_frames_embeddings()`
9. Inserting the embeddings into the Aurora PostgreSQL database with `aurora.insert()`


### Configuration
The system uses environment variables and AWS Systems Manager Parameter Store for configuration:

**DEFAULT_MODEL_ID:** Bedrock model ID (default: "amazon.titan-embed-image-v1")

**DEFAULT_EMBEDDING_DIMENSION:** Embedding dimension (default: "1024")

### SSM Parameters:

```
/pgvector/cluster_arn
```

```
/pgvector/secret_arn
```
```
/pgvector/table_name
```

In [8]:

def get_ssm_parameter(name):
    response = ssm.get_parameter(Name=name, WithDecryption=True)
    return response["Parameter"]["Value"]



In [9]:
# Get Data from environment variables, never share secrets!

cluster_arn = get_ssm_parameter("/videopgvector/cluster_arn")
credentials_arn = get_ssm_parameter("/videopgvector/secret_arn")
table_name = get_ssm_parameter("/videopgvector/video_table_name")
default_database_name = "kbdata"

In [None]:
# Initialize Aurora PostgreSQL client
aurora = AuroraPostgres(cluster_arn, default_database_name, credentials_arn,_region_name)

In [14]:
# Verify Aurora Cluster conectivity:
aurora.execute_statement("select count(*) from bedrock_integration.knowledge_bases")

{'numberOfRecordsUpdated': 0, 'formattedRecords': '[{"count":125}]'}

# Upload Video to Amazon S3 bucket and Obtain s3_uri

This code shows how to upload a video from the `tmp` folder to an S3 bucket and obtain the S3 URI needed for further processing.

In [14]:
from pathlib import Path

def upload_video_to_s3(video_path, bucket_name, s3_key=None):
    """
    Upload a video file to an S3 bucket and return the S3 URI.
    
    Parameters:
    -----------
    video_path : str
        Local path to the video file
    bucket_name : str
        Name of the S3 bucket
    s3_key : str, optional
        The S3 key (path) where the video will be stored. If not provided,
        the filename from video_path will be used.
        
    Returns:
    --------
    str
        The S3 URI of the uploaded video (s3://bucket-name/key)
    """
    # Check if the file exists
    if not os.path.exists(video_path):
        raise FileNotFoundError(f"Video file not found: {video_path}")
    
    # If s3_key is not provided, use the filename from video_path
    if s3_key is None:
        s3_key = Path(video_path).name
    
    # Initialize S3 client
    s3_client = boto3.client('s3')
    
    try:
        # Upload the file
        print(f"Uploading {video_path} to s3://{bucket_name}/{s3_key}...")
        s3_client.upload_file(video_path, bucket_name, s3_key)
        print("Upload successful!")
        
        # Construct and return the S3 URI
        s3_uri = f"s3://{bucket_name}/{s3_key}"
        return s3_uri
    
    except Exception as e:
        print(f"Error uploading file to S3: {e}")
        raise

In [None]:
# Configure the parameters
video_path = "tmp/video.mp4"  # Path to the video in the tmp folder
bucket_name = "you-bucket-1234"     # Name of your S3 bucket


# You can also specify a custom path in S3 (optional)
s3_key = "videos/sample_video.mp4"

# Subir el video y obtener el S3 URI
s3_uri = upload_video_to_s3(video_path, bucket_name,s3_key)
print(f"S3 URI: {s3_uri}")

FileNotFoundError: Video file not found: tmp/video.mp4

In [15]:
# Download the file
# Create directory if it doesn't exist

tmp_path                    = "./tmp"

#s3_uri = "s3://you-bucket-1234/videos/you-video.mp4"

s3_uri = "s3://embeddings-demo-1234/videos/DEV315.mp4"


In [16]:
videomanager = VideoManager(s3_uri,_region_name)

bucket, prefix, fileName, extension, file  = videomanager.parse_location(s3_uri)

local_path              = f"{tmp_path}/{file}"
location                = f"{prefix}/{file}"
output_dir              = f"{tmp_path}/{fileName}"


os.makedirs(os.path.dirname(local_path), exist_ok=True)
print(f"dowloading s3://{bucket}/{prefix}/{file} to {local_path}")
result = videomanager.download_file(bucket,location, local_path)

dowloading s3://embeddings-demo-1234/videos/DEV315.mp4 to ./tmp/DEV315.mp4
File already exists


In [17]:
# Verify ffmpeg is installed 
videoprocessor = VideoProcessor()
videoprocessor.ffmpeg_check() ## Check if ffmpeg is installed

ffmpeg: ffmpeg version 7.1.1 Copyright (c) 2000-2025 the FFmpeg developers
built with Apple clang version 16.0.0 (clang-1600.0.26.6)
configuration: --prefix=/opt/homebrew/Cellar/ffmpeg/7.1.1_1 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags='-Wl,-ld_classic' --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libaribb24 --enable-libbluray --enable-libdav1d --enable-libharfbuzz --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libs

True

## Process the media file
This part involves:
1. Transcribing the audio to text using Amazon Transcribe
2. Extract frames using ffmpeg

This notebook assumes you have a valid media file in s3://path/to/video

#### ✅ Start Amazon Transcribe Job

In [18]:
audio_processing = AudioProcessing(_region_name,videomanager)

In [19]:
job_name = audio_processing.transcribe(s3_uri)

Transcription job DEV315.mp4-250409-115704 started...


#### ✅  Extract Key Frames with ffmpeg and Amazon Bedrock with Titan Multimodal Embeddings Model

In [20]:
files = videoprocessor.extract_frames(local_path, output_dir, every=1) # 1 frame per second

processing frames...
code: 0 stdout:  stderr: frame=  253 fps=0.0 q=7.1 size=N/A time=00:04:13.00 bitrate=N/A speed= 502x    
frame=  512 fps=510 q=8.9 size=N/A time=00:08:32.00 bitrate=N/A speed= 510x    
frame=  749 fps=497 q=4.7 size=N/A time=00:12:29.00 bitrate=N/A speed= 497x    
frame=  951 fps=473 q=8.5 size=N/A time=00:15:51.00 bitrate=N/A speed= 473x    
frame= 1215 fps=483 q=3.8 size=N/A time=00:20:15.00 bitrate=N/A speed= 483x    
frame= 1490 fps=494 q=10.4 size=N/A time=00:24:50.00 bitrate=N/A speed= 494x    
frame= 1734 fps=492 q=7.6 size=N/A time=00:28:54.00 bitrate=N/A speed= 492x    
frame= 2015 fps=501 q=9.0 size=N/A time=00:33:35.00 bitrate=N/A speed= 501x    
frame= 2097 fps=499 q=7.2 Lsize=N/A time=00:34:57.00 bitrate=N/A speed= 499x    

done processing frames => 2097


## Create Text and Image embeddings 

In [21]:
embedding_generation = EmbeddingGeneration(videomanager,_region_name,default_model_id,default_embedding_dimmesion)

In [None]:
# calculate embeddings for all extracted frames (1 per second)
embed_1024 = embedding_generation.get_images_embeddings(files)

Creating embeddings:  39%|███▉      | 823/2097 [04:20<06:26,  3.29img/s, file=sec_00823.jpg]

In [None]:
compareframes = CompareFrames()

In [None]:
# Get only different frames by calculating cosine similarity sequentially
selected_frames = compareframes.filter_relevant_frames(embed_1024, difference_threshold=0.8) # frame is skipped if is similar to previous 

print (f"from {len(embed_1024)} frames to {len(selected_frames)} relevant frames:")


#### ✅  Check the transcription Job and process text results

In [None]:
#job_name = "XXXX" # For existing jobs put the job name here
transcript_url =audio_processing.wait_transcription_complete(job_name)

In [None]:
transcripts, duration = audio_processing.process_transcript(transcript_url, max_chars_per_segment=1000)
print (f"Duration:{duration}s")
for seg, speaker, text in transcripts[:2]:
    print (f"sec: {seg}\n{speaker}:\n   {text}\n\n")

In [None]:
selected_frames_files = [(sf, files[sf])for sf in selected_frames]
selected_frames_files

In [None]:
text_embeddings = embedding_generation.create_text_embeddings(transcripts, transcript_url)
frames_embeddings = embedding_generation.create_frames_embeddings(selected_frames_files, s3_uri)

In [None]:
print ("Text Embeddings:\n")
for te in text_embeddings:
    print(f"Chunk:{te.get('chunks')[:50]}, embedding(3): {te.get('embedding')[:3]}, metadata: {te.get('metadata')} ")


print ("\nImage Embeddings:\n")
for fe in frames_embeddings:
    print(f"Source:{fe.get('source')}, embedding(3): {fe.get('embedding')[:3]}, metadata: {fe.get('metadata')} ")


## Insert to Vector Database Aurora PostgreSQL (pgvector)

In [None]:
aurora.execute_statement("select count(*) from bedrock_integration.knowledge_bases")

In [None]:
# Optionally clean the table
#aurora.execute_statement("delete from bedrock_integration.knowledge_bases")
aurora.execute_statement("select count(*) from bedrock_integration.knowledge_bases")

In [None]:
# Insert text embeddings into Aurora PostgreSQL
if text_embeddings:
    aurora.insert(text_embeddings)
    print(f"Inserted {len(text_embeddings)} text embeddings")



In [None]:
# Insert frame embeddings into Aurora PostgreSQL
if frames_embeddings:
    aurora.insert(frames_embeddings)
    print(f"Inserted {len(frames_embeddings)} frame embeddings")


aurora.execute_statement("select count(*) from bedrock_integration.knowledge_bases")

## Similarity Search

Implements functions for:
- `retrieve()`: Performs similarity searches in the database and displays results
- `aurora.similarity_search()`: Executes the vector similarity search in the database
- `get_embeddings()`: Generates embeddings for the search query

In [None]:
from IPython.display import display

def retrieve(search_query, how="cosine", k=5):
    search_vector = embedding_generation.get_embeddings(search_query)
    
    result = aurora.similarity_search(search_vector,how=how, k=k)
    rows = json.loads(result.get("formattedRecords"))
    for row in rows:
        metric = "similarity" if how == "cosine" else "distance"
        metric_value = row.get(metric)
        if row.get("content_type") == "text":
            print (f"text:\n{row.get('chunks')}\n{metric}:{metric_value}\nmetadata:{row.get('metadata')}\n")
        if row.get("content_type") == "image":
            img = PILImage.open(row.get('source'))            
            print (f"Image:\n{row.get('source')}\n{metric}:{metric_value}\nmetadata:{row.get('metadata')}\n")
            display(img)
        del row["embedding"]
        del row["id"]

    return rows

In [None]:
search_query = "elizabeth fuentes guillermo ruiz"
docs = retrieve(search_query, how="cosine", k=10)

### Search using images

In [None]:
one_image = random.choice(files)
print(one_image)
display(PILImage.open(one_image))

In [None]:
docs = retrieve(videomanager.read_image_from_local(one_image), how="cosine", k=3)

## RAG Implementation

Finally, the notebook implements a complete RAG system:
- `CustomMultimodalRetriever`: A custom retriever class that extends BaseRetriever
- `_get_relevant_documents()`: Core retrieval method that finds similar content
- `image_content_block()`: Formats image content for LLM consumption
- `text_content_block()`: Formats text content for LLM consumption
- `parse_docs_for_context()`: Processes documents for context (text and images)
- `ThinkingLLM`: Uses an LLM to answer questions based on retrieved content

> Based on https://github.com/langchain-ai/langchain/blob/master/docs/docs/how_to/custom_retriever.ipynb



In [None]:
from typing import List

from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever

class CustomMultimodalRetriever(BaseRetriever):
    """A retriever that contains the top k documents that contain the user query.
    query could be text or image_bytes
    """
    k: int
    """Number of top results to return"""
    how: str
    """How to calculate the similarity between the query and the documents."""

    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Sync implementations for retriever."""
        search_vector = get_embeddings(query)
        result = aurora.similarity_search(search_vector, how=self.how, k=self.k)
        rows = json.loads(result.get("formattedRecords"))

        matching_documents = []

        for row in rows:
            document_kwargs = dict(
                metadata=dict(**json.loads(row.get("metadata")), content_type = row.get("content_type"), source=row.get("sourceurl")))
            
            if self.how == "cosine":
                document_kwargs["similarity"] = row.get("similarity")
            elif self.how == "l2":
                document_kwargs["distance"] = row.get("distance")

            if row.get("content_type") == "text":
                matching_documents.append( Document( page_content=row.get("chunks"), **document_kwargs ))
            if row.get("content_type") == "image":
                matching_documents.append( Document( page_content=row.get("source"),**document_kwargs ))

        return matching_documents

In [None]:
retriever = CustomMultimodalRetriever(how="cosine", k=4)

In [None]:
query = "elizabeth?"
docs = retriever.invoke(query)
docs

# Building the RAG 

In [None]:
from typing import List, Dict

budget_tokens = 0
max_tokens = 1024
conversation: List[Dict] = []
reasoning_config = {"thinking": {"type": "enabled", "budget_tokens": budget_tokens}}

In [None]:
def image_content_block(image_file):
    image_bytes = read_image_from_local(image_file)
    extension = image_file.split('.')[-1]
    print (f"Including Image :{image_file}")
    if extension == 'jpg':
        extension = 'jpeg'
    
    block = { "image": { "format": extension, "source": { "bytes": image_bytes}}}
    return block

def text_content_block(text):
    return { "text": text }

def parse_docs_for_context(docs):
    blocks = []
    for doc in docs:
        if doc.metadata.get('content_type') == "image":
            blocks.append(image_content_block(doc.page_content))
        else:
            blocks.append(text_content_block(doc.page_content))
    return blocks

In [None]:
def answer(model_id,system_prompt,content) -> str:
    """Get completion from Claude model based on conversation history.

    Returns:
        str: Model completion text
    """

    # Invoke model

    kwargs = dict(
        modelId=model_id,
        inferenceConfig=dict(maxTokens=max_tokens),
        messages=[
            {
                "role": "user",
                "content": content,
            }
        ],


    )

    kwargs["system"] = [{"text": system_prompt}]

    response = bedrock_runtime.converse(**kwargs)
    
    return response.get("output",{}).get("message",{}).get("content", [])
    


In [None]:
parsed_docs = parse_docs_for_context(docs)

In [None]:

system_prompt = """Answer the user's questions based on the below context. If the context has an image, indicate that it can be reviewed for further feedback.
If the context doesn't contain any relevant information to the question, don't make something up and just say "I don't know". (IF YOU MAKE SOMETHING UP BY YOUR OWN YOU WILL BE FIRED). For each statement in your response provide a [n] where n is the document number that provides the response. """
model_id = "us.amazon.nova-pro-v1:0"


In [None]:
query = "que dicen sobre los agentes de bedrock?"
docs = retriever.invoke(query)
parsed_docs = parse_docs_for_context(docs)

In [None]:
llm_response = answer(model_id,system_prompt,[text_content_block(f"question:{query}\n\nDocs:\n"), *parsed_docs])

In [None]:
print(llm_response[0].get("text"))

In [None]:
query = "donde se dice elizabeth?"
docs = retriever.invoke(query)
parsed_docs = parse_docs_for_context(docs)
llm_response = answer(model_id,system_prompt,[text_content_block(f"question:{query}\n\nDocs:\n"), *parsed_docs])
print(llm_response[0].get("text"))