## Video and Audio Content Analysis with Amazon Bedrock and Amazon Aurora PostgreSQL pgvector

This notebook demonstrates how to process video and audio content using [Amazon Bedrock](https://aws.amazon.com/bedrock/) to invoke [Amazon Titan Multimodal Embeddings G1 model](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-multiemb-models.html) for generating multimodal embeddings, [Amazon Transcribe](https://aws.amazon.com/transcribe/) for converting speech to text, and [Amazon Aurora PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/data-api.html) with pgvector for efficient vector storage and similarity search, you will build an app that understands both visual and audio content, enabling natural language queries to find specific moments in videos.

> Create Amazon Aurora PostgreSQL with this [Amazon CDK Stack](../create-audio-video-embeddings/02-aurora-pg-vector/README.md)

![Diagram](data/video-embedding.png)


### Processing Flow

The pipeline processes videos through these steps:

1. A video file is processed using Python code in a Jupyter notebook, utilizing the boto3 SDK to interact with AWS services.

2. The audio stream is extracted and sent to Amazon Transcribe for speech-to-text conversion.

3. Simultaneously, the video is processed to extract key frames, which are stored in an Amazon S3 bucket.

4. The extracted frames are processed through Amazon Bedrock's Titan embedding model to generate multimodal vectors that represent the visual content.

5. Finally, all the processed data (transcriptions, frame data, and vectors) is stored in Amazon Aurora Serverless PostgreSQL with pgvector extension, enabling vector-based searches through standard RDS API calls.

![Diagram](data/diagram_video.png)

### 💰 Cost to complete: 
- [Amazon Bedrock Pricing](https://aws.amazon.com/bedrock/pricing/)
- [Amazon S3 Pricing](https://aws.amazon.com/s3/pricing/)
- [Amazon Aurora Pricing](https://aws.amazon.com/rds/aurora/pricing/)
- [Amazon Transcribe Pricing](https://aws.amazon.com/transcribe/pricing/)

### Configuration
- [AWS SDK for Python ](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingTheBotoAPI.html)
- [Configure AWS credentials](https://docs.aws.amazon.com/braket/latest/developerguide/braket-using-boto3.html) 



In [None]:
#!pip install boto3
#!pip install json
#!pip install base64
#!pip install uuid
# or install requirements.txt

In [None]:
import boto3
import json
import os
from PIL import Image as PILImage
import random

_region_name = "us-west-2"
ssm = boto3.client(service_name="ssm", region_name=_region_name)

# Default model settings
default_model_id = os.environ.get("DEFAULT_MODEL_ID", "amazon.titan-embed-image-v1")
default_embedding_dimmesion = os.environ.get("DEFAULT_EMBEDDING_DIMENSION", "1024")

## 2. Database Interface (AuroraPostgres Class)

An `AuroraPostgres` class that interacts with Amazon Aurora PostgreSQL [using RDS Data API](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/data-api.html)

Code: [aurora_service.py](create_audio_video_helper/aurora_service.py)

In [None]:
from create_audio_video_helper import AuroraPostgres

## 3. Video Content Processing

A `VideoProcessor` class uses the [ffmpeg libavcodec library](https://ffmpeg.org/) to proccess the audio and create frames. 

The class is set to process frames every 1 second, you can modify this by changing the FPS value in command.

Code: [video_processor.py](create_audio_video_helper/video_processor.py)

In [None]:
from create_audio_video_helper import VideoProcessor

## 4. Video Download and Processing

Code: [video_manager.py](create_audio_video_helper/video_manager.py)

In [None]:
from create_audio_video_helper import VideoManager


## 5. Audio Processing with Amazon Transcribe

The `AudioProcessing` class extracts the audio track from the video file using [Amazon Transcribe StartTranscriptionJob API](https://docs.aws.amazon.com/transcribe/latest/APIReference/API_StartTranscriptionJob.html), converting speech into accurate text transcripts.
With `IdentifyMultipleLanguages` as True, Transcribe uses [Amazon Comprehend](https://aws.amazon.com/comprehend/)to identify the language in the audio, If you know the language of your media file, specify it using the `LanguageCode` parameter.  

`ShowSpeakerLabels` parameter as `True` enables speaker partitioning (diarization) in the transcription output. Speaker partitioning labels the speech from individual speakers in the media file and include `MaxSpeakerLabels` to specify the maximum number of speakers, in this case is 10. 

Code: [audio_processor.py](create_audio_video_helper/audio_processing.py)

In [None]:
from create_audio_video_helper import AudioProcessing

## 6. Embedding Generation

Generate Embeddings for each extracted frame. Embeddins are created with the Amazon Titan Multimodal Embeddings G1 model using Amazon Bedrock Invoke Model API. 

Code: [embedding_generation.py](create_audio_video_helper/embedding_generation.py)


In [None]:
from create_audio_video_helper import EmbeddingGeneration

## 7. Select Key Frames

The app uses the `CompareFrame` class to identifies significant visual changes by detecting when frame similarity falls below a defined threshold, in this case 0.8. This comparison leverages Cosine Similarity, calculating the cosine of the angle between frame vectors. The similarity score ranges from -1 to 1, with higher values indicating greater visual similarity between frames.

Code: [compare_frames.py](create_audio_video_helper/compare_frames.py)


In [None]:
from create_audio_video_helper import CompareFrames

### Configuration
The system uses environment variables and AWS Systems Manager Parameter Store for configuration:

**DEFAULT_MODEL_ID:** Bedrock model ID (default: "amazon.titan-embed-image-v1")

**DEFAULT_EMBEDDING_DIMENSION:** Embedding dimension (default: "1024")

In [None]:

def get_ssm_parameter(name):
    response = ssm.get_parameter(Name=name, WithDecryption=True)
    return response["Parameter"]["Value"]



In [None]:
# Get Data from environment variables, never share secrets!

cluster_arn = get_ssm_parameter("/videopgvector/cluster_arn")
credentials_arn = get_ssm_parameter("/videopgvector/secret_arn")
table_name = get_ssm_parameter("/videopgvector/video_table_name")
default_database_name = "kbdata"

In [None]:
# Initialize Aurora PostgreSQL client
aurora = AuroraPostgres(cluster_arn, default_database_name, credentials_arn,_region_name)

In [None]:
# Verify Aurora Cluster conectivity:
aurora.execute_statement("select count(*) from bedrock_integration.knowledge_bases")

# Upload Video to Amazon S3 bucket and Obtain s3_uri

This code shows how to upload a video from the `tmp` folder to an S3 bucket and obtain the S3 URI needed for further processing.

In [None]:
from create_audio_video_helper.video_s3_uploader import UploadVideoS3
# Configure the parameters
video_path = "tmp/video.mp4"  # Path to the video in the tmp folder
bucket_name = "you-bucket-1234"     # Name of your S3 bucket
uploadvideo = UploadVideoS3(bucket_name)


In [None]:

# You can also specify a custom path in S3 (optional)
s3_key = "videos/sample_video.mp4"

# Subir el video y obtener el S3 URI
s3_uri = uploadvideo.upload_video_to_s3(video_path, s3_key)
print(f"S3 URI: {s3_uri}")

In [None]:
# Download the file
# Create directory if it doesn't exist

tmp_path                    = "./tmp"

#s3_uri = "s3://you-bucket-1234/videos/you-video.mp4"


In [None]:
videomanager = VideoManager(s3_uri,_region_name)

bucket, prefix, fileName, extension, file  = videomanager.parse_location(s3_uri)

local_path              = f"{tmp_path}/{file}"
location                = f"{prefix}/{file}"
output_dir              = f"{tmp_path}/{fileName}"


os.makedirs(os.path.dirname(local_path), exist_ok=True)
print(f"dowloading s3://{bucket}/{prefix}/{file} to {local_path}")
result = videomanager.download_file(bucket,location, local_path)

In [None]:
# Verify ffmpeg is installed 
videoprocessor = VideoProcessor()
videoprocessor.ffmpeg_check() ## Check if ffmpeg is installed

## Process the media file
This part involves:
1. For visual content:

![Diagram](data/frames_processing.png)

2. Transcribing the audio to text using Amazon Transcribe

![Diagram](data/audio_processing.png)

This notebook assumes you have a valid media file in s3://path/to/video

#### ✅ Start Amazon Transcribe Job

In [None]:
audio_processing = AudioProcessing(_region_name,videomanager)

In [None]:
job_name = audio_processing.transcribe(s3_uri)

#### ✅  Extract Key Frames with ffmpeg and Amazon Bedrock with Titan Multimodal Embeddings Model

![Diagram](data/extract_frames.png)

In [None]:
files = videoprocessor.extract_frames(local_path, output_dir, every=1) # 1 frame per second

## Create Text and Image embeddings 

![Diagram](data/get_images_embeddings.png)

In [None]:
embedding_generation = EmbeddingGeneration(videomanager,_region_name,default_model_id,default_embedding_dimmesion)

In [None]:
# calculate embeddings for all extracted frames (1 per second)
embed_1024 = embedding_generation.get_images_embeddings(files)

In [None]:
compareframes = CompareFrames()

In [None]:
# Get only different frames by calculating cosine similarity sequentially
selected_frames = compareframes.filter_relevant_frames(embed_1024, difference_threshold=0.8) # frame is skipped if is similar to previous 

print (f"from {len(embed_1024)} frames to {len(selected_frames)} relevant frames:")


#### ✅  Check the transcription Job and process text results

In [None]:
#job_name = "XXXX" # For existing jobs put the job name here
transcript_url =audio_processing.wait_transcription_complete(job_name)

In [None]:
transcripts, duration = audio_processing.process_transcript(transcript_url, max_chars_per_segment=1000)
print (f"Duration:{duration}s")
for seg, speaker, text in transcripts[:2]:
    print (f"sec: {seg}\n{speaker}:\n   {text}\n\n")

In [None]:
selected_frames_files = [(sf, files[sf])for sf in selected_frames]
selected_frames_files

In [None]:
text_embeddings = embedding_generation.create_text_embeddings(transcripts, transcript_url)


The list of embeddings for text should look like this: 

![Diagram](data/images_embeddings.png)

In [None]:
print ("Text Embeddings:\n")
for te in text_embeddings:
    print(f"Chunk:{te.get('chunks')[:50]}, embedding(3): {te.get('embedding')[:3]}, metadata: {te.get('metadata')} ")


In [None]:
frames_embeddings = embedding_generation.create_frames_embeddings(selected_frames_files, s3_uri)

The list of embeddings for image should look like this:

![Diagram](data/text_embedding.png)


In [None]:
print ("\nImage Embeddings:\n")
for fe in frames_embeddings:
    print(f"Source:{fe.get('source')}, embedding(3): {fe.get('embedding')[:3]}, metadata: {fe.get('metadata')} ")


## Insert to Vector Database Aurora PostgreSQL (pgvector)

In [None]:
aurora.execute_statement("select count(*) from bedrock_integration.knowledge_bases")

In [None]:
# Optionally clean the table
aurora.execute_statement("delete from bedrock_integration.knowledge_bases")
aurora.execute_statement("select count(*) from bedrock_integration.knowledge_bases")

In [None]:
# Insert text embeddings into Aurora PostgreSQL
if text_embeddings:
    aurora.insert(text_embeddings)
    print(f"Inserted {len(text_embeddings)} text embeddings")



In [None]:
# Insert frame embeddings into Aurora PostgreSQL
if frames_embeddings:
    aurora.insert(frames_embeddings)
    print(f"Inserted {len(frames_embeddings)} frame embeddings")


aurora.execute_statement("select count(*) from bedrock_integration.knowledge_bases")

## Similarity Search

Implements functions for:
- `retrieve()`: Performs similarity searches in the database and displays results
- `aurora.similarity_search()`: Executes the vector similarity search in the database
- `get_embeddings()`: Generates embeddings for the search query

In [None]:
from IPython.display import display

def retrieve(search_query, how="cosine", k=5):
    search_vector = embedding_generation.get_embeddings(search_query)
    
    result = aurora.similarity_search(search_vector,how=how, k=k)
    rows = json.loads(result.get("formattedRecords"))
    for row in rows:
        metric = "similarity" if how == "cosine" else "distance"
        metric_value = row.get(metric)
        if row.get("content_type") == "text":
            print (f"text:\n{row.get('chunks')}\n{metric}:{metric_value}\nmetadata:{row.get('metadata')}\n")
        if row.get("content_type") == "image":
            img = PILImage.open(row.get('source'))            
            print (f"Image:\n{row.get('source')}\n{metric}:{metric_value}\nmetadata:{row.get('metadata')}\n")
            display(img)
        del row["embedding"]
        del row["id"]

    return rows

I tested the notebook with my AWS re:Invent 2024 sesion [AI self-service support with knowledge retrieval using PostgreSQL](https://www.youtube.com/watch?v=fpi3awGakyg?trk=fccf147c-636d-45bf-bf0a-7ab087d5691a&sc_channel=video). 

I ask for Aurora and it brings me images and texts where it mentions:

![Diagram](data/cosine.png)

```bash
text:
memory . A place where all the information is stored and can easily be retrievable , and that's where the vector database comes in . This is the the first building block . And a vector database stores and retrieves data in the form of vector embeddeds or mathematical representations . This allows us to find similarities between data rather than relying on the exact keyword match that is what usually happens up to today . This is essential for systems like retrieval ofmented generation or RAC , which combines external knowledge with the AI response to deliver those accurate and context aware response . And by the way , I think yesterday we announced the re-rank API for RAC . So now your rack applications , you can score and it will prioritize those documents that have the most accurate information . So at the end will be even faster and cheaper building rack . We're gonna use Amazon Aurora postgrade SQL with vector support that will give us a scalable and fully managed solution for our AI tasks .
similarity:0.5754164493071239
metadata:{"speaker":"spk_0","second":321}
```


In [None]:
search_query = "bedrock"
docs = retrieve(search_query, how="cosine", k=10)

In [None]:
search_query = "elizabeth"
docs = retrieve(search_query, how="l2", k=10)

### Search using images

In [None]:
one_image = random.choice(files)
print(one_image)
display(PILImage.open(one_image))

In [None]:
docs = retrieve(videomanager.read_image_from_local(one_image), how="cosine", k=3)

## RAG Implementation

Finally, the notebook implements a complete RAG system:
- `CustomMultimodalRetriever`: A custom retriever class that extends BaseRetriever
- `_get_relevant_documents()`: Core retrieval method that finds similar content
- `image_content_block()`: Formats image content for LLM consumption
- `text_content_block()`: Formats text content for LLM consumption
- `parse_docs_for_context()`: Processes documents for context (text and images)
- `ThinkingLLM`: Uses an LLM to answer questions based on retrieved content

> Based on https://github.com/langchain-ai/langchain/blob/master/docs/docs/how_to/custom_retriever.ipynb



In [None]:
from typing import List

from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever

class CustomMultimodalRetriever(BaseRetriever):
    """A retriever that contains the top k documents that contain the user query.
    query could be text or image_bytes
    """
    k: int
    """Number of top results to return"""
    how: str
    """How to calculate the similarity between the query and the documents."""

    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Sync implementations for retriever."""
        search_vector = embedding_generation.get_embeddings(query)
        result = aurora.similarity_search(search_vector, how=self.how, k=self.k)
        rows = json.loads(result.get("formattedRecords"))

        matching_documents = []

        for row in rows:
            document_kwargs = dict(
                metadata=dict(**json.loads(row.get("metadata")), content_type = row.get("content_type"), source=row.get("sourceurl")))
            
            if self.how == "cosine":
                document_kwargs["similarity"] = row.get("similarity")
            elif self.how == "l2":
                document_kwargs["distance"] = row.get("distance")

            if row.get("content_type") == "text":
                matching_documents.append( Document( page_content=row.get("chunks"), **document_kwargs ))
            if row.get("content_type") == "image":
                matching_documents.append( Document( page_content=row.get("source"),**document_kwargs ))

        return matching_documents

In [None]:
retriever = CustomMultimodalRetriever(how="cosine", k=4)

In [None]:
list(docs)

In [None]:
query = "elizabeth"
docs = retriever.invoke(query)


# Building the RAG 

In [None]:
from typing import List, Dict
bedrock_runtime = boto3.client(service_name="bedrock-runtime", region_name=_region_name)


budget_tokens = 0
max_tokens = 1024
conversation: List[Dict] = []
reasoning_config = {"thinking": {"type": "enabled", "budget_tokens": budget_tokens}}

In [None]:
def image_content_block(image_file):
    image_bytes = videomanager.read_image_from_local(image_file)
    extension = image_file.split('.')[-1]
    print (f"Including Image :{image_file}")
    if extension == 'jpg':
        extension = 'jpeg'
    
    block = { "image": { "format": extension, "source": { "bytes": image_bytes}}}
    return block

def text_content_block(text):
    return { "text": text }

def parse_docs_for_context(docs):
    blocks = []
    for doc in docs:
        if doc.metadata.get('content_type') == "image":
            blocks.append(image_content_block(doc.page_content))
        else:
            blocks.append(text_content_block(doc.page_content))
    return blocks

In [None]:
def answer(model_id,system_prompt,content) -> str:
    """Get completion from Claude model based on conversation history.

    Returns:
        str: Model completion text
    """

    # Invoke model
    kwargs = dict(
        modelId=model_id,
        inferenceConfig=dict(maxTokens=max_tokens),
        messages=[
            {
                "role": "user",
                "content": content,
            }
        ],

    )

    kwargs["system"] = [{"text": system_prompt}]

    response = bedrock_runtime.converse(**kwargs)
    
    return response.get("output",{}).get("message",{}).get("content", [])
    


In [None]:
parsed_docs = parse_docs_for_context(docs)

In [None]:

system_prompt = """Answer the user's questions based on the below context. If the context has an image, indicate that it can be reviewed for further feedback.
If the context doesn't contain any relevant information to the question, don't make something up and just say "I don't know". (IF YOU MAKE SOMETHING UP BY YOUR OWN YOU WILL BE FIRED). For each statement in your response provide a [n] where n is the document number that provides the response. """
model_id = "us.amazon.nova-pro-v1:0"


In [None]:
query = "What is the session about?"
docs = retriever.invoke(query)
parsed_docs = parse_docs_for_context(docs)

In [None]:
docs

In [None]:
llm_response = answer(model_id,system_prompt,[text_content_block(f"question:{query}\n\nDocs:\n"), *parsed_docs])

In [None]:
print(llm_response[0].get("text"))

In [None]:
query = "donde se dice elizabeth?"
docs = retriever.invoke(query)
parsed_docs = parse_docs_for_context(docs)
llm_response = answer(model_id,system_prompt,[text_content_block(f"question:{query}\n\nDocs:\n"), *parsed_docs])
print(llm_response[0].get("text"))