# Building a Multimodal Retrieval-Augmented Generation (RAG) Application with Amazon Bedrock Data Automation

# Introduction

This notebook demonstrates how to build a Multimodal Retrieval-Augmented Generation (RAG) application using Amazon Bedrock Data Automation (BDA) and Bedrock Knowledge Bases (KB). With the latest integration between BDA and Amazon Bedrock Knowledge Bases, you can specify BDA as parser of your data source for Bedrock Knowledge Bases.

## Key Features

- Amazon Bedrock Data Automation (BDA): A managed service that automatically extracts content from multimodal data. BDA streamlines the generation of valuable insights from unstructured multimodal content such as documents, images, audio, and videos through a unified multi-modal inference API.
  
- Bedrock KB to build a RAG solution with BDA: Amazon Bedrock KB extract multi-modal content using BDA, generating semantic embeddings using the selected embedding model, and storing them in the chosen vector store. This enables users to retrieve and generate answers to questions derived not only from text but also from image data. Additionally, retrieved results include source attribution for visual data, enhancing transparency and building trust in the generated outputs.

## Prerequisites
Please make sure to enable `Anthropic Claude 3 Sonnet` , `Amazon Nova Micro` and  `Titan Text Embeddings V2` model access in Amazon Bedrock Console

You need to have suitable IAM role permission to run this notebook. For IAM role, choose either an existing IAM role in your account or create a new role. The role must the necessary permissions to invoke the BDA, Bedrock KB, create IAM roles, SageMaker and S3 APIs.

Note: The AdministratorAccess IAM policy can be used, if allowed by security policies at your organization.

<div class="alert alert-block alert-info">
<b>Note:</b> Please run the notebook cell one at a time instead of using "Run All Cells" option.
</div>


# Setup notebook and boto3 clients

In this step, we will import some necessary libraries that will be used throughout this notebook. To use Amazon Bedrock Data Automation (BDA) with boto3, you'll need to ensure you have the latest version of the AWS SDK for Python (boto3) installed. Version Boto3 1.35.96 of later is required.

Note: At time of Public Preview launch, BDA is available in us-west-2 only.

In [None]:
%pip install --force-reinstall -q -r ./requirements.txt

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

<div class="alert alert-block alert-info">
<b>Note:</b> In this workshop, a pre-created S3 bucket will be used, and the input and output will be saved under a folder called "bda" in the default bucket. Please replace the following cell with your S3 bucket name. 
</div>

In [None]:
import boto3
from botocore.exceptions import ClientError
import os
import json, uuid
from datetime import datetime
import time
from time import sleep
import pprint
import random
from retrying import retry
from PyPDF2 import PdfReader
from tqdm import tqdm
from pathlib import Path
import tempfile
import io
import base64
from IPython.display import JSON, IFrame, Audio, display, clear_output
import IPython.display as display
import sagemaker
import logging
from utils.knowledge_base import BedrockKnowledgeBase


suffix = random.randrange(200, 900)

sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]

session = sagemaker.Session()
bucket_name = session.default_bucket()

bucket_name_kb = f'kb-bda-multimodal-datasource-{account_id}'


region_name = "us-west-2" # can be removed ones BDA is GA and available in other regions.
region = region_name

s3_client = boto3.client('s3', region_name=region_name)

bda_client = boto3.client('bedrock-data-automation', region_name=region_name)
bda_runtime_client = boto3.client('bedrock-data-automation-runtime', region_name=region_name)

bedrock_agent_client = boto3.client('bedrock-agent')
bedrock_agent_runtime_client = boto3.client('bedrock-agent-runtime') 

logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)
region, account_id

# Create S3 bucket for the KB
if region == "us-east-1":
    s3_client.create_bucket(Bucket=bucket_name_kb)
else:
    s3_client.create_bucket(
        Bucket=bucket_name_kb,
        CreateBucketConfiguration={'LocationConstraint': region}
    )

bucket_name_input = f's3://{bucket_name}/bda/input'      # DBA input path
bucket_name_output = f's3://{bucket_name}/bda/output'    # DBA output path


## Create a BDA project
To start a BDA job, you need a BDA project, which organizes both standard and custom output configurations. This project is reusable, allowing you to apply the same configuration to process multiple video/audio files that share the same settings.

In [5]:
project_name= f'bda-workshop-kb-project-{str(uuid.uuid4())[0:4]}'

# delete project if it already exists
projects_existing = [project for project in bda_client.list_data_automation_projects()["projects"] if project["projectName"] == project_name]
if len(projects_existing) >0:
    print(f"Deleting existing project: {projects_existing[0]}")
    bda_client.delete_data_automation_project(projectArn=projects_existing[0]["projectArn"])

In [6]:
response = bda_client.create_data_automation_project(
    projectName=project_name,
    projectDescription='BDA workshop audio sample project',
    projectStage='DEVELOPMENT',
    standardOutputConfiguration={
        "video": {
            "extraction": {
                "category": {
                    "state": "ENABLED",
                    "types": ["CONTENT_MODERATION", "TEXT_DETECTION", "TRANSCRIPT"]
                },
                "boundingBox": {"state": "ENABLED"}
            },
            "generativeField": {
                "state": "ENABLED",
                "types": ["VIDEO_SUMMARY", "SCENE_SUMMARY", "IAB"]
            }
        },
        "audio": {
            "extraction": {
                "category": {
                    "state": "ENABLED", 
                    "types": ["AUDIO_CONTENT_MODERATION", "CHAPTER_CONTENT_MODERATION", "TRANSCRIPT"]
                }
            },
            "generativeField": {
                "state": "ENABLED",
                "types": ["AUDIO_SUMMARY", "CHAPTER_SUMMARY", "IAB"]
            }
        }
    }
)


In [None]:
kb_project_arn = response.get("projectArn")
print("BDA kb project ARN:", kb_project_arn)

In [None]:
# Download sample audio
file_name_audio = 'podcastdemo.mp3'
source_url = f'https://d1xvhy22zmw77y.cloudfront.net/tmp/{file_name_audio}'

!curl {source_url} --output {file_name_audio}

# Download sample video
file_name_video = 'podcastdemo.mp3'
source_url = f'https://d1xvhy22zmw77y.cloudfront.net/tmp/{file_name_video}'

!curl {source_url} --output {file_name_video}

In [None]:
# Upload an audio and video file samples to S3 for BDA processing
from IPython.display import Audio,Video, display

object_name_audio = f'bda/input/{file_name_audio}'

s3_client.upload_file(file_name_audio, bucket_name, object_name_audio)

object_name_video = f'bda/input/{file_name_video}'

s3_client.upload_file(file_name_video, bucket_name, object_name_video)

In [None]:
# Load and play an MP3 file
display(Audio(file_name, autoplay=True))

In [None]:
# Load and play an MP4 file

from IPython.display import HTML
from base64 import b64encode

def play(filename):
    html = ''
    video = open(filename,'rb').read()
    src = 'data:video/mp4;base64,' + b64encode(video).decode()
    html += '<video width=1000 controls autoplay loop><source src="%s" type="video/mp4"></video>' % src 
    return HTML(html)

play(file_name_video)

### Start BDA tasks
We will now invoke the BDA API to process the uploaded audio file. You need to provide the BDA project ARN that we created at the beginning of the lab and specify an S3 location where BDA will store the output results.

For a complete API reference for invoke a BDA async task, refer to this [document](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-data-automation-runtime/client/invoke_data_automation_async.html).

In [None]:
# Start BDA task audio

input_name = object_name_audio
output_name = f'bda/output/' 

response_aud = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={'s3Uri':  f"s3://{bucket_name}/{input_name}"},
    outputConfiguration={'s3Uri': f"s3://{bucket_name}/{output_name}"},
    dataAutomationConfiguration={
        'dataAutomationArn': kb_project_arn,
        'stage': 'DEVELOPMENT'
    })
response_aud

In [None]:
invocation_audio_arn = response_aud.get("invocationArn")

print("BDA audio task started:", invocation_audio_arn)


We will repeat the process for the uploaded video file. 

In [None]:
# Start BDA task video

input_name = object_name_video
output_name = f'bda/output/' 

response_vid = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={'s3Uri':  f"s3://{bucket_name}/{object_name_video}"},
    outputConfiguration={'s3Uri': f"s3://{bucket_name}/{output_name}"},
    dataAutomationConfiguration={
        'dataAutomationArn': kb_project_arn,
        'stage': 'DEVELOPMENT'
    })
response_vid

In [None]:
invocation_video_arn = response_vid.get("invocationArn")

print("BDA video task started:", invocation_video_arn)

We can monitor the progress status of BDA task execution, by running the code cell below

In [None]:

statusAudio,statusVideo, status_aud_response, status_vid_response = None, None, None, None
while (statusAudio not in ["Success","ServiceError","ClientError"]) and (statusAudio not in ["Success","ServiceError","ClientError"]):
    status_aud_response = bda_runtime_client.get_data_automation_status(
        invocationArn=invocation_audio_arn
    )
    statusAudio = status_aud_response.get("status")
    clear_output(wait=True)
   # print(f"{datetime.now().strftime('%H:%M:%S')} : BDA kb audio task: {statusAudio}")
    
    status_vid_response = bda_runtime_client.get_data_automation_status(
        invocationArn=invocation_video_arn
    )
    statusVideo = status_vid_response.get("status")
    clear_output(wait=True)
    print(f"{datetime.now().strftime('%H:%M:%S')} : "\
          f"BDA kb video task: {statusVideo} "\
          f"BDA kb audio task: {statusAudio}")
    time.sleep(5)

output_aud_config = status_aud_response.get("outputConfiguration",{}).get("s3Uri")
print("Ouput configuration file:", output_aud_config)

output_vid_config = status_vid_response.get("outputConfiguration",{}).get("s3Uri")
print("Ouput configuration file:", output_vid_config)

# Examine the BDA output for the processed audio file

In [None]:
out_aud_loc = status_aud_response['outputConfiguration']['s3Uri'].split("/job_metadata.json", 1)[0].split(bucket_name+"/")[1]
out_aud_loc += "/0/standard_output/0/result.json"
print(out_aud_loc)
s3_client.download_file(bucket_name, out_aud_loc, 'result_aud.json')

In [None]:
data_aud = json.load(open('result_aud.json'))
print(data_aud["audio"]["summary"])

# Examine the BDA output for the processed video file

In [None]:
out_vid_loc = status_vid_response['outputConfiguration']['s3Uri'].split("/job_metadata.json", 1)[0].split(bucket_name+"/")[1]
out_vid_loc += "/0/standard_output/0/result.json"
print(out_vid_loc)
s3_client.download_file(bucket_name, out_vid_loc, 'result_vid.json')

In [None]:
data_vid = json.load(open('result_vid.json'))
print(data_vid["video"]["summary"])

# Integrate with Amazon Bedrock Knowledge Bases:
After processed the audio and video files with a BDA project, next it is time to integrate with Bedrock KB.
## Steps involved in this integration: 
- Set up a knowledge base to parse documents using Amazon Bedrock Data Automation as the parser.
- Ingest the processed data into the knowledge base for retrieval and response generation.

In [None]:
# Download sample doc
file_name_doc = 'bedrock-ug.pdf'
source_url = f'https://d1xvhy22zmw77y.cloudfront.net/tmp/{file_name_doc}'

!curl {source_url} --output {file_name_doc}

# Download sample image
file_name_img = 'bda-idp.png'
source_url = f'https://d1xvhy22zmw77y.cloudfront.net/tmp/{file_name_img}'

!curl {source_url} --output {file_name_img}

In [None]:
# copy BDA proj output to bda_kb_bucket_name

obj_audio = 'bda/dataset/result_aud.json'  
s3_client.upload_file('result_aud.json', bucket_name_kb, obj_audio )

obj_video = 'bda/dataset/result_vid.json'  
s3_client.upload_file('result_vid.json', bucket_name_kb, obj_video )

# copy pdf file and image file to bda_kb_bucket_name
obj_doc = f"bda/dataset/{file_name_doc}"

obj_img = f"bda/dataset/{file_name_img}"

s3_client.upload_file(file_name_doc, bucket_name_kb, obj_doc )
s3_client.upload_file(file_name_img, bucket_name_kb, obj_img )


In [20]:
# Get the current timestamp
current_time = time.time()

# Format the timestamp as a string
timestamp_str = time.strftime("%Y%m%d%H%M%S", time.localtime(current_time))[-7:]
# Create the suffix using the timestamp
suffix = f"{timestamp_str}"

knowledge_base_name = f"bedrock-multi-modal-kb-{suffix}"
knowledge_base_description = "Multi-modal RAG knowledge base."

foundation_model = "anthropic.claude-3-sonnet-20240229-v1:0"

#### You can add multiple data sources (S3, Sharepoint) to a multimodal Knowledge Base. 

In this notebook, the process of creating a KB is simplified by using a wrapper function from the knowledge_base.py file in "utils" folder of this notebook. The whole process of creating data source, creating a KB, creating an embedding index, saving the index in a vector data store is simplified by using this function. 


In [21]:
## Please uncomment the data sources that you want to add and update the placeholder values accordingly.

data=[{"type": "S3", "bucket_name": bucket_name_kb}]

    # {"type": "SHAREPOINT", "tenantId": "888d0b57-69f1-4fb8-957f-e1f0bedf64de", "domain": "yourdomain",
    #   "authType": "OAUTH2_CLIENT_CREDENTIALS",
    #  "credentialsSecretArn": f"arn:aws::secretsmanager:{region_name}:secret:<<your_secret_name>>",
    #  "siteUrls": ["https://yourdomain.sharepoint.com/sites/mysite"]
    # },
    
                
pp = pprint.PrettyPrinter(indent=2)

### Step 1 - Create Knowledge Base with Multi modality

In [None]:
# For multi-modal RAG While instantiating BedrockKnowledgeBase, pass multi_modal= True and choose the parser you want to use

knowledge_base = BedrockKnowledgeBase(
    kb_name=f'{knowledge_base_name}',
    kb_description=knowledge_base_description,
    data_sources=data,
    multi_modal= True,
    parser= 'BEDROCK_DATA_AUTOMATION', #'BEDROCK_FOUNDATION_MODEL'
    chunking_strategy = "FIXED_SIZE", 
    suffix = f'{suffix}-f'
)

### Step 2 - Start data ingestion job to KB

Once the KB and data source(s) created, we can start the ingestion job for each data source. During the ingestion job, KB will fetch the documents from the data source, Parse the document to extract text, chunk it based on the chunking size provided, create embeddings of each chunk and then write it to the vector database, in this case OSS.

NOTE: Currently, you can only kick-off one ingestion job at one time.

In [None]:
# ensure that the kb is available
time.sleep(30)
# sync knowledge base
knowledge_base.start_ingestion_job()

In [None]:
# keep the kb_id for invocation later in the invoke request
kb_id = knowledge_base.get_knowledge_base_id()
%store kb_id

### Step 3 -  Test the Knowledge Base
Now the Knowlegde Base is available we can test it out using the [**retrieve**](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve.html) and [**retrieve_and_generate**](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve_and_generate.html) functions. 

#### Testing Knowledge Base with Retrieve and Generate API

Let's first test the knowledge base using the retrieve and generate API. With this API, Bedrock takes care of retrieving the necessary references from the knowledge base and generating the final answer using a foundation model from Bedrock.

query = Give me the summary of the AWS Rethink podcast hosted by Nolan Chen and Malini Chatterjee?

The right response for this query is expected to fetch from a chart/graph from the PDF document.

### Step 4: View Query Output

In [38]:
query = "Give me the summary of the AWS Rethink podcast hosted by Nolan Chen and Malini Chatterjee?"

In [None]:
foundation_model = "anthropic.claude-3-sonnet-20240229-v1:0"
# foundation_model = "amazon.nova-micro-v1:0"

response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": query
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            'knowledgeBaseId': kb_id,
            "modelArn": "arn:aws:bedrock:{}::foundation-model/{}".format(region, foundation_model),
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5
                } 
            }
        }
    }
)

print(response['output']['text'],end='\n'*2)

## Query image

In [None]:
query = "Give me an architecture design of an IDP workflow using Bedrock Data Automation"

In [None]:
foundation_model = "anthropic.claude-3-sonnet-20240229-v1:0"
# foundation_model = "amazon.nova-micro-v1:0"

response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": query
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            'knowledgeBaseId': kb_id,
            "modelArn": "arn:aws:bedrock:{}::foundation-model/{}".format(region, foundation_model),
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5
                } 
            }
        }
    }
)

print(response['output']['text'],end='\n'*2)

In [None]:
from PIL import Image
import s3fs

fs = s3fs.S3FileSystem()

## Function to print retrieved response

def print_response(response):
#structure 'retrievalResults': list of contents. Each list has ['ResponseMetadata', 'citations', 'output', 'sessionId']
    print( f'OUTPUT: {response["output"]["text"]} \n')
    
    print(f'CITATION DETAILS: \n')
    
    for num, chunk in enumerate(response['citations']):
        print(f'CHUNK {num}',end='\n'*1)
        print("========")
        print(f'\t Generated  Response Text: ')
        print(f'\t ------------------------- ')
        print(f'\t Generated  Response Text: ',chunk['generatedResponsePart']['textResponsePart']['text'],end='\n'*2)
        for i, ref in enumerate (chunk['retrievedReferences']):
            print(f'\t Retrieved References: ')
            print(f'\t ---------------------', )
            print(f'\n\t\t --> Location:', ref['location'])
            print(f'\t\n\t\t --> Metadata: \n\t\t\t ---> Source', ref['metadata']['x-amz-bedrock-kb-source-uri'])
            # print(f'\t\n\t\t\n\t\t\t ---> x-amz-bedrock-kb-description', ref['metadata']['x-amz-bedrock-kb-description'])
            print(f'\t\n\t\t\n\t\t\t ---> x-amz-bedrock-kb-byte-content-source', ref['metadata']['x-amz-bedrock-kb-byte-content-source'])
            print("")
            with fs.open(ref['metadata']['x-amz-bedrock-kb-byte-content-source']) as f:
                display(Image.open(f).resize((400, 400)))

In [None]:
print_response(response)

## Clean Up
Let's delete the sample files that were uploaded to S3 and Bedrock Knowledge Base created using BDA as parser

In [None]:
## Delete S3 Files

s3_client.delete_object(Bucket=bucket_name, Key='bda/')

# Delete Knowledge Base

knowledge_base.delete_kb(delete_s3_bucket=True, delete_iam_roles_and_policies=False)


## Conclusion

By following this guide, you can effectively harness the power of Amazon Bedrock’s features to build a robust Multimodal RAG application tailored to your specific needs.