# Multi modal data processing -  End to end example using Amazon Bedrock Knowledge Bases for text & images

Multi-modal RAG can analyze and leverage insights from both textual and visual data, such as images, charts, diagrams, and tables.Bedrock Knowledge Bases offers end-to-end managed Retrieval-Augmented Generation (RAG) workflow that enables customers to create highly accurate, low-latency, secure, and custom generative AI applications by incorporating contextual information from their own data sources.

Bedrock Knowledge Bases extracts content from both text and visual data, generates semantic embeddings using the selected embedding model, and stores them in the chosen vector store. This enables users to retrieve and generate answers to questions derived not only from text but also from visual data. Additionally, retrieved results now include source attribution for visual data, enhancing transparency and building trust in the generated outputs.

You can choose between: Amazon Bedrock Data Automation, a managed service that automatically extracts content from multimodal data (currently in Preview), or FMs such as Claude 3.5 Sonnet or Claude 3 Haiku, with the flexibility to customize the default prompt.

This notebook provides sample code for building a Multimodal RAG using Amazon Bedrock Knowledge Bases.

#### Steps: 
- Create Knowledge Base execution role with necessary policies for accessing/writing data from/to S3 and required Foundation models .
- Create a knowledge base with rich content documents
- Create data source(s) within knowledge base
- Start ingestion jobs using KB APIs which which will read data from the data source, parse the documents (images, charts, tables etc.)using Bedrock Data Automation or Foundation model, chunk it, convert chunks into embeddings using Amazon Titan Embeddings model and then store these embeddings in AOSS. All of this without having to build, deploy and manage the data pipeline.

Once the data is available in the Bedrock Knowledge Base then a question answering application can be built using the Knowledge Base APIs provided by Amazon Bedrock.



#### Pre-requisites:

Please make sure to enable `Anthropic Claude 3 Sonnet` , `Amazon Nova Micro` and  `Titan Text Embeddings V2` model access in Amazon Bedrock Console

<div class="alert alert-block alert-info">
<b>Note:</b> Please run the notebook cell one at a time instead of using "Run All Cells" option.
</div>


### 0 - Setup
Before running the rest of this notebook, you'll need to run the cells below to (ensure necessary libraries are installed and) connect to Bedrock.

Please ignore any pip dependency error (if you see any while installing libraries)

In [None]:
%pip install --upgrade pip --quiet
%pip install -r ../../requirements.txt --no-deps --quiet
%pip install -r ../../requirements.txt --upgrade --quiet

In [None]:
# %pip install --upgrade boto3
import boto3
print(boto3.__version__)

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import os
import sys
import time
import boto3
import logging
import pprint
import json

# Set the path to import module
from pathlib import Path
current_path = Path().resolve()
current_path = current_path.parent.parent
if str(current_path) not in sys.path:
    sys.path.append(str(current_path))
# Print sys.path to verify
# print(sys.path)

from utils.knowledge_base import BedrockKnowledgeBase

In [None]:
#Clients
s3_client = boto3.client('s3')
sts_client = boto3.client('sts')
session = boto3.session.Session()
region =  session.region_name
account_id = sts_client.get_caller_identity()["Account"]
bedrock_agent_client = boto3.client('bedrock-agent')
bedrock_agent_runtime_client = boto3.client('bedrock-agent-runtime') 
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)
region, account_id

In [None]:
import time

# Get the current timestamp
current_time = time.time()

# Format the timestamp as a string
timestamp_str = time.strftime("%Y%m%d%H%M%S", time.localtime(current_time))[-7:]
# Create the suffix using the timestamp
suffix = f"{timestamp_str}"

knowledge_base_name = f"bedrock-multi-modal-kb-{suffix}"
knowledge_base_description = "Multi-modal RAG knowledge base."

bucket_name = f'{knowledge_base_name}-{account_id}'
# intermediate_bucket_name = f'{knowledge_base_name}-mm-storage-{account_id}'
foundation_model = "anthropic.claude-3-sonnet-20240229-v1:0"

#### You can add multiple data sources (S3, Sharepoint) to a multimodal Knowledge Base. For this notebook, we'll test Knowledge Base creation with S3 Bucket.


Each data source may have different pre-requisites, please refer to the AWS documetation for more information.

In [None]:
## Please uncomment the data sources that you want to add and update the placeholder values accordingly.

data_sources=[
                {"type": "S3", "bucket_name": bucket_name}, 

                # {"type": "SHAREPOINT", "tenantId": "888d0b57-69f1-4fb8-957f-e1f0bedf64de", "domain": "yourdomain",
                #   "authType": "OAUTH2_CLIENT_CREDENTIALS",
                #  "credentialsSecretArn": f"arn:aws::secretsmanager:{region_name}:secret:<<your_secret_name>>",
                #  "siteUrls": ["https://yourdomain.sharepoint.com/sites/mysite"]
                # },
            ]
                
pp = pprint.PrettyPrinter(indent=2)

### 1 - Create Knowledge Base with Multi modality

In [None]:
# For multi-modal RAG While instantiating BedrockKnowledgeBase, pass multi_modal= True and choose the parser you want to use

knowledge_base = BedrockKnowledgeBase(
    kb_name=f'{knowledge_base_name}',
    kb_description=knowledge_base_description,
    data_sources=data_sources,
    multi_modal= True,
    parser='BEDROCK_FOUNDATION_MODEL', # BEDROCK_DATA_AUTOMATION
    chunking_strategy = "FIXED_SIZE", 
    suffix = f'{suffix}-f'
)

### 2 - Data Ingestion
We'll download publically available rich content PDF and upload it to an S3 bucket

In [None]:
import os

def create_directory(directory_name):    
    if not os.path.exists(directory_name):
        os.makedirs(directory_name)
        print(f"Directory '{directory_name}' created successfully.")
    else:
        print(f"Directory '{directory_name}' already exists.")

# Call the function to create the directory
create_directory("mm-data")

In [None]:
import requests

def download_file(url, filename):
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Open the file in write-binary mode
        with open(filename, 'wb') as file:
            # Write the content of the response to the file
            file.write(response.content)
        print(f"File downloaded successfully: {filename}")
    else:
        print(f"Failed to download file. Status code: {response.status_code}")

# URL of the file to download
url = "https://sgp.fas.org/crs/misc/IF12695.pdf"

# Name for the downloaded file
filename = "./mm-data/tornadoes_report.pdf"

# Call the function to download the file
download_file(url, filename)

##### Upload data to S3 Bucket data source

In [None]:
def upload_directory(path, bucket_name):
        for root,dirs,files in os.walk(path):
            for file in files:
                file_to_upload = os.path.join(root,file)
                print(f"uploading file {file_to_upload} to {bucket_name}")
                s3_client.upload_file(file_to_upload,bucket_name,file)

upload_directory("./mm-data", bucket_name)

### Start ingestion job
Once the KB and data source(s) created, we can start the ingestion job for each data source.
During the ingestion job, KB will fetch the documents from the data source, Parse the document to extract text, chunk it based on the chunking size provided, create embeddings of each chunk and then write it to the vector database, in this case OSS.

NOTE: Currently, you can only kick-off one ingestion job at one time.

In [None]:
# ensure that the kb is available
time.sleep(30)
# sync knowledge base
knowledge_base.start_ingestion_job()

In [None]:
# keep the kb_id for invocation later in the invoke request
kb_id = knowledge_base.get_knowledge_base_id()
%store kb_id

### 4 -  Test the Knowledge Base
Now the Knowlegde Base is available we can test it out using the [**retrieve**](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve.html) and [**retrieve_and_generate**](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve_and_generate.html) functions. 

#### Testing Knowledge Base with Retrieve and Generate API

Let's first test the knowledge base using the retrieve and generate API. With this API, Bedrock takes care of retrieving the necessary references from the knowledge base and generating the final answer using a foundation model from Bedrock.

query = `Summarize annual trends of tornado reports and how it varies year over year.`

The right response for this query is expected to fetch from a chart/graph from the PDF document.

In [None]:
query = "Summarize annual trends of tornado reports and how it varies year over year."

In [None]:
foundation_model = "anthropic.claude-3-sonnet-20240229-v1:0"
# foundation_model = "amazon.nova-micro-v1:0"

response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": query
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            'knowledgeBaseId': kb_id,
            "modelArn": "arn:aws:bedrock:{}::foundation-model/{}".format(region, foundation_model),
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5
                } 
            }
        }
    }
)

print(response['output']['text'],end='\n'*2)

In [None]:
from PIL import Image
import s3fs

fs = s3fs.S3FileSystem()

## Function to print retrieved response

def print_response(response):
#structure 'retrievalResults': list of contents. Each list has ['ResponseMetadata', 'citations', 'output', 'sessionId']
    print( f'OUTPUT: {response["output"]["text"]} \n')
    
    print(f'CITATION DETAILS: \n')
    
    for num, chunk in enumerate(response['citations']):
        print(f'CHUNK {num}',end='\n'*1)
        print("========")
        print(f'\t Generated  Response Text: ')
        print(f'\t ------------------------- ')
        print(f'\t Generated  Response Text: ',chunk['generatedResponsePart']['textResponsePart']['text'],end='\n'*2)
        for i, ref in enumerate (chunk['retrievedReferences']):
            print(f'\t Retrieved References: ')
            print(f'\t ---------------------', )
            print(f'\n\t\t --> Location:', ref['location'])
            print(f'\t\n\t\t --> Metadata: \n\t\t\t ---> Source', ref['metadata']['x-amz-bedrock-kb-source-uri'])
            # print(f'\t\n\t\t\n\t\t\t ---> x-amz-bedrock-kb-description', ref['metadata']['x-amz-bedrock-kb-description'])
            print(f'\t\n\t\t\n\t\t\t ---> x-amz-bedrock-kb-byte-content-source', ref['metadata']['x-amz-bedrock-kb-byte-content-source'])
            print("")
            with fs.open(ref['metadata']['x-amz-bedrock-kb-byte-content-source']) as f:
                display(Image.open(f).resize((400, 400)))

In [None]:
print_response(response)

#### Testing Knowledge Base with Retrieve API
If you need an extra layer of control, you can retrieve the chuncks that best match your query using the retrieve API. In this setup, we can configure the desired number of results and control the final answer with your own application logic. The API then provides you with the matching content, its S3 location, the similarity score and the chunk metadata.

In [None]:
response_ret = bedrock_agent_runtime_client.retrieve(
    knowledgeBaseId=kb_id, 
    nextToken='string',
    retrievalConfiguration={
        "vectorSearchConfiguration": {
            "numberOfResults":5,
        } 
    },
    retrievalQuery={
        "text": "How many new positions were opened across Amazon's fulfillment and delivery network?"
    }
)

def response_print(retrieve_resp):
#structure 'retrievalResults': list of contents. Each list has content, location, score, metadata
    for num,chunk in enumerate(response_ret['retrievalResults'],1):
        if 'text' in chunk['content']:
            print(f'Chunk {num}: ',chunk['content']['text'],end='\n'*2)
        if 'byteContent' in chunk['content']:
            print(f'Chunk {num}: ',chunk['content']['byteContent'],end='\n'*2)
        print(f'Chunk {num} Location: ',chunk['location'],end='\n'*2)
        print(f'Chunk {num} Score: ',chunk['score'],end='\n'*2)
        print(f'Chunk {num} Metadata: ',chunk['metadata'],end='\n'*2)
        print("--------------------------------")

response_print(response_ret)

### Clean up
Please make sure to uncomment and run the below section to delete all the resources.

In [None]:
# delete local directory
import shutil

dir_path = "mm-data" # Replace with the actual path

try:
    shutil.rmtree(dir_path)
    print(f"Directory '{dir_path}' and its contents have been deleted successfully.")
except FileNotFoundError:
    print(f"Directory '{dir_path}' not found.")
except Exception as e:
        print(f"An error occurred: {e}")

In [None]:
# # Delete resources
# print("===============================Deleteing resources ==============================\n")
knowledge_base.delete_kb(delete_s3_bucket=True, delete_iam_roles_and_policies=True)
