# Metadata extraction and Amazon Bedrock Knowledge Bases creation

This notebook demonstrates the first part of implementing an agentic RAG system using Amazon Bedrock. You will:

- Extract metadata from PDF documents to enhance retrieval capabilities
- Create two strategic Amazon Bedrock Knowledge Bases:
  1. A summary KB containing document overviews and metadata for initial document filtering
  2. A detailed KB with document chunks and associated metadata for precise content retrieval
- Set up the foundation for an agentic RAG system that makes intelligent decisions about document relevance






Let's do this step by step in the following sections:

- **Prerequisites**: Prerequisties to execute the 2 notebooks of this repo successfully
- **Base Infrastructure Deployment**: In this section you will deploy an Amazon Cloudformation Template which will create and configure some of the services used for the solution. 
- **Metadata association:** You will use the doctor identifiers generated by Cognito to create metadata files associated to each transcript file.
- **Upload the dataset to Amazon S3:** You will create an Amazon S3 bucket and upload the dataset and metadata files. 
- **Create a Amazon Bedrock Knowledge Bases**: You will create and sync the Knowledge Base with the transcripts and associated metadata.

## Prerequisites
- Access to Amazon Bedrock models. [Amazon Bedrock](https://aws.amazon.com/bedrock/) is a fully managed service that makes base models from Amazon and third-party model providers accessible through an API.
- Use an **IAM role** with access to the following services: **Amazon S3, AWS STS,  AWS CloudFormation, Amazon Bedrock and Amazon Opensearch Serverless**.
- PDF documents to process
- Required Python packages installed


<div class="alert alert-block alert-warning">
<b>Note:</b> Amazon Bedrock users need to request access to models before they are available for use. If you want to add additional models for text, chat, and image generation, you need to request access to models in Amazon Bedrock. To request access to additional models, select the Model access link in the left side navigation panel in the Amazon Bedrock console. For more information see: <a href="https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html">https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html</a>
</div>

For this, you will need to request access to:

- Embeddings model: **Amazon Titan Embeddings V2**
- Text generation model: **Sonnet 3**

## Base Infrastructure Deployment 
We have created two Amazon CloudFormation templates which will automatically set up some of the services needed for this notebook.

The first CloudFormation template will automatically create the Amazon S3 bucket and Amazon OpenSearch Serverless collection. Both are necessary to create the Amazon Bedrock Knowledge Bases




### Setting up the environment

In this step, we'll import the necessary libraries and configure our AWS credentials. This setup is crucial for interacting with Amazon Bedrock and other AWS services.
  

In [None]:
%pip install opensearch-py boto3 botocore PyPDF2

Let's import necessary Python modules and libraries, and initialize AWS service clients required for the notebook.

In [49]:
import os
import time
import uuid
import boto3
from botocore.exceptions import ClientError

# First, set up the session with the correct profile
session = boto3.Session()

# Now, create all clients using this session
s3_client = session.client('s3')
sts_client = session.client('sts')
cloudformation = session.client('cloudformation')
bedrock_agent_client = session.client('bedrock-agent')
bedrock = session.client("bedrock")
bedrock_agent_runtime_client = session.client('bedrock-agent-runtime')

# Get the region from the session
region = session.region_name

# Get the account ID using the sts client created from the session
account_id = sts_client.get_caller_identity()["Account"]

# Get the identity ARN
identity_arn = sts_client.get_caller_identity()['Arn']

We will define a solution id that will be used as a prefix to create the names of the resources

In [50]:
def short_uuid():
    uuid_str = str(uuid.uuid4())
    return uuid_str[:8]

solution_id = 'KBS{}'.format(short_uuid()).lower()

Next, we define the wrapper function to create the stack in CloudFormation

In [51]:
import re
import boto3
import json
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth

opensearch_client = session.client('opensearchserverless')

def create_base_infrastructure(solution_id):
    # Read the YAML template file
    with open('templates/1-base-infra.yaml', 'r') as f:
        template_body = f.read()

    # Define the stack parameters
    stack_parameters = [
        {
            'ParameterKey': 'SolutionId',
            'ParameterValue': solution_id
        }
    ]

    # Create the CloudFormation stack
    stack_name = "KB-E2E-Base-{}".format(solution_id)
    response = cloudformation.create_stack(
        StackName=stack_name,
        TemplateBody=template_body,
        Parameters=stack_parameters,
        Capabilities=['CAPABILITY_NAMED_IAM']  # Required if your template creates IAM resources
    )

    stack_id = response['StackId']
    print(f'Creating stack {stack_name} ({stack_id})')

    # Wait for the stack to be created
    waiter = cloudformation.get_waiter('stack_create_complete')
    waiter.wait(StackName=stack_id)

    # Get the stack outputs
    stack_outputs = cloudformation.describe_stacks(StackName=stack_id)['Stacks'][0]['Outputs']

    # Extract the output values into variables
    s3_bucket = next((output['OutputValue'] for output in stack_outputs if output['OutputKey'] == 's3bucket'), None)
    collection_id = next((output['OutputValue'] for output in stack_outputs if output['OutputKey'] == 'OpenSearchCollectionId'), None)

    print('Stack outputs:')
    print(f'S3 Bucket: {s3_bucket}')
    print(f'OpenSearchCollectionId: {collection_id}')

    return s3_bucket, collection_id

Ok, now we are ready to launch the CloudFormation template

In [52]:
s3_bucket, collection_id = create_base_infrastructure(solution_id)

Creating stack KB-E2E-Base-kbs948efd22 (arn:aws:cloudformation:us-east-1:776299153297:stack/KB-E2E-Base-kbs948efd22/237fcd60-9189-11ef-82ab-0affcbf29b41)
Stack outputs:
S3 Bucket: kbs948efd22-bucket
OpenSearchCollectionId: xlhcvyck6rthnso7caq6


<div class="alert alert-block alert-warning">
The deployment of the Amazon Cloudformation template should take around <b>1-2 minutes</b>.
    
You can also follow the deployment status in the Amazon Cloudformation console. 
</div>

## Metadata extraction


<div class="alert alert-block alert-warning">
<b>Warning:</b> Make sure you have enabled Anthropic Claude Sonnet 3 access in the Amazon Bedrock Console (model access). 
</div>

The following code implements a PDF processing pipeline that extracts text from each PDF file and generates a summary and metadata using Anthropic Claude on Amazon Bedrock. 

The code first defines utility functions for cleaning text, and a function to extract text from PDFs using PyPDF2. The main function **query_bedrock** sends the extracted text to Amazon Bedrock, requesting a JSON response containing the document's filename, title, and a comprehensive summary. 

The script then processes all PDF files in a specified folder, generating two JSON output files for each PDF: one containing basic metadata (filename and title) stored in the original folder, and another with the full metadata including the summary stored in a separate 'Summary-PDFs' subdirectory. 

To use metadata filtering, we need to create a separate metadata JSON file for each file. The metadata file should share the same name as the corresponding PDF file (including the extension). For instance, if the file is named *file_001.pdf*, the metadata file should be named *file_001.pdf.metadata.json*. This nomenclature is crucial for the Knowledge Base to identify the metadata for specific files during the ingestion process. 

The metadata JSON file will contain key-value pairs representing the relevant metadata fields associated with the file.

In [53]:
# Path of the folder where the PDF files are located
folder_path = './documents/PDFs/'
json_folder_path = './documents/Summary-PDFs/'

# Create the 'Summary-PDFs' directory if it doesn't exist
os.makedirs(json_folder_path, exist_ok=True)
# Create the 'PDFs' directory if it doesn't exist
os.makedirs(folder_path, exist_ok=True)

In [54]:
import json
import PyPDF2
import re
import traceback

# Setting up the Bedrock runtime client
bedrock_runtime = session.client(service_name='bedrock-runtime')

def clean_text(text):
    # Remove non-printable characters and control characters
    text = ''.join(char for char in text if ord(char) >= 32 or char in '\n\r\t')

    # Replace any remaining problematic Unicode characters with a placeholder
    text = re.sub(r'[\ud800-\udfff]', '', text)

    # Replace multiple newlines with a single newline
    text = re.sub(r'\n+', '\n', text)

    # Replace multiple spaces with a single space
    text = re.sub(r' +', ' ', text)
    
    return text.strip()

def extract_text_from_pdf(file_path):
    try:
        with open(file_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            text = ''
            for page in pdf_reader.pages:
                text += page.extract_text() + ' '
        return clean_text(text)
    except Exception as e:
        print(f"Error extracting text from PDF {file_path}: {str(e)}")
        return None

def clean_json_string(json_string):
    # Remove any leading/trailing whitespace
    json_string = json_string.strip()

    # Ensure the JSON string starts and ends with curly braces
    if not json_string.startswith('{'):
        json_string = '{' + json_string
    if not json_string.endswith('}'):
        json_string = json_string + '}'

    # Remove any control characters
    json_string = ''.join(char for char in json_string if ord(char) >= 32 or char in '\n\r\t')
    
    return json_string

def query_bedrock(text, filename):
    messages = [
        {
            "role": "user",
            "content": f"""Based on the following document content and filename, please extract the title and generate a comprehensive summary of the document. 
            Return the information in JSON format as shown below. Ensure the JSON is complete and valid, starting with an opening curly brace and ending with a closing curly brace:
            {{
                "metadataAttributes": {{ 
                    "filename": string,
                    "title": string,
                    "summary": string
                }}
            }}

            Filename: {filename}
            Document content:
            {text}"""
        },
        {
            "role": "assistant", "content": "{"
        }  # Prefill here
    ]

    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": messages,
        "temperature": 0.1,
        "top_p": 0.9,
    })

    try:
        response = bedrock_runtime.invoke_model(
            body=body,
            modelId="anthropic.claude-3-sonnet-20240229-v1:0",
            accept="application/json",
            contentType="application/json"
        )

        response_body = json.loads(response.get('body').read())
        print(response_body)

        json_string = response_body['content'][0]['text']
        print("Raw JSON string:", json_string)

        cleaned_json_string = clean_json_string(json_string)
        print("Cleaned JSON string:", cleaned_json_string)
        
        return json.loads(cleaned_json_string)
    except json.JSONDecodeError as e:
        print(f"JSON Decode Error for file {filename}: {str(e)}")
        print("Problematic JSON string:", cleaned_json_string)
        return None
    except Exception as e:
        print(f"Error processing file {filename} in query_bedrock: {str(e)}")
        traceback.print_exc()
        return None

# Loop through all files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith('.pdf'):
        file_path = os.path.join(folder_path, filename)

        try:
            # Extract text from PDF
            document_text = extract_text_from_pdf(file_path)

            if document_text is None:
                print(f"Skipping file {filename} due to text extraction error")
                continue

            # Query Bedrock for metadata
            json_response = query_bedrock(document_text, filename)
            
            if json_response:
                # Create JSON output filenames
                output_filename_meta = os.path.splitext(filename)[0] + '.pdf.metadata.json'
                output_filename_meta_w_summary = os.path.splitext(filename)[0] + '.json'

                # Prepare metadata without summary
                metadata_without_summary = {
                    "metadataAttributes": {
                        "filename": json_response["metadataAttributes"]["filename"],
                        "title": json_response["metadataAttributes"]["title"]
                    }
                }

                # Save the metadata without summary in the original folder
                with open(os.path.join(folder_path, output_filename_meta), 'w', encoding='utf-8') as jsonfile:
                    json.dump(metadata_without_summary, jsonfile, ensure_ascii=False, indent=4)

                # Save the metadata including summary in the 'Summary-PDFs' subdirectory
                with open(os.path.join(json_folder_path, output_filename_meta_w_summary), 'w', encoding='utf-8') as jsonfile:
                    json.dump(json_response, jsonfile, ensure_ascii=False, indent=4)

                print(f'Files {output_filename_meta} and {output_filename_meta_w_summary} successfully generated and saved.')
            else:
                print(f'Failed to generate metadata for {filename}')
        except Exception as e:
            print(f"Error processing file {filename}: {str(e)}")
            traceback.print_exc()

print("Process completed.")

{'id': 'msg_bdrk_01PQ6KwJrdLUV6G77RozxneN', 'type': 'message', 'role': 'assistant', 'model': 'claude-3-sonnet-20240229', 'content': [{'type': 'text', 'text': '\n    "metadataAttributes": {\n        "filename": "metagpt.pdf",\n        "title": "MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework",\n        "summary": "MetaGPT is a novel meta-programming framework that leverages Standardized Operating Procedures (SOPs) to enhance the problem-solving capabilities of multi-agent systems based on Large Language Models (LLMs). It models a group of agents as a simulated software company, with specialized roles like Product Manager, Architect, Engineer, and QA Engineer following a streamlined workflow. MetaGPT uses structured communication interfaces, a publish-subscribe mechanism, and an executable feedback mechanism to improve code generation quality. On benchmarks like HumanEval and MBPP, MetaGPT achieves state-of-the-art performance, outperforming existing approaches. The s

To better understand what we have generated, review the output files in the **documents** folder  

## Upload to Amazon S3

Amazon Bedrock Knowledge Bases currently require data to reside in an Amazon S3 bucket. We will upload all the generated files in an S3 bucket.

In [55]:
# S3 upload functionality
def upload_directory_to_s3(local_directory, s3_bucket, s3_prefix):
    for root, dirs, files in os.walk(local_directory):
        for filename in files:
            local_path = os.path.join(root, filename)
            relative_path = os.path.relpath(local_path, local_directory)
            s3_path = os.path.join(s3_prefix, relative_path)
            s3_client.upload_file(local_path, s3_bucket, s3_path)
            print(f"Uploaded {local_path} to s3://{s3_bucket}/{s3_path}")

# Create the S3 bucket if it doesn't exist
try:
    s3_client.create_bucket(Bucket=s3_bucket)
    print(f"Bucket '{s3_bucket}' created successfully.")
except ClientError as e:
    error_code = e.response['Error']['Code']
    if error_code == 'BucketAlreadyOwnedByYou':
        print(f"Bucket '{s3_bucket}' already exists and is owned by you. Proceeding with file upload.")
    elif error_code == 'BucketAlreadyExists':
        print(f"Bucket '{s3_bucket}' already exists but is owned by another account. Please choose a different bucket name.")
        raise
    else:
        print(f"An error occurred while creating the bucket: {e}")
        raise

# Upload PDFs directory
upload_directory_to_s3(folder_path, s3_bucket, "PDFs")

# Upload Summary-PDFs directory
upload_directory_to_s3(json_folder_path, s3_bucket, "Summary-PDFs")

print("S3 upload completed.")

Bucket 'kbs948efd22-bucket' created successfully.
Uploaded ./documents/PDFs/vr_mcl.pdf.metadata.json to s3://kbs948efd22-bucket/PDFs/vr_mcl.pdf.metadata.json
Uploaded ./documents/PDFs/finetune_fair_diffusion.pdf.metadata.json to s3://kbs948efd22-bucket/PDFs/finetune_fair_diffusion.pdf.metadata.json
Uploaded ./documents/PDFs/longlora.pdf.metadata.json to s3://kbs948efd22-bucket/PDFs/longlora.pdf.metadata.json
Uploaded ./documents/PDFs/metagpt.pdf to s3://kbs948efd22-bucket/PDFs/metagpt.pdf
Uploaded ./documents/PDFs/selfrag.pdf.metadata.json to s3://kbs948efd22-bucket/PDFs/selfrag.pdf.metadata.json
Uploaded ./documents/PDFs/zipformer.pdf.metadata.json to s3://kbs948efd22-bucket/PDFs/zipformer.pdf.metadata.json
Uploaded ./documents/PDFs/knowledge_card.pdf to s3://kbs948efd22-bucket/PDFs/knowledge_card.pdf
Uploaded ./documents/PDFs/zipformer.pdf to s3://kbs948efd22-bucket/PDFs/zipformer.pdf
Uploaded ./documents/PDFs/selfrag.pdf to s3://kbs948efd22-bucket/PDFs/selfrag.pdf
Uploaded ./documen

Now that the files have been uploaded to S3, let's create the necessary Knowledge Bases

## Create a Amazon Bedrock Knowledge Bases

In this section we will go through all the steps to create and test a Knowledge Base. 

We will first prepare the Amazon Opensearch Serverless collection

In [56]:
def updateDataAccessPolicy(solution_id):
    data_access_policy_name = "{}-kbcollection-access".format(solution_id)
    current_role_arn = sts_client.get_caller_identity()['Arn']
    response = opensearch_client.get_access_policy(
        name=data_access_policy_name,
        type='data'
    )
    policy_version = response["accessPolicyDetail"]["policyVersion"]
    existing_policy = response['accessPolicyDetail']['policy']
    updated_policy = existing_policy.copy()
    updated_policy[0]['Principal'].append(current_role_arn)
    updated_policy = str(updated_policy).replace("'", '"')

    response = opensearch_client.update_access_policy(
        description='dataAccessPolicy',
        name=data_access_policy_name,
        policy=updated_policy,
        policyVersion=policy_version,
        type='data'
    )
    print(response)

def createAOSSIndex(indexName, region, collection_id):
    # Set up AWS authentication
    service = 'aoss'
    credentials = session.get_credentials()
    awsauth = AWSV4SignerAuth(credentials, region, service)

    # Define index settings and mappings
    index_settings = {
        "settings": {
            "index.knn": "true"
        },
        "mappings": {
            "properties": {
                "vector": {
                    "type": "knn_vector",
                    "dimension": 1024,
                     "method": {
                         "name": "hnsw",
                         "engine": "faiss",
                         "space_type": "innerproduct",
                         "parameters": {
                             "ef_construction": 512,
                             "m": 16
                         },
                     },
                 },
                "text": {
                    "type": "text"
                },
                "text-metadata": {
                    "type": "text"
                }
            }
        }
    }

    # Build the OpenSearch client
    host = f"{collection_id}.{region}.aoss.amazonaws.com"
    oss_client = OpenSearch(
        hosts=[{'host': host, 'port': 443}],
        http_auth=awsauth,
        use_ssl=True,
        verify_certs=True,
        connection_class=RequestsHttpConnection,
        timeout=300
    )

    # Create index
    response = oss_client.indices.create(index=indexName, body=json.dumps(index_settings))
    print(response)


Let's now update the Opensearch Serverless Access Policy

In [57]:
updateDataAccessPolicy(solution_id) # Adding the current role to the collection's data access policy
time.sleep(60) # Changes to the data access policy might take a bit to update

{'accessPolicyDetail': {'createdDate': 1729720373787, 'description': 'dataAccessPolicy', 'lastModifiedDate': 1729720652206, 'name': 'kbs948efd22-kbcollection-access', 'policy': [{'Rules': [{'Resource': ['collection/kbs948efd22-kbcollection'], 'Permission': ['aoss:CreateCollectionItems', 'aoss:UpdateCollectionItems', 'aoss:DescribeCollectionItems'], 'ResourceType': 'collection'}, {'Resource': ['index/kbs948efd22-kbcollection/*'], 'Permission': ['aoss:CreateIndex', 'aoss:DescribeIndex', 'aoss:ReadDocument', 'aoss:WriteDocument', 'aoss:UpdateIndex', 'aoss:DeleteIndex'], 'ResourceType': 'index'}], 'Principal': ['arn:aws:iam::776299153297:role/kbs948efd22-kbrole', 'arn:aws:sts::776299153297:assumed-role/AmazonSageMaker-ExecutionRole-20240702T000932/SageMaker']}], 'policyVersion': 'MTcyOTcyMDY1MjIwNl8y', 'type': 'data'}, 'ResponseMetadata': {'RequestId': 'b21af790-2ffa-4c35-bf34-5b21a0d1c6c9', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'b21af790-2ffa-4c35-bf34-5b21a0d1c6c9', 

As the final step to prepare our Openserach Serverless collection we will create two different indexes. 

In [58]:
indexName = "kb-index-" + solution_id
print("Index name:",indexName)
indexNameSummaries = "kb-index-summaries-" + solution_id
print("Index name for summaries:",indexNameSummaries)

Index name: kb-index-kbs948efd22
Index name for summaries: kb-index-summaries-kbs948efd22


In [59]:
createAOSSIndex(indexName, region, collection_id) # Create the AOSS index
createAOSSIndex(indexNameSummaries, region, collection_id) # Create the AOSS index for summaries

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'kb-index-kbs948efd22'}
{'acknowledged': True, 'shards_acknowledged': True, 'index': 'kb-index-summaries-kbs948efd22'}


#### Create the Knowledge Base
In this section you will create the Knowledge Base. Before creating a new KB we need to define which embeddings model we want it to use. In this case we will be using Amazon Titan Embeddings V2. 

<div class="alert alert-block alert-warning">
<b>Warning:</b> Make sure you have enabled Amazon Titan Embeddings V2 access in the Amazon Bedrock Console (model access). 
</div>

In [60]:
embeddingModelArn = "arn:aws:bedrock:{}::foundation-model/amazon.titan-embed-text-v2:0".format(region)

Now we can create our Amazon Bedrock Knowledge Bases. We have created an Amazon CloudFormation template which takes care of the configuration needed.

**Note that each KB will use the same Opensearch collection but different indexes: *indexName* and *indexNameSummaries***

<div class="alert alert-block alert-warning">
The deployment of the Amazon Cloudformation template should take around <b>1-2 minutes</b>.
    
You can also follow the deployment status in the Amazon Cloudformation console. 
</div>

In [61]:
import botocore

def create_or_update_kb_infrastructure(solution_id, s3_bucket, embeddingModelArn, indexName, SummaryIndexName, region, account_id, collection_id):
    # Define the template parameters
    template_parameters = [
        {'ParameterKey': 'SolutionId', 'ParameterValue': solution_id},
        {'ParameterKey': 'InputBucketName', 'ParameterValue': s3_bucket},
        {'ParameterKey': 'EmbeddingModel', 'ParameterValue': embeddingModelArn},
        {'ParameterKey': 'IndexName', 'ParameterValue': indexName},
        {'ParameterKey': 'SummaryIndexName', 'ParameterValue': SummaryIndexName},
        {'ParameterKey': 'VectorFieldName', 'ParameterValue': 'vector'},
        {'ParameterKey': 'MetaDataFieldName', 'ParameterValue': 'text-metadata'},
        {'ParameterKey': 'TextFieldName', 'ParameterValue': 'text'},
        {'ParameterKey': 'CollectionArn', 'ParameterValue': f"arn:aws:aoss:{region}:{account_id}:collection/{collection_id}"},
    ]

    # Read the CloudFormation template from a file
    with open('templates/2-knowledgebase-infra.yaml', 'r') as template_file:
        template_body = template_file.read()

    stack_name = f"KB-E2E-KB-{solution_id}"

    try:
        # Check if the stack exists
        cloudformation.describe_stacks(StackName=stack_name)
        stack_exists = True
    except botocore.exceptions.ClientError as e:
        if "does not exist" in str(e):
            stack_exists = False
        else:
            raise

    if stack_exists:
        # Update the existing stack
        try:
            response = cloudformation.update_stack(
                StackName=stack_name,
                TemplateBody=template_body,
                Parameters=template_parameters,
                Capabilities=['CAPABILITY_IAM', 'CAPABILITY_AUTO_EXPAND', 'CAPABILITY_NAMED_IAM']
            )
            print(f'Stack update initiated: {response["StackId"]}')
            waiter = cloudformation.get_waiter('stack_update_complete')
        except botocore.exceptions.ClientError as e:
            if "No updates are to be performed" in str(e):
                print("No updates are necessary. Stack is already up to date.")
                stack_id = cloudformation.describe_stacks(StackName=stack_name)['Stacks'][0]['StackId']
            else:
                raise
        else:
            stack_id = response['StackId']
            waiter.wait(StackName=stack_id)
    else:
        # Create a new stack
        response = cloudformation.create_stack(
            StackName=stack_name,
            TemplateBody=template_body,
            Parameters=template_parameters,
            Capabilities=['CAPABILITY_IAM', 'CAPABILITY_AUTO_EXPAND', 'CAPABILITY_NAMED_IAM']
        )
        stack_id = response['StackId']
        print(f'Stack creation initiated: {stack_id}')
        waiter = cloudformation.get_waiter('stack_create_complete')
        waiter.wait(StackName=stack_id)

    # Retrieve the stack outputs
    stack_description = cloudformation.describe_stacks(StackName=stack_id)['Stacks'][0]
    outputs = stack_description['Outputs']
    kb_id = next((output['OutputValue'] for output in outputs if output['OutputKey'] == 'KBID'), None)
    datasource_id = next((output['OutputValue'].split('|')[1] for output in outputs if output['OutputKey'] == 'DS'), None)
    summaries_kb_id = next((output['OutputValue'] for output in outputs if output['OutputKey'] == 'SummaryKBID'), None)
    summaries_datasource_id = next((output['OutputValue'].split('|')[1] for output in outputs if output['OutputKey'] == 'SummaryDS'), None)

    # Print the output values
    for output in outputs:
        print(f"{output['OutputKey']}: {output['OutputValue']}")

    return kb_id, datasource_id, summaries_kb_id, summaries_datasource_id

In [62]:
kb_id, datasource_id, summaries_kb_id, summaries_datasource_id = create_or_update_kb_infrastructure(solution_id, s3_bucket, embeddingModelArn, indexName, indexNameSummaries, region, account_id, collection_id)

Stack creation initiated: arn:aws:cloudformation:us-east-1:776299153297:stack/KB-E2E-KB-kbs948efd22/f25e2d20-9189-11ef-9c68-12dd996c8721
KBID: U168IJOXJX
SummaryKBID: CIDQQ7USCL
SummaryDS: CIDQQ7USCL|0ICTQBG7SD
DS: U168IJOXJX|VO37NBGNVE


#### Sync the Knowledge Base
As we have created and associated the data sources to the two Knowledge Bases, we can proceed to Sync the data. 


Each time you add, modify, or remove files from the S3 bucket for a data source, you must sync the data source so that it is re-indexed to the knowledge base. Syncing is incremental, so Amazon Bedrock only processes the objects in your S3 bucket that have been added, modified, or deleted since the last sync.

In [63]:
ingestion_job_response = bedrock_agent_client.start_ingestion_job(
    knowledgeBaseId=kb_id,
    dataSourceId=datasource_id,
    description='Initial Ingestion'
)

In [64]:
status = bedrock_agent_client.get_ingestion_job(
    knowledgeBaseId=ingestion_job_response["ingestionJob"]["knowledgeBaseId"],
    dataSourceId=ingestion_job_response["ingestionJob"]["dataSourceId"],
    ingestionJobId=ingestion_job_response["ingestionJob"]["ingestionJobId"]
)["ingestionJob"]["status"]
print(status)
while status not in ["COMPLETE", "FAILED", "STOPPED"]:
    status = bedrock_agent_client.get_ingestion_job(
        knowledgeBaseId=ingestion_job_response["ingestionJob"]["knowledgeBaseId"],
        dataSourceId=ingestion_job_response["ingestionJob"]["dataSourceId"],
        ingestionJobId=ingestion_job_response["ingestionJob"]["ingestionJobId"]
    )["ingestionJob"]["status"]
    print(status)
    time.sleep(30)
print("Waiting for changes to take place in the vector database")
time.sleep(30) # Wait for all changes to take place

STARTING
STARTING
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
COMPLETE
Waiting for changes to take place in the vector database


In [65]:
summaries_ingestion_job_response = bedrock_agent_client.start_ingestion_job(
    knowledgeBaseId=summaries_kb_id,
    dataSourceId=summaries_datasource_id,
    description='Initial Ingestion'
)

In [None]:
status = bedrock_agent_client.get_ingestion_job(
    knowledgeBaseId=summaries_ingestion_job_response["ingestionJob"]["knowledgeBaseId"],
    dataSourceId=summaries_ingestion_job_response["ingestionJob"]["dataSourceId"],
    ingestionJobId=summaries_ingestion_job_response["ingestionJob"]["ingestionJobId"]
)["ingestionJob"]["status"]
print(status)
while status not in ["COMPLETE", "FAILED", "STOPPED"]:
    status = bedrock_agent_client.get_ingestion_job(
        knowledgeBaseId=summaries_ingestion_job_response["ingestionJob"]["knowledgeBaseId"],
        dataSourceId=summaries_ingestion_job_response["ingestionJob"]["dataSourceId"],
        ingestionJobId=summaries_ingestion_job_response["ingestionJob"]["ingestionJobId"]
    )["ingestionJob"]["status"]
    print(status)
    time.sleep(30)
print("Waiting for changes to take place in the vector database")
time.sleep(30) # Wait for all changes to take place

STARTING
STARTING
COMPLETE


#### Test the Knowledge Base

Now the Knowledge Bases are available we can test them out using the **retrieve** API.

In order to test both Knowledge Bases we have create 3 helper functions:

- **get_filename:**
    - Performs a vector search query against the summaries Knowledge Base containing document summaries
    - Takes a text query as input
    - Retrieves up to 5 most relevant results based on semantic similarity
- **construct_metadata_filter:**
    - Creates a metadata filter structure for filename-based filtering
    - Generates an "equals" filter condition when a valid filename is present
- **process_query:**
    - Executes a filtered search against the documents Knowledge Base
    - Takes both query text and a filename as parameters
    - Creates the metadata filter using the *construct_metadata_filter*
    - Applies metadata filtering to limit results to specific files using the *filename* metadata
    - Returns up to 5 most relevant results that match both the semantic query and filename filter

In [None]:
def get_filename(text):
    response = bedrock_agent_runtime_client.retrieve(
        knowledgeBaseId=summaries_kb_id,
        retrievalConfiguration={
            "vectorSearchConfiguration": {
                "numberOfResults": 5
            }
        },
        retrievalQuery={
            'text': text
        }
    )
    return response

def process_query(text, filename):
    metadata_filter = construct_metadata_filter(filename)
    print('Here is the prepared metadata filters:')
    print(metadata_filter)

    response = bedrock_agent_runtime_client.retrieve(
        knowledgeBaseId=kb_id,
        retrievalConfiguration={
            "vectorSearchConfiguration": {
                "filter": metadata_filter,
                "numberOfResults": 5
            }
        },
        retrievalQuery={
            'text': text
        }
    )
    return response

def construct_metadata_filter(filename):
    if not filename:
        return None
    metadata_filter = {"equals": []}

    if filename and filename != 'unknown':
        metadata_filter = {
            "equals": {
                "key": "filename",
                "value": filename
            }
        }

    return metadata_filter if metadata_filter["equals"] else None

Let's now test the summaries Knowledge Base

In [None]:
text = "Which are the Retrieval-based Evaluation results for LongLora"
get_filename(text)

And finally test the documents Knowledge Base using longlora.pdf filename as a filter

In [None]:
process_query("Which are the Retrieval-based Evaluation results for LongLora", "longlora.pdf")

As you can see, we are performing semantic search only on the *longlora.pdf* file chunks.

### Store variables to use in Notebook 2

As a final step, let's store all the necessary variables that we need on the second Notebook

In [None]:
%store kb_id
%store summaries_kb_id
%store s3_bucket
%store solution_id

Great, we are all set!

## Next Steps
After completing this notebook, proceed to [02-agentic-rag-converse-api.ipynb](./02-agentic-rag-converse-api.ipynb) to implement the agentic RAG system that will leverage these Knowledge Bases.