# Accuracy improvement for Knowledge Bases for Amazon Bedrock

## Setup 
Before running the rest of this notebook, you'll need to run the cells below to (ensure necessary libraries are installed and) connect to Bedrock.

In [1]:
# %pip install -U opensearch-py==2.3.1 --quiet
# %pip install -U boto3 --quiet
# %pip install -U retrying==1.3.4 --quiet

In [2]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [3]:
import warnings
warnings.filterwarnings('ignore')

## Create a vector store - OpenSearch Serverless index

### Step 1 - Create OSS policies and collection
Firt of all we have to create a vector store. In this section we will use *Amazon OpenSerach serverless.*

Amazon OpenSearch Serverless is a serverless option in Amazon OpenSearch Service. As a developer, you can use OpenSearch Serverless to run petabyte-scale workloads without configuring, managing, and scaling OpenSearch clusters. You get the same interactive millisecond response times as OpenSearch Service with the simplicity of a serverless environment. Pay only for what you use by automatically scaling resources to provide the right amount of capacity for your application—without impacting data ingestion.

In [4]:
import json
import os
import boto3
from utility import create_bedrock_execution_role, create_oss_policy_attach_bedrock_execution_role, create_policies_in_oss
import random
from retrying import retry
import time
from pprint import pprint as pp 

suffix = random.randrange(200, 900)

sts_client = boto3.client('sts')
boto3_session = boto3.session.Session()
region_name = boto3_session.region_name

bedrock_agent_client = boto3_session.client('bedrock-agent', region_name=region_name)
bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime", region_name=region_name)

service = 'aoss'
s3_client = boto3.client('s3')
account_id = sts_client.get_caller_identity()["Account"]
s3_suffix = f"{region_name}-{account_id}"

In [5]:
print(boto3.__version__)

1.34.143


In [6]:
bucket_name = 'my-kb-dataset-test-bucket-2024'                                                          #### Provide your bucket name which is already created

vector_store_name = f'bedrock-vectordb-rag-{suffix}'
index_name = f"bedrock-vectordb-rag-index-{suffix}"
aoss_client = boto3_session.client('opensearchserverless')
bedrock_kb_execution_role = create_bedrock_execution_role(bucket_name)
bedrock_kb_execution_role_arn = bedrock_kb_execution_role['Role']['Arn']

In [7]:
# create security, network and data access policies within OSS
encryption_policy, network_policy, access_policy = create_policies_in_oss( vector_store_name=vector_store_name,
                                                                           aoss_client=aoss_client,
                                                                           bedrock_kb_execution_role_arn=bedrock_kb_execution_role_arn)
collection = aoss_client.create_collection(name=vector_store_name,type='VECTORSEARCH')

In [8]:
pp(collection)
time.sleep(10)

{'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
                                      'content-length': '316',
                                      'content-type': 'application/x-amz-json-1.0',
                                      'date': 'Thu, 18 Jul 2024 19:48:58 GMT',
                                      'x-amzn-requestid': 'dd54881b-4048-4a3c-adbe-4bb73580c094'},
                      'HTTPStatusCode': 200,
                      'RequestId': 'dd54881b-4048-4a3c-adbe-4bb73580c094',
                      'RetryAttempts': 0},
 'createCollectionDetail': {'arn': 'arn:aws:aoss:us-east-1:507922848584:collection/94qa58u59f50h2eiuwp3',
                            'createdDate': 1721332138246,
                            'id': '94qa58u59f50h2eiuwp3',
                            'kmsKeyArn': 'auto',
                            'lastModifiedDate': 1721332138246,
                            'name': 'bedrock-vectordb-rag-383',
                            'standbyReplicas': '

In [9]:
collection_id = collection['createCollectionDetail']['id']
host = collection_id + '.' + region_name + '.aoss.amazonaws.com'
pp(host)

'94qa58u59f50h2eiuwp3.us-east-1.aoss.amazonaws.com'


In [10]:
# create oss policy and attach it to Bedrock execution role
create_oss_policy_attach_bedrock_execution_role(collection_id=collection_id,
                                                bedrock_kb_execution_role=bedrock_kb_execution_role)

Opensearch serverless arn:  arn:aws:iam::507922848584:policy/AmazonBedrockOSSPolicyForKnowledgeBase_600


## Step 2 - Create vector index

In [11]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
credentials = boto3.Session().get_credentials()
awsauth = auth = AWSV4SignerAuth(credentials, region_name, service)

index_name = f"bedrock-sample-index-{suffix}"
body_json = {
   "settings": {
      "index.knn": "true"
   },
   "mappings": {
      "properties": {
         "vector": {
            "type": "knn_vector",
            "dimension": 1536,
            "method": {
                "name": "hnsw",
                "engine": "faiss",  
                "space_type": "l2",
                "parameters": {
                    "ef_construction": 200,
                    "m": 16
                }
            }
        },
         "text": {
            "type": "text"
         },
         "text-metadata": {
            "type": "text"         }
      }
   }
}
# Build the OpenSearch client
oss_client = OpenSearch(
    hosts=[{'host': host, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    timeout=300
)
# # It can take up to a minute for data access rules to be enforced
time.sleep(60)

In [12]:
# Create index
response = oss_client.indices.create(index=index_name, body=json.dumps(body_json))
print('\nCreating index:')
pp(response)


Creating index:
{'acknowledged': True,
 'index': 'bedrock-sample-index-383',
 'shards_acknowledged': True}


## Create Knowledge Base
Steps:
- initialize Open search serverless configuration which will include collection ARN, index name, vector field, text field and metadata field.
- initialize chunking strategy, based on which KB will split the documents into pieces of size equal to the chunk size mentioned in the `chunkingStrategyConfiguration`.
- initialize the web URL configuration, which will be used to create the data source object later.
- initialize the Titan embeddings model ARN, as this will be used to create the embeddings for each of the text chunks.

In [13]:
opensearchServerlessConfiguration = {
            "collectionArn": collection["createCollectionDetail"]['arn'],
            "vectorIndexName": index_name,
            "fieldMapping": {
                "vectorField": "vector",
                "textField": "text",
                "metadataField": "text-metadata"
            }
        }

# # Differet Chunking Strategies  
# # FIXED_SIZE Chunking
# chunkingStrategyConfiguration = {
#                                     "chunkingStrategy": "FIXED_SIZE",
#                                     "fixedSizeChunkingConfiguration": {
#                                                                             "maxTokens": 512,
#                                                                             "overlapPercentage": 20
#                                                                         }
#                                 }

# HIERARCHICAL Chunking 
chunkingStrategyConfiguration = {
                                    "chunkingStrategy": "HIERARCHICAL",      
                                    "hierarchicalChunkingConfiguration": {  
                                                                            'levelConfigurations': [
                                                                                {
                                                                                    'maxTokens': 1500
                                                                                },
                                                                                {
                                                                                    'maxTokens': 300
                                                                                }
                                                                            ],
                                                                            'overlapTokens': 60
                                                                        }
                                }

# # SEMANTIC Chunking 
# chunkingStrategyConfiguration = {
#                                     "semanticChunkingConfiguration": {          
#                                                                          'breakpointPercentileThreshold': 95,
#                                                                          'bufferSize': 1,
#                                                                          'maxTokens': 300
#                                                                      }
#                                 }


## Differet Data Source 
# Web URL 
# my_url = <ENTER YOUR WEB URL> 
# webConfiguration = {"sourceConfiguration": {
#                           "urlConfiguration": {
#                            "seedUrls": [{
#                                     "url": my_url                  
#                                 }]
#                             }
#                         },
#                      "crawlerConfiguration": {
#                             "crawlerLimits": {
#                                 "rateLimit": 50
#                             },
#                             "scope": "HOST_ONLY"
#                         }
#                    }         

# S3 
s3Configuration = {
    "bucketArn": f"arn:aws:s3:::{bucket_name}",
}

In [14]:
embeddingModelArn = f"arn:aws:bedrock:{region_name}::foundation-model/amazon.titan-embed-text-v1"

name = f"bedrock-sample-knowledge-base-{suffix}"
description = "Bedrock Knowledge Bases for Web URL and S3 Connector"
roleArn = bedrock_kb_execution_role_arn

Provide the above configurations as input to the `create_knowledge_base` method, which will create the Knowledge base.

In [15]:
# Create a KnowledgeBase
from retrying import retry

@retry(wait_random_min=1000, wait_random_max=2000,stop_max_attempt_number=7)
def create_knowledge_base_func():
    create_kb_response = bedrock_agent_client.create_knowledge_base(
                                                                        name = name,
                                                                        description = description,
                                                                        roleArn = roleArn,
                                                                        knowledgeBaseConfiguration = {
                                                                                                        "type": "VECTOR",
                                                                                                        "vectorKnowledgeBaseConfiguration": {
                                                                                                            "embeddingModelArn": embeddingModelArn
                                                                                                        }
                                                                                                    },
                                                                        storageConfiguration = {
                                                                                                    "type": "OPENSEARCH_SERVERLESS",
                                                                                                    "opensearchServerlessConfiguration":opensearchServerlessConfiguration
                                                                                                }
                                                                    )
    return create_kb_response["knowledgeBase"]

In [16]:
try:
    kb = create_knowledge_base_func()
except Exception as err:
    print(f"{err=}, {type(err)=}")

In [17]:
pp(kb)

{'createdAt': datetime.datetime(2024, 7, 18, 19, 50, 9, 937932, tzinfo=tzutc()),
 'description': 'Bedrock Knowledge Bases for Web URL and S3 Connector',
 'knowledgeBaseArn': 'arn:aws:bedrock:us-east-1:507922848584:knowledge-base/DARMEMCMOS',
 'knowledgeBaseConfiguration': {'type': 'VECTOR',
                                'vectorKnowledgeBaseConfiguration': {'embeddingModelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v1'}},
 'knowledgeBaseId': 'DARMEMCMOS',
 'name': 'bedrock-sample-knowledge-base-383',
 'roleArn': 'arn:aws:iam::507922848584:role/AmazonBedrockExecutionRoleForKnowledgeBase_600',
 'status': 'CREATING',
 'storageConfiguration': {'opensearchServerlessConfiguration': {'collectionArn': 'arn:aws:aoss:us-east-1:507922848584:collection/94qa58u59f50h2eiuwp3',
                                                                'fieldMapping': {'metadataField': 'text-metadata',
                                                                               

In [18]:
# Get KnowledgeBase 
get_kb_response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId = kb['knowledgeBaseId'])

Next we need to create a data source, which will be associated with the knowledge base created above. Once the data source is ready, we can then start to ingest the documents.

In [19]:
# # Just for documentation :) 
# help(bedrock_agent_client.create_data_source)

In [20]:
# Create a S3 DataSource in KnowledgeBase 
create_ds_response = bedrock_agent_client.create_data_source(
                                                                name = name,
                                                                description = description,
                                                                knowledgeBaseId = kb['knowledgeBaseId'],
                                                                dataDeletionPolicy = 'DELETE',
                                                                dataSourceConfiguration = {
                                                                    # # For S3 
                                                                    "type": "S3",
                                                                    "s3Configuration" : s3Configuration
                                                                    # # For Web URL 
                                                                    # "type": "WEB",
                                                                    # "webConfiguration":webConfiguration                                                                    
                                                                },
                                                                vectorIngestionConfiguration = {
                                                                    "chunkingConfiguration": chunkingStrategyConfiguration
                                                                }
                                                            )
ds = create_ds_response["dataSource"]
pp(ds)

{'createdAt': datetime.datetime(2024, 7, 18, 19, 50, 18, 982893, tzinfo=tzutc()),
 'dataDeletionPolicy': 'DELETE',
 'dataSourceConfiguration': {'s3Configuration': {'bucketArn': 'arn:aws:s3:::my-kb-dataset-test-bucket-2024'},
                             'type': 'S3'},
 'dataSourceId': 'Y6SFR9NSQB',
 'description': 'Bedrock Knowledge Bases for Web URL and S3 Connector',
 'knowledgeBaseId': 'DARMEMCMOS',
 'name': 'bedrock-sample-knowledge-base-383',
 'status': 'AVAILABLE',
 'updatedAt': datetime.datetime(2024, 7, 18, 19, 50, 18, 982893, tzinfo=tzutc()),
 'vectorIngestionConfiguration': {'chunkingConfiguration': {'chunkingStrategy': 'HIERARCHICAL',
                                                            'hierarchicalChunkingConfiguration': {'levelConfigurations': [{'maxTokens': 1500},
                                                                                                                          {'maxTokens': 300}],
                                                            

In [21]:
# Get S3 DataSource 
bedrock_agent_client.get_data_source(knowledgeBaseId = kb['knowledgeBaseId'], dataSourceId = ds["dataSourceId"])

{'ResponseMetadata': {'RequestId': '35b665b2-b12a-4e68-b5f3-4424cbf2f83b',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Thu, 18 Jul 2024 19:50:19 GMT',
   'content-type': 'application/json',
   'content-length': '657',
   'connection': 'keep-alive',
   'x-amzn-requestid': '35b665b2-b12a-4e68-b5f3-4424cbf2f83b',
   'x-amz-apigw-id': 'bH6_RGJcoAMEuow=',
   'x-amzn-trace-id': 'Root=1-669971fb-64809b30540a52cf4963a86e'},
  'RetryAttempts': 0},
 'dataSource': {'createdAt': datetime.datetime(2024, 7, 18, 19, 50, 18, 982893, tzinfo=tzutc()),
  'dataDeletionPolicy': 'DELETE',
  'dataSourceConfiguration': {'s3Configuration': {'bucketArn': 'arn:aws:s3:::my-kb-dataset-test-bucket-2024'},
   'type': 'S3'},
  'dataSourceId': 'Y6SFR9NSQB',
  'description': 'Bedrock Knowledge Bases for Web URL and S3 Connector',
  'knowledgeBaseId': 'DARMEMCMOS',
  'name': 'bedrock-sample-knowledge-base-383',
  'status': 'AVAILABLE',
  'updatedAt': datetime.datetime(2024, 7, 18, 19, 50, 18, 982893, tzinfo=tzut

### Start ingestion job
Once the KB and data source is created, we can start the ingestion job.
During the ingestion job, KB will fetch the documents in the data source, pre-process it to extract text, chunk it based on the chunking size provided, create embeddings of each chunk and then write it to the vector database, in this case OSS.

In [22]:
time.sleep(10)

In [23]:
# Start an ingestion job
start_job_response = bedrock_agent_client.start_ingestion_job(knowledgeBaseId = kb['knowledgeBaseId'], dataSourceId = ds["dataSourceId"])

In [24]:
job = start_job_response["ingestionJob"]
pp(job)

{'dataSourceId': 'Y6SFR9NSQB',
 'ingestionJobId': 'K83WPI3TED',
 'knowledgeBaseId': 'DARMEMCMOS',
 'startedAt': datetime.datetime(2024, 7, 18, 19, 50, 29, 695810, tzinfo=tzutc()),
 'statistics': {'numberOfDocumentsDeleted': 0,
                'numberOfDocumentsFailed': 0,
                'numberOfDocumentsScanned': 0,
                'numberOfMetadataDocumentsModified': 0,
                'numberOfMetadataDocumentsScanned': 0,
                'numberOfModifiedDocumentsIndexed': 0,
                'numberOfNewDocumentsIndexed': 0},
 'status': 'STARTING',
 'updatedAt': datetime.datetime(2024, 7, 18, 19, 50, 29, 695810, tzinfo=tzutc())}


In [25]:
# Get job 
while(job['status']!='COMPLETE' ):
  get_job_response = bedrock_agent_client.get_ingestion_job(
      knowledgeBaseId = kb['knowledgeBaseId'],
        dataSourceId = ds["dataSourceId"],
        ingestionJobId = job["ingestionJobId"]
  )
  job = get_job_response["ingestionJob"]
pp(job)
time.sleep(40)

{'dataSourceId': 'Y6SFR9NSQB',
 'ingestionJobId': 'K83WPI3TED',
 'knowledgeBaseId': 'DARMEMCMOS',
 'startedAt': datetime.datetime(2024, 7, 18, 19, 50, 29, 695810, tzinfo=tzutc()),
 'statistics': {'numberOfDocumentsDeleted': 0,
                'numberOfDocumentsFailed': 0,
                'numberOfDocumentsScanned': 4,
                'numberOfMetadataDocumentsModified': 0,
                'numberOfMetadataDocumentsScanned': 0,
                'numberOfModifiedDocumentsIndexed': 0,
                'numberOfNewDocumentsIndexed': 4},
 'status': 'COMPLETE',
 'updatedAt': datetime.datetime(2024, 7, 18, 19, 50, 48, 228320, tzinfo=tzutc())}


In [26]:
kb_id = kb["knowledgeBaseId"]
print(kb_id)

DARMEMCMOS


In [27]:
%store kb_id

Stored 'kb_id' (str)


## Test the knowledge base
### Using RetrieveAndGenerate API
Behind the scenes, RetrieveAndGenerate API converts queries into embeddings, searches the knowledge base, and then augments the foundation model prompt with the search results as context information and returns the FM-generated response to the question. For multi-turn conversations, Knowledge Bases manage short-term memory of the conversation to provide more contextual results.

The output of the RetrieveAndGenerate API includes the generated response, source attribution as well as the retrieved text chunks.

In [28]:
# try out KB using RetrieveAndGenerate API
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"                                           # <Change it to any model of your choice which is supported by KB>
model_arn = f'arn:aws:bedrock:us-east-1::foundation-model/{model_id}'

In [29]:
query = "What is Amazon doing towards generative AI ? and how is Amazon's performance in last couple of years ?"
response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        'text': query
    },
    retrieveAndGenerateConfiguration={
        'type': 'KNOWLEDGE_BASE',
        'knowledgeBaseConfiguration': {
            'knowledgeBaseId': kb_id,
            'modelArn': model_arn
        }
    },
)

generated_text = response['output']['text']

print(generated_text)

Amazon has been investing heavily in Large Language Models (LLMs) and Generative AI, which it believes will transform and improve virtually every customer experience. Amazon has been working on its own LLMs for a while and plans to continue investing substantially in these models across all of its consumer, seller, brand, and creator experiences. Additionally, Amazon is democratizing this technology through AWS so that companies of all sizes can leverage Generative AI by offering price-performant machine learning chips and enabling companies to choose from various LLMs and build applications with AWS features. Regarding Amazon's performance in the last couple of years, the letter mentions that despite 2022 being a challenging macroeconomic year, Amazon still found a way to grow demand and innovate in its largest businesses. However, it also faced some operating challenges and had to make adjustments in investment decisions. Overall, the CEO expresses optimism that Amazon will emerge st

In [30]:
## print out the source attribution/citations from the original documents to see if the response generated belongs to the context.
citations = response["citations"]
contexts = []
for citation in citations:
    retrievedReferences = citation["retrievedReferences"]
    for reference in retrievedReferences:
        contexts.append(reference["content"]["text"])

print(contexts)

['Grocery is a big growth opportunity for Amazon.   Amazon Business is another example of an investment where our ecommerce and logistics capabilities position us well to pursue this large market segment. Amazon Business allows businesses, municipalities, and organizations to procure products like office supplies and other bulk items easily and at great savings. While some areas of the economy have struggled over the past few years, Amazon Business has thrived. Why? Because the team has translated what it means to deliver selection, value, and convenience into a business procurement setting, constantly listening to and learning from customers, and innovating on their behalf. Some people have never heard of Amazon Business, but, our business customers love it. Amazon Business launched in 2015 and today drives roughly $35B in annualized gross sales. More than six million active customers, including 96 of the global Fortune 100 companies, are enjoying Amazon Business’ one-stop shopping, r

### Retrieve API
Retrieve API converts user queries into embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom workﬂows on top of the semantic search results. The output of the Retrieve API includes the the retrieved text chunks, the location type and URI of the source data, as well as the relevance scores of the retrievals.

In [31]:
# retreive api for fetching only the relevant context.
query = "How many new positions were opened across Amazon's fulfillment and delivery network?" 

relevant_documents = bedrock_agent_runtime_client.retrieve(
    retrievalQuery= {
        'text': query
    },
    knowledgeBaseId=kb_id,
    retrievalConfiguration= {
        'vectorSearchConfiguration': {
            'numberOfResults': 3 # will fetch top 3 documents which matches closely with the query.
        }
    }
)

In [32]:
print(relevant_documents["retrievalResults"])

[{'content': {'text': 'Dear shareholders:   Over the past 25 years at Amazon, I’ve had the opportunity to write many narratives, emails, letters, and keynotes for employees, customers, and partners. But, this is the first time I’ve had the honor of writing our annual shareholder letter as CEO of Amazon. Jeff set the bar high on these letters, and I will try to keep them worth reading.   When the pandemic started in early 2020, few people thought it would be as expansive or long-running as it’s been. Whatever role Amazon played in the world up to that point became further magnified as most physical venues shut down for long periods of time and people spent their days at home. This meant that hundreds of millions of people relied on Amazon for PPE, food, clothing, and various other items that helped them navigate this unprecedented time. Businesses and governments also had to shift, practically overnight, from working with colleagues and technology on-premises to working remotely. AWS pl

## Clean up
Please make sure to comment the below section if you are planning to use the Knowledge Base that you created above for building your RAG application.
If you only wanted to try out creating the KB using SDK, then please make sure to delete all the resources that were created as you will be incurred cost for storing documents in OSS index.

In [33]:
# # Delete KnowledgeBase
# bedrock_agent_client.delete_data_source(dataSourceId = ds["dataSourceId"], knowledgeBaseId=kb['knowledgeBaseId'])
# bedrock_agent_client.delete_knowledge_base(knowledgeBaseId=kb['knowledgeBaseId'])
# oss_client.indices.delete(index=index_name)
# aoss_client.delete_collection(id=collection_id)
# aoss_client.delete_access_policy(type="data", name=access_policy['accessPolicyDetail']['name'])
# aoss_client.delete_security_policy(type="network", name=network_policy['securityPolicyDetail']['name'])
# aoss_client.delete_security_policy(type="encryption", name=encryption_policy['securityPolicyDetail']['name'])

In [34]:
# # delete role and policies
# from utility import delete_iam_role_and_policies
# delete_iam_role_and_policies()