# Using Knowledge Bases for Amazon Bedrock, Amazon OpenSearch Serverless, and Cohere Embed
---
# Introduction
This notebook builds on the content in the [Text Embeddings using Cohere LLM stored in Amazon OpenSearch Serverless](https://github.com/aws-samples/Cohere-on-AWS/blob/main/cohere-cookbooks/Embeddings/Cohere_Embeddings_Search.ipynb) notebook. Amazon OpenSearch Serverless allows developers to run petabyte-scale workloads without configuring, managing, and scaling OpenSearch clusters. OpenSearch Serverless delivers millisecond response times with the simplicity of a serverless environment, making it ideal as a vector store for Retrieval-Augmented Generation (RAG). Using OpenSearch also allows developers to take advantage of visualization and monitoring through OpenSearch Dashboard features that developers may already be familiar with. In addition, OpenSearch Serverless is cost-effective because you only pay for the resources you consume. There is no need for upfront provisioning and overprovisioning for peak workloads. OpenSearch Serverless also automatically updates your collections to consume the latest bug fixes, features, and performance improvements.

Within the AWS ecosystem, there are different ways to use Amazon OpenSearch Serverless. One way is to use Amazon OpenSearch Serverless as a vector store within a [Knowledge Base for Amazon Bedrock](https://aws.amazon.com/bedrock/knowledge-bases/). A Knowledge Base is fully managed Retrieval-Augmented Generation (RAG) capability that allows you to connect to foundation models (FMs) to deliver more relevant, context-specific, and accurate responses. 

This notebook focuses on using the Cohere Embed Multilingual V3 LLM (Large Language Model) to create embedings stored in a Knowledge Base for Amazon Bedrock powered by an Amazon OpenSearch Serverles vector store. The goal is to help developers generate accurate responses without performing undifferentiated steps to implement RAG. 

---

# Prerequisites
1.  Ensure you have requested access to the models provided by Cohere in the Bedrock console by clicking "model access." Instructions can be found here: https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html
1.  Make sure you have the permissions to access Bedrock and you have the correct IAM permissions from your administrator.
2. Run the following cell to install boto3 and necessary packages.

---

In [1]:
import boto3
import random
import time
import json
import os
from botocore.exceptions import ClientError
import pprint
import random
from retrying import retry
import warnings
warnings.filterwarnings('ignore')

# Install dependencies
%pip install -U opensearch-py==2.3.1
%pip install -U boto3==1.33.2
%pip install -U retrying==1.3.4

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Step 0: Configure permissions by creating helper functions 
One benefit of using an Amazon Bedrock Knowledge Base is that users have centralized control over permissions through AWS Identity and Access Management (IAM). The helper functions below allow Amazon Bedrock to access resources such as Amazon Simple Storage Service (S3) and Amazon OpenSearch Serverless (AOSS). By using Knowledge Bases, users also do not need to perform undifferentiated steps such as configuring an orchestrator framework such as LangChain.

Run the cell below to configure the permissions required to create a Knowledge Base from this Python notebook.

In [3]:
# The create_bedrock_execution_role function creates an IAM role that Bedrock can assume to access resources like S3 and OpenSearch. 
# This role has policies to allow access to specific S3 bucket and the Amazon Cohere Embed Multilingual V3 LLM.
def create_bedrock_execution_role(bucket_name):
    foundation_model_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "bedrock:InvokeModel",
                ],
                "Resource": [
                    # f"arn:aws:bedrock:{region_name}::foundation-model/amazon.titan-embed-text-v1",
                    # f"arn:aws:bedrock:{region_name}::foundation-model/amazon.titan-embed-text-v2:0",
                    f"arn:aws:bedrock:{region_name}::foundation-model/cohere.embed-multilingual-v3",
                ]
            }
        ]
    }

    s3_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:ListBucket"
                ],
                "Resource": [
                    f"arn:aws:s3:::{bucket_name}",
                    f"arn:aws:s3:::{bucket_name}/*"
                ],
                "Condition": {
                    "StringEquals": {
                        "aws:ResourceAccount": f"{account_number}"
                    }
                }
            }
        ]
    }

    assume_role_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": "bedrock.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }
    
    # create policies based on the policy documents
    fm_policy = iam_client.create_policy(
        PolicyName=fm_policy_name,
        PolicyDocument=json.dumps(foundation_model_policy_document),
        Description='Policy for accessing foundation model',
    )

    s3_policy = iam_client.create_policy(
        PolicyName=s3_policy_name,
        PolicyDocument=json.dumps(s3_policy_document),
        Description='Policy for reading documents from s3')

    # create bedrock execution role
    bedrock_kb_execution_role = iam_client.create_role(
        RoleName=bedrock_execution_role_name,
        AssumeRolePolicyDocument=json.dumps(assume_role_policy_document),
        Description='Amazon Bedrock Knowledge Base Execution Role for accessing OSS and S3',
        MaxSessionDuration=3600
    )

    # fetch arn of the policies and role created above
    bedrock_kb_execution_role_arn = bedrock_kb_execution_role['Role']['Arn']
    s3_policy_arn = s3_policy["Policy"]["Arn"]
    fm_policy_arn = fm_policy["Policy"]["Arn"]
    

    # attach policies to Amazon Bedrock execution role
    iam_client.attach_role_policy(
        RoleName=bedrock_kb_execution_role["Role"]["RoleName"],
        PolicyArn=fm_policy_arn
    )
    iam_client.attach_role_policy(
        RoleName=bedrock_kb_execution_role["Role"]["RoleName"],
        PolicyArn=s3_policy_arn
    )
    return bedrock_kb_execution_role


# The create_oss_policy_attach_bedrock_execution_role creates an IAM policy granting access to a specific OpenSearch collection
# The IAM policy created is attached to the Bedrock execution role.
def create_oss_policy_attach_bedrock_execution_role(collection_id, bedrock_kb_execution_role):
    # define oss policy document
    oss_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "aoss:APIAccessAll"
                ],
                "Resource": [
                    f"arn:aws:aoss:{region_name}:{account_number}:collection/{collection_id}"
                ]
            }
        ]
    }
    oss_policy = iam_client.create_policy(
        PolicyName=oss_policy_name,
        PolicyDocument=json.dumps(oss_policy_document),
        Description='Policy for accessing opensearch serverless',
    )
    oss_policy_arn = oss_policy["Policy"]["Arn"]
    print("Opensearch serverless arn: ", oss_policy_arn)

    iam_client.attach_role_policy(
        RoleName=bedrock_kb_execution_role["Role"]["RoleName"],
        PolicyArn=oss_policy_arn
    )
    return None


# The create_policies_in_oss function creates OpenSearch policies for encryption, network access, and data access for a vector store.
def create_policies_in_oss(vector_store_name, aoss_client, bedrock_kb_execution_role_arn):
    encryption_policy = aoss_client.create_security_policy(
        name=encryption_policy_name,
        policy=json.dumps(
            {
                'Rules': [{'Resource': ['collection/' + vector_store_name],
                           'ResourceType': 'collection'}],
                'AWSOwnedKey': True
            }),
        type='encryption'
    )

    network_policy = aoss_client.create_security_policy(
        name=network_policy_name,
        policy=json.dumps(
            [
                {'Rules': [{'Resource': ['collection/' + vector_store_name],
                            'ResourceType': 'collection'}],
                 'AllowFromPublic': True}
            ]),
        type='network'
    )
    access_policy = aoss_client.create_access_policy(
        name=access_policy_name,
        policy=json.dumps(
            [
                {
                    'Rules': [
                        {
                            'Resource': ['collection/' + vector_store_name],
                            'Permission': [
                                'aoss:CreateCollectionItems',
                                'aoss:DeleteCollectionItems',
                                'aoss:UpdateCollectionItems',
                                'aoss:DescribeCollectionItems'],
                            'ResourceType': 'collection'
                        },
                        {
                            'Resource': ['index/' + vector_store_name + '/*'],
                            'Permission': [
                                'aoss:CreateIndex',
                                'aoss:DeleteIndex',
                                'aoss:UpdateIndex',
                                'aoss:DescribeIndex',
                                'aoss:ReadDocument',
                                'aoss:WriteDocument'],
                            'ResourceType': 'index'
                        }],
                    'Principal': [identity, bedrock_kb_execution_role_arn],
                    'Description': 'Easy data policy'}
            ]),
        type='data'
    )
    return encryption_policy, network_policy, access_policy


def delete_iam_role_and_policies():
    fm_policy_arn = f"arn:aws:iam::{account_number}:policy/{fm_policy_name}"
    s3_policy_arn = f"arn:aws:iam::{account_number}:policy/{s3_policy_name}"
    oss_policy_arn = f"arn:aws:iam::{account_number}:policy/{oss_policy_name}"
    sm_policy_arn = f"arn:aws:iam::{account_number}:policy/{sm_policy_name}"

    iam_client.detach_role_policy(
        RoleName=bedrock_execution_role_name,
        PolicyArn=s3_policy_arn
    )
    iam_client.detach_role_policy(
        RoleName=bedrock_execution_role_name,
        PolicyArn=fm_policy_arn
    )
    iam_client.detach_role_policy(
        RoleName=bedrock_execution_role_name,
        PolicyArn=oss_policy_arn
    )
    iam_client.detach_role_policy(
        RoleName=bedrock_execution_role_name,
        PolicyArn=sm_policy_arn
    )
    iam_client.delete_role(RoleName=bedrock_execution_role_name)
    iam_client.delete_policy(PolicyArn=s3_policy_arn)
    iam_client.delete_policy(PolicyArn=fm_policy_arn)
    iam_client.delete_policy(PolicyArn=oss_policy_arn)
    iam_client.delete_policy(PolicyArn=sm_policy_arn)
    return 0


def interactive_sleep(seconds: int):
    dots = ''
    for i in range(seconds):
        dots += '.'
        print(dots, end='\r')
        time.sleep(1)

def create_bedrock_execution_role_multi_ds(bucket_names = None, secrets_arns = None):
    
    # 0. Create bedrock execution role

    assume_role_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": "bedrock.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }
    
    # create bedrock execution role
    bedrock_kb_execution_role = iam_client.create_role(
        RoleName=bedrock_execution_role_name,
        AssumeRolePolicyDocument=json.dumps(assume_role_policy_document),
        Description='Amazon Bedrock Knowledge Base Execution Role for accessing OSS, secrets manager and S3',
        MaxSessionDuration=3600
    )

    # fetch arn of the role created above
    bedrock_kb_execution_role_arn = bedrock_kb_execution_role['Role']['Arn']

    # 1. Cretae and attach policy for foundation models
    foundation_model_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "bedrock:InvokeModel",
                ],
                "Resource": [
                    f"arn:aws:bedrock:{region_name}::foundation-model/amazon.titan-embed-text-v1",
                    f"arn:aws:bedrock:{region_name}::foundation-model/amazon.titan-embed-text-v2:0",
                    f"arn:aws:bedrock:{region_name}::foundation-model/cohere.embed-multilingual-v3",
                ]
            }
        ]
    }
    
    fm_policy = iam_client.create_policy(
        PolicyName=fm_policy_name,
        PolicyDocument=json.dumps(foundation_model_policy_document),
        Description='Policy for accessing foundation model',
    )
  
    # fetch arn of this policy 
    fm_policy_arn = fm_policy["Policy"]["Arn"]
    
    # attach this policy to Amazon Bedrock execution role
    iam_client.attach_role_policy(
        RoleName=bedrock_kb_execution_role["Role"]["RoleName"],
        PolicyArn=fm_policy_arn
    )

    # 2. Cretae and attach policy for s3 bucket
    if bucket_names:
        s3_policy_document = {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": [
                        "s3:GetObject",
                        "s3:ListBucket"
                    ],
                    "Resource": [item for sublist in [[f'arn:aws:s3:::{bucket}', f'arn:aws:s3:::{bucket}/*'] for bucket in bucket_names] for item in sublist], 
                    "Condition": {
                        "StringEquals": {
                            "aws:ResourceAccount": f"{account_number}"
                        }
                    }
                }
            ]
        }
        # create policies based on the policy documents
        s3_policy = iam_client.create_policy(
            PolicyName=s3_policy_name,
            PolicyDocument=json.dumps(s3_policy_document),
            Description='Policy for reading documents from s3')

        # fetch arn of this policy 
        s3_policy_arn = s3_policy["Policy"]["Arn"]
        
        # attach this policy to Amazon Bedrock execution role
        iam_client.attach_role_policy(
            RoleName=bedrock_kb_execution_role["Role"]["RoleName"],
            PolicyArn=s3_policy_arn
        )

    # 3. Cretae and attach policy for secrets manager
    if secrets_arns:
        secrets_manager_policy_document = {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": [
                        "secretsmanager:GetSecretValue",
                        "secretsmanager:PutSecretValue"
                    ],
                    "Resource": secrets_arns
                }
            ]
        }
        # create policies based on the policy documents
        
        secrets_manager_policy = iam_client.create_policy(
            PolicyName=sm_policy_name,
            PolicyDocument=json.dumps(secrets_manager_policy_document),
            Description='Policy for accessing secret manager',
        )

        # fetch arn of this policy
        sm_policy_arn = secrets_manager_policy["Policy"]["Arn"]

        # attach policy to Amazon Bedrock execution role
        iam_client.attach_role_policy(
            RoleName=bedrock_kb_execution_role["Role"]["RoleName"],
            PolicyArn=sm_policy_arn
        )
    
    return bedrock_kb_execution_role

In [5]:
# Initialize a suffix to use as a unique name for the S3 bucket
suffix = random.randrange(200, 900)

# Initialize a boto3 session.
boto3_session = boto3.session.Session()
region_name = boto3_session.region_name

# Initialize the AWS Identity and Access Management (IAM) client
iam_client = boto3_session.client('iam')
account_number = boto3.client('sts').get_caller_identity().get('Account')
identity = boto3.client('sts').get_caller_identity()['Arn']

# Initialize f-strings to be used as variable names
encryption_policy_name = f"bedrock-sample-rag-sp-{suffix}"
network_policy_name = f"bedrock-sample-rag-np-{suffix}"
access_policy_name = f'bedrock-sample-rag-ap-{suffix}'
bedrock_execution_role_name = f'AmazonBedrockExecutionRoleForKnowledgeBase_{suffix}'
fm_policy_name = f'AmazonBedrockFoundationModelPolicyForKnowledgeBase_{suffix}'
s3_policy_name = f'AmazonBedrockS3PolicyForKnowledgeBase_{suffix}'
sm_policy_name = f'AmazonBedrockSecretPolicyForKnowledgeBase_{suffix}'
oss_policy_name = f'AmazonBedrockOSSPolicyForKnowledgeBase_{suffix}'

# Initialize a AWS Security Token Service client for temporary credentials
sts_client = boto3.client('sts')
boto3_session = boto3.session.Session()
region_name = boto3_session.region_name
bedrock_agent_client = boto3_session.client('bedrock-agent', region_name=region_name)
service = 'aoss'
s3_client = boto3.client('s3')
account_id = sts_client.get_caller_identity()["Account"]
s3_suffix = f"{region_name}-{account_id}"
bucket_name = f'bedrock-kb-{s3_suffix}' # replace it with your bucket name.
pp = pprint.PrettyPrinter(indent=2)

In [6]:
# Check if bucket exists, and if not create S3 bucket for knowledge base data source
try:
    s3_client.head_bucket(Bucket=bucket_name)
    print(f'Bucket {bucket_name} Exists')
except ClientError as e:
    print(f'Creating bucket {bucket_name}')
    if region_name == "us-east-1":
        s3bucket = s3_client.create_bucket(
            Bucket=bucket_name)
    else:
        s3bucket = s3_client.create_bucket(
        Bucket=bucket_name,
        CreateBucketConfiguration={ 'LocationConstraint': region_name }
    )

Bucket bedrock-kb-us-east-1-809719347864 Exists


In [7]:
# Store the bucket name
%store bucket_name

Stored 'bucket_name' (str)


## Step 1: Create the Amazon OpenSearch Vector Store
OpenSearch Serverless continually adjusts to get millisecond response times during changing usage patterns and demand. This makes it an ideal solution for a RAG-based workflow. Run the cell below to use the f-strings initialized in Step 0 to create the vector store. An Amazon OpenSearch Serverless collection is a logical grouping of one or more indexes that represent an analytics workload. 

In [8]:
vector_store_name = f'bedrock-sample-rag-{suffix}'
index_name = f"bedrock-sample-rag-index-{suffix}"
aoss_client = boto3_session.client('opensearchserverless')
bedrock_kb_execution_role = create_bedrock_execution_role(bucket_name=bucket_name)
bedrock_kb_execution_role_arn = bedrock_kb_execution_role['Role']['Arn']

# Create security, network and data access policies within OSS
encryption_policy, network_policy, access_policy = create_policies_in_oss(vector_store_name=vector_store_name,
                       aoss_client=aoss_client,
                       bedrock_kb_execution_role_arn=bedrock_kb_execution_role_arn)
collection = aoss_client.create_collection(name=vector_store_name,type='VECTORSEARCH')

# Print the OpenSearch vector search collection
pp.pprint(collection)

{ 'ResponseMetadata': { 'HTTPHeaders': { 'connection': 'keep-alive',
                                         'content-length': '314',
                                         'content-type': 'application/x-amz-json-1.0',
                                         'date': 'Fri, 30 Aug 2024 05:23:04 '
                                                 'GMT',
                                         'x-amzn-requestid': 'cf2ac17e-bc6b-4a44-8813-f0c5a59be626'},
                        'HTTPStatusCode': 200,
                        'RequestId': 'cf2ac17e-bc6b-4a44-8813-f0c5a59be626',
                        'RetryAttempts': 0},
  'createCollectionDetail': { 'arn': 'arn:aws:aoss:us-east-1:809719347864:collection/8gtddkrsd44cx7fqnl4k',
                              'createdDate': 1724995384485,
                              'id': '8gtddkrsd44cx7fqnl4k',
                              'kmsKeyArn': 'auto',
                              'lastModifiedDate': 1724995384485,
                             

In [9]:
# Store the encryption policy, network policy, access policy, and collection variables
%store encryption_policy network_policy access_policy collection

Stored 'encryption_policy' (dict)
Stored 'network_policy' (dict)
Stored 'access_policy' (dict)
Stored 'collection' (dict)


In [10]:
# Get the OpenSearch serverless collection URL
collection_id = collection['createCollectionDetail']['id']
host = collection_id + '.' + region_name + '.aoss.amazonaws.com'
print(host)

8gtddkrsd44cx7fqnl4k.us-east-1.aoss.amazonaws.com


In [11]:
# Wait for collection creation. This can take couple of minutes to finish
response = aoss_client.batch_get_collection(names=[vector_store_name])
# Periodically check collection status
while (response['collectionDetails'][0]['status']) == 'CREATING':
    print('Creating collection...')
    interactive_sleep(30)
    response = aoss_client.batch_get_collection(names=[vector_store_name])
print('\nCollection successfully created:')
pp.pprint(response["collectionDetails"])

Creating collection...
..............................
Collection successfully created:
[ { 'arn': 'arn:aws:aoss:us-east-1:809719347864:collection/8gtddkrsd44cx7fqnl4k',
    'collectionEndpoint': 'https://8gtddkrsd44cx7fqnl4k.us-east-1.aoss.amazonaws.com',
    'createdDate': 1724995384485,
    'dashboardEndpoint': 'https://8gtddkrsd44cx7fqnl4k.us-east-1.aoss.amazonaws.com/_dashboards',
    'id': '8gtddkrsd44cx7fqnl4k',
    'kmsKeyArn': 'auto',
    'lastModifiedDate': 1724995408124,
    'name': 'bedrock-sample-rag-749',
    'standbyReplicas': 'ENABLED',
    'status': 'ACTIVE',
    'type': 'VECTORSEARCH'}]


In [12]:
# create opensearch serverless access policy and attach it to Bedrock execution role
try:
    create_oss_policy_attach_bedrock_execution_role(collection_id=collection_id,
                                                    bedrock_kb_execution_role=bedrock_kb_execution_role)
    # It can take up to a minute for data access rules to be enforced
    interactive_sleep(60)
except Exception as e:
    print("Policy already exists")
    pp.pprint(e)

Opensearch serverless arn:  arn:aws:iam::809719347864:policy/AmazonBedrockOSSPolicyForKnowledgeBase_749
............................................................

# Step 2: Create Vector Index
In Step 1, we created a vector store. This is where we will be storing the index which powers the Knowledge Base for Amazon Bedrock. An index is a way to structure your data. In this case, the data we need to store and structure are the embeddings.

In [13]:
# Create the vector index in Opensearch serverless, with the knn_vector field index mapping, specifying the dimension size, name and engine.
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth, RequestError
# Use the credentials from the boto3 session.
credentials = boto3.Session().get_credentials()
awsauth = auth = AWSV4SignerAuth(credentials, region_name, service)

index_name = f"bedrock-sample-index-{suffix}"
body_json = {
   "settings": {
      "index.knn": "true",
       "number_of_shards": 1,
       "knn.algo_param.ef_search": 512,
       "number_of_replicas": 0,
   },
   "mappings": {
      "properties": {
         "vector": {
            "type": "knn_vector",
            # "dimension": 1536,
            "dimension": 1024,
             "method": {
                 "name": "hnsw",
                 "engine": "faiss",
                 "space_type": "l2"
             },
         },
         "text": {
            "type": "text"
         },
         "text-metadata": {
            "type": "text"         }
      }
   }
}

# Build the OpenSearch client
# An OpenSearch client interfaces with OpenSearch to perform actions such as indexing, searching, or updating data. 
oss_client = OpenSearch(
    hosts=[{'host': host, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    timeout=300
)

# Create index
try:
    response = oss_client.indices.create(index=index_name, body=json.dumps(body_json))
    print('\nCreating index:')
    pp.pprint(response)

    # index creation can take up to a minute
    interactive_sleep(60)
except RequestError as e:
    # you can delete the index if its already exists
    # oss_client.indices.delete(index=index_name)
    print(f'Error while trying to create the index, with error {e.error}\nyou may unmark the delete above to delete, and recreate the index')


Creating index:
{ 'acknowledged': True,
  'index': 'bedrock-sample-index-749',
  'shards_acknowledged': True}
............................................................

## Step 3: Download data to ingest into our Knowledge Base
Now that we have created the Knowledge Base, it is time to configure data for testing. Note that it is also possible to test your own documents by directly uploading these documents to the S3 bucket created in step 0.

In [17]:
# Download and prepare dataset
!mkdir -p ./data

from urllib.request import urlretrieve
urls = [
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf'
]

filenames = [
    'AMZN-2022-Shareholder-Letter.pdf',
    'AMZN-2021-Shareholder-Letter.pdf',
    'AMZN-2020-Shareholder-Letter.pdf',
    'AMZN-2019-Shareholder-Letter.pdf'
]

data_root = "./data/"

for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

Run the cell below to upload the data to the S3 bucket created.

In [19]:
# Upload data to s3 to the bucket that was configured as a data source to the knowledge base
s3_client = boto3.client("s3")
def uploadDirectory(path,bucket_name):
        for root,dirs,files in os.walk(path):
            for file in files:
                s3_client.upload_file(os.path.join(root,file),bucket_name,file)

uploadDirectory(data_root, bucket_name)

## Step 4: Create a Knowledge Base
Amazon Bedrock splits your documents or content into manageable chunks for efficient data retrieval. Chunks are converted to embeddings and written to a vector index while maintaining a mapping to the original document. If a single document or piece of content contains less than the specified number of tokens in a chunk, the document is not further split. The overlap percentage controls the overlap tokens that each parent chunk has with its children. You can experiment with these parameters to find the parameters that provide the best quality of responses.

In [22]:
opensearchServerlessConfiguration = {
            "collectionArn": collection["createCollectionDetail"]['arn'],
            "vectorIndexName": index_name,
            "fieldMapping": {
                "vectorField": "vector",
                "textField": "text",
                "metadataField": "text-metadata"
            }
        }

# Ingest strategy - How to ingest data from the data source
chunkingStrategyConfiguration = {
    "chunkingStrategy": "FIXED_SIZE",
    "fixedSizeChunkingConfiguration": {
        "maxTokens": 512,
        "overlapPercentage": 20
    }
}

# The data source to ingest documents from, into the OpenSearch serverless knowledge base index
s3Configuration = {
    "bucketArn": f"arn:aws:s3:::{bucket_name}",
    # "inclusionPrefixes":["*.*"] # you can use this if you want to create a KB using data within s3 prefixes.
}

# The embedding model used by Bedrock to embed ingested documents, and realtime prompts
# embeddingModelArn = f"arn:aws:bedrock:{region_name}::foundation-model/amazon.titan-embed-text-v1"
embeddingModelArn = f"arn:aws:bedrock:{region_name}::foundation-model/cohere.embed-multilingual-v3"

name = f"bedrock-sample-knowledge-base-{suffix}"
description = "Amazon shareholder letter knowledge base."
roleArn = bedrock_kb_execution_role_arn

# Create a KnowledgeBase
from retrying import retry

@retry(wait_random_min=1000, wait_random_max=2000,stop_max_attempt_number=7)
def create_knowledge_base_func():
    create_kb_response = bedrock_agent_client.create_knowledge_base(
        name = name,
        description = description,
        roleArn = roleArn,
        knowledgeBaseConfiguration = {
            "type": "VECTOR",
            "vectorKnowledgeBaseConfiguration": {
                "embeddingModelArn": embeddingModelArn
            }
        },
        storageConfiguration = {
            "type": "OPENSEARCH_SERVERLESS",
            "opensearchServerlessConfiguration":opensearchServerlessConfiguration
        }
    )
    return create_kb_response["knowledgeBase"]

Run the cell below to create the Knowledge Base.

In [25]:
try:
    kb = create_knowledge_base_func()
except Exception as err:
    print(f"{err=}, {type(err)=}")
    
pp.pprint(kb)

err=ConflictException('An error occurred (ConflictException) when calling the CreateKnowledgeBase operation: KnowledgeBase with name bedrock-sample-knowledge-base-201 already exists.'), type(err)=<class 'botocore.errorfactory.ConflictException'>
{ 'createdAt': datetime.datetime(2024, 8, 30, 4, 12, 31, 775542, tzinfo=tzlocal()),
  'description': 'Amazon shareholder letter knowledge base.',
  'knowledgeBaseArn': 'arn:aws:bedrock:us-east-1:809719347864:knowledge-base/ODK0AD66TQ',
  'knowledgeBaseConfiguration': { 'type': 'VECTOR',
                                  'vectorKnowledgeBaseConfiguration': { 'embeddingModelArn': 'arn:aws:bedrock:us-east-1::foundation-model/cohere.embed-multilingual-v3'}},
  'knowledgeBaseId': 'ODK0AD66TQ',
  'name': 'bedrock-sample-knowledge-base-201',
  'roleArn': 'arn:aws:iam::809719347864:role/AmazonBedrockExecutionRoleForKnowledgeBase_385',
  'status': 'CREATING',
  'storageConfiguration': { 'opensearchServerlessConfiguration': { 'collectionArn': 'arn:aws:ao

In [30]:
# Get KnowledgeBase
get_kb_response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId = kb['knowledgeBaseId'])

In [31]:
# Create a DataSource in KnowledgeBase 
create_ds_response = bedrock_agent_client.create_data_source(
    name = name,
    description = description,
    knowledgeBaseId = kb['knowledgeBaseId'],
    dataSourceConfiguration = {
        "type": "S3",
        "s3Configuration":s3Configuration
    },
    vectorIngestionConfiguration = {
        "chunkingConfiguration": chunkingStrategyConfiguration
    }
)
ds = create_ds_response["dataSource"]
pp.pprint(ds)

{ 'createdAt': datetime.datetime(2024, 8, 30, 4, 28, 38, 922964, tzinfo=tzlocal()),
  'dataSourceConfiguration': { 's3Configuration': { 'bucketArn': 'arn:aws:s3:::bedrock-kb-us-east-1-809719347864'},
                               'type': 'S3'},
  'dataSourceId': 'MHMQOKD4CN',
  'description': 'Amazon shareholder letter knowledge base.',
  'knowledgeBaseId': 'ODK0AD66TQ',
  'name': 'bedrock-sample-knowledge-base-201',
  'status': 'AVAILABLE',
  'updatedAt': datetime.datetime(2024, 8, 30, 4, 28, 38, 922964, tzinfo=tzlocal()),
  'vectorIngestionConfiguration': { 'chunkingConfiguration': { 'chunkingStrategy': 'FIXED_SIZE',
                                                               'fixedSizeChunkingConfiguration': { 'maxTokens': 512,
                                                                                                   'overlapPercentage': 20}}}}


In [32]:
# Get DataSource
bedrock_agent_client.get_data_source(knowledgeBaseId = kb['knowledgeBaseId'], dataSourceId = ds["dataSourceId"])

{'ResponseMetadata': {'RequestId': 'e5b4219e-739a-418b-bf74-b60f091ac083',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Fri, 30 Aug 2024 04:28:44 GMT',
   'content-type': 'application/json',
   'content-length': '603',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'e5b4219e-739a-418b-bf74-b60f091ac083',
   'x-amz-apigw-id': 'dTiTbHKFoAMEY1g=',
   'x-amzn-trace-id': 'Root=1-66d14a7c-4a64679f1abc86f46a88cf7c'},
  'RetryAttempts': 0},
 'dataSource': {'knowledgeBaseId': 'ODK0AD66TQ',
  'dataSourceId': 'MHMQOKD4CN',
  'name': 'bedrock-sample-knowledge-base-201',
  'status': 'AVAILABLE',
  'description': 'Amazon shareholder letter knowledge base.',
  'dataSourceConfiguration': {'type': 'S3',
   's3Configuration': {'bucketArn': 'arn:aws:s3:::bedrock-kb-us-east-1-809719347864'}},
  'vectorIngestionConfiguration': {'chunkingConfiguration': {'chunkingStrategy': 'FIXED_SIZE',
    'fixedSizeChunkingConfiguration': {'maxTokens': 512,
     'overlapPercentage': 20}}},
  'createdAt': da

## Step 5: Start data ingestion 
After you create your Knowledge Base, you ingest your data source into your Knowledge Base to be queried. Data ingestion converts the raw data in your data source into vector embeddings. You must sync the data sourch each time you add, modify, or remove files so that it is re-indexed to the Knowledge Base. Syncing is incremental, so only added, modified, or deleted documents since the last sync are processed.

OpenSearch Serverless uses a cloud-native architecture that separates the indexing (ingest) components from the search (query) components with Amazon S3 as the primary data storage for indexes. Refer to this document to understand more about Amazon OpenSearch Serverless indexing and search compute units.

In [33]:
# Start an ingestion job
start_job_response = bedrock_agent_client.start_ingestion_job(knowledgeBaseId = kb['knowledgeBaseId'], dataSourceId = ds["dataSourceId"])

In [34]:
job = start_job_response["ingestionJob"]
pp.pprint(job)

{ 'dataSourceId': 'MHMQOKD4CN',
  'ingestionJobId': '4OL04V2CD1',
  'knowledgeBaseId': 'ODK0AD66TQ',
  'startedAt': datetime.datetime(2024, 8, 30, 4, 28, 52, 474536, tzinfo=tzlocal()),
  'statistics': { 'numberOfDocumentsDeleted': 0,
                  'numberOfDocumentsFailed': 0,
                  'numberOfDocumentsScanned': 0,
                  'numberOfModifiedDocumentsIndexed': 0,
                  'numberOfNewDocumentsIndexed': 0},
  'status': 'STARTING',
  'updatedAt': datetime.datetime(2024, 8, 30, 4, 28, 52, 474536, tzinfo=tzlocal())}


In [28]:
# Get job 
while(job['status']!='COMPLETE' ):
    get_job_response = bedrock_agent_client.get_ingestion_job(
      knowledgeBaseId = kb['knowledgeBaseId'],
        dataSourceId = ds["dataSourceId"],
        ingestionJobId = job["ingestionJobId"]
  )
    job = get_job_response["ingestionJob"]
    
    interactive_sleep(30)

pp.pprint(job)

{ 'dataSourceId': 'EA1ZIPGJEY',
  'ingestionJobId': 'FYFWKD7YZJ',
  'knowledgeBaseId': 'GKEHKIWESJ',
  'startedAt': datetime.datetime(2024, 8, 28, 12, 58, 1, 652880, tzinfo=tzlocal()),
  'statistics': { 'numberOfDocumentsDeleted': 0,
                  'numberOfDocumentsFailed': 0,
                  'numberOfDocumentsScanned': 4,
                  'numberOfModifiedDocumentsIndexed': 0,
                  'numberOfNewDocumentsIndexed': 4},
  'status': 'COMPLETE',
  'updatedAt': datetime.datetime(2024, 8, 28, 12, 58, 17, 148888, tzinfo=tzlocal())}


In [28]:
# Get job 
while(job['status']!='COMPLETE' ):
    get_job_response = bedrock_agent_client.get_ingestion_job(
      knowledgeBaseId = kb['knowledgeBaseId'],
        dataSourceId = ds["dataSourceId"],
        ingestionJobId = job["ingestionJobId"]
  )
    job = get_job_response["ingestionJob"]
    
    interactive_sleep(30)

pp.pprint(job)

{ 'dataSourceId': 'EA1ZIPGJEY',
  'ingestionJobId': 'FYFWKD7YZJ',
  'knowledgeBaseId': 'GKEHKIWESJ',
  'startedAt': datetime.datetime(2024, 8, 28, 12, 58, 1, 652880, tzinfo=tzlocal()),
  'statistics': { 'numberOfDocumentsDeleted': 0,
                  'numberOfDocumentsFailed': 0,
                  'numberOfDocumentsScanned': 4,
                  'numberOfModifiedDocumentsIndexed': 0,
                  'numberOfNewDocumentsIndexed': 4},
  'status': 'COMPLETE',
  'updatedAt': datetime.datetime(2024, 8, 28, 12, 58, 17, 148888, tzinfo=tzlocal())}


In [35]:
# Print the knowledge base Id in bedrock, that corresponds to the Opensearch index in the collection we created before, we will use it for the invocation later
kb_id = kb["knowledgeBaseId"]
pp.pprint(kb_id)

'ODK0AD66TQ'


In [36]:
# Store the kb_id for invocation later in the invoke request
%store kb_id

Stored 'kb_id' (str)


## Step 6: Test the Knowledge Base
Now that we have indexed data into the Knowledge Base, it is time to test the Knowledge Base with the Retrieve and Generate API. [Here are the models supported by the Retrieve and Generate API](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-overview.html#serverless-process).

In [37]:
# Initialize the Bedrock client
bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime", region_name=region_name)
# Lets see how different Anthropic Claude 3 models responds to the input text we provide
claude_model_ids = [ ["Claude 3 Sonnet", "anthropic.claude-3-sonnet-20240229-v1:0"], ["Claude 3 Haiku", "anthropic.claude-3-haiku-20240307-v1:0"]]


In [38]:
# Define a function to send a query to a FM using a Knowledge Base.
def ask_bedrock_llm_with_knowledge_base(query: str, model_arn: str, kb_id: str) -> str:
    response = bedrock_agent_runtime_client.retrieve_and_generate(
        input={
            'text': query
        },
        retrieveAndGenerateConfiguration={
            'type': 'KNOWLEDGE_BASE',
            'knowledgeBaseConfiguration': {
                'knowledgeBaseId': kb_id,
                'modelArn': model_arn
            }
        },
    )

    return response

In [39]:
# Modify this query if you customized the documents uploaded to Amazon S3.
query = "What is Amazon's doing in the field of generative AI?"

for model_id in claude_model_ids:
    model_arn = f'arn:aws:bedrock:{region_name}::foundation-model/{model_id[1]}'
    response = ask_bedrock_llm_with_knowledge_base(query, model_arn, kb_id)
    generated_text = response['output']['text']
    citations = response["citations"]
    contexts = []
    for citation in citations:
        retrievedReferences = citation["retrievedReferences"]
        for reference in retrievedReferences:
            contexts.append(reference["content"]["text"])
    print(f"---------- Generated using {model_id[0]}:")
    pp.pprint(generated_text )
    print(f'---------- The citations for the response generated by {model_id[0]}:')
    pp.pprint(contexts)
    print()

---------- Generated using Claude 3 Sonnet:
('Amazon has been working on developing its own large language models (LLMs) '
 'for generative AI applications. The company believes generative AI will '
 'transform and improve virtually every customer experience across its '
 'consumer, seller, brand, and creator offerings. Amazon is investing '
 'substantially in LLMs and plans to continue doing so. Through its cloud '
 'computing arm AWS, Amazon is democratizing generative AI technology by '
 'offering price-performant machine learning chips like Trainium and '
 'Inferentia. This allows companies of all sizes to afford training and '
 'running LLMs in production. AWS also enables companies to choose from '
 "various pre-trained LLMs and build applications using AWS's security, "
 "privacy and other features. One example is AWS's CodeWhisperer, which uses "
 'generative AI to provide real-time code suggestions to improve developer '
 'productivity.')
---------- The citations for the respo

## Step 7: Clean up

When you finish this exercise, remove your resources with the following steps to prevent incurring costs:

Delete the vector index.
Delete data, network, and encryption access ploicies.
Delete the OpenSearch collection.
Delete the SageMaker Studio user profile and domain.
Optionally, empty and delete the S3 bucket, or keep whatever you want.  

In [41]:
# delete vector index
oss_client.indices.delete(index=index_name)

# delete data, network, and encryption access ploicies
aoss_client.delete_access_policy(type="data", name=access_policy['accessPolicyDetail']['name'])
aoss_client.delete_security_policy(type="network", name=network_policy['securityPolicyDetail']['name'])
aoss_client.delete_security_policy(type="encryption", name=encryption_policy['securityPolicyDetail']['name'])

# delete collection
collection_id = collection['createCollectionDetail']['id']
aoss_client.delete_collection(id=collection_id)

NotFoundError: NotFoundError(404, 'resource_not_found_exception', 'Collection with ID [o2gjsxvnvzk88aolis03] cannot be found.')

# Conclusion
In this notebook, we discussed the benefit of using a Knowledge Base for Amazon Bedrock. 