# Lab 1 - Set up Knowledge Base

In this notebook, we will be creating an Amazon Bedrock Knowledge Base that will hold information that will support the agent's decisions about how to handle some ticket resolutions scenarios unique to the organization. We will perform the following steps:

1. Notebook setup
2. Create Amazon Bedrock Knowledge Base
3. Ingest Documents into the knowledge base
4. Test the knowledge base functionality with a few queries

![data_ingestion](images/data_ingestion.png)

## 1. Notebook setup

In [None]:
!pip install --upgrade -q -r requirements.txt

In [None]:
# restart kernel for packages to take effect
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import json
import os
import pprint
import random
from retrying import retry

from utility.knowledgebase import create_bedrock_execution_role, create_oss_policy_attach_bedrock_execution_role, create_policies_in_oss, interactive_sleep

import boto3
from botocore.exceptions import ClientError

from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth, RequestError

In [None]:
boto3_session = boto3.session.Session()
region_name = boto3_session.region_name

sts_client = boto3.client('sts')
bedrock_agent_client = boto3_session.client('bedrock-agent', region_name=region_name)
bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime", region_name=region_name)
s3_client = boto3.client('s3')
aoss_client = boto3_session.client('opensearchserverless')
service = 'aoss'

account_id = sts_client.get_caller_identity()["Account"]

credentials = boto3.Session().get_credentials()
awsauth = auth = AWSV4SignerAuth(credentials, region_name, service)

In [None]:
suffix = random.randrange(200, 900)
s3_suffix = f"{region_name}-{account_id}"

bucket_name = f'ticket-kb-{s3_suffix}' # replace it with your bucket name.
vector_store_name = f'ticket-sample-rag-{suffix}'
index_name = f"ticket-sample-rag-index-{suffix}"

kb_name = f"tickets-sample-knowledge-base-{suffix}"
kb_description = "Company policies on granting environment access to employees"
kb_files_path = "kb_documents"
kb_key = 'kb_documents'
data_source_name = f'tickets-docs-kb-docs-{suffix}'

pp = pprint.PrettyPrinter(indent=2)

In [None]:
model_id = "anthropic.claude-3-5-sonnet-20240620-v1:0"
embedding_model_arn = f'arn:aws:bedrock:{region_name}::foundation-model/amazon.titan-embed-text-v2:0'

In [None]:
%store suffix model_id embedding_model_arn

## 2. Create Knowledge Base

### 2.1 Create the Amazon S3 bucket and upload files

Amazon Bedrock Knowledge Bases support various data sources, including Amazon S3, as documented in the Data Source Connectors guide [link](https://docs.aws.amazon.com/bedrock/latest/userguide/data-source-connectors.html). In this section, we will create an Amazon S3 bucket and upload files containing the company's policy regarding ticket resolution.

In [None]:
# Check if bucket exists, and if not create S3 bucket for knowledge base data source
try:
    s3_client.head_bucket(Bucket=bucket_name)
    print(f'Bucket {bucket_name} Exists')
except ClientError as e:
    print(f'Creating bucket {bucket_name}')
    if region_name == "us-east-1":
        s3bucket = s3_client.create_bucket(
            Bucket=bucket_name)
    else:
        s3bucket = s3_client.create_bucket(
            Bucket=bucket_name,
            CreateBucketConfiguration={ 'LocationConstraint': region_name }
        )

In [None]:
%store bucket_name

In [None]:
for f in os.listdir(kb_files_path):
    if f.endswith(".pdf") or f.endswith(".txt"):
        s3_client.upload_file(kb_files_path+'/'+f, bucket_name, kb_key+'/'+f)

### 2.2 Create Knowledge Base

In this section we will go through all the steps to create and test a Knowledge Base.

These are the steps to complete:

1. Create Knowledge Base Role and OpenSearch Collection Policies
2. Create an OpenSearch collection
3. Create vector index
4. Create a Knowledge Base
5. Create a data source and attach to the recently created Knowledge Base
6. Ingest data to your knowledge Base

First of all we have to create a vector store. In this section we will use Amazon OpenSerach Serverless. Knowledge Bases also support other vector databases as documented [here](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-setup.html). 

Amazon OpenSearch Serverless is a serverless option in Amazon OpenSearch Service. As a developer, you can use OpenSearch Serverless to run petabyte-scale workloads without configuring, managing, and scaling OpenSearch clusters. You get the same interactive millisecond response times as OpenSearch Service with the simplicity of a serverless environment. Pay only for what you use by automatically scaling resources to provide the right amount of capacity for your application—without impacting data ingestion.

#### Step 1 Create Knowledge Base Role and OpenSearch Collection Policies

In [None]:
bedrock_kb_execution_role = create_bedrock_execution_role(bucket_name=bucket_name, embedding_model_arn=embedding_model_arn, suffix=suffix)
bedrock_kb_execution_role_arn = bedrock_kb_execution_role['Role']['Arn']

In [None]:
# create security, network and data access policies within OSS
encryption_policy, network_policy, access_policy = create_policies_in_oss(vector_store_name=vector_store_name,
                       aoss_client=aoss_client,
                       bedrock_kb_execution_role_arn=bedrock_kb_execution_role_arn,
                       suffix=suffix
                       )

In [None]:
%store encryption_policy network_policy access_policy

#### Step 2 Create an OpenSearch collection

In [None]:
collection = aoss_client.create_collection(name=vector_store_name,type='VECTORSEARCH')

In [None]:
pp.pprint(collection)
%store collection

In [None]:
# Get the OpenSearch serverless collection URL
collection_id = collection['createCollectionDetail']['id']
host = collection_id + '.' + region_name + '.aoss.amazonaws.com'
print(host)

In [None]:
# wait for collection creation
# This can take couple of minutes to finish
response = aoss_client.batch_get_collection(names=[vector_store_name])
# Periodically check collection status
while (response['collectionDetails'][0]['status']) == 'CREATING':
    print('Creating collection...')
    interactive_sleep(30)
    response = aoss_client.batch_get_collection(names=[vector_store_name])
print('\nCollection successfully created:')
pp.pprint(response["collectionDetails"])

In [None]:
# create opensearch serverless access policy and attach it to Bedrock execution role
try:
    create_oss_policy_attach_bedrock_execution_role(collection_id=collection_id,
                                                    bedrock_kb_execution_role=bedrock_kb_execution_role,
                                                    suffix=suffix)
    # It can take up to a minute for data access rules to be enforced
    interactive_sleep(60)
except Exception as e:
    print("Policy already exists")
    pp.pprint(e)

#### Step 3 Create a Vector Index

Let's now create a vector index to index our data


In [None]:
body_json = {
   "settings": {
      "index.knn": "true",
       "number_of_shards": 1,
       "knn.algo_param.ef_search": 512,
       "number_of_replicas": 0,
   },
   "mappings": {
      "properties": {
         "vector": {
            "type": "knn_vector",
            "dimension": 1024,
             "method": {
                 "name": "hnsw",
                 "engine": "faiss",
                 "space_type": "l2"
             },
         },
         "text": {
            "type": "text"
         },
         "text-metadata": {
            "type": "text"         }
      }
   }
}

# Build the OpenSearch client
oss_client = OpenSearch(
    hosts=[{'host': host, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    timeout=300
)

In [None]:
# Create index
try:
    response = oss_client.indices.create(index=index_name, body=json.dumps(body_json))
    print('\nCreating index:')
    pp.pprint(response)

    # index creation can take up to a minute
    interactive_sleep(60)
except RequestError as e:
    # you can delete the index if its already exists
    # oss_client.indices.delete(index=index_name)
    print(f'Error while trying to create the index, with error {e.error}\nyou may unmark the delete above to delete, and recreate the index')
    

#### Step 4 Create a Knowledge Base

Now that we have the Vector index available in OpenSearch Serverless, let's create a Knowledge Base and associate it with the OpenSearch DB

- Initialize Open search serverless configuration which will include collection ARN, index name, vector field, text field and metadata field.
 - Initialize chunking strategy, based on which KB will split the documents into pieces of size equal to the chunk size mentioned in the `chunkingStrategyConfiguration`.
- Initialize the s3 configuration, which will be used to create the data source object later.
- Initialize the Titan embeddings model ARN, as this will be used to create the embeddings for each of the text chunks.

In [None]:
opensearchServerlessConfiguration = {
            "collectionArn": collection["createCollectionDetail"]['arn'],
            "vectorIndexName": index_name,
            "fieldMapping": {
                "vectorField": "vector",
                "textField": "text",
                "metadataField": "text-metadata"
            }
        }

# Ingest strategy - How to ingest data from the data source
chunkingStrategyConfiguration = {
    "chunkingStrategy": "FIXED_SIZE",
    "fixedSizeChunkingConfiguration": {
        "maxTokens": 512,
        "overlapPercentage": 20
    }
}

# The data source to ingest documents from, into the OpenSearch serverless knowledge base index
s3Configuration = {
    "bucketArn": f"arn:aws:s3:::{bucket_name}",
    # "inclusionPrefixes":["*.*"] # you can use this if you want to create a KB using data within s3 prefixes.
}

Provide the above configurations as input to the `create_knowledge_base method`, which will create the Knowledge base.



In [None]:
@retry(wait_random_min=1000, wait_random_max=2000,stop_max_attempt_number=7)
def create_knowledge_base_func():
    create_kb_response = bedrock_agent_client.create_knowledge_base(
        name = kb_name,
        description = kb_description,
        roleArn = bedrock_kb_execution_role_arn,
        knowledgeBaseConfiguration = {
            "type": "VECTOR",
            "vectorKnowledgeBaseConfiguration": {
                "embeddingModelArn": embedding_model_arn
            }
        },
        storageConfiguration = {
            "type": "OPENSEARCH_SERVERLESS",
            "opensearchServerlessConfiguration":opensearchServerlessConfiguration
        }
    )
    return create_kb_response["knowledgeBase"]

In [None]:
try:
    kb = create_knowledge_base_func()
except Exception as err:
    print(f"{err=}, {type(err)=}")

In [None]:
pp.pprint(kb)

In [None]:
# Get KnowledgeBase 
get_kb_response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId = kb['knowledgeBaseId'])

#### Step 5 Create a data source and attach to the recently created Knowledge Base

Next we need to create a data source, which will be associated with the knowledge base created above. Once the data source is ready, we can then start to ingest the documents.

In [None]:
# Create a DataSource in KnowledgeBase 
create_ds_response = bedrock_agent_client.create_data_source(
    name = data_source_name,
    knowledgeBaseId = kb['knowledgeBaseId'],
    dataSourceConfiguration = {
        "type": "S3",
        "s3Configuration":s3Configuration
    },
    vectorIngestionConfiguration = {
        "chunkingConfiguration": chunkingStrategyConfiguration
    }
)
ds = create_ds_response["dataSource"]
pp.pprint(ds)

In [None]:
# Get DataSource 
bedrock_agent_client.get_data_source(knowledgeBaseId = kb['knowledgeBaseId'], dataSourceId = ds["dataSourceId"])

## 3. Ingest Documents into the knowledge base

Once the Knowledge Base and Data Source are created, we can start the ingestion job. During the ingestion job, Knowledge Base will fetch the documents in the data source, pre-process it to extract text, chunk it based on the chunking size provided, create embeddings of each chunk and then write it to the vector database, in this case Amazon OpenSource Serverless.

<div class="alert alert-block alert-info">
Take a moment to review the contents of the documents.

- **permissionManual.txt**: contains information on how to resolve environment access tickets raised by users. It outlines three key points:
```
If the employee already has access to the environment, the ticket can be auto-resolved.
```

```
Access can be auto-assigned to the employee if all the following conditions are met:

1. The environment is owned by the employee's manager.
2. The requested access duration is less than 30 days.
3. The requested access type is not Admin.
```

```
If any of the above conditions are not met, the ticket should be assigned to the environment owner.
```
- **ticketResolution.txt**: contains organization-wide best practices for efficient ticket resolution.
</div>

<div class="alert alert-block alert-info">
The objective is to auto-resolve the ticket when all the predefined conditions are satisfied. In the event that the conditions are not met, the ticket should be assigned to the environment owner (you have access to environment owner via <b>Environment table</b>), along with relevant diagnostic information to facilitate ticket resolution.
</div>

In [None]:
interactive_sleep(30)
start_job_response = bedrock_agent_client.start_ingestion_job(knowledgeBaseId = kb['knowledgeBaseId'], dataSourceId = ds["dataSourceId"])

In [None]:
job = start_job_response["ingestionJob"]
pp.pprint(job)

In [None]:
# Get job 
while(job['status']!='COMPLETE' ):
    get_job_response = bedrock_agent_client.get_ingestion_job(
      knowledgeBaseId = kb['knowledgeBaseId'],
        dataSourceId = ds["dataSourceId"],
        ingestionJobId = job["ingestionJobId"]
  )
    job = get_job_response["ingestionJob"]
    
    interactive_sleep(30)

pp.pprint(job)

In [None]:
# Print the knowledge base Id in bedrock, that corresponds to the Opensearch index in the collection we created before, we will use it for the invocation later
kb_id = kb["knowledgeBaseId"]
pp.pprint(kb_id)

In [None]:
# keep the kb_id for invocation later in the invoke request
%store kb_id

## 4. Test the knowledge base functionality with a few queries

### Using RetrieveAndGenerate API

Behind the scenes, RetrieveAndGenerate API converts queries into embeddings, searches the knowledge base, and then augments the foundation model prompt with the search results as context information and returns the FM-generated response to the question. For multi-turn conversations, Knowledge Bases manage short-term memory of the conversation to provide more contextual results.

The output of the RetrieveAndGenerate API includes the generated response, source attribution as well as the retrieved text chunks.

![retrieveAndGenerate](images/retrieveAndGenerate.png)

In [None]:
def ask_bedrock_llm_with_knowledge_base(query: str, model_id: str, kb_id: str) -> str:
    response = bedrock_agent_runtime_client.retrieve_and_generate(
        input={
            'text': query
        },
        retrieveAndGenerateConfiguration={
            'type': 'KNOWLEDGE_BASE',
            'knowledgeBaseConfiguration': {
                'knowledgeBaseId': kb_id,
                'modelArn': model_id
            }
        },
    )

    return response

In [None]:
query = "Company policies on granting environment access to employees"

response = ask_bedrock_llm_with_knowledge_base(query, model_id, kb_id)
generated_text = response['output']['text']
citations = response["citations"]
contexts = []
for citation in citations:
    retrievedReferences = citation["retrievedReferences"]
    for reference in retrievedReferences:
        contexts.append(reference["content"]["text"])
print(f"---------- Generated using {model_id[0]}:")
pp.pprint(generated_text )
print(f'---------- The citations for the response generated by {model_id[0]}:')
pp.pprint(contexts)
print()

In [None]:
print(response["output"]["text"])

### Optional: Retrieve API


Retrieve API converts user queries into embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom workﬂows on top of the semantic search results. The output of the Retrieve API includes the the retrieved text chunks, the location type and URI of the source data, as well as the relevance scores of the retrievals.



![retrieveAPI](images/retrieveAPI.png)

In [None]:
# retrieve api for fetching only the relevant context.
relevant_documents = bedrock_agent_runtime_client.retrieve(
    retrievalQuery= {
        'text': query
    },
    knowledgeBaseId=kb_id,
    retrievalConfiguration= {
        'vectorSearchConfiguration': {
            'numberOfResults': 3 # will fetch top 3 documents which matches closely with the query.
        }
    }
)

In [None]:
pp.pprint(relevant_documents["retrievalResults"])

<div class="alert alert-block alert-warning">
<b>Next steps:</b> Proceed to the next labs to learn how to associate Bedrock Knowledge bases with Bedrock Agents. Remember to run the CLEANUP notebook at the end of your session.
</div>