## Near real-time ingestion using Document level API (DLA) - End to end example 

With Document Level API (DLA), customers can now efficiently and cost-effectively ingest, update, or delete data directly from Amazon Bedrock Knowledge Bases using a single API call, without the need to perform a full sync with the data source periodically or after every change.

To read more about DLA, see the [documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-direct-ingestion-add.html)


#### Pre-requisites: 

- You have already created a Amazon Bedrock Knowledge base by running  [01_create_ingest_documents_test_kb_multi_ds.ipynb](/knowledge-bases/01-rag-concepts/01_create_ingest_documents_test_kb_multi_ds.ipynb)
- Note down the KB id

#### Test Knowledge base: 
- Ingest document into Knowledge base using DLA.
- Start querying knowledge base for information



<div class="alert alert-block alert-info">
<b>Note:</b> Please make sure to enable `amazon.nova-micro-v1:0`, `Anthropic Claude 3 Sonnet`, `amazon.titan-text-express-v1`, `anthropic.claude-3-haiku-20240307-v1:0` and,  `Titan Text Embeddings V2` model access in Amazon Bedrock Console.
<br> -------------------------------------------------------------------------------------------------------------------------------------------------------   </br>
    
Please run the notebook cell by cell instead of using "Run All Cells" option.
</div>


### 0 - Setup
Before running the rest of this notebook, you'll need to run the cells below to (ensure necessary libraries are installed and) connect to Bedrock.

Please ignore any pip dependency error (if you see any while installing libraries)

In [None]:
%pip install --upgrade pip --quiet
%pip install -r ../requirements.txt --no-deps --quiet
%pip install -r ../requirements.txt --upgrade --quiet

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import boto3
print(boto3.__version__)

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import os
import sys
import time
import boto3
import logging
import pprint
import json
import uuid

# Set the path to import module
from pathlib import Path
current_path = Path().resolve()
current_path = current_path.parent
if str(current_path) not in sys.path:
    sys.path.append(str(current_path))
# Print sys.path to verify
# print(sys.path)

from utils.knowledge_base_operators import create_document_config, ingest_documents_dla

In [None]:
#Clients
s3_client = boto3.client('s3')
sts_client = boto3.client('sts')
session = boto3.session.Session()
region =  session.region_name
account_id = sts_client.get_caller_identity()["Account"]
bedrock_agent_client = boto3.client('bedrock-agent')
bedrock_agent_runtime_client = boto3.client('bedrock-agent-runtime') 
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)
region, account_id

### Ingest document directly into Knowledge base using Document Level API (INLINE)

To ingest documents directly into a knowledge base, send an [IngestKnowledgeBaseDocuments](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_IngestKnowledgeBaseDocuments.html) request by specifying the knowledge base ID and data source 


In [None]:
%store -r kb_id

# kb_id = "<<knowledge_base_id>>" # Replace with your knowledge base id here.
ds_id_list = bedrock_agent_client.list_data_sources( knowledgeBaseId=kb_id, maxResults=100)['dataSourceSummaries']
ds_id = ds_id_list[0]['dataSourceId']

kb_id, ds_id


Currently You can use DLA only if your knowledge base is connected to one of the following data source types:

     - Amazon S3
     - Custom 

Based on various configurations, there can be  different types of ingest patterns as shown below. To read more about these patteren refer to API documentation [here](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent/client/ingest_knowledge_base_documents.html)

In [None]:
# You can choose between different Ingest patterns (based on Data source type i.e. s3 or Custom) using DLA

print("Different DLA ingest pattern:")
# For S3 Data source type
print("1. Data Source Type: S3 - Metadata: INLINE")
print("2. Data Source Type: S3 - Metadata: S3_Location")

# For CUSTOM Data source type
print("3. Data Source Type: CUSTOM -  Document source type: INLINE -  conetent type: TEXT - Metadata: INLINE")
print("4. Data Source Type: CUSTOM -  Document source type: INLINE -  conetent type: TEXT - Metadata: S3_Location")
print("5. Data Source Type: CUSTOM -  Document source type: INLINE -  conetent type: BYTE - Metadata: INLINE")
print("6. Data Source Type: CUSTOM -  Document source type: INLINE -  conetent type: BYTE - Metadata: S3_Location")
print("7. Data Source Type: CUSTOM -  Document source type: S3_LOCATION - Metadata: INLINE")
print("8. Data Source Type: CUSTOM -  Document source type: S3_LOCATION - Metadata: S3_Location")


For using DLA, you have to define the knowledgeBaseId, dataSourceId, and the documents (A list of objects, each of which contains information about the documents to add).

- We have created a custom function named `create_document_config` , which will define the list of documents based on the ingest pattern you chose. this function accepts the following arguments:

    - data_source_type: Either 'CUSTOM' or 'S3'.
    - document_id: The ID for a custom document.
    - s3_uri: The S3 URI for S3 data source.
    - inline_content: The inline content configuration for custom data source.
    - Metadata:  Metadata information that can  be a list of inline attributes or an S3 location.

For this notebook - we have implemented only four ingest patterns i.e. pattern 1,2,3 & 4. But you can extent it to pattern 5, 6, 7 & 8.




<div class="alert alert-block alert-info">
<b>Note:</b>  While using DLA, the dataSourceType specified in the content for each document must match the type of the data source that you specify  otherwise ingestion will throw an error. 
<ul> - if your KB data source is S3, then choose S3 as data source type while using DLA API</ul>
<ul> - if your KB data source is CUSTOM, then choose CUSTOM as data source type while using DLA API</ul>

In [None]:
# Provide below information based on your ingest pattern
#---------------------------------------------------------------------------------------
# FOR INGEST PATTERN CHOICE = 1, i.e. Data Source Type: S3 - Metadata: INLINE
# **************************************************************************************
# S3 uri of the data to be ingetsed
document_s3_uri = 's3://semantic-kb-9194117/octank_financial_10K.pdf'

# INLINE Metadata details
metdata_1 = {'key': 'company', 'value': { 'stringValue': 'octank', 'type': 'STRING'}}
metdata_2 = {'key': 'document', 'value': { 'stringValue': '10k', 'type': 'STRING'}}
metadata_list =[metdata_1, metdata_2]

inline_metadata ={'inlineAttributes':metadata_list}

# Create document configuration for this ingest pattern
s3_doc_inline_metadata = create_document_config(
    data_source_type='S3',
    s3_uri=document_s3_uri,
    metadata= inline_metadata
)

# #---------------------------------------------------------------------------------------
# # FOR INGEST PATTERN CHOICE = 2, i.e. Data Source Type: S3 - Metadata: S3_Location
# # **************************************************************************************
# document_s3_uri = '<Insert S3 URI here>' 
# metadata_s3_uri = '<Insert S3 URI here>' 
# metadata_s3_accountid = '<Insert S3 URI accountid here>' 

# # if your metada is stored at S3_location
# metadata_s3_uri = '<Insert S3 URI here>' 
# metadata_s3_accountid = '<Insert S3 URI accountid here>' 
# s3_metadata = {'uri': metadata_s3_uri, 'bucketOwnerAccountId': metadata_s3_accountid }

# s3_doc_s3_metadata = create_document_config(
#             data_source_type='S3',
#             s3_uri='s3://standard-kb-7104855/octank_financial_10K (1).pdf',
#             metadata= s3_metadata
#         )


## ---------------------------------------------------------------------------------------
## FOR INGEST PATTERN CHOICE = 3, i.e. Data Source Type: CUSTOM -  Document source type: INLINE -  conetent type: TEXT - Metadata: INLINE
## **************************************************************************************

## Example :  USE DLA to ingest a custom document with TEXT inline content and inline metadata

# document_content = '''This is sample document content'''
# document_id = '<insert document id here>'

# # if your Metadata is INLINE
# metdata_1 = {'key': 'company', 'value': { 'stringValue': 'octank', 'type': 'STRING'}}
# metdata_2 = {'key': 'document', 'value': { 'stringValue': '10k', 'type': 'STRING'}}
# metadata_list =[metdata_1, metdata_2]

# inline_metadata ={'inlineAttributes': metadata_list}

# custom_inline_text_inline_metadata = create_document_config(
#     data_source_type='CUSTOM',
#     document_id=document_id,
#     inline_content={
#         'type': 'TEXT',
#         'data': document_content
#     },
#     metadata= inline_metadata
# )

##---------------------------------------------------------------------------------------
## FOR INGEST PATTERN CHOICE = 4, i.e. Data Source Type: CUSTOM -  Document source type: INLINE -  conetent type: TEXT - Metadata: S3_Location
## **************************************************************************************

## Example : USE DLA to ingest a custom document with TEXT inline content and S3 metadata

# document_content = '''This is sample document content'''
# document_id = '<insert document id here>'

# # if your metada is stored at S3_location
# metadata_s3_uri = '<Insert S3 URI here>' 
# metadata_s3_accountid = '<Insert S3 URI accountid here>' 
# s3_metadata = {'uri': metadata_s3_uri, 'bucketOwnerAccountId': metadata_s3_accountid }

# custom_inline_text_s3_metadata = create_document_config(
#     data_source_type='CUSTOM',
#     document_id=document_id,
#     inline_content={
#         'type': 'TEXT',
#         'data': document_content
#     },
#     metadata=s3_metadata
#)


After the document list has been configured, you can call the `ingest_documents_dla` (another custom function) function to ingest the documents into Knowledge base which will call [ingest_knowledge_base_documents](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent/client/ingest_knowledge_base_documents.html) API.

- This function accepts the following arguments:

    - knowledge_base_id: The ID of the knowledge base.
    - data_source_id: The ID of the data source.
    - documents: A list of document configurations to ingest.

In [None]:
# Ingest the documents using DLA
response = ingest_documents_dla(
    knowledge_base_id=kb_id,
    data_source_id=ds_id,
    documents=[ s3_doc_inline_metadata] # Based on the ingest pattern, this can be changed to [s3_doc_s3_metadata], [custom_inline_text_inline_metadata] or [custom_inline_text_s3_metadata]
)

print(response)


Check the status of the documents ingested via DLA

In [None]:
## To fetch the status of documents
# response = bedrock_agent_client.list_knowledge_base_documents(
#     dataSourceId=ds_id,
#     knowledgeBaseId=kb_id,
# )
# print(response)

### 2.2 Test the Knowledge Base
Now the Knowlegde Base is available we can test it out using the [**retrieve**](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve.html) and [**retrieve_and_generate**](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve_and_generate.html) functions. 

#### Testing Knowledge Base with Retrieve and Generate API

Let's first test the knowledge base using the retrieve and generate API. With this API, Bedrock takes care of retrieving the necessary references from the knowledge base and generating the final answer using a foundation model from Bedrock.

query = `Provide a summary of consolidated statements of cash flows of Octank Financial for the fiscal years ended December 31, 2019.`

The right response for this query as per ground truth QA pair is:
```
The cash flow statement for Octank Financial in the year ended December 31, 2019 reveals the following:
- Cash generated from operating activities amounted to $710 million, which can be attributed to a $700 million profit and non-cash charges such as depreciation and amortization.
- Cash outflow from investing activities totaled $240 million, with major expenditures being the acquisition of property, plant, and equipment ($200 million) and marketable securities ($60 million), partially offset by the sale of property, plant, and equipment ($40 million) and maturing marketable securities ($20 million).
- Financing activities resulted in a cash inflow of $350 million, stemming from the issuance of common stock ($200 million) and long-term debt ($300 million), while common stock repurchases ($50 million) and long-term debt payments ($100 million) reduced the cash inflow.
Overall, Octank Financial experienced a net cash enhancement of $120 million in 2019, bringing their total cash and cash equivalents to $210 million.

In [None]:
query = "Provide a summary of consolidated statements of cash flows of Octank Financial for the fiscal years ended December 31, 2019?"

In [None]:
foundation_model = "anthropic.claude-3-sonnet-20240229-v1:0"

response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": query
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            'knowledgeBaseId': kb_id,
            "modelArn": "arn:aws:bedrock:{}::foundation-model/{}".format(region, foundation_model),
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5
                } 
            }
        }
    }
)

print(response['output']['text'],end='\n'*2)

In [None]:
import boto3
import time

def delete_all_knowledge_bases():
    try:
        # Create Bedrock client
        bedrock = boto3.client('bedrock-agent-runtime')
        bedrock_agent = boto3.client('bedrock-agent')

        # List all knowledge bases
        kb_response = bedrock_agent.list_knowledge_bases()
        knowledge_bases = kb_response.get('knowledgeBaseSummaries', [])

        if not knowledge_bases:
            print("No knowledge bases found.")
            return

        # Iterate through each knowledge base
        for kb in knowledge_bases:
            kb_id = kb['knowledgeBaseId']
            print(f"Processing Knowledge Base: {kb_id}")

            # List data sources for the knowledge base
            data_sources = bedrock_agent.list_data_sources(
                knowledgeBaseId=kb_id
            )

            # Delete each data source
            for ds in data_sources.get('dataSourceSummaries', []):
                ds_id = ds['dataSourceId']
                print(f"Deleting Data Source: {ds_id}")
                
                try:
                    bedrock_agent.delete_data_source(
                        knowledgeBaseId=kb_id,
                        dataSourceId=ds_id
                    )
                    print(f"Data Source {ds_id} deleted successfully")
                except Exception as e:
                    print(f"Error deleting Data Source {ds_id}: {str(e)}")

            # Delete the knowledge base
            try:
                print(f"Deleting Knowledge Base: {kb_id}")
                bedrock_agent.delete_knowledge_base(
                    knowledgeBaseId=kb_id
                )
                print(f"Knowledge Base {kb_id} deleted successfully")
            except Exception as e:
                print(f"Error deleting Knowledge Base {kb_id}: {str(e)}")

            # Wait for a few seconds between deletions
            time.sleep(2)

        print("Deletion process completed")

    except Exception as e:
        print(f"An error occurred: {str(e)}")

In [None]:
import boto3
import time
from botocore.exceptions import ClientError

def delete_all_opensearch_serverless():
    # Create OpenSearch Serverless client
    client = boto3.client('opensearchserverless')
    
    def delete_security_policies():
        print("\nDeleting security policies...")
        for policy_type in ['encryption', 'network']:
            try:
                # Get policies with correct response key
                response = client.list_security_policies(type=policy_type)
                policies = response.get('securityPolicies', [])  # Changed from 'securityPolicyDetails'
                
                if not policies:
                    print(f"No {policy_type} policies found.")
                    continue

                for policy in policies:
                    try:
                        print(f"Deleting {policy_type} policy: {policy['name']}")
                        client.delete_security_policy(
                            name=policy['name'],
                            type=policy_type
                        )
                        print(f"Successfully deleted {policy_type} policy: {policy['name']}")
                        # Add small delay to avoid throttling
                        time.sleep(1)
                    except ClientError as e:
                        print(f"Error deleting {policy_type} policy {policy['name']}: {str(e)}")
            except ClientError as e:
                print(f"Error listing {policy_type} policies: {str(e)}")

    def delete_access_policies():
        print("\nDeleting access policies...")
        for policy_type in ['data', 'collection']:
            try:
                response = client.list_access_policies(type=policy_type)
                policies = response.get('accessPolicies', [])  # Changed from 'accessPolicySummaries'
                
                if not policies:
                    print(f"No {policy_type} access policies found.")
                    continue

                for policy in policies:
                    try:
                        print(f"Deleting {policy_type} access policy: {policy['name']}")
                        client.delete_access_policy(
                            name=policy['name'],
                            type=policy_type
                        )
                        print(f"Successfully deleted {policy_type} access policy: {policy['name']}")
                        # Add small delay to avoid throttling
                        time.sleep(1)
                    except ClientError as e:
                        print(f"Error deleting {policy_type} access policy {policy['name']}: {str(e)}")
            except ClientError as e:
                print(f"Error listing {policy_type} access policies: {str(e)}")

    def delete_collections():
        print("\nDeleting collections...")
        try:
            response = client.list_collections()
            collections = response.get('collectionSummaries', [])

            if not collections:
                print("No collections found.")
                return

            for collection in collections:
                collection_name = collection['name']
                collection_id = collection['id']  # Use the actual ID provided by AWS
                try:
                    print(f"Deleting collection: {collection_name} (ID: {collection_id})")
                    if collection['status'] == 'ACTIVE':
                        client.delete_collection(id=collection_id)
                        print(f"Deletion initiated for collection: {collection_name}")
                        
                        # Wait for collection deletion
                        while True:
                            try:
                                status_response = client.batch_get_collection(ids=[collection_id])
                                status_collections = status_response.get('collectionDetails', [])
                                
                                if not status_collections:
                                    print(f"Collection {collection_name} deleted successfully")
                                    break
                                    
                                status = status_collections[0]['status']
                                print(f"Waiting for collection {collection_name} to be deleted... Current status: {status}")
                                
                                if status == 'DELETED':
                                    print(f"Collection {collection_name} deleted successfully")
                                    break
                                
                                time.sleep(30)
                                
                            except ClientError as e:
                                if 'ResourceNotFoundException' in str(e):
                                    print(f"Collection {collection_name} deleted successfully")
                                    break
                                raise e
                except ClientError as e:
                    print(f"Error deleting collection {collection_name}: {str(e)}")
                    
        except ClientError as e:
            print(f"Error listing collections: {str(e)}")

    try:
        print("Starting OpenSearch Serverless cleanup...")
        
        # Delete in order: collections first, then policies
        delete_collections()
        delete_access_policies()
        delete_security_policies()
        
        print("\nCleanup completed successfully!")
        
    except Exception as e:
        print(f"An error occurred during cleanup: {str(e)}")




As you can see, with the retrieve and generate API we get the final response directly and we don't see the different sources used to generate this response. Let's now retrieve the source information from the knowledge base with the retrieve API.

#### Testing Knowledge Base with Retrieve API
If you need an extra layer of control, you can retrieve the chuncks that best match your query using the retrieve API. In this setup, we can configure the desired number of results and control the final answer with your own application logic. The API then provides you with the matching content, its S3 location, the similarity score and the chunk metadata.

In [None]:
response_ret = bedrock_agent_runtime_client.retrieve(
    knowledgeBaseId=kb_id, 
    nextToken='string',
    retrievalConfiguration={
        "vectorSearchConfiguration": {
            "numberOfResults":5,
        } 
    },
    retrievalQuery={
        "text": "How many new positions were opened across Amazon's fulfillment and delivery network?"
    }
)

def response_print(retrieve_resp):
#structure 'retrievalResults': list of contents. Each list has content, location, score, metadata
    for num,chunk in enumerate(response_ret['retrievalResults'],1):
        print(f'Chunk {num}: ',chunk['content']['text'],end='\n'*2)
        print(f'Chunk {num} Location: ',chunk['location'],end='\n'*2)
        print(f'Chunk {num} Score: ',chunk['score'],end='\n'*2)
        print(f'Chunk {num} Metadata: ',chunk['metadata'],end='\n'*2)

response_print(response_ret)

<div class="alert alert-block alert-warning">
<b>Note:</b> Remember to delete KB, OSS index and related IAM roles and policies to avoid incurring any charges.
</div>