# Chapter 5 - Knowledge Bases: Knowledge Bases for Amazon Bedrock

## Overview
This notebook demonstrates how to create and configure Amazon Bedrock Knowledge Bases for retrieval-augmented generation (RAG) applications. We'll explore how to ingest documents, create vector embeddings, and build intelligent search capabilities.

## Introduction
This notebook demonstrates how to build and use a Knowledge Base with Amazon Bedrock, enabling semantic search and question-answering over your documents. We'll use OpenSearch Serverless as the vector store, Amazon Titan Text Embeddings for embedding generation, and Claude 3 Sonnet for answering queries based on the retrieved information.

## Prerequisites
- AWS account with Amazon Bedrock access
- S3 bucket with document data
- Access to Amazon Titan Embeddings and Claude 3 models
- Required IAM permissions for OpenSearch Serverless and knowledge base creation

## Setup

### Install Required Dependencies

In [None]:
# Installing necessary packages for AWS Bedrock and OpenSearch integration

%pip install --upgrade boto3      # AWS SDK for Python
%pip install --upgrade botocore   # Low-level AWS service access
%pip install --upgrade opensearch-py  # OpenSearch Python client

### Import Libraries

In [None]:
# Standard library imports
import json
import os
import time
import random

# AWS SDK imports
import boto3
from botocore.exceptions import ClientError

# Utility imports
import pprint
from utility import (
    create_bedrock_execution_role, 
    create_oss_policy_attach_bedrock_execution_role, 
    create_policies_in_oss, 
    interactive_sleep
)

### Client and Configuration Setup

In [None]:
# Initialize AWS session and get region information
boto3_session = boto3.session.Session()
region_name = boto3_session.region_name

# Create AWS service clients
bedrock_agent_client = boto3_session.client('bedrock-agent', region_name=region_name)

# Configuration parameters
service = 'aoss'  # Amazon OpenSearch Serverless
bucket_name = ""  # ⚠️ **IMPORTANT**: Replace with your S3 bucket name

# Pretty printer for formatted output
pp = pprint.PrettyPrinter(indent=2)

print(f"🌍 **Region**: {region_name}")
print(f"📦 **Service**: {service}")

## OpenSearch Serverless Setup

### IAM Role Creation

In [None]:
# Define vector store configuration
vector_store_name = 'bedrock-sample-rag'  # Name for the OpenSearch collection
index_name = 'bedrock-sample-rag-index'   # Name for the vector index

# Create OpenSearch Serverless client
aoss_client = boto3_session.client('opensearchserverless', region_name=region_name)

# Create IAM execution role for Bedrock Knowledge Base
print("🔐 Creating Bedrock execution role...")
bedrock_kb_execution_role = create_bedrock_execution_role(bucket_name=bucket_name)
bedrock_kb_execution_role_arn = bedrock_kb_execution_role['Role']['Arn']

print(f"✅ **Execution Role ARN**: {bedrock_kb_execution_role_arn}")

### OpenSearch Policies and Collection

In [None]:
# Create security, network and data access policies within OpenSearch Serverless
print("🛡️ Creating OpenSearch Serverless policies...")
encryption_policy, network_policy, access_policy = create_policies_in_oss(
    vector_store_name=vector_store_name,
    aoss_client=aoss_client,
    bedrock_kb_execution_role_arn=bedrock_kb_execution_role_arn
)

# Create the vector search collection
print(f"📊 Creating vector collection: {vector_store_name}...")
collection = aoss_client.create_collection(
    name=vector_store_name,
    type='VECTORSEARCH'
)

print("✅ Collection creation initiated!")

### Collection Details

In [None]:
# Display collection details
print("📋 **Collection Details**:")
pp.pprint(collection)

### Collection Endpoint

In [None]:
# Extract collection ID and construct the endpoint URL
collection_id = collection['createCollectionDetail']['id']
host = f"{collection_id}.{region_name}.aoss.amazonaws.com"

print(f"🌐 **Collection Endpoint**: {host}")
print(f"🆔 **Collection ID**: {collection_id}")

### Role Policy Attachment

In [None]:
# Create and attach OpenSearch Serverless access policy to Bedrock execution role
print("🔗 Attaching OpenSearch access policy to Bedrock role...")
create_oss_policy_attach_bedrock_execution_role(
    collection_id=collection_id,
    bedrock_kb_execution_role=bedrock_kb_execution_role
)

print("✅ Policy attachment completed!")

## Vector Index Creation

### OpenSearch Client Setup

In [None]:
# Import OpenSearch Python client
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth

# Set up AWS authentication for OpenSearch
credentials = boto3.Session().get_credentials()
awsauth = AWSV4SignerAuth(credentials, region_name, service)

# Define index name and configuration
index_name = "bedrock-sample-index"

# Vector index schema configuration
body_json = {
    "settings": {
        "index.knn": "true"  # Enable k-nearest neighbor search
    },
    "mappings": {
        "properties": {
            # Vector field for embeddings
            "vector": {
                "type": "knn_vector",
                "dimension": 1536,  # Titan embedding dimensions
                "method": {
                    "name": "hnsw",      # Hierarchical Navigable Small World
                    "engine": "faiss",    # Facebook AI Similarity Search
                    "space_type": "l2"    # L2/Euclidean distance
                }
            },
            # Text content field
            "text": {
                "type": "text"
            },
            # Metadata field
            "text-metadata": {
                "type": "text"
            }
        }
    }
}

# Create OpenSearch client with AWS authentication
print("🔧 Setting up OpenSearch client...")
oss_client = OpenSearch(
    hosts=[{'host': host, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    timeout=300
)

# Wait for data access rules to be enforced
print("⏳ Waiting for data access rules to be enforced (120 seconds)...")
time.sleep(120)
print("✅ Ready to create index!")

### Optional: Index Cleanup

In [None]:
# 🗑️ **Optional**: Delete existing index if needed
# Uncomment the line below if you encounter errors or need to recreate the index
# oss_client.indices.delete(index=index_name)
# print(f"🗑️ Deleted existing index: {index_name}")

### Index Creation

In [None]:
# Create the vector index
# ⚠️ **Important**: Collection must be in ACTIVE state before creating an index
print(f"✨ Creating vector index: {index_name}...")

try:
    response = oss_client.indices.create(index=index_name, body=json.dumps(body_json))
    print("✅ **Index Creation Response**:")
    pp.pprint(response)
except Exception as e:
    print(f"❌ **Error creating index**: {str(e)}")
    print("💡 **Tip**: Ensure the collection is in ACTIVE state and try again")

## Data Upload to S3

### Upload Documents

In [None]:
# Define data source directory
data_root = "data/"

# Create S3 client
s3_client = boto3.client("s3")

def uploadDirectory(path, bucket_name):
    """
    Upload all files from a local directory to S3 bucket.
    
    Args:
        path (str): Local directory path containing files to upload
        bucket_name (str): Target S3 bucket name
    """
    uploaded_files = []
    
    for root, dirs, files in os.walk(path):
        for file in files:
            local_path = os.path.join(root, file)
            try:
                s3_client.upload_file(local_path, bucket_name, file)
                uploaded_files.append(file)
                print(f"📤 Uploaded: {file}")
            except Exception as e:
                print(f"❌ Failed to upload {file}: {str(e)}")
    
    return uploaded_files

# Upload documents to S3
print(f"📤 **Uploading files from {data_root} to bucket: {bucket_name}**")

if bucket_name:
    uploaded_files = uploadDirectory(data_root, bucket_name)
    print(f"✅ **Upload completed!** {len(uploaded_files)} files uploaded.")
else:
    print("⚠️ **Warning**: Please set the bucket_name variable before running this cell")

## Knowledge Base Creation

### Configure Knowledge Base

In [None]:
#initialize opensearch serverless configuration including collection ARN, index name, vector field, text field and metadata field.
opensearchServerlessConfiguration = {
            "collectionArn": collection["createCollectionDetail"]['arn'],
            "vectorIndexName": index_name,
            "fieldMapping": {
                "vectorField": "vector",
                "textField": "text",
                "metadataField": "text-metadata"
            }
        }

# Ingest/chunking strategy - How to ingest data from the data source
chunkingStrategyConfiguration = {
    "chunkingStrategy": "FIXED_SIZE",
    "fixedSizeChunkingConfiguration": {
        "maxTokens": 512,
        "overlapPercentage": 20
    }
}

# The data source to ingest documents from, into the opensearch serverless knowledge base index
s3Configuration = {
    "bucketArn": f"arn:aws:s3:::{bucket_name}",
}

# The embedding model used by Bedrock to embed ingested documents, and realtime prompts
embeddingModelArn = f"arn:aws:bedrock:{region_name}::foundation-model/amazon.titan-embed-text-v1"

name = "bedrock-sample-knowledge-base"
description = "FAQs"
roleArn = bedrock_kb_execution_role_arn

print(roleArn)

### Create Knowledge Base

In [None]:
# Create the knowledge base

def create_knowledge_base():
    create_kb_response = bedrock_agent_client.create_knowledge_base(
        name = name,
        description = description,
        roleArn = roleArn,
        knowledgeBaseConfiguration = {
            "type": "VECTOR",
            "vectorKnowledgeBaseConfiguration": {
                "embeddingModelArn": embeddingModelArn
            }
        },
        storageConfiguration = {
            "type": "OPENSEARCH_SERVERLESS",
            "opensearchServerlessConfiguration":opensearchServerlessConfiguration
        }
    )
    return create_kb_response["knowledgeBase"]

In [None]:
try:
    kb = create_knowledge_base()
except Exception as err:
    print(f"{err=}, {type(err)=}")

pp.pprint(kb)

In [None]:
# Get Knowledge Base info
get_kb_response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId = kb['knowledgeBaseId'])
pp.pprint(get_kb_response)

### Create Data Source

In [None]:
# Create a data source in knowledge base 
create_ds_response = bedrock_agent_client.create_data_source(
    name = name,
    description = description,
    knowledgeBaseId = kb['knowledgeBaseId'],
    dataSourceConfiguration = {
        "type": "S3",
        "s3Configuration":s3Configuration
    },
    vectorIngestionConfiguration = {
        "chunkingConfiguration": chunkingStrategyConfiguration
    }
)
ds = create_ds_response["dataSource"]
pp.pprint(ds)

In [None]:
# Get data source info
bedrock_agent_client.get_data_source(knowledgeBaseId = kb['knowledgeBaseId'], dataSourceId = ds["dataSourceId"])

## Document Ingestion

### Start Ingestion Job

In [None]:
# Start an ingestion job
start_job_response = bedrock_agent_client.start_ingestion_job(knowledgeBaseId = kb['knowledgeBaseId'], dataSourceId = ds["dataSourceId"])
job = start_job_response["ingestionJob"]
pp.pprint(job)

In [None]:
# Get job ID
get_job_response = bedrock_agent_client.get_ingestion_job(
    knowledgeBaseId = kb['knowledgeBaseId'],
      dataSourceId = ds["dataSourceId"],
      ingestionJobId = job["ingestionJobId"]
)
job = get_job_response["ingestionJob"]

pp.pprint(job)

In [None]:
# Print the knowledge base Id in bedrock, that corresponds to the opensearch index in the collection we created before, we will use it for the invocation later
kb_id = kb["knowledgeBaseId"]
pp.pprint(kb_id)

## Testing the Knowledge Base

### Query the Knowledge Base

In [None]:
# try out KB using RetrieveAndGenerate API
bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime", region_name=region_name)
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

In [None]:
def ask_bedrock_llm_with_knowledge_base(query: str, model_arn: str, kb_id: str) -> str:
    response = bedrock_agent_runtime_client.retrieve_and_generate(
        input={
            'text': query
        },
        retrieveAndGenerateConfiguration={
            'type': 'KNOWLEDGE_BASE',
            'knowledgeBaseConfiguration': {
                'knowledgeBaseId': kb_id,
                'modelArn': model_arn
            }
        },
    )

    return response

In [None]:
query = "How can I find out if you have a product in stock?"

model_arn = f'arn:aws:bedrock:{region_name}::foundation-model/{model_id}'
response = ask_bedrock_llm_with_knowledge_base(query, model_arn, kb_id)
generated_text = response['output']['text']
citations = response["citations"]
contexts = []
for citation in citations:
    retrievedReferences = citation["retrievedReferences"]
    for reference in retrievedReferences:
        contexts.append(reference["content"]["text"])
print(f"---------- Generated using {model_id[0]}:")
pp.pprint(generated_text )
print(f'---------- The citations for the response generated by {model_id[0]}:')
pp.pprint(contexts)
print()

## Conclusion

In this notebook, we've successfully built a fully functional Knowledge Base using Amazon Bedrock and OpenSearch Serverless. This implementation demonstrates an end-to-end Retrieval Augmented Generation (RAG) system that can answer questions based on your custom document collection.

Key components we established:

1. **Vector Store**: We set up OpenSearch Serverless as our vector database, configuring the proper schema for storing embeddings and text.

2. **Access Control**: We created the necessary IAM roles and policies to allow secure interaction between Bedrock and OpenSearch Serverless.

3. **Document Processing**: We implemented a chunking strategy that breaks documents into fixed-size chunks with overlap for better context preservation.

4. **Embedding Generation**: We configured the system to use Amazon Titan Text Embeddings to convert document chunks into vector representations.

5. **Query Processing**: We demonstrated how to use the RetrieveAndGenerate API to retrieve relevant information and generate accurate answers using Claude 3 Sonnet.

This Knowledge Base architecture provides several advantages:

- **Scalability**: OpenSearch Serverless automatically scales to accommodate growing document collections
- **Semantic Understanding**: Vector search finds contextually relevant information, not just keyword matches
- **Source Attribution**: Citations provide transparency about which documents informed the response
- **Up-to-date Knowledge**: Foundation models are enhanced with your specific, current information

Potential applications include:
- Customer support knowledge bases
- Employee portals for company policies and procedures
- Research assistants for analyzing large document collections
- Product documentation and FAQ systems

This approach combines the best of both worlds: the vast knowledge and language capabilities of foundation models with the accuracy and specificity of your proprietary documents.