## Create a Knowledge Base with fixed chunking strategy

Chunking data is essential. If you are adding large documents with hundreds of pages to your knowledge base then you need to split them up and return only the relevant sections to use as context for your inference. If you are returning too much context it will increase costs (models charge based on input token count) and latency. It may also harm output quality. Shorter chunks will provide a better match but may lack the context necessary to answer a question.

Bedrock Knowledge bases has a few different chunking strategies to choose from. They handle everything from splitting at semantic boundaries like paragraphs and hierarchical structures. However some document types can benefit from custom chunking. For example, any form of mark up can be used by a custom chunking approach.

You can also create your own custom chunking approach using a Lambda function. If you want to add any custom metadata then you will need to add a Lambda function. You can either handle the chunking yourself, edit an existing chunk or just add metadata. Metadata can then be used for filtering.

It is important to tune your chunking to the type of documents being ingested. Getting the wrong chunk size will affect the accuracy and response times. It will also increase the costs in both the vector storage and inference steps. The defaults supplied in Bedrock are pretty good but they may need tailored to your specific circumstances. Longer and more technical documents may need larger chunk sizes to make sure they include more context. Speech (like a chat transcript) can benefit from shorter chunks.

![Chunking Strategies](./chunking-strategies.png)


## Overview

In this notebook, we will implement a knowledge base using a fixed chunking strategy. Here are the key steps we'll perform:

1. **Create a Knowledge Base**: Set up an Amazon Bedrock Knowledge Base with fixed-size chunking configuration that will store and retrieve our vector embeddings.

2. **Create a Data Source**: Connect our Knowledge Base to the documents we uploaded to S3 in the previous notebook.

3. **Start Ingestion Job**: Begin the process of transforming our documents into chunks, creating embeddings, and storing them in our vector database.

4. **Retrieve and Generate**: Test our Knowledge Base by retrieving relevant information based on a sample query.

#### Concept

**Fixed Chunking**: Involves dividing your documents into fixed-size chunks, regardless of the content within them. Each chunk contains a predefined number of tokens or characters, and this method allows for more uniform data organization. 

Fixed chunking is useful when you want to ensure that your chunks are of a consistent size, making them easier to process and retrieve in a predictable manner. The document is split into sections of equal length, and each section becomes a separate chunk. This method works well when the content is relatively homogeneous, and the chunk boundaries are not as crucial to understanding the underlying context.

#### Benefits

- **Uniformity**: Each chunk has the same size, making the system more predictable. This helps with processing efficiency since you know that each chunk is of a consistent size, making batch operations and parallel processing easier.
- **Simplified Retrieval**: Since the chunk sizes are uniform, searching through the data becomes straightforward. You can quickly determine the length of chunks, which can be useful for performance optimization and scalability in large datasets.
- **Performance Optimization**: Fixed chunks are ideal when you want to control the computational cost of document retrieval and chunking. Having equal-sized chunks reduces the chance of computational bottlenecks in scenarios requiring large-scale document processing.

> **Note:** While fixed chunking can be efficient for certain use cases, it may not preserve the natural semantic boundaries of the content, such as paragraphs or sections. This may lead to chunks that start or end at arbitrary places, potentially cutting off context in the middle of a sentence or idea.

### **Best Use Cases**
Fixed chunking is suitable for cases where:
- **Homogeneous content**: The content is consistent, and boundaries are not as important.
- **Performance**: You need uniform-sized chunks for predictable processing or optimization of large-scale systems.
- **Simplified text processing**: When chunk boundaries do not need to match natural semantic structures like paragraphs or sentences.

Examples include:
- **General document indexing**: When large datasets are involved, and uniform chunk sizes optimize retrieval.
- **Text summarization**: Fixed chunking is helpful when generating summaries from uniformly sized data pieces.


In [1]:
import json
with open("variables.json", "r") as f:
    variables = json.load(f)

variables

{'accountNumber': '307297743176',
 'regionName': 'us-west-2',
 'collectionArn': 'arn:aws:aoss:us-west-2:307297743176:collection/h7cmj732p9d3v91spkhd',
 'collectionId': 'h7cmj732p9d3v91spkhd',
 'vectorIndexName': 'ws-index-',
 'bedrockExecutionRoleArn': 'arn:aws:iam::307297743176:role/advanced-rag-workshop-bedrock_execution_role-us-west-2',
 's3Bucket': '307297743176-us-west-2-advanced-rag-workshop'}

### 1. Create a Knowledge Base

In [2]:
from retrying import retry
import boto3

# Initialize the Bedrock Agent client using the provided AWS region
bedrock_agent = boto3.client("bedrock-agent", region_name=variables["regionName"])

# Retry decorator: If the function fails, it will retry up to 3 times with a random wait time between 1-2 seconds
@retry(wait_random_min=1000, wait_random_max=2000, stop_max_attempt_number=3)
def create_knowledge_base_func(name, description, chunking_type):
    """
    Creates a knowledge base in Amazon Bedrock with OpenSearch Serverless as the vector store.
    
    Parameters:
        name (str): The name of the knowledge base.
        description (str): A description of the knowledge base.
        chunking_type (str): The type of chunking strategy applied to vector indexing.

    Returns:
        dict: The response containing details of the created knowledge base.
    """
    
    # Define the ARN of the embedding model used for vectorization
    embedding_model_arn = f"arn:aws:bedrock:{variables['regionName']}::foundation-model/amazon.titan-embed-text-v2:0"

    # Configure OpenSearch Serverless for vector storage
    opensearch_serverless_configuration = {
        "collectionArn": variables["collectionArn"],  # ARN of the OpenSearch collection
        "vectorIndexName": variables["vectorIndexName"] + chunking_type,  # Index name based on chunking strategy
        "fieldMapping": {  # Define field mappings for vectors, text, and metadata
            "vectorField": "vector",
            "textField": "text",
            "metadataField": "text-metadata"
        }
    }

    print(opensearch_serverless_configuration)  # Print configuration for debugging

    # Create the knowledge base in Amazon Bedrock
    create_kb_response = bedrock_agent.create_knowledge_base(
        name=name,
        description=description,
        roleArn=variables["bedrockExecutionRoleArn"],  # IAM Role ARN for Bedrock execution
        knowledgeBaseConfiguration={
            "type": "VECTOR",
            "vectorKnowledgeBaseConfiguration": {
                "embeddingModelArn": embedding_model_arn  # Reference to the embedding model
            }
        },
        storageConfiguration={
            "type": "OPENSEARCH_SERVERLESS",
            "opensearchServerlessConfiguration": opensearch_serverless_configuration
        }
    )

    return create_kb_response["knowledgeBase"]  # Return the created knowledge base details

In [3]:
import boto3
import json

try:
    # Create a knowledge base using the predefined function
    kb = create_knowledge_base_func(
        name="advanced-rag-workshop-fixed-chunking",
        description="Knowledge base using Amazon OpenSearch Service as a vector store",
        chunking_type="fixed"
    )

    # Retrieve details of the newly created knowledge base
    get_kb_response = bedrock_agent.get_knowledge_base(knowledgeBaseId=kb['knowledgeBaseId'])

    # Update the variables dictionary with the new knowledge base ID
    variables["kbFixedChunk"] = kb['knowledgeBaseId']

    # Save updated variables to a JSON file, handling datetime serialization
    with open("variables.json", "w") as f:
        json.dump(variables, f, indent=4, default=str)  # Convert datetime to string

    # Print the retrieved knowledge base response in a readable format
    print(f'OpenSearch Knowledge Response: {json.dumps(get_kb_response, indent=4, default=str)}')
    
# except Exception as e:
#     # Check if error message indicates the knowledge base already exists
#     error_message = str(e).lower()
#     if any(phrase in error_message for phrase in ["already exist", "duplicate", "already been created"]):
#         print("Knowledge Base already exist")
#     else:
#         # If it's a different error, re-raise it
#         raise e

except Exception as e:
    # Check if error message indicates the knowledge base already exists
    error_message = str(e).lower()
    if any(phrase in error_message for phrase in ["already exist", "duplicate", "already been created"]):
        print("Knowledge Base already exists. Retrieving its ID...")
        
        # List all knowledge bases to find the one that already exists
        list_kb_response = bedrock_agent.list_knowledge_bases()
        
        # Look for a knowledge base with the desired name
        for kb in list_kb_response.get('knowledgeBaseSummaries', []):
            if kb['name'] == "advanced-rag-workshop-fixed-chunking":
                kb_id = kb['knowledgeBaseId']
                print(f"Found existing knowledge base with ID: {kb_id}")
                
                # Get the details of the existing knowledge base
                get_kb_response = bedrock_agent.get_knowledge_base(knowledgeBaseId=kb_id)
                
                # With this code that reads existing values first:
                try:
                    # Read existing variables
                    with open("variables.json", "r") as f:
                        existing_variables = json.load(f)
                except (FileNotFoundError, json.JSONDecodeError):
                    # If file doesn't exist or is invalid JSON
                    existing_variables = {}
                
                # Update only the semantic chunking value
                existing_variables["kbFixedChunk"] = kb_id
                                
                # Write back all variables
                with open("variables.json", "w") as f:
                    json.dump(existing_variables, f, indent=4, default=str)
                
                # Print the retrieved knowledge base response
                print(f'OpenSearch Knowledge Response: {json.dumps(get_kb_response, indent=4, default=str)}')
                break
        else:
            print("Could not find a knowledge base with the specified name.")
    else:
        # If it's a different error, re-raise it
        raise e

{'collectionArn': 'arn:aws:aoss:us-west-2:307297743176:collection/h7cmj732p9d3v91spkhd', 'vectorIndexName': 'ws-index-fixed', 'fieldMapping': {'vectorField': 'vector', 'textField': 'text', 'metadataField': 'text-metadata'}}
OpenSearch Knowledge Response: {
    "ResponseMetadata": {
        "RequestId": "e65d6118-0bc9-41c9-b4ed-06bc60b14d79",
        "HTTPStatusCode": 200,
        "HTTPHeaders": {
            "date": "Mon, 07 Apr 2025 15:25:29 GMT",
            "content-type": "application/json",
            "content-length": "956",
            "connection": "keep-alive",
            "x-amzn-requestid": "e65d6118-0bc9-41c9-b4ed-06bc60b14d79",
            "x-amz-apigw-id": "IqIwiEgJPHcEoHQ=",
            "x-amzn-trace-id": "Root=1-67f3ee69-352184af3ece6a544ce8f3bb"
        },
        "RetryAttempts": 0
    },
    "knowledgeBase": {
        "createdAt": "2025-04-07 15:25:28.518211+00:00",
        "description": "Knowledge base using Amazon OpenSearch Service as a vector store",
        "k

### 2. Create Datasources for Knowledge Base

In [4]:
import time
import boto3
from botocore.exceptions import ClientError

# Define the chunking strategy for data ingestion
chunking_strategy_configuration = {
    "chunkingStrategy": "FIXED_SIZE",
    "fixedSizeChunkingConfiguration": {
        "maxTokens": 1024,
        "overlapPercentage": 20
    }
}

# Define the S3 bucket configuration for the data source
s3_configuration = {
    "bucketArn": f"arn:aws:s3:::{variables['s3Bucket']}",
    "inclusionPrefixes": ["data"]  # Only include objects with the "data" prefix
}

data_source_name = "advanced-rag-example"

# First, check if a data source with this name already exists in Bedrock (not just locally)
try:
    # List all data sources for the knowledge base
    list_ds_response = bedrock_agent.list_data_sources(
        knowledgeBaseId=kb['knowledgeBaseId']
    )
    
    # Check if our named data source exists
    existing_ds = None
    for ds in list_ds_response.get('dataSourceSummaries', []):
        if ds['name'] == data_source_name:
            existing_ds = ds
            break
    
    # If it exists, delete it
    if existing_ds:
        print(f"Found existing data source '{data_source_name}'. Deleting it...")
        bedrock_agent.delete_data_source(
            knowledgeBaseId=kb['knowledgeBaseId'],
            dataSourceId=existing_ds["dataSourceId"]
        )
        print("Waiting for data source deletion to complete...")
        time.sleep(10)
        print("Data source deleted successfully.")
        
except Exception as e:
    print(f"Error while checking or deleting data source: {e}")

# Now create a new data source
try:
    print(f"Creating new data source '{data_source_name}'...")
    create_ds_response = bedrock_agent.create_data_source(
        name=data_source_name,
        description="A data source for Advanced RAG workshop",
        knowledgeBaseId=kb['knowledgeBaseId'],
        dataSourceConfiguration={
            "type": "S3",
            "s3Configuration": s3_configuration
        },
        vectorIngestionConfiguration={
            "chunkingConfiguration": chunking_strategy_configuration
        }
    )
    
    # Store the created data source object
    ds_fixed_chunk = create_ds_response["dataSource"]
    print(f"Data source created successfully.")
    
except ClientError as e:
    if e.response['Error']['Code'] == 'ConflictException':
        print(f"Data source '{data_source_name}' still exists. Retrieving it...")
        # Get the existing data source
        list_ds_response = bedrock_agent.list_data_sources(
            knowledgeBaseId=kb['knowledgeBaseId']
        )
        for ds in list_ds_response.get('dataSourceSummaries', []):
            if ds['name'] == data_source_name:
                ds_fixed_chunk = ds
                print(f"Retrieved existing data source: {ds['dataSourceId']}")
                break
    else:
        raise e

# Print the data source information
print(ds_fixed_chunk)

Creating new data source 'advanced-rag-example'...
Data source created successfully.
{'createdAt': datetime.datetime(2025, 4, 7, 15, 25, 32, 958519, tzinfo=tzlocal()), 'dataDeletionPolicy': 'DELETE', 'dataSourceConfiguration': {'s3Configuration': {'bucketArn': 'arn:aws:s3:::307297743176-us-west-2-advanced-rag-workshop', 'inclusionPrefixes': ['data']}, 'type': 'S3'}, 'dataSourceId': 'KBFRA7RZBF', 'description': 'A data source for Advanced RAG workshop', 'knowledgeBaseId': '4P6PBDDEGL', 'name': 'advanced-rag-example', 'status': 'AVAILABLE', 'updatedAt': datetime.datetime(2025, 4, 7, 15, 25, 32, 958519, tzinfo=tzlocal()), 'vectorIngestionConfiguration': {'chunkingConfiguration': {'chunkingStrategy': 'FIXED_SIZE', 'fixedSizeChunkingConfiguration': {'maxTokens': 1024, 'overlapPercentage': 20}}}}


### 3. Start Ingestion Job for Amazon Bedrock Knowledge base pointing to Amazon OpenSearch

> **Note**: The ingestion process will take approximately 2-3 minutes to complete. During this time, the system is processing your documents by:
> 1. Extracting text from the source files
> 2. Chunking the content according to the defined strategy (Fixed / Semantic / Hierachical / Custom)
> 3. Generating embeddings for each chunk
> 4. Storing the embeddings and associated metadata in the OpenSearch vector database
>
> You'll see status updates as the process progresses. Please wait for the "Ingestion job completed successfully" message before proceeding to the next step.

In [5]:
import time

# List to keep track of all ingestion jobs
ingest_jobs = []

# Start an ingestion job for the data source
try:
    start_job_response = bedrock_agent.start_ingestion_job(
        knowledgeBaseId=kb['knowledgeBaseId'],  # ID of the Knowledge Base
        dataSourceId=ds_fixed_chunk["dataSourceId"]  # ID of the associated data source
    )
    
    # Extract job details
    job = start_job_response["ingestionJob"]
    print("Ingestion job started successfully.")

    # Polling mechanism to check job status until it is complete
    while job['status'] != 'COMPLETE':
        # Sleep for a brief period to ensure the job is fully completed
        print("running...")
        time.sleep(10)
        get_job_response = bedrock_agent.get_ingestion_job(
            knowledgeBaseId=kb['knowledgeBaseId'],  # ID of the Knowledge Base
            dataSourceId=ds_fixed_chunk["dataSourceId"],  # ID of the data source
            ingestionJobId=job["ingestionJobId"]  # ID of the running ingestion job
        )
        
        # Update job status
        job = get_job_response["ingestionJob"]
        print(f"Job status: {job['status']}")  # Log the current job status

    print("Ingestion job completed successfully.")

except Exception as e:
    print("Error: Couldn't start ingestion job.")
    print(e)  # Print the exact error message for debugging

Ingestion job started successfully.
running...
Job status: IN_PROGRESS
running...
Job status: IN_PROGRESS
running...
Job status: COMPLETE
Ingestion job completed successfully.


### 4. Retrieve

In [7]:
import boto3

# Initialize the Bedrock Agent Runtime client
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", region_name=variables["regionName"])

# Define the query for retrieving relevant documents
query = "What were net incomes of Amazon in 2022, 2023 and 2024?" 

try:
    # Retrieve the top 3 most relevant documents from the knowledge base
    relevant_documents_os = bedrock_agent_runtime.retrieve(
        retrievalQuery={
            'text': query  # Query text for document retrieval
        },
        knowledgeBaseId=kb['knowledgeBaseId'],  # ID of the Knowledge Base to search in
        retrievalConfiguration={
            'vectorSearchConfiguration': {
                'numberOfResults': 3  # Fetch the top 3 most relevant documents
            }
        }
    )

    # Print the retrieved documents for debugging
    print("Successfully retrieved relevant documents.")

except Exception as e:
    print("Error: Unable to retrieve relevant documents.")
    print(e)  # Print the error details for debugging

# Output the retrieved documents
print(json.dumps([i["content"]["text"] for i in relevant_documents_os["retrievalResults"]][0], indent=2))

Successfully retrieved relevant documents.
"CONSOLIDATED STATEMENTS OF OPERATIONS     (in millions, except per share data) Year Ended December 31,      2022 2023 2024     Net product sales $ 242,901 $ 255,887 $ 272,311 Net service sales 271,082 318,898 365,648     Total net sales 513,983 574,785 637,959 Operating expenses:     Cost of sales 288,831 304,739 326,288 Fulfillment 84,299 90,619 98,505 Technology and infrastructure 73,213 85,622 88,544 Sales and marketing 42,238 44,370 43,907 General and administrative 11,891 11,816 11,359 Other operating expense (income), net 1,263 767 763     Total operating expenses 501,735 537,933 569,366 Operating income 12,248 36,852 68,593 Interest income 989 2,949 4,677 Interest expense (2,367) (3,182) (2,406) Other income (expense), net (16,806) 938 (2,250)     Total non-operating income (expense) (18,184) 705 21 Income (loss) before income taxes (5,936) 37,557 68,614 Benefit (provision) for income taxes 3,217 (7,120) (9,265) Equity-method investmen

> **Note**: After creating the knowledge base, you can explore its details and settings in the Amazon Bedrock console. This gives you a more visual interface to understand how the knowledge base is structured.
> 
> **[➡️ View your Knowledge Bases in the AWS Console](https://us-west-2.console.aws.amazon.com/bedrock/home?region=us-west-2#/knowledge-bases)**
>
> In the console, you can:
> - See all your knowledge bases in one place
> - View ingestion status and statistics
> - Test queries through the built-in chat interface
> - Modify settings and configurations