## RAG Pre-Retrieval Optimization - Advanced chunking strategies

Chunking is a technique used to break down text data into segments before embedding, with the aim of improving the efficiency of retrieval and optimizing the context window of our downstream foundation model. Bedrock's knowledge base natively supports a variety of chunking strategies to reduce your operational burden.

In this lab, we will use the Amazons SEC-10k statments thats already prepared and uploaded to S3 during workshop setup to create three knowledge bases using three natively supported chunking strategies. Then compare and contrast pros and cons of each. For more details, please refer to [How content chunking and parsing works for knowledge bases](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-chunking-parsing.html).

| None | Fixed Chunking | Semantic Chunking | Hierarchical Chunking |
|------|----------------|-------------------|-----------------------|
|Each file is treated as a single chunk. This approach is useful when you want to maintain the integrity of each document or product description.| Fixed chunking is a basic strategy to split large text documents into smaller, uniform segments. It optimizes the retrieval process for Retrieve-and-Ground (RAG) systems by breaking down documents into manageable chunks. While easy to implement and understand, fixed chunking may sometimes split sentences or concepts across chunk boundaries. However, you can adjust the chunk size and overlap parameters to tune the reulsts. In general, Fixed chunking is most suitable for simple, structured documents. | Hierarchical chunking organizes your data into a hierarchical structure, allowing for more granular and efficient retrieval based on the inherent relationships within your data. When it parses the documents, the first step is to chunk them based on parent and child chunking sizes. Where parent chunks (higher level) represent larger chunks (e.g., documents or sections), and child chunks (lower level) represent smaller chunks (e.g., paragraphs or sentences). The semantic search is done on the child chunks, but parent chunks are returned during retrieval. This will result in more comprehensive context for the foundation model. Hierarchical chunking is best suited for complex documents with nested or hierarchical structures, such as technical manuals, legal documents, or academic papers with complex formatting and nested tables. | Semantic chunking is the most computation intensive because it use a embedding model to compare and combine semantic similarity of chunks. This approach preserves the information's integrity during retrieval, ensuring accurate and contextually appropriate results. By focusing on the text's meaning and context, semantic chunking significantly improves the quality of retrieval and should be used in scenarios where maintaining the semantic integrity of the text is crucial.

## Pre-req
You must run the `[workshop_setup.ipynb]`(../lab00-setup/workshop_setup.ipynb) notebook in `lab00-setup` before starting this lab.

In [None]:
import warnings
warnings.warn("Warning: if you did not run lab00-setup, please go back and run the lab00 notebook") 

### Load the parameters

In [None]:
print("Lab parameters....\n")
%store -r amzn10k_prefix
%store -r amzn10k_s3_path
%store -r bucket
print(amzn10k_prefix)
print(amzn10k_s3_path)
print(bucket)

print("\nload the vector db parameters....\n")
# vector parameters stored from Initial setup lab02
%store -r vector_collection_arn
%store -r vector_collection_id
%store -r vector_host
%store -r bedrock_kb_execution_role_arn
## check all 4 values are printed and do not fail
print(vector_collection_arn)
print(vector_collection_id)
print(vector_host)
print(bedrock_kb_execution_role_arn)

### Initialize other parameters

In [None]:
import random
import time
import boto3
import sys

sys.path.append('../lab00-setup')
from knowledge_base import BedrockKnowledgeBase

# auth for opensearch
boto3_session = boto3.Session()
region_name = boto3_session.region_name
# try out KB using RetrieveAndGenerate API
bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime", region_name=region_name)
model_id = "anthropic.claude-3-sonnet-20240229-v1:0" # try with both claude instant as well as claude-v2. for claude v2 - "anthropic.claude-v2"
model_arn = f'arn:aws:bedrock:{region_name}::foundation-model/{model_id}'

### Create the Knowledge Bases w/ different chunking strategy

Let's create the Knowledge Bases for Amazon Bedrock to store the Amazons SEC-10k statments. Knowledge Bases allow you to integrate with different vector databases including Amazon OpenSearch Serverless, Amazon Aurora, Pinecone, Redis Enterprise and MongoDB Atlas. For this example, we will integrate the knowledge base with Amazon OpenSearch Serverless. The embedding model we use is `amazon.titan-embed-text-v2:0`.

Here are the possible values for "chunkingStrategy" atribute: "NONE | FIXED_SIZE | HIERARCHICAL | SEMANTIC". NONE was used in previous Naive RAG labs. Now we are going to try the other 3.

In [None]:
%%time

kb_mapping = dict()

for chucking_strategy in ["FIXED_SIZE", "HIERARCHICAL", "SEMANTIC"]:

    # create a object for each chunking strategy
    kb_mapping[chucking_strategy] = dict()
    
    # Create knowledge base
    suffix = random.randrange(200, 900)
    kb_name = f"bedrock-{chucking_strategy.lower().strip('_')}-{suffix}"
    index_name = f"bedrock-{chucking_strategy.lower().replace('_', '')}-{suffix}"
    description = "This knowledge base contain Amazon 10K financial document from 2022 and 2023"
    
    knowledge_base = BedrockKnowledgeBase(
        kb_name=kb_name,
        kb_description=description,
        data_bucket_name=bucket,
        data_prefix=[amzn10k_prefix],
        vector_collection_arn=vector_collection_arn,
        vector_collection_id=vector_collection_id,
        vector_host=vector_host,
        bedrock_kb_execution_role_arn=bedrock_kb_execution_role_arn,
        index_name=index_name,
        suffix=suffix,
        chunking_strategy=chucking_strategy
    )

    kb_mapping[chucking_strategy]["KnowledgeBase"] = knowledge_base
    # ensure that the kb is available
    time.sleep(30)
    # Start the data ingestion
    knowledge_base.start_ingestion_job()
    kb_mapping[chucking_strategy]["KbId"] = knowledge_base.get_knowledge_base_id()
    

### Prompt to test

we are going to use the same prompt and test against all the different knowledge base with different chucking strategy to compare

"What is Amazon doing in the field of entertainment, movies and cinema?"
"Key challenges faced by Amazon in year 2022 and 2023"

In [None]:
prompt = "Key challenges faced by Amazon in year 2022 and 2023"

### Generate and render the response

In [None]:
for chucking_strategy in ["FIXED_SIZE", "HIERARCHICAL", "SEMANTIC"]:

    print("========================================================================================")
    print(f"Generate a response using ({chucking_strategy}) chucking knowledge base")
    
    response_ret = bedrock_agent_runtime_client.retrieve_and_generate(
        input={
            "text": prompt
        },
        retrieveAndGenerateConfiguration={
            "type": "KNOWLEDGE_BASE",
            "knowledgeBaseConfiguration": {
                'knowledgeBaseId': kb_mapping[chucking_strategy]["KbId"],
                "modelArn": "arn:aws:bedrock:{}::foundation-model/{}".format(region_name, 
                                                                             model_id),
                "retrievalConfiguration": {
                    "vectorSearchConfiguration": {
                        "numberOfResults":3
                    } 
                }
            }
        }
    )
    # generated text output
    kb_mapping[chucking_strategy]["Response"] = response_ret['output']['text']
    
    response_ret = bedrock_agent_runtime_client.retrieve(
        knowledgeBaseId=kb_mapping[chucking_strategy]["KbId"],
        retrievalQuery={
            'text': prompt
        },
        retrievalConfiguration={
            'vectorSearchConfiguration': {
                "numberOfResults":3
            }
        }
    )
    

    # generated text output
    kb_mapping[chucking_strategy]["SearchResults"] = response_ret['retrievalResults']
    
    print("========================================================================================")

In [None]:
import pandas as pd
from IPython.display import display, HTML

# First, determine the maximum length needed
max_length = 0
for key in kb_mapping:
    # Count response + separator + search results
    current_length = 2 + len(kb_mapping[key]["SearchResults"])  # 2 for response and separator
    max_length = max(max_length, current_length)

display_map = dict()
        
# reformat results
for key in kb_mapping:
    display_map[key] = []
    
    # Add response
    response = kb_mapping[key]["Response"]
    display_map[key].append(response)
    
    # Add separator
    display_map[key].append("======[Search Results]======")
    
    # Add search results
    for result in kb_mapping[key]["SearchResults"]:
        display_map[key].append(f'{result["content"]["text"][:1000]}...')
    
    # Pad with empty strings if needed
    while len(display_map[key]) < max_length:
        display_map[key].append("")

# Create DataFrame
df = pd.DataFrame(display_map)

output = ""
output += df.style.hide()._repr_html_()
output += "&nbsp;"

display(HTML(output))

Looking at these three responses comparing Amazon's challenges across different categorizations (FIXED_SIZE, HIERARCHICAL, and SEMANTIC), they all cover similar core challenges but present them slightly differently:

Common Themes Across All Three:
1. Foreign exchange rate fluctuations impact
2. Economic conditions and geopolitical changes
3. Supply chain constraints
4. Labor market challenges
5. COVID-19 pandemic effects
6. Interest rate concerns

Key Differences:

FIXED_SIZE:
- Most concise presentation
- Focuses on specific financial impacts (e.g., 210 basis points impact on net sales)
- Provides specific forecasts for Q1 2023 and Q1 2024

HIERARCHICAL:
- More detailed organization of challenges
- Emphasizes operational aspects like product mix and third-party sellers
- Includes broader strategic concerns like world events and new technologies

SEMANTIC:
- Most comprehensive coverage
- Includes additional challenges like:
  - Tax obligations
  - Competition
  - Managing growth
  - Inventory management
  - Payment risks
  - Fulfillment optimization

All three perspectives provide valuable insights, with SEMANTIC offering the most detailed view, HIERARCHICAL providing good structural organization, and FIXED_SIZE giving the most concise financial impact assessment.

### Clean up

In [None]:
for key in kb_mapping:
    kb_mapping[key]["KnowledgeBase"].delete_kb()