# Bedrock Knowledge Base Retrieval and Generation with Reranking

The Rerank API in Amazon Bedrock is a new feature that improves the accuracy and relevance of responses in Retrieval-Augmented Generation (RAG) applications. It supports reranker models that rank a set of retrieved documents based on their relevance to a user's query, helping to prioritize the most relevant content for response generation.

## Key features and use cases:

1. **Enhancing RAG applications**: The Rerank API addresses challenges in semantic search, particularly with complex or ambiguous queries. For example, it can help a customer service chatbot focus on return policies rather than shipping guidelines when asked about returning an online purchase.

2. **Improving search relevance**: It enables developers to significantly enhance their search relevance and content ranking capabilities, making enterprise-grade search technology more accessible.

3. **Optimizing context window usage**: By ensuring the most useful information is sent to the foundation model, it potentially reduces costs and improves response accuracy.

4. **Flexible integration**: The Rerank API can be used independently to rerank documents even if you're not using Amazon Bedrock Knowledge Bases.

5. **Multiple model support**: At launch, it supports Amazon Rerank 1.0 and Cohere Rerank 3.5 models.

6. **Customizable configurations**: Developers can specify additional model configurations as key-value pairs for more tailored reranking.

The Rerank API is available in select AWS Regions, including US West (Oregon), Canada (Central), Europe (Frankfurt), and Asia Pacific (Tokyo). It can be integrated into existing systems at scale, whether keyword-based or semantic, through a single API call in Amazon Bedrock.


![Reranking](./reranking.png)

## 1: Import and Load Variables

In [1]:
import json

# Load the configuration variables from a JSON file
with open("../Lab 1/variables.json", "r") as f:
    variables = json.load(f)

variables


{'accountNumber': '307297743176',
 'regionName': 'us-west-2',
 'collectionArn': 'arn:aws:aoss:us-west-2:307297743176:collection/h7cmj732p9d3v91spkhd',
 'collectionId': 'h7cmj732p9d3v91spkhd',
 'vectorIndexName': 'ws-index-',
 'bedrockExecutionRoleArn': 'arn:aws:iam::307297743176:role/advanced-rag-workshop-bedrock_execution_role-us-west-2',
 's3Bucket': '307297743176-us-west-2-advanced-rag-workshop',
 'kbFixedChunk': '4P6PBDDEGL',
 'kbSemanticChunk': 'IC3ZCBORXT',
 'kbCustomChunk': 'Q2T9CZ5VFA',
 'kbHierarchicalChunk': '1YIFVW0Z5E',
 'sagemakerLLMEndpoint': 'endpoint-llama-3-2-3b-instruct-2025-04-07-16-05-17',
 'guardrail_id': 'fe7ryshi7i7b',
 'guardrail_version': '1'}

## 2: Define ARN and Configuration Details

In [2]:
# Setting up configuration for Bedrock
regionName=variables['regionName'] 
accountNumber = variables['accountNumber']
knowledge_base_id = variables['kbSemanticChunk']
model_id = 'us.amazon.nova-pro-v1:0' 

# Define ARNs (Amazon Resource Names) for the model
model_arn = f"arn:aws:bedrock:us-west-2:{accountNumber}:inference-profile/{model_id}"
rerank_model_arn=f"arn:aws:bedrock:us-west-2::foundation-model/cohere.rerank-v3-5:0"


## 3: Set Up Bedrock Client

In [3]:
import boto3
import json
from typing import *

# Configure the Bedrock client
bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name=regionName)


## 4: Function to use Bedrock Converse API to send tokens to LLM

In [4]:

def get_value_by_key_path(d, path):
    """
    Retrieve a value from a nested dictionary using a key path.

    Args:
        d (dict): The dictionary to search.
        path (list): List of keys forming the path to the desired value.

    Returns:
        The value at the specified path, or None if not found.
    """
    current = d
    for key in path:
        try:
            current = current[key]
        except (KeyError, IndexError, TypeError):
            return None  # Return None if the path is invalid (key not found, wrong type, etc.)
    return current

def invoke_converse(
    system_prompt: str,
    user_prompt: str,
    model_id: str,
    temperature: float = 0.1,
    max_tokens: int = 4000
) -> Optional[str]:
    """
    Chat with a Bedrock model using the Converse API.
    
    Args:
        system_prompt (str): System instructions/context
        user_prompt (str): User's input/question
        model_id (str): Bedrock model ID
        temperature (float): Controls randomness (0.0 to 1.0)
        max_tokens (int): Maximum tokens in response
        
    Returns:
        Optional[str]: Model's response or None if error
    """
    try:
        # Initialize Bedrock Runtime client with configuration
        client = boto3.client('bedrock-runtime', region_name=regionName)
        
        # Prepare the system prompt from session state
        system_prompt = [{'text': system_prompt}]
        messages = []

        # Format the user's question as a message
        message = {
            "role": "user", 
            "content": [            
                {
                    "text": f"{user_prompt}"
                }
            ]
        }

        # Set inference configuration
        messages.append(message)
        inferenceConfig = {
            "maxTokens" : 4096,
            "temperature": temperature
        }
        
        #invoke the API
        answer = ""
        response = client.converse(modelId=model_id, 
                                messages=messages,
                                system=system_prompt,
                                inferenceConfig = inferenceConfig)
        
        # Process the response
        if response['ResponseMetadata']['HTTPStatusCode'] == 200 :
            # Extract and concatenate the content from the response 
            content_list = get_value_by_key_path(response, ['output', 'message', 'content'])
            # print(f"content_list = {content_list}")
            answer = ""
            for content in content_list :
                text = content.get('text')
                if text:  # Only concatenate if text is not None
                    answer += text
        else :
            # Format an error message if the request was unsuccessful
            answer = f"Error: {response['ResponseMetadata']['HTTPStatusCode']} - {response['Error']['Message']}"
        return answer, response

    except Exception as e:
        print(f"Error in invoke_converse: {str(e)}")
        return None, None


## 5: Function to search Knowledge Base (vector database)

In [5]:
import boto3
import json

def search_kb(query, kb_id, numberOfResults):
    """Use query to search the knowledge base with the guve kb_id for the specified number of search results."""

    # Create the client
    client = boto3.client("bedrock-agent-runtime", region_name=regionName)
    
    # Retrieve from knowledge base
    kb_response = client.retrieve(
        knowledgeBaseId=kb_id,
        retrievalQuery={"text": query},
        retrievalConfiguration={"vectorSearchConfiguration": {"numberOfResults": numberOfResults}}
    )
    
    # Extract documents
    documents = []
    original_results = []
    
    for i, result in enumerate(kb_response.get("retrievalResults", [])):
        # Extract text from result
        text = ""
        if "content" in result and "text" in result["content"]:
            text = "".join([item.get("span", "") if isinstance(item, dict) else str(item) 
                           for item in result["content"]["text"]])
            
        # Store original result
        original_results.append({
            "position": i + 1,
            "score": result.get("scoreValue", 0),
            "text": text
            # "text": text[:300] + "..." if len(text) > 300 else text
        })
        documents.append(text)
    
    return documents

## 5: Get results from KB

In [6]:
query = "What is red teaming? How can it be used with text to SQL?"

numberOfResults = 5
original_kb_results = search_kb(query, knowledge_base_id, numberOfResults)
kb_context = '\n\n'.join(original_kb_results)
# print(json.dumps(kb_results, indent=2))

## 6: Get response from the LLM
We will use the results we recieve from KnowledgeBase (KB) as-is. We are not doing any re ranking yet.
We will send the context from KB and the user query to the LLM.

In [7]:
model_id = 'us.amazon.nova-pro-v1:0' 

#Create a good system prompt that will ask the LLM to behave in a certain way.
system_prompt = f"""
Please use the context below to respond to the question. 
If you have enough information to answer the question, please explain the reasoning behind the response.
If you do not have enough information to answer the question, please don't guess. Instead, just say I don't know with the reason.
CONTEXT:
{kb_context}
"""
# We will send this query to the LLM. Red Teaming and text-to-SQL are in two different documents. 
# With fewer chunks like 5 from KB and no reranking, the results from KB will not be as relevant. We should expect an I don't know answer.
query = "What is red teaming? How can it be used with text to SQL?"

#Send the system prompt, context from KB, user query to the LLM.
answer, response = invoke_converse(system_prompt, query, model_id)
print(answer)

I don't know the specific details about how red teaming can be used with text-to-SQL systems, as the provided context does not mention red teaming or its application in this context. Red teaming typically involves simulating adversarial attacks to test the robustness and security of systems, but the context provided does not offer information on this specific application.


## 7: Function to Rerank Chunks recieved from KB
Reranking works by sending the original query and the results to a ReRanking API.
AWS offers two Reranking APIs:  Amazon Rerank, and Cohere Rerank.
In this exercise, we will use Cohere Rerank API.

In [8]:
def rerank_results(query, kb_results, rerank_model_arn, reranked_result_count = 5) :

    # Create the client
    client = boto3.client("bedrock-agent-runtime", region_name=regionName)

    #invoke the rerank API
    reranked = client.rerank(
        queries=[{"textQuery": {"text": query}, "type": "TEXT"}],
        rerankingConfiguration={
            "bedrockRerankingConfiguration": {
                "modelConfiguration": {"modelArn": rerank_model_arn},
                "numberOfResults": reranked_result_count
            },
            "type": "BEDROCK_RERANKING_MODEL"
        },
        sources=[{
            "inlineDocumentSource": {"textDocument": {"text": doc}, "type": "TEXT"},
            "type": "INLINE"
        } for doc in kb_results]
    )
    
    # Process reranked results
    reranked_results = []
    for result in reranked.get("results", []):
        idx = result.get("index", 0)
        reranked_results.append({
            "original_position": idx + 1,
            "new_position": len(reranked_results) + 1,
            "relevance_score": result.get("relevanceScore", 0),  # Full precision score
            "text": kb_results[idx]
        })
    return {"original_results": kb_results, "reranked_results": reranked_results}

## 8: Use Rerank API

In [9]:
# We get more results from KB and then rerank and reduce the context before sending to LLM.
numberOfResults = 20
original_kb_results = search_kb(query, knowledge_base_id, numberOfResults)
# print(json.dumps(kb_results, indent=2))

In [10]:
# Send the results from KB search to Rerank API and get a smaller count of results such as 5 (instead of all 20). 
# This helps eliminate sending lower ranked results to the LLM. Also reduces input tokens and hence reduces costs and latency as well.
reranked_result_count = 5
reranked_json = rerank_results(query, original_kb_results, rerank_model_arn, reranked_result_count)
# print(json.dumps(reranked_json, indent=2))

In [11]:
#Let's concatenate the reranked results into a string so that we can send the string to the LLM as context.
kb_context = ""
for result in reranked_json['reranked_results'] :
    kb_context += result['text'] + "\n\n"
# kb_context

In [12]:
model_id = 'us.amazon.nova-pro-v1:0' 

#We will send the same system prompt. The only change is the re ranked context.
system_prompt = f"""
Please use the context below to respond to the question. 
If you have enough information to answer the question, please explain the reasoning behind the response.
If you do not have enough information to answer the question, please don't guess. Instead, just say I don't know with the reason.
CONTEXT:
{kb_context}
"""

# We will send the same query to the LLM.
query = "What is red teaming? How can it be used with text to SQL?"

# Send the system prompt, context from KB, user query to the LLM. We should see a better response because we re ranked the context from KB.
answer, result = invoke_converse(system_prompt, query, model_id)
print(answer)

Red teaming is a security practice where a group of experts, known as red teamers, simulate attacks on a system to identify vulnerabilities and improve its defenses. In the context of text-to-SQL systems, red teaming can be used to:

1. **Identify Vulnerabilities**: Red teamers can craft malicious or ambiguous queries to test the system's robustness and identify weaknesses in handling such inputs.

2. **Improve Clarification Mechanisms**: By simulating user queries that are ambiguous or unanswerable, red teamers can help developers refine the system's ability to seek clarification and provide appropriate responses.

3. **Enhance Security Measures**: Red teaming can reveal potential security risks, such as SQL injection vulnerabilities, which can then be addressed through improved input validation and sanitization.

4. **Validate Mitigation Strategies**: After implementing security measures, red teamers can test these strategies to ensure they effectively protect the system against vari

In [13]:
# Are you curious how to find input tokens, output tokens, and latency?
print(json.dumps(result, indent=2))

{
  "ResponseMetadata": {
    "RequestId": "c02dcbc1-89fb-4c36-8f49-bc85517a1f71",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 07 Apr 2025 17:00:35 GMT",
      "content-type": "application/json",
      "content-length": "1412",
      "connection": "keep-alive",
      "x-amzn-requestid": "c02dcbc1-89fb-4c36-8f49-bc85517a1f71"
    },
    "RetryAttempts": 0
  },
  "output": {
    "message": {
      "role": "assistant",
      "content": [
        {
          "text": "Red teaming is a security practice where a group of experts, known as red teamers, simulate attacks on a system to identify vulnerabilities and improve its defenses. In the context of text-to-SQL systems, red teaming can be used to:\n\n1. **Identify Vulnerabilities**: Red teamers can craft malicious or ambiguous queries to test the system's robustness and identify weaknesses in handling such inputs.\n\n2. **Improve Clarification Mechanisms**: By simulating user queries that are ambiguous or unanswerable