# Bedrock Knowledge Base Retrieval and Generation with Metadata Filtering

### Description:
This notebook demonstrates how to query and retrieve data from an Amazon Bedrock-powered knowledge base using different configurations, filters, and citation extraction. The steps include creating a query, retrieving responses, and printing the citations used for generating the results.


![Metadata Filtering](./metadata_filtering.png)

## 1. Load Configuration Variables

In [1]:
# Load configuration variables from a JSON file to access knowledge base ID, account number, and guardrail info.
import json

with open("../Lab 1/variables.json", "r") as f:
    variables = json.load(f)

variables  # Display the loaded variables for confirmation

{'accountNumber': '307297743176',
 'regionName': 'us-west-2',
 'collectionArn': 'arn:aws:aoss:us-west-2:307297743176:collection/h7cmj732p9d3v91spkhd',
 'collectionId': 'h7cmj732p9d3v91spkhd',
 'vectorIndexName': 'ws-index-',
 'bedrockExecutionRoleArn': 'arn:aws:iam::307297743176:role/advanced-rag-workshop-bedrock_execution_role-us-west-2',
 's3Bucket': '307297743176-us-west-2-advanced-rag-workshop',
 'kbFixedChunk': '4P6PBDDEGL',
 'kbSemanticChunk': 'IC3ZCBORXT',
 'kbCustomChunk': 'Q2T9CZ5VFA',
 'kbHierarchicalChunk': '1YIFVW0Z5E',
 'sagemakerLLMEndpoint': 'endpoint-llama-3-2-3b-instruct-2025-04-07-16-05-17',
 'guardrail_id': 'fe7ryshi7i7b',
 'guardrail_version': '1'}

## 2. Set Up Required IDs and Model ARNs

In [2]:
accountNumber=variables['accountNumber']   
knowledge_base_id = variables['kbSemanticChunk']   
model_id = 'us.amazon.nova-pro-v1:0' 
model_arn = f"arn:aws:bedrock:us-west-2:{accountNumber}:inference-profile/{model_id}"


## 3. Configure Bedrock Client

In [3]:
import boto3
import json
from typing import *

# Configure the Bedrock client
bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name="us-west-2")


## 4. Define Function to Retrieve and Generate Without Filters

In [4]:
def retrieve_and_generate_without_filter(query, knowledge_base_id, model_arn):
    """
    Retrieves and generates a response based on the given query.

    Parameters:
    - query (str): The input query.
    - knowledge_base_id (str): The ID of the knowledge base.
    - model_arn (str): The ARN of the model.
    - one_group_filter (dict): The filter for the vector search configuration.

    Returns:
    - response: The response from the retrieve_and_generate method.
    """
    response = bedrock_agent_runtime.retrieve_and_generate(
        input={
            "text": query
        },
        retrieveAndGenerateConfiguration={
            "type": "KNOWLEDGE_BASE",
            "knowledgeBaseConfiguration": {
                'knowledgeBaseId': knowledge_base_id,
                "modelArn": model_arn,
                "retrievalConfiguration": {
                    "vectorSearchConfiguration": {
                        "numberOfResults": 5
                    }
                }
            }
        }
    )
    return response


## 5. Define Function to Retrieve and Generate With Filters

In [5]:
def retrieve_and_generate_with_filter(query, knowledge_base_id, model_arn, metadata_filter):
    """
    Retrieves and generates a response based on the given query.

    Parameters:
    - query (str): The input query.
    - knowledge_base_id (str): The ID of the knowledge base.
    - model_arn (str): The ARN of the model.
    - one_group_filter (dict): The filter for the vector search configuration.

    Returns:
    - response: The response from the retrieve_and_generate method.
    """
    response = bedrock_agent_runtime.retrieve_and_generate(
        input={
            "text": query
        },
        retrieveAndGenerateConfiguration={
            "type": "KNOWLEDGE_BASE",
            "knowledgeBaseConfiguration": {
                'knowledgeBaseId': knowledge_base_id,
                "modelArn": model_arn,
                "retrievalConfiguration": {
                    "vectorSearchConfiguration": {
                        "numberOfResults": 5,
                        "filter": metadata_filter
                    }
                }
            }
        }
    )
    return response



## 6. Define Query

In [6]:
query = "what was the % increase in sales?"

## 7. Retrieve Response Without Metadata Filter

In [7]:
response_withoutMetadata=retrieve_and_generate_without_filter(query, knowledge_base_id, model_arn)
print(response_withoutMetadata['output']['text'])


The sales increased by 11% in 2024 compared to the prior year.


## 8. Retrieve and Print Citations Without Metadata Filter

In [8]:
# Extract citations used to generate the response
response_without_MD = response_withoutMetadata['citations'][0]['retrievedReferences']
print("# of citations or chunks used to generate the response: ", len(response_without_MD))

# Function to print citations or chunks of text retrieved
def citations_rag_print(response_ret):
    for num, chunk in enumerate(response_ret, 1):
        print(f'Chunk {num}: ', chunk['content']['text'], end='\n'*2)
        print(f'Chunk {num} Location: ', chunk['location'], end='\n'*2)
        print(f'Chunk {num} Metadata: ', chunk['metadata'], end='\n'*2)

# Print citations
citations_rag_print(response_without_MD)


# of citations or chunks used to generate the response:  1
Chunk 1:  Service sales primarily represent third-party seller fees, which includes commissions and any related fulfillment and shipping fees, AWS sales, advertising services, Amazon Prime membership fees, and certain digital media content subscriptions. Net sales information is as follows (in millions): Year Ended December 31, 
 
  2023 2024 
 
 Net Sales: North America $ 352,828 $ 387,497 International 131,200 142,906 AWS 90,757 107,556 
 
 Consolidated $ 574,785 $ 637,959 Year-over-year Percentage Growth: 
 
 North America 12 % 10 % International 11 9 AWS 13 19 
 
 Consolidated 12 11 Year-over-year Percentage Growth, excluding the effect of foreign exchange rates: 
 
 North America 12 % 10 % International 11 10 AWS 13 19 
 
 Consolidated 12 11 Net Sales Mix: 
 
 North America 61 % 61 % International 23 22 AWS 16 17 
 
 Consolidated 100 % 100 % 
 
 Sales increased 11% in 2024, compared to the prior year. Changes in foreign ex

## 9. Define Metadata Filter

The code below defines a metadata filter to narrow down the knowledge base search:
- Creates a complex filter using logical operators (andAll)
- The filter has two conditions that must both be true:
  1. docType must equal '10K Report'
  2. year must equal 2023
- This filter will limit retrieval to only chunks from 2023 10K reports
- The structure demonstrates how to build more complex queries with multiple conditions

This filter will be used to demonstrate selective retrieval from specific documents.

In [9]:
# Define a metadata filter for advanced filtering based on specific conditions
one_group_filter= {
    "andAll": [
        {
            "equals": {
                "key": "docType",
                "value": '10K Report'
            }
        },
        {
            "equals": {
                "key": "year",
                "value": 2023
            }
        }
    ]
}


## 10. Retrieve Response With Metadata Filter

In [10]:
# Use the function to retrieve a response with metadata filtering
response_with_Metadata = retrieve_and_generate_with_filter(query, knowledge_base_id, model_arn, one_group_filter)

# Print the response text
print(response_with_Metadata['output']['text'])


The percentage increase in sales was 9% in 2022 compared to the prior year.


## 11. Retrieve and Print Citations With Metadata Filter

In [11]:
# Extract citations used to generate the response with metadata filter
response_with_MD = response_with_Metadata['citations'][0]['retrievedReferences']
print("# of citations or chunks used to generate the response: ", len(response_with_MD))

# Print citations for the filtered response
citations_rag_print(response_with_MD)


# of citations or chunks used to generate the response:  1
Chunk 1:  Net sales information is as follows (in millions): Year Ended December 31, 
 
  2021 2022 
 
 Net Sales: North America $ 279,833 $ 315,880 International 127,787 118,007 AWS 62,202 80,096 
 
 Consolidated $ 469,822 $ 513,983 Year-over-year Percentage Growth (Decline): 
 
 North America 18 % 13 % International 22 (8) AWS 37 29 
 
 Consolidated 22 9 Year-over-year Percentage Growth, excluding the effect of foreign exchange rates: 
 
 North America 18 % 13 % International 20 4 AWS 37 29 
 
 Consolidated 21 13 Net sales mix: 
 
 North America 60 % 61 % International 27 23 AWS 13 16 
 
 Consolidated 100 % 100 % 
 
 Sales increased 9% in 2022, compared to the prior year. Changes in foreign currency exchange rates reduced net sales by $15.5 billion in 2022. For a discussion of the effect of foreign exchange rates on sales growth, see “Effect of Foreign Exchange Rates” below. 
 
 North America sales increased 13% in 2022, comp

## 12. Advanced Metadata Filtering

Dynamically creating metadata fliters allows  to create query-specific filters programmatically rather than hardcoding them.

This cell defines a function to create  metadata filters programatically based on various parameters:
- company: Filter by company name
- year: Filter by year (can be a single year or list of years)
- docType: Filter by document type
- min_page/max_page: Filter by page number ranges
- s3_prefix: Filter by S3 URI prefix
The function builds a filter configuration based on the provided parameters,
combining them with appropriate operators (equals, greaterThanOrEquals, etc.).

In [12]:
def create_dynamic_filter(company=None, year=None, docType=None, min_page=None, max_page=None, s3_prefix=None):
    """
    Creates a dynamic metadata filter for Amazon Bedrock Knowledge Base queries.
    
    Parameters:
    - company (str or list): Filter by company name (e.g., 'Amazon')
    - year (int or list): Filter by year or list of years
    - docType (str): Filter by document type (e.g., '10K Report')
    - min_page (int): Filter for pages greater than or equal to this number
    - max_page (int): Filter for pages less than or equal to this number
    - s3_prefix (str): Filter by S3 URI prefix
    
    Returns:
    - dict: A metadata filter configuration or None if no valid filters
    """
    filter_conditions = []
    
    # Add company filter if specified and not empty
    if company:
        if isinstance(company, list):
            # Filter out empty strings and check if we have any values left
            company_list = [c for c in company if c]
            if len(company_list) >= 2:
                # If we have at least 2 valid values, use orAll
                company_conditions = []
                for c in company_list:
                    company_conditions.append({
                        "equals": {
                            "key": "company",
                            "value": c
                        }
                    })
                filter_conditions.append({"orAll": company_conditions})
            elif len(company_list) == 1:
                # If only one valid company, use a direct equals condition
                filter_conditions.append({
                    "equals": {
                        "key": "company",
                        "value": company_list[0]
                    }
                })
        elif isinstance(company, str) and company.strip():  # Check if string is not just whitespace
            filter_conditions.append({
                "equals": {
                    "key": "company",
                    "value": company
                }
            })
    
    # Add year filter (single year or multiple years)
    if year:
        if isinstance(year, list):
            # Filter out empty values and check if we have any values left
            year_list = [y for y in year if y]
            if len(year_list) >= 2:
                # If we have at least 2 valid values, use orAll
                year_conditions = []
                for y in year_list:
                    year_conditions.append({
                        "equals": {
                            "key": "year",
                            "value": y
                        }
                    })
                filter_conditions.append({"orAll": year_conditions})
            elif len(year_list) == 1:
                # If only one valid year, use a direct equals condition
                filter_conditions.append({
                    "equals": {
                        "key": "year",
                        "value": year_list[0]
                    }
                })
        elif str(year).strip():  # Convert to string and check if not just whitespace
            filter_conditions.append({
                "equals": {
                    "key": "year",
                    "value": year
                }
            })
    
    # Add document type filter if specified and not empty
    if docType and (not isinstance(docType, str) or docType.strip()):
        filter_conditions.append({
            "equals": {
                "key": "docType",
                "value": docType
            }
        })
    
    # Add minimum page filter if specified
    if min_page is not None:
        filter_conditions.append({
            "greaterThanOrEquals": {
                "key": "x-amz-bedrock-kb-document-page-number",
                "value": min_page
            }
        })
    
    # Add maximum page filter if specified
    if max_page is not None:
        filter_conditions.append({
            "lessThanOrEquals": {
                "key": "x-amz-bedrock-kb-document-page-number",
                "value": max_page
            }
        })

    # Add S3 prefix filter if specified and not empty
    if s3_prefix and (not isinstance(s3_prefix, str) or s3_prefix.strip()):
        filter_conditions.append({
            "startsWith": {
                "key": "x-amz-bedrock-kb-source-uri",
                "value": s3_prefix
            }
        })
    
    # Return the complete filter only if we have TWO OR MORE conditions
    # The API requires at least 2 conditions for andAll
    if len(filter_conditions) >= 2:
        return {"andAll": filter_conditions}
    # If we have exactly ONE condition, return it directly without andAll
    elif len(filter_conditions) == 1:
        return filter_conditions[0]
    else:
        # Return None if no valid conditions
        return None

## Query Financial Data Function
This cell creates a higher-level function that uses the dynamic filter:
- Takes a query text and various filter parameters
- Creates a filter using the create_dynamic_filter function
- Prints the filter configuration for debugging
- Calls retrieve_and_generate_with_filter with the created filter
- Returns the complete response

In [13]:
def query_financial_data(query_text, kb_id, model_arn, **filter_params):
    """
    Perform a query against financial data with dynamic filtering.
    
    Parameters:
    - query_text (str): The natural language query
    - kb_id (str): Knowledge base ID
    - model_arn (str): Model ARN
    - **filter_params: Parameters to pass to create_dynamic_filter
    
    Returns:
    - dict: Response from Bedrock
    """
    # Create the filter
    filter_config = create_dynamic_filter(**filter_params)
    
    # Log the filter for debugging
    print("Using filter configuration:")
    print(json.dumps(filter_config, indent=2) if filter_config else "No filter applied")
    
    # Run the query with or without filter based on whether we have a valid filter
    if filter_config is not None:
        response = retrieve_and_generate_with_filter(
            query_text, kb_id, model_arn, filter_config
        )
    else:
        # If no filter conditions, call the function without filter
        response = retrieve_and_generate_without_filter(
            query_text, kb_id, model_arn
        )
    
    return response

In [14]:
# Compare growth rates across all Amazon business segments
from utils import print_citations
response = query_financial_data(
    "Compare the year-over-year growth rates for AWS, North America, and International segments, including factors that influenced performance differences",
    knowledge_base_id,
    model_arn,
    company="Amazon",
    year=[2023, 2024]
)
print_citations(response)
#print(response['output']['text'])

Using filter configuration:
{
  "andAll": [
    {
      "equals": {
        "key": "company",
        "value": "Amazon"
      }
    },
    {
      "orAll": [
        {
          "equals": {
            "key": "year",
            "value": 2023
          }
        },
        {
          "equals": {
            "key": "year",
            "value": 2024
          }
        }
      ]
    }
  ]
}

Generated Response:
The year-over-year growth rates for the segments are as follows:

- AWS: 13% in 2023 compared to 2022 - North America: Not explicitly stated, but operating income changed from a loss of $2.847 billion in 2022 to a profit of $14.877 billion in 2023 - International: 11% in 2023 compared to 2022 Factors influencing performance differences:

- AWS: Growth was driven by increased customer usage, partially offset by pricing changes due to long-term customer contracts - North America: The significant change from loss to profit suggests a major improvement in operational efficiency or ma

In [15]:
# Filter for 2023 documents in a specific folder
s3_prefix_2023 = f"s3://{variables['s3Bucket']}/pdf_documents/"

response_2023 = query_financial_data(
    "What was the AWS revenue growth in 2023?",
    knowledge_base_id,
    model_arn,
    year=[2023,2024],
    s3_prefix=s3_prefix_2023
)

#print(response_2023['output']['text'])
print_citations(response)

Using filter configuration:
{
  "andAll": [
    {
      "orAll": [
        {
          "equals": {
            "key": "year",
            "value": 2023
          }
        },
        {
          "equals": {
            "key": "year",
            "value": 2024
          }
        }
      ]
    },
    {
      "startsWith": {
        "key": "x-amz-bedrock-kb-source-uri",
        "value": "s3://307297743176-us-west-2-advanced-rag-workshop/pdf_documents/"
      }
    }
  ]
}

Generated Response:
The year-over-year growth rates for the segments are as follows:

- AWS: 13% in 2023 compared to 2022 - North America: Not explicitly stated, but operating income changed from a loss of $2.847 billion in 2022 to a profit of $14.877 billion in 2023 - International: 11% in 2023 compared to 2022 Factors influencing performance differences:

- AWS: Growth was driven by increased customer usage, partially offset by pricing changes due to long-term customer contracts - North America: The significant chang

### Dynamic Entity extraction to create filters on the fly
In the examples so far you knew the filters that you need to apply. Perhaps your application forces the user to pick a year or department name while asking questions. In those situations, the above approach would work.
However, you may have a situation where there is no way for a user to specify filters. Thus, the application may, at run time, have to figure out the filters based on a question.
For example, assume that your documents are stored in respective department folders such as HR, Finance, Legal, Science, Engineering, Customer Support, etc. 
Assume that your user asks an HR related question. There are two options for you.
### Option 1: 
You create a vector embedding for HR related questions and search the vector space in the entire knowledgebase. This will give you some context and might even pick up some HR related content from customer support content.
### Option 2: 
You ask an LLM to determine the topic to which the question most likely belongs to.Then you use the topic as a filter to query the knowledge base. This limits the search to fewer topics and hence reduces the noise from unrelated topics.
While this mightb improve accuracy because of richer context with reduced noise, this would also introduce extra costs because of an extra call to LLM to determine the topic to which the questuon belongs to.

In [16]:
def get_value_by_key_path(d, path):
    """
    Retrieve a value from a nested dictionary using a key path.

    Args:
        d (dict): The dictionary to search.
        path (list): List of keys forming the path to the desired value.

    Returns:
        The value at the specified path, or None if not found.
    """
    current = d
    for key in path:
        try:
            current = current[key]
        except (KeyError, IndexError, TypeError):
            return None  # Return None if the path is invalid (key not found, wrong type, etc.)
    return current
    
def invoke_converse(
    system_prompt: str,
    user_prompt: str,
    model_id: str,
    temperature: float = 0.2,
    max_tokens: int = 4000
) -> Optional[str]:
    """
    Chat with a Bedrock model using the Converse API.
    
    Args:
        system_prompt (str): System instructions/context
        user_prompt (str): User's input/question
        model_id (str): Bedrock model ID
        temperature (float): Controls randomness (0.0 to 1.0)
        max_tokens (int): Maximum tokens in response
        
    Returns:
        Optional[str]: Model's response or None if error
    """
    try:
        # Initialize Bedrock Runtime client with configuration
        client = boto3.client('bedrock-runtime', region_name=variables['regionName'] )
        
        # Prepare the system prompt from session state
        system_prompt = [{'text': system_prompt}]
        messages = []

        # Format the user's question as a message
        message = {
            "role": "user", 
            "content": [            
                {
                    "text": f"{user_prompt}"
                }
            ]
        }

        # Set inference configuration
        messages.append(message)
        inferenceConfig = {
            "maxTokens" : 4096,
            "temperature": temperature
        }
        
        #invoke the API
        answer = ""
        response = client.converse(modelId=model_id, 
                                messages=messages,
                                system=system_prompt,
                                inferenceConfig = inferenceConfig)
        
        # Process the response
        if response['ResponseMetadata']['HTTPStatusCode'] == 200 :
            # Extract and concatenate the content from the response 
            content_list = get_value_by_key_path(response, ['output', 'message', 'content'])
            answer = ""
            for content in content_list :
                text = content.get('text')
                if text:  # Concatenate only if text is not None
                    answer += text
        else :
            # Format an error message if the request was unsuccessful
            answer = f"Error: {response['ResponseMetadata']['HTTPStatusCode']} - {response['Error']['Message']}"
        return answer

    except Exception as e:
        print(f"Error in invoke_converse: {str(e)}")
        return None

In [17]:
# A prompt template with Model Instructions:
system_prompt = """
You are an expert in extracting entity from queries so that those entities can be used as filters.

Model Instructions:
- If company name and year is mentioned in the Query, extract them as entities.
- Return the information strictly in JSON format where company and year are keys and their corresponding values are an array of strings.
- If you are not sure if company name and year is mentioned in the query, please return an empty list for the corresponding entity.
- Please do not return any explanation.

$Query$
"""

# user prompt
user_prompt = "Compare the year-over-year growth rates for AWS, North America, and International segments, including factors that influenced performance differences in years 2023 and 2024."
user_prompt = "In Amazon's cash flow statement, what was the net income in years 2023 and 2024?"

# Send the system prompt and user prompt to an LLM and get the response.
model_id = "anthropic.claude-3-5-haiku-20241022-v1:0"
response = invoke_converse(system_prompt, user_prompt, model_id)

# load the response into a json format.
data = json.loads(response)
print(json.dumps(data, indent=2))

{
  "company": [
    "Amazon"
  ],
  "year": [
    "2023",
    "2024"
  ]
}


In [18]:
# extract years as a list of integers.
if 'year' in data :
    years = data['year']
    years = [int(x) for x in years]
else :
    years = []
years

[2023, 2024]

In [19]:
# extract company names as a list of strings.
if 'company' in data :
    company = data['company']
else :
    company = []
company


['Amazon']

In [20]:
response = query_financial_data(
    user_prompt,
    knowledge_base_id,
    model_arn,
    company=company,
    year=years
)
print_citations(response)

Using filter configuration:
{
  "andAll": [
    {
      "equals": {
        "key": "company",
        "value": "Amazon"
      }
    },
    {
      "orAll": [
        {
          "equals": {
            "key": "year",
            "value": 2023
          }
        },
        {
          "equals": {
            "key": "year",
            "value": 2024
          }
        }
      ]
    }
  ]
}

Generated Response:
The net income for Amazon in 2023 was $30,425 million. The model cannot find sufficient information to provide the net income for 2024.

Number of citations: 1

Citation 1:
----------------------------------------
Content: CONSOLIDATED STATEMENTS OF COMPREHENSIVE INCOME (LOSS) 
 
 (in millions) Year Ended December 31, 
 
  2021 2022 2023 
 
 Net income (loss) $ 33,364 $ (2,722) $ 30,425 Other comprehensive income (loss): 
 
 Foreign currency translation adjustments, net of tax of $47, $100, and $(55) (819) (2,586) 1,0...
Source: {'s3Location': {'uri': 's3://307297743176-us-west-2