# Autogenerated filters

## Overview

This notebook demonstrates the **Autogenerated Filters** feature of Amazon Bedrock Knowledge Bases, which enhances search capabilities through automatic metadata filtering. By leveraging Amazon Bedrock's foundation models, the Autogenerated Filters feature dynamically interprets user queries and generates appropriate metadata filters based on the defined metadata schema for the Knowledge Base. This improves the quality and relevance of search results without requiring explicit filter specifications.

The Autogenerated Filters feature is enabled through the `implicitFilterConfiguration` parameter within the `vectorSearchConfiguration` of the `retrievalConfiguration` in the `Retrieve` and `RetrieveAndGenerate` API calls. These APIs analyze the user's query, identify relevant metadata attributes based on the specified `implicitFilterConfiguration`, and apply these filters to narrow down the search results.

For example, if a user searches for "marketing reports from last year," the Autogenerated Filters feature can automatically recognize that "last year" refers to a specific time period and apply a filter based on the date metadata field. Similarly, if the query mentions a specific product or department, the system can apply filters based on the corresponding metadata fields.

Let's explore how to implement and utilize Autogenerated Filters with Amazon Bedrock Knowledge Bases for an example use case.

## 1. Setup
Before running the rest of this notebook, you'll need to run the cells below to (ensure necessary libraries are installed and) connect to Bedrock.

Please ignore any pip dependency error (if you see any while installing libraries)

In [None]:
%pip install --upgrade pip --quiet
%pip install -r ../requirements.txt --no-deps --quiet
%pip install -r ../requirements.txt --upgrade --quiet

In [None]:
%pip install --upgrade boto3

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import boto3
print(boto3.__version__)

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import os
import sys
import time
import boto3
import logging
import pprint
import json

# Set the path to import module
from pathlib import Path
current_path = Path().resolve()
current_path = current_path.parent
if str(current_path) not in sys.path:
    sys.path.append(str(current_path))
# Print sys.path to verify
# print(sys.path)

from utils.knowledge_base import BedrockKnowledgeBase

In [None]:
#Clients
s3_client = boto3.client('s3')
sts_client = boto3.client('sts')
session = boto3.session.Session()
region =  session.region_name
account_id = sts_client.get_caller_identity()["Account"]
bedrock_agent_client = boto3.client('bedrock-agent')
bedrock_agent_runtime_client = boto3.client('bedrock-agent-runtime') 
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)
region, account_id

In [None]:
import time

# Get the current timestamp
current_time = time.time()

# Format the timestamp as a string
timestamp_str = time.strftime("%Y%m%d%H%M%S", time.localtime(current_time))[-7:]
# Create the suffix using the timestamp
suffix = f"{timestamp_str}"
knowledge_base_name = 'autogenerated-filters-kb'
knowledge_base_description = "Knowledge Base autogenerated metadata filtering."
bucket_name = f'{knowledge_base_name}-{suffix}'
foundation_model = "anthropic.claude-3-sonnet-20240229-v1:0"

# Define data sources
data_source=[{"type": "S3", "bucket_name": bucket_name}]

## 2 - Create knowledge bases with fixed chunking strategy
Let's start by creating a [Amazon Bedrock Knowledge Bases](https://aws.amazon.com/bedrock/knowledge-bases/) to store video games data in csv format. Knowledge Bases allow you to integrate with different vector databases including [Amazon OpenSearch Serverless](https://aws.amazon.com/opensearch-service/features/serverless/), [Amazon Aurora](https://aws.amazon.com/rds/aurora/), [Pinecone](http://app.pinecone.io/bedrock-integration), [Redis Enterprise]() and [MongoDB Atlas](). For this example, we will integrate the knowledge base with Amazon OpenSearch Serverless. To do so, we will use the helper class `BedrockKnowledgeBase` which will create the knowledge base and all of its pre-requisites:
1. IAM roles and policies
2. S3 bucket
3. Amazon OpenSearch Serverless encryption, network and data access policies
4. Amazon OpenSearch Serverless collection
5. Amazon OpenSearch Serverless vector index
6. Knowledge base
7. Knowledge base data source

We will create a knowledge base using fixed chunking strategy. 

You can chhose different chunking strategies by changing the below parameter values: 
```
"chunkingStrategy": "FIXED_SIZE | NONE | HIERARCHICAL | SEMANTIC"
```

In [None]:
knowledge_base_metadata = BedrockKnowledgeBase(
    kb_name=f'{knowledge_base_name}-{suffix}',
    kb_description=knowledge_base_description,
    data_sources=data_source, 
    chunking_strategy = "FIXED_SIZE", 
    suffix = suffix
)

### 2.1 Download Amazon 2019, 2020, 2021, 2022, & 2023 annual reports and upload it to Amazon S3

Now that we have created the knowledge base, let's populate it with the `sec-10-k reports` dataset to KB. This data is being downloaded from [here](https://ir.aboutamazon.com/annual-reports-proxies-and-shareholder-letters/default.aspx). This data is about Amazon's annual reports, proxies and shareholder letters.

In [None]:
import os

def create_directory(directory_name):    
    if not os.path.exists(directory_name):
        os.makedirs(directory_name)
        print(f"Directory '{directory_name}' created successfully.")
    else:
        print(f"Directory '{directory_name}' already exists.")

# Call the function to create the directory
create_directory("sec-10-k")

In [None]:
import requests

def download_file(url, filename):
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Open the file in write-binary mode
        with open(filename, 'wb') as file:
            # Write the content of the response to the file
            file.write(response.content)
        print(f"File downloaded successfully: {filename}")
    else:
        print(f"Failed to download file. Status code: {response.status_code}")

# URL of the files to download
urls = ["https://s2.q4cdn.com/299287126/files/doc_financials/2024/ar/Amazon-com-Inc-2023-Annual-Report.pdf",
        "https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/Amazon-2022-Annual-Report.pdf",
        "https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/Amazon-2021-Annual-Report.pdf",
        "https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Annual-Report.pdf",
        "https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Annual-Report.pdf"]


for url in urls:
    # Name for the downloaded file
    filename = url.split('/')[-1]

    # Path to save the downloaded file
    filepath = f"./sec-10-k/{filename}"

    # Call the function to download the file
    download_file(url, filepath)

Let's upload the annual reports data available in the `sec-10-k` folder to s3.

In [None]:
def upload_directory(path, bucket_name):
        for root,dirs,files in os.walk(path):
            for file in files:
                if not file.startswith('.DS_Store'):
                    file_to_upload = os.path.join(root,file)
                    print(f"uploading file {file_to_upload} to {bucket_name}")
                    s3_client.upload_file(file_to_upload,bucket_name,file)

upload_directory("sec-10-k", bucket_name)

#### 2.3 Prepare metadata for ingestion


In [None]:
import json
import re

def generate_matadata(data_dir):
    
    # Loop through all PDF files in the directory
    for filename in os.listdir(data_dir):
        if not filename.startswith('.DS_Store'):
            # Define the metadata dictionary
            metadata ={}
            
            filename= f'{data_dir}/{filename}'
            print(filename)
            
            # Create metadata
            metadata["company"] = "Amazon"
            metadata["ticker"] = "AMZN"
            metadata["year"] = re.search(r'\d+', filename.split('/')[-1]).group(0)

            # Create a JSON object
            json_data = {"metadataAttributes": metadata}

            # print(json_data)

            # Write the JSON object to a file
            with open(f"{filename.replace('.pdf', '.pdf.metadata.json')}", "w") as f:
                json.dump(json_data, f)


In [None]:
data_dir = './sec-10-k'
generate_matadata(data_dir)

In [None]:
# upload metadata file to S3
upload_directory("sec-10-k", bucket_name)

Now start the ingestion job. Since, we are using the same documents as used for fixed chunking, we are skipping the step to upload documents to s3 bucket. 

In [None]:
# ensure that the kb is available
time.sleep(30)
# sync knowledge base
knowledge_base_metadata.start_ingestion_job()

Finally we save the Knowledge Base Id to test the solution at a later stage. 

In [None]:
kb_id_metadata = knowledge_base_metadata.get_knowledge_base_id()

### 2.4 Enable Cloudwatch logs for Debugging our Autogenerated filters

Below we're creating some helper functions to enable the cloudwatch logs for debugging.

#### Helper functions for CloudWatch Logs

The helper functions facilitate the analysis of autogenerated filter generation by querying and processing CloudWatch logs. They provide insights into how user queries are leveraged to generate metadata filters. 

These functions allow you to examine the actual filters being generated, verify their consistency across different user query variations, and ensure they align with your expectations and query logic. If any filters are inconsistent or generated unexpectedly, you can use these functions to troubleshoot and understand the reasons behind it.

In [None]:
import boto3
from botocore.exceptions import ClientError

def create_cloudwatch_log_group_for_bedrock(log_group_name):
    logs_client = boto3.client('logs')
    try:
        logs_client.create_log_group(logGroupName=log_group_name)
        print(f"Successfully created CloudWatch log group: {log_group_name}")
        return True
    except ClientError as e:
        if e.response['Error']['Code'] == 'ResourceAlreadyExistsException':
            print(f"Log group {log_group_name} already exists.")
            return True
        else:
            print(f"Error creating log group: {e}")
            return False

def enable_bedrock_invokemodel_logs(cw_log_group_name):
    bedrock_client = boto3.client('bedrock')
    log_group_name = cw_log_group_name
    
    if not create_cloudwatch_log_group_for_bedrock(log_group_name):
        print("Failed to create or confirm log group. Aborting log enablement.")
        return False

    try:
        get_kb_response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId = kb_id_metadata)
        role_arn = get_kb_response['knowledgeBase']['roleArn']
        kb_role_name = role_arn.split('/')[-1]

        sts_client = boto3.client('sts')
        account_id = sts_client.get_caller_identity()["Account"]
        role_name = kb_role_name  
        role_arn = f"arn:aws:iam::{account_id}:role/{role_name}"

        # Get all available model types
        response = bedrock_client.list_foundation_models()
        all_model_types = list(set([model['modelName'] for model in response['modelSummaries']]))

        response = bedrock_client.put_model_invocation_logging_configuration(
            loggingConfig={
                'cloudWatchConfig': {
                    'logGroupName': log_group_name,
                    'roleArn': role_arn
                },
                'textDataDeliveryEnabled': True,
                'imageDataDeliveryEnabled': True,
                'embeddingDataDeliveryEnabled': True,
                'videoDataDeliveryEnabled': True
            }
        )
        
        print(f"Successfully enabled InvokeModel logging for Bedrock with log group: {log_group_name}")
        return True
    except ClientError as e:
        print(f"Error enabling InvokeModel logging: {e}")
        return False

def delete_bedrock_invokemodel_log_group(log_group_name):
    log_group_name = "/aws/bedrock/invokemodel"
    logs_client = boto3.client('logs')
    try:
        # First, disable the logging configuration in Bedrock
        bedrock_client = boto3.client('bedrock')
        bedrock_client.delete_model_invocation_logging_configuration()
        print("Successfully disabled InvokeModel logging for Bedrock")

        # Then, delete the log group
        logs_client.delete_log_group(logGroupName=log_group_name)
        print(f"Successfully deleted CloudWatch log group: {log_group_name}")
        return True
    except ClientError as e:
        if e.response['Error']['Code'] == 'ResourceNotFoundException':
            print(f"Log group {log_group_name} does not exist.")
            return True
        else:
            print(f"Error deleting log group or disabling logging: {e}")
            return False


# Call the function
cw_log_group_name = "/aws/bedrock/invokemodel"

if enable_bedrock_invokemodel_logs(cw_log_group_name):
        print("Bedrock InvokeModel logging has been set up successfully.")
        log_group_name  = cw_log_group_name
else:
    print("Failed to set up Bedrock InvokeModel logging.")


print("Logroup name: ", log_group_name)

In [None]:
import time
from datetime import datetime, timezone
from datetime import timedelta
import boto3
import json

cw_client = boto3.client('logs', region_name=region)

def query_model_invocation_log(query_string):
    # Start query
    startTime=int((datetime.now(timezone.utc) - timedelta(minutes=60)).timestamp())
    start_query_response = cw_client.start_query(
        logGroupName=log_group_name,
        startTime=int((datetime.now(timezone.utc) - timedelta(minutes=60)).timestamp()),
        endTime=int(datetime.now(timezone.utc).timestamp()),
        queryString=query_string,
    )
    query_id = start_query_response['queryId']
    # Wait for the query to complete
    response = None
    while response == None or response['status'] == 'Running':
        print('Waiting for query to complete ...')
        time.sleep(1)
        response = cw_client.get_query_results(
            queryId=query_id
        )

    # Print the results
    print(f"Query status: {response['status']}")
    return response

def print_filter_generation_output(user_query):
    print(user_query)
    # Construct CloudWatch Logs Insights query
    # This query:
    # 1. Selects timestamp and message fields
    # 2. Filters for filter generation task messages
    # 3. Matches the user's query
    # 4. Orders results by most recent first
    response = query_model_invocation_log(
        f"""
        fields @timestamp, @message
        | filter @message like /Your task is to structure the user's query to match the request schema provided below./
        | filter input.inputBodyJson.messages.0.content.0.text like /{user_query}/
        | sort @timestamp desc
        """)
    print(response)
    results = response['results']
    if (len(results) == 0):
        print("No results found")
        return
    
    result= results[0][1]['value']
    result_dict = json.loads(result)
    print(f"Generated filter:")
    filter_gen_output = result_dict['output']['outputBodyJson']['output']['message']['content'][0]['text']
    print(filter_gen_output)

### 2.5 Update Knowledge Bases execution role

In [None]:
# Before using autogenerated filters - update the knowledge base execution IAM role with right permissions

iam = boto3.resource('iam')
client = boto3.client('iam')

def get_attached_policies(role_name):
    response = client.list_attached_role_policies(RoleName=role_name)
    attached_policies = response['AttachedPolicies']
    return attached_policies

# get the knowledge base IAM role name
get_kb_response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId = kb_id_metadata)
role_arn = get_kb_response['knowledgeBase']['roleArn']
role_name = role_arn.split('/')[-1]

# get attached policies
attached_policies = get_attached_policies(role_name)
attached_policies

def update_kb_execution_role(attached_policies, region_name):
    
    for policy in attached_policies:

        print(policy['PolicyArn'])
        policy_name = policy['PolicyName']
        policy_arn = policy['PolicyArn']

        if 'FoundationModel' in policy_arn:
            print('Updating FoundationModel policy: ',policy_arn)
            policy = iam.Policy(policy_arn)
            version = policy.default_version
            policyJson = version.document
            policyJson['Statement'][0]['Resource'].append('arn:aws:bedrock:{}::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0'.format(region)) 
            policyJson['Statement'][0]['Resource'].append('arn:aws:bedrock:{}::foundation-model/anthropic.claude-3-5-sonnet-20240620-v1:0'.format(region))  
        
            client.detach_role_policy(RoleName=role_name,
                PolicyArn=policy_arn)
            
            response = client.delete_policy(
                PolicyArn=policy_arn
            )
            print(response)
           
            response = client.create_policy(
            PolicyName= policy_name,
            PolicyDocument=json.dumps(policyJson)
            )
            print(response)
        
        client.attach_role_policy(
            RoleName=role_name,
            PolicyArn=policy_arn
        )

In [None]:
update_kb_execution_role(attached_policies, region)
time.sleep(30)

### 2.6 Query the Knowledge Base with Retrieve and Generate API - with metadata (using Autogenerated filters)

Rather creating filters manually, We'll use auto generated filters by Amazon Bedrock Knowledge Bases.

In [None]:
query = "How many prime members does Amazon have after 2021?"

#### 2.6.1 Test - RetreiveAndGenerate API with Autogenerated filters

Let's first test how the `RetrieveAndGenerate` API processes autogenerated filters with a sample user query. As a refresher, the `RetrieveAndGenerate` API is one of the APIs provided by Amazon Bedrock Knowledge Bases. This API queries the knowledge base to retrieve the desired number of document chunks based on a similarity search. It then integrates these retrieved chunks with a LLM to generate an answer to the user's question.



In the code below, we are configuring the `RetrieveAndGenerate` API to use autogenerated filters based on specific metadata attributes. 

The `implicitFilterConfiguration` is a configuration object that defines the metadata attributes that can be used as filters for the similarity search during the retrieval process of the `RetrieveAndGenerate` API. 

This configuration specifies which metadata fields can potentially be used for filtering the retrieved document chunks. In this case, we included three fields:

1. `year`: A number representing the year the document is about.
2. `company`: A string representing the company name the document describes.
3. `ticker`: A string representing the stock ticker symbol of the company.

Each metadata attribute is defined as an object with the following properties:

- `key`: The name of the metadata attribute.
- `type`: The data type of the metadata attribute (e.g., NUMBER, STRING).
- `description`: A human-readable description of the metadata attribute, including any potential values it may take.

The `implicitFilterConfiguration` also specifies the model ARN for the foundation model used to generate filter expressions relevant to the user's query. In this case, it's set to `anthropic.claude-3-5-sonnet-20240620-v1:0`.

In [None]:
foundation_model = "anthropic.claude-3-sonnet-20240229-v1:0"

response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": query
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            'knowledgeBaseId': kb_id_metadata,
            "modelArn": "arn:aws:bedrock:{}::foundation-model/{}".format(region, foundation_model),
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":10,
                    # "filter": { "equals": { "key": "x-amz-bedrock-kb-data-source-id", "value": ds_id }},
                    "implicitFilterConfiguration": {
                    "metadataAttributes":[
                        {
                            "key": "year",
                            "type": "NUMBER",
                            "description": "The year in which the document is about."
                        },
                        {
                            "key": "company",
                            "type": "STRING",
                            "description": "The company name the document is describing. Possible values include ['Amazon']"
                        },
                        {
                            "key": "ticker",
                            "type": "STRING",
                            "description": "The ticker name of the company. Possible values include ['AMZN']"
                        }
                    ],
                    "modelArn": "arn:aws:bedrock:{}::foundation-model/anthropic.claude-3-5-sonnet-20240620-v1:0".format(region)
                },
                } 
            }
        }
    }
)

print(response['output']['text'],end='\n'*2)

Now, let's check the CloudWatch logs for the autogenerated filter generated for the given user query.

The `RetrieveAndGenerate` API, with the autogenerated filter configuration, processes the user query and generates a structured filter expression based on the metadata attributes. This filter expression is represented as a JSON object containing logical operations (`and`, `or`) and comparison statements (`eq`, `gt`, `lt`, etc.). The comparison statements specify the metadata attributes and their corresponding values to filter the documents.

Behind the scenes, the model determines the logical operations and values that constitute the filter expressions. It does this by analyzing the user query and the configured metadata attributes. The model identifies the relevant metadata attributes and their values mentioned in the query, and then constructs the filter expression accordingly, using appropriate logical operations and comparison statements.

In [None]:
# check cloudwatch logs for the implict filter generated
time.sleep(60)
print_filter_generation_output(query)

#### 2.6.2 Test - Retrieve API with autogenerated filters

Next, we will test the autogenerated filter configuration with the `Retreive` API, another API provided by Amazon Bedrock Knowledge Bases which converts user queries into embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom workﬂows on top of the semantic search results. The output of the Retrieve API includes the the retrieved text chunks, the location type and URI of the source data, as well as the scores of the retrievals.

As we can see, some of the chunks returned by the `Retrieve` API includes associated metadata that has been filtered based on our autogenerated filters. 

1. Relevant Year: Notice that the chunks have a `year` metadata field set to greater than 2021, matching our query about "Amazon's prime members after 2021".

2. Company Information: The `company` field in the metadata consistently shows "Amazon", which aligns with our query about a specific company .

3. Stock Ticker Information: The `ticker` field consistently shows "AMZN", which aligns with our query about a specific stock.

This demonstrates how the autogenerated filters are effectively narrowing down the search results based on the query's implied criteria, even without explicit filter specifications. The system has interpreted the user query about revenue in 2023 and automatically applied filters to return the most relevant information from the knowledge base.

In [None]:
response_ret_with_implicit_fiters = bedrock_agent_runtime_client.retrieve(
    knowledgeBaseId=kb_id_metadata, 
    nextToken='string',
    retrievalConfiguration={
        "vectorSearchConfiguration": {
            "numberOfResults":10,
            "implicitFilterConfiguration": {
                    "metadataAttributes":[
                        {
                            "key": "year",
                            "type": "NUMBER",
                            "description": "The year in which the document is about."
                        },
                        {
                            "key": "company",
                            "type": "STRING",
                            "description": "The company name the document is describing. Possible values include ['Amazon']"
                        },
                        {
                            "key": "ticker",
                            "type": "STRING",
                            "description": "The ticker name of the company. Possible values include ['AMZN']"
                        }
                    ],
                    "modelArn": "arn:aws:bedrock:{}::foundation-model/anthropic.claude-3-5-sonnet-20240620-v1:0".format(region)
                },
        } 
    },
    retrievalQuery={
        "text": query
    }
)

def response_print(retrieve_resp):
#structure 'retrievalResults': list of contents. Each list has content, location, score, metadata
    for num,chunk in enumerate(retrieve_resp['retrievalResults'],1):
        print(f'Chunk {num}: ',chunk['content']['text'],end='\n'*2)
        print(f'Chunk {num} Location: ',chunk['location'],end='\n'*2)
        print(f'Chunk {num} Score: ',chunk['score'],end='\n'*2)
        print(f'Chunk {num} Metadata: ',chunk['metadata'],end='\n'*2)

response_print(response_ret_with_implicit_fiters)

In [None]:
# check cloudwatch logs for the implict filter generated
time.sleep(20)
print_filter_generation_output(query)

### 2.7 Clean up
Please make sure to uncomment and run below cells to delete the resources created in this notebook.

In [None]:
# delete local directory
import shutil

dir_path = "sec-10-k" # Replace with the actual path

try:
    shutil.rmtree(dir_path)
    print(f"Directory '{dir_path}' and its contents have been deleted successfully.")
except FileNotFoundError:
    print(f"Directory '{dir_path}' not found.")
except Exception as e:
        print(f"An error occurred: {e}")

In [None]:
# Delete log group name
delete_bedrock_invokemodel_log_group(log_group_name)

In [None]:
## Empty and delete S3 Bucket

objects = s3_client.list_objects(Bucket=bucket_name)  
if 'Contents' in objects:
    for obj in objects['Contents']:
        s3_client.delete_object(Bucket=bucket_name, Key=obj['Key']) 
s3_client.delete_bucket(Bucket=bucket_name)

In [None]:
print("===============================Knowledge base==============================")
knowledge_base_metadata.delete_kb(delete_s3_bucket=True, delete_iam_roles_and_policies=True)