# Metadata filtering using Amazon Bedrock Knowledge Bases
This notebook provides sample code walkthrough for 'metadata filtering' feature, for Amazon Bedrock Knowledge Bases.

Using metadata filtering feature, you can use to improve search results by pre-filtering your retrievals from vector stores. 
For more details on this feature, please read this [blog](https://aws.amazon.com/blogs/machine-learning/amazon-bedrock-knowledge-bases-now-supports-metadata-filtering-to-improve-retrieval-accuracy/).

## 1. Import the needed libraries
First step is to install the pre-requisites packages.

In [None]:
%pip install --upgrade pip --quiet
%pip install -r ../requirements.txt --no-deps --quiet
%pip install -r ../requirements.txt --upgrade --quiet

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import botocore
botocore.__version__

In [None]:
import os
import sys
import time
import boto3
import logging
import pprint
import json

# Set the path to import module
from pathlib import Path
current_path = Path().resolve()
current_path = current_path.parent
if str(current_path) not in sys.path:
    sys.path.append(str(current_path))
# Print sys.path to verify
# print(sys.path)

from utils.knowledge_base import BedrockKnowledgeBase

In [None]:
#Clients
s3_client = boto3.client('s3')
sts_client = boto3.client('sts')
session = boto3.session.Session()
region =  session.region_name
account_id = sts_client.get_caller_identity()["Account"]
bedrock_agent_client = boto3.client('bedrock-agent')
bedrock_agent_runtime_client = boto3.client('bedrock-agent-runtime') 
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)
region, account_id

In [None]:
import time

# Get the current timestamp
current_time = time.time()

# Format the timestamp as a string
timestamp_str = time.strftime("%Y%m%d%H%M%S", time.localtime(current_time))[-7:]
# Create the suffix using the timestamp
suffix = f"{timestamp_str}"
knowledge_base_name = 'metadata-filtering-kb'
knowledge_base_description = "Knowledge Base metadata filtering."
bucket_name = f'{knowledge_base_name}-{suffix}'
foundation_model = "anthropic.claude-3-sonnet-20240229-v1:0"

# Define data sources
data_source=[{"type": "S3", "bucket_name": bucket_name}]

## 2 - Create knowledge bases with fixed chunking strategy
Let's start by creating a [Amazon Bedrock Knowledge Bases](https://aws.amazon.com/bedrock/knowledge-bases/) to store video games data in csv format. Knowledge Bases allow you to integrate with different vector databases including [Amazon OpenSearch Serverless](https://aws.amazon.com/opensearch-service/features/serverless/), [Amazon Aurora](https://aws.amazon.com/rds/aurora/), [Pinecone](http://app.pinecone.io/bedrock-integration), [Redis Enterprise]() and [MongoDB Atlas](). For this example, we will integrate the knowledge base with Amazon OpenSearch Serverless. To do so, we will use the helper class `BedrockKnowledgeBase` which will create the knowledge base and all of its pre-requisites:
1. IAM roles and policies
2. S3 bucket
3. Amazon OpenSearch Serverless encryption, network and data access policies
4. Amazon OpenSearch Serverless collection
5. Amazon OpenSearch Serverless vector index
6. Knowledge base
7. Knowledge base data source

We will create a knowledge base using fixed chunking strategy. 

You can chhose different chunking strategies by changing the below parameter values: 
```
"chunkingStrategy": "FIXED_SIZE | NONE | HIERARCHICAL | SEMANTIC"
```

In [None]:
knowledge_base_metadata = BedrockKnowledgeBase(
    kb_name=f'{knowledge_base_name}-{suffix}',
    kb_description=knowledge_base_description,
    data_sources=data_source, 
    chunking_strategy = "FIXED_SIZE", 
    suffix = suffix
)

### 2.1 Download video game dataset and upload it to Amazon S3

Now that we have created the knowledge base, let's populate it with the `video_games` dataset to KB. This data is being downloaded from [here](https://aws-blogs-artifacts-public.s3.amazonaws.com/ML-16482/30_generated_video_game_records.zip). This data is about fictional video games containing information like title, description, genre, year, publisher, and score for each video games.

In [None]:
import os
import zipfile

# Download the zip file
!wget https://aws-blogs-artifacts-public.s3.amazonaws.com/ML-16482/30_generated_video_game_records.zip

# Unzip the file content - This data will get unzipped into a folder name 'video_game'
with zipfile.ZipFile('./30_generated_video_game_records.zip', 'r') as zipf:
    csv_files = [x for x in zipf.infolist() if not x.filename.startswith('__MACOSX/') and x.filename.endswith('.csv')]
    for csv_file in csv_files:
        zipf.extract(csv_file, './')

#remove original zip file
os.remove('./30_generated_video_game_records.zip')

Let's upload the video games data available in the `video_game` folder to s3.

In [None]:
def upload_directory(path, bucket_name):
        for root,dirs,files in os.walk(path):
            for file in files:
                if not file.startswith('.DS_Store'):
                    file_to_upload = os.path.join(root,file)
                    print(f"uploading file {file_to_upload} to {bucket_name}")
                    s3_client.upload_file(file_to_upload,bucket_name,file)

upload_directory("video_game", bucket_name)

Now we start the ingestion job.

In [None]:
# ensure that the kb is available
time.sleep(30)
# sync knowledge base
knowledge_base_metadata.start_ingestion_job()

Finally we save the Knowledge Base Id to test the solution at a later stage. 

In [None]:
kb_id_metadata = knowledge_base_metadata.get_knowledge_base_id()

### 2.2 Query the Knowledge Base with Retrieve and Generate API - without metadata

Let's test the knowledge base using the [**retrieve_and_generate**](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve_and_generate.html) API. With this API, Bedrock takes care of retrieving the necessary references from the knowledge base and generating the final answer using a foundation model from Bedrock.

'''
query = "A strategy game with cool graphic with score of 9.0"
'''

Expected Results: 
    * Fantasy Kingdoms: Chronicles of Eldoria is a strategy RPG game with a score of 9.0.


In [None]:
query = "A strategy game with cool graphic with score of 9.0"

In [None]:
response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": query
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            'knowledgeBaseId': kb_id_metadata,
            "modelArn": "arn:aws:bedrock:{}::foundation-model/{}".format(region, foundation_model),
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5
                } 
            }
        }
    }
)

pprint.pp(response['output']['text'])

#### 2.3 Prepare metadata for ingestion


In [None]:
import csv
import json
import pandas as pd

def generate_matadata(data_dir , metadata_fields):
    # Define the metadata attributes
    metadata_attributes = metadata_fields

    # Loop through all CSV files in the directory
    for filename in os.listdir(data_dir):
        filename= f'{data_dir}/{filename}'
        if filename.endswith(".csv"):
            # Read the CSV file
            df = pd.read_csv(filename)
            df["Id"] = [os.path.basename(filename)]
            
            # Extract the metadata attributes
            metadata = {k:v[0] for k,v in df[metadata_attributes].to_dict(orient='list').items()}
            # reorder the keys
            metadata = {key: metadata[key] for key in metadata_attributes}
            
            # Create a JSON object
            json_data = {"metadataAttributes": metadata}
            
            
            # Write the JSON object to a file
            with open(f"{filename.replace('.csv', '.csv.metadata.json')}", "w") as f:
                json.dump(json_data, f)

In [None]:
data_dir = './video_game'
metadata_fields = ["Id", "genres", "year", "publisher", "score"]

generate_matadata(data_dir, metadata_fields)

In [None]:
# upload metadata file to S3
upload_directory("video_game", bucket_name)

In [None]:
# delete metadata files from local
data_dir = './video_game'
for filename in os.listdir(data_dir):
    filename= f'{data_dir}/{filename}'
    if filename.endswith(".csv.metadata.json"):
        os.remove(filename)

Now start the ingestion job. Since, we are using the same documents as used for fixed chunking, we are skipping the step to upload documents to s3 bucket. 

In [None]:
# ensure that the kb is available
time.sleep(30)
# sync knowledge base
knowledge_base_metadata.start_ingestion_job()

### 2.4 Query the Knowledge Base with Retrieve and Generate API - with metadata

create the filter 

In [None]:
one_group_filter= {
    "andAll": [
        {
            "equals": {
                "key": "genres",
                "value": "Strategy"
            }
        },
        {
            "greaterThanOrEquals": {
                "key": "score",
                "value": 9.0
            }
        }
    ]
}

Pass the filter to `retrievalConfiguration` of the [**retrieve_and_generate**](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve_and_generate.html).

In [None]:
response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": query
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            'knowledgeBaseId': kb_id_metadata,
            "modelArn": "arn:aws:bedrock:{}::foundation-model/{}".format(region, foundation_model),
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5,
                    "filter": one_group_filter
                } 
            }
        }
    }
)

print(response['output']['text'])

As you can see, with the retrieve and generate API we get the final response directly, now let's observe the citations for `RetreiveAndGenerate` API. Also, let's  observe the retrieved chunks and citations returned by the model while generating the response. When we provide the relevant context to the foundation model alongwith the query, it will most likely generate the high quality response. 

In [None]:
# response_metadata = response['citations'][0]['retrievedReferences']
# print("# of citations or chunks used to generate the response: ", len(response_metadata))
# def citations_rag_print(response_ret):
# #structure 'retrievalResults': list of contents. Each list has content, location, score, metadata
#     for num,chunk in enumerate(response_ret,1):
#         print(f'Chunk {num}: ',chunk['content']['text'],end='\n'*2)
#         print(f'Chunk {num} Location: ',chunk['location'],end='\n'*2)
#         print(f'Chunk {num} Metadata: ',chunk['metadata'],end='\n'*2)

# citations_rag_print(response_metadata)

### Clean up
Please make sure to uncomment and run below cells to delete the resources created in this notebook. If you are planning to run `dynamic-metadata-filtering` notebook under `03-advanced-concepts` section, then make sure to come back here to delete the resources. 

In [None]:
# # Empty and delete S3 Bucket

# objects = s3_client.list_objects(Bucket=bucket_name)  
# if 'Contents' in objects:
#     for obj in objects['Contents']:
#         s3_client.delete_object(Bucket=bucket_name, Key=obj['Key']) 
# s3_client.delete_bucket(Bucket=bucket_name)

In [None]:
# # print("===============================Knowledge base==============================")
# knowledge_base_metadata.delete_kb(delete_s3_bucket=True, delete_iam_roles_and_policies=True)