# Optimize RAG retrieval using metadata filtering
For more details, refer AWS Machine Learning blog at https://aws.amazon.com/blogs/machine-learning/knowledge-bases-for-amazon-bedrock-now-supports-metadata-filtering-to-improve-retrieval-accuracy/


### Context
For RAG-based applications, the accuracy of the generated responses from Foundation Models (FMs) depend on the context provided to the model. Contexts are retrieved from vector stores based on user queries. However, in many situations, you may need to retrieve documents created in a defined period or tagged with certain categories. To refine the search results, you can filter based on document metadata to improve retrieval accuracy, which in turn leads to more relevant and accurate responses aligned with your interests.

With metadata filters, you can retrieve not only semantically relevant chunks but a well-defined subset of those relevant chuncks based on applied metadata filters and associated values. Metadata filtering provides more control over the retrieved documents, especially if your queries are ambiguous.



<img src="images/metadata-filter.png" width="800" width="1000"/>

To apply metatdata filters:

- provide a custom metadata file (each up to 10 KB) for each document in the knowledge base (KB). 
- apply filters to your retrievals, instructing the vector store to pre-filter based on document metadata and then search for relevant documents

### Pre-requisite

Before being able to answer the questions, the documents must be processed and stored in knowledge base.

1. Load the documents into the knowledge base by connecting your s3 bucket (data source). 
2. Ingestion - Knowledge base will split them into smaller chunks (based on the strategy selected), generate embeddings and store it in the associated vectore store and notebook [0_create_ingest_documents_test_kb.ipynb](../1a_create_ingest_documents_test_kb.ipynb) takes care of it for you.

#### Notebook Walkthrough

For our notebook we will use the  Knowledge Bases created  for Amazon Bedrock which converts user queries into
embeddings, searches the knowledge base, get the relevant results, augment the prompt and then invoking a LLM to generate the response. 

## Steps
1. Download sample data with custom metadata and ingest into current knowledge base (created previously)
2. Ingest custom metadata to the knowledge base via ingest job
3. Try a query with and without using metadata filtering to observe the difference

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
kb_id = "<knowledge base id>" # Provide knowledge base id here.

In [None]:
import json
import os
import boto3
from botocore.exceptions import ClientError
import pprint
from utility import create_bedrock_execution_role, create_oss_policy_attach_bedrock_execution_role, create_policies_in_oss, interactive_sleep
import random
import zipfile
from retrying import retry
suffix = random.randrange(200, 900)

sts_client = boto3.client('sts')
boto3_session = boto3.session.Session()
region_name = boto3_session.region_name
bedrock_agent_client = boto3_session.client('bedrock-agent', region_name=region_name)

pp = pprint.PrettyPrinter(indent=2)

## Download data with custom metadata and ingest into current knowledge base

In [None]:
from urllib.request import urlretrieve
# Download and prepare dataset
!mkdir -p ./data

url = "https://aws-blogs-artifacts-public.s3.amazonaws.com/ML-16482/30_generated_video_game_records.zip"
data_root = "data"

zip_path, _ = urlretrieve(url)

with zipfile.ZipFile(zip_path, "r") as zip:
    for zip_info in zip.infolist():
        # Skip if it is a directory i.e. __MACOSX
        if zip_info.is_dir():
            continue
        zip_info.filename = os.path.basename(zip_info.filename)
        if not zip_info.filename.startswith("._") and zip_info.filename.endswith(".json"):
            zip.extract(zip_info, data_root)
            # print(zip_info.filename)

#### Upload custom metadata to S3 Bucket data source

In [None]:
# Upload data to s3 to the bucket that was configured as a data source to the knowledge base
s3_client = boto3.client("s3")
def uploadDirectory(path,bucket_name):
        for root,dirs,files in os.walk(path):
            for file in files:
                s3_client.upload_file(os.path.join(root,file),bucket_name,file)

uploadDirectory(data_root, bucket_name)

In [None]:
# Print a sample s3 object (metadata file)
# A JSON file with metadata key/value under "metadataAttributes"
filename = '1.csv.metadata.json'

obj = s3_client.get_object(Bucket=bucket_name, Key=filename)
print(obj['Body'].read().decode('utf-8'))

### Start ingestion job to ingest metadata
Once the KB and data source is created, start the ingestion job, which will incremently ingest the metadata.

In [None]:
# Start an ingestion job
start_job_response = bedrock_agent_client.start_ingestion_job(knowledgeBaseId = kb_id, dataSourceId = ds_id)

In [None]:
job = start_job_response["ingestionJob"]
pp.pprint(job)

In [None]:
# Get job 
while(job['status']!='COMPLETE' ):
    get_job_response = bedrock_agent_client.get_ingestion_job(
      knowledgeBaseId = kb_id,
        dataSourceId = ds_id,
        ingestionJobId = job["ingestionJobId"]
  )
    job = get_job_response["ingestionJob"]
pp.pprint(job)
interactive_sleep(40)

### Query knowledge base using boto3

In [None]:
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime")

#### Query without metadata filtering
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve.html

In [None]:
query = 'A strategy game with cool graphic released after 2023'

response = bedrock_agent_runtime.retrieve(
    knowledgeBaseId=kb_id,
    retrievalQuery={
        'text': query
    }
)

for game in response.get('retrievalResults'):
    # print(f"Title: {game.get('content').get('text').split('\n')[0].split(',')[0]}")
    print(f"Title: {game.get('content').get('text').split(':')[0].split(',')[-1].replace('score ','')}")
    print(f"Year: {game.get('metadata').get('year')}")
    print(f"Genre: {game.get('metadata').get('genres')}")
 

If you don't see any results, wait for ~10s and try again. Note that some video games have the wrong genre and/or year

#### Query with metadata filtering

In [None]:
query = 'A strategy game with cool graphic released after 2023'
# genres = Strategy AND year >= 2023
metadata_filter = {
    "andAll": [
        {
            "equals": {
                "key": "genres",
                "value": "Strategy"
            }
        },
        {
            "greaterThanOrEquals": {
                "key": "year",
                "value": 2023
            }
        }
    ]
}

response = bedrock_agent_runtime.retrieve(
    knowledgeBaseId=kb_id,
    retrievalConfiguration={
        "vectorSearchConfiguration": {
            "filter": metadata_filter
        }
    },
    retrievalQuery={
        'text': query
    }
)

for game in response.get('retrievalResults'):
    # print(f"Title: {game.get('content').get(
    #     'text').split('\n')[1].split(',')[0]}")
    print(f"Title: {game.get('content').get('text').split(':')[0].split(',')[-1].replace('score ','')}")
    print(f"Year: {game.get('metadata').get('year')}")
    print(f"Genre: {game.get('metadata').get('genres')}")

With the pre-filtering, now 100% of the retrieved results have the correct genre and year

<div class="alert alert-block alert-warning">
<b>Note:</b> Remember to delete KB, OSS index and related IAM roles and policies to avoid incurring any charges.
</div>