# Evaluating Knowledge Bases for Amazon Bedrock with Ragas

> *This notebook has been tested in the Python 3 kernel of SageMaker Studio JupyterLab (Distribution v1.9)*

In this notebook, we'll explore how open-source library [Ragas](https://docs.ragas.io/en/latest/) can be applied to evaluate the quality of [Retrieval-Augmented Generation (RAG)](https://aws.amazon.com/what-is/retrieval-augmented-generation/) flows managed by [Amazon Bedrock Knowledge Bases](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html).

## Pre-requisites

### Additional permissions for Amazon OpenSearch

To complete the manual Bedrock Knowledge setup steps in this notebook, your **AWS Console user/role** will need:

- [Permissions to work with Amazon OpenSearch vector collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-vector-search.html)
- Permission to **create IAM roles** and attach policies to them, including: `iam:AttachRolePolicy`, `iam:CreateRole`, `iam:DetachRolePolicy`, `iam:GetRole`, `iam:PassRole`, `iam:CreatePolicy`, `iam:CreatePolicyVersion`, and `iam:DeletePolicyVersion`.

> ℹ️ **Note:** In testing, we saw `NetworkError` issues when attempting to create Bedrock KBs using only the above-linked `aoss` policy statements. This was resolved by granting `aoss:*` on `*` instead, but you should consider reducing these permissions before using in production environments.

If you're in an instructor-led workshop using temporary accounts provided by AWS, this setup should already have been completed for you. If not, refer to the [AWS Console for Identity and Access Management (IAM)](https://console.aws.amazon.com/iam/home?#/home) to grant permissions to your user or role.

## Imports and setup

First, we'll need to install [Ragas](https://docs.ragas.io/en/latest/) as it's not included by default on the SageMaker Studio base notebook kernel:

In [None]:
# We also force Pydantic version to avoid: https://github.com/explodinggradients/ragas/issues/867
%pip install "langchain-aws>=0.1,<0.2" "ragas==0.1.8" "pydantic>=2.8,<3"

Next, let's import the libraries that'll be used in the rest of the notebook - and set some **configurations** that we'll use later:

- An [Amazon S3 `bucket_name`](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is required to store our document corpus. By default, we'll use the *default bucket for Amazon SageMaker* - but you could change this to any bucket that this notebook's [execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) (if running in SageMaker) or your IAM user/role (if running locally) has access to
- A folder prefix under the bucket where artifacts will be stored (to keep things tidy in case the bucket is used for other projects also)

In [None]:
# Python Built-Ins:
from concurrent.futures import ThreadPoolExecutor
import json

# External Dependencies:
import boto3  # General Python SDK for AWS (including Bedrock)
from datasets import Dataset  # For use with Ragas
from langchain_community.embeddings import BedrockEmbeddings as LangChainBedrockEmbed
from langchain_aws import ChatBedrock as LangChainBedrock
import pandas as pd  # For working with tabular data
import ragas
import sagemaker  # Just used for looking up default bucket
from tqdm.notebook import tqdm  # Progress bars

bucket_name = sagemaker.Session().default_bucket()
s3_prefix = "bedrock-rag-eval"

botosess = boto3.Session()
region = botosess.region_name
br_agents_runtime = botosess.client("bedrock-agent-runtime")

## Create the knowledge base

> ⚠️ ***Watch out:** This section includes steps you'll need to take manually, not just running the code cells!*

First, we'll need to upload the sample documents to Amazon S3 - for which you can just run the code cell below:

In [None]:
corpus_s3uri = f"s3://{bucket_name}/{s3_prefix}/corpus"

print(f"Syncing corpus to:\n{corpus_s3uri}/")

!aws s3 sync --quiet ./datasets/question-answering/corpus {corpus_s3uri}/

The simplest way to set up the actual Bedrock Knowledge Base for testing will be **manually through the AWS Console**:

▶️ First, **open** the [AWS Console for Amazon Bedrock](https://console.aws.amazon.com/bedrock/home?#/knowledge-bases) and select *Orchestration > Knowledge bases* from the left sidebar menu, as shown in the screenshot below:

> ⚠️ **Check** you're working in the correct *AWS Region* in the top right corner of the UI

![](images/bedrock-kbs/01-bedrock-kb-console.png "Screenshot of AWS Console for Amazon Bedrock Knowledge Bases, showing 'Create knowledge base' action button")

▶️ **Click** the *Create knowledge base* button to start the workflow. In the screen that opens:

- For **knowledge base name**, enter `example-squad-kb`
- For **knowledge base description**, you can provide (something like) `Demo knowledge base for question answering evaluation`
- Leave the other settings as default (allow creating a new execution role, and no tags)

Your configuration should be as shown below:

![](images/bedrock-kbs/02a-create-kb-basics.png "Screenshot of step 1 in Bedrock Knowledge Base creation workflow: with KB name, description, (create new) execution role, and (empty) tags configured. At the end of the form, a 'Next' button is visible.")

▶️ In the **Next** screen, you'll configure the S3 data source:

- For **data source name**, enter `example-squad-corpus`
- For **S3 URI**, refer to the previous cell of this notebook where we uploaded the data and output `Syncing corpus to: ...`

> ⚠️ **Be sure to include the trailing slash** in your `s3://.../.../` URI. If you omit it, you may find that creating the KB succeeds but it fails to sync any documents later (because the auto-created execution role for Amazon Bedrock will be granted IAM `s3:GetObject` permissions to `.../corpus` instead of to `.../corpus/*`).

You should leave all *Advanced settings* as default, but feel free to expand out this section to explore the options available.

![](images/bedrock-kbs/02b-create-kb-data-source.png "Screenshot of S3 data source configuration, with name and S3 URI configured and 'next' button visible")

▶️ In the **Next** screen, you'll configure the vector index:

- For **embeddings model**, select `Cohere Embed Multilingual`

> ⚠️ **Check** in the [Amazon Bedrock Model Access console](https://console.aws.amazon.com/bedrock/home?#/modelaccess) that you've enabled access to this model in the current region.
>
> If needed, you should be able to select an alternative embedding model instead... But we haven't tested all options for this walkthrough.

- For **Vector database**, select `Quick create a new vector store`

You can find more information from this screen or the [Amazon Bedrock Developer Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-setup.html) about the different vector stores Bedrock Knowledge Bases support. This default option will create a new [Amazon OpenSearch Serverless](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-overview.html) cluster

Leave other settings at their defaults as shown below, and you should be ready to proceed:

![](images/bedrock-kbs/02c-create-kb-index.png "Screenshot of Knowledge Base vector index settings including Cohere Embed Multilingual embedding model, and quick-create vector store. 'Next' button is visible.")

▶️ Click **Next** to review your configuration, and then **Create knowledge base** to complete the process.

> ⏰ It might take **a few minutes** for the creation to complete. A progress indicator banner should be visible if you scroll up. Alternatively in a separate tab, you could check the [Amazon OpenSearch Serverless Collections console](https://console.aws.amazon.com/aos/home?#opensearch/collections) - where you should see the underlying vector collection being created.

Once your Knowledge Base is completed successfully, you'll be directed to the its detail screen as shown below:

![](images/bedrock-kbs/03-kb-detail-page.png "Detail screen for the created Amazon Bedrock Knowledge Base, showing creation success banner. Includes sections 'Knowledge base overview' (containing the KB ID, name, and other details); 'Tags' (empty); 'Data source' (one Amazon S3 data source listed); 'Embeddings model' (Cohere Embed); and an interactive 'Test knowledge base' chat sidebar on the right with a warning that some data sources have not been synced.")

As mentioned in the alert box shown ahead, your new knowledge base will not yet contain your documents until we **sync** the data source:

▶️ **Select** your S3 data source using the radio button to the left of it's name in the data sources list, and **click the Sync button** above to start the sync.

The sync should only take a few seconds, after which your data source's *Status* will return to `Available`

![](images/bedrock-kbs/04a-kb-data-source-after-sync.png "Screenshot of KB 'data source' section after running sync, with the data source selected and status showing as 'available'")

With the sync completed, your Knowledge Base should be ready to use.

Optionally, you can click through to your data source name to check the sync `Added` the 20 files as expected:

![](images/bedrock-kbs/04b-kb-data-sync-details.png "Data source details screen showing sync completed successfully with 20 files detected and added to the index, and 0 files failed").

## Try out your Knowledge Base

Before we discuss evaluation at scale, let's run a couple of test queries to check the KB is working properly

### ...From the AWS Console

Your Knowledge Base's detail screen in the [AWS Console for Amazon Bedrock](https://console.aws.amazon.com/bedrock/home?2#/knowledge-bases) includes an interactive widget on the right sidebar, for trying out queries.

▶️ **Click** the orange *Select model* button to get started, and select `Claude 3 Sonnet` with on-demand throughput as shown below:

![](images/bedrock-kbs/05-kb-select-model-claude-3-sonnet.png "Screenshot of model selection interface with Claude 3 Sonnet selected on on-demand throughput")

▶️ **Type** an example question into the chat panel on the right of the screen, and click **Run** to try it out. You can ask for example:

```
In what country is Normandy located?
```

You should find the system is able to respond, based on the [Normans.txt](datasets/question-answering/corpus/Normans.txt) file we ingested, as shown below:

![](images/bedrock-kbs/06a-kb-test.png "example-squad-kb detail page with the sample question already asked in the interactive try it out sidebar. The model's response, about a paragraph long, correctly identifies Normandy as being in France and includes a link to 'show source details")

▶️ **Explore** the source details, and the configuration menu available in the top left of the 'Test knowledge base' widget. You'll see:

1. An extensive range of configuration options are available, covering both source document retrieval and final answer generation
2. In this case, the `Normans` source article has been automatically split into two **chunks** during ingestion to the knowledge base - to improve answer relevancy.

### ...From Python code

To use our Knowledge Base programmatically, we'll need to look up the **knowledge base ID**. This automatically-generated, alphanumeric string is different from the *name* we gave our KB during creation. You can find it [from the AWS Console](https://console.aws.amazon.com/bedrock/home?#/knowledge-bases in the "Knowledge base overview" section of your KB's detail screen.

▶️ **Replace** the below placeholder with your knowledge base's unique ID, and run the cells below to continue:

In [None]:
knowledge_base_id = "TODO"  # Something like "55GUAMQYUT"

With the ID identified, you can use the Bedrock runtime [RetrieveAndGenerate API](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent-runtime_RetrieveAndGenerate.html) (see [corresponding boto3 doc page for Python](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve_and_generate.html)) to query your knowledge base.

As in the manual example, you'll also need to select which text generation model to use - which we've pre-populated below for Claude 3 Sonnet:

In [None]:
rag_resp = br_agents_runtime.retrieve_and_generate(
    input={"text": "In what country is Normandy located?"},
    retrieveAndGenerateConfiguration={
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": knowledge_base_id,
            "modelArn": f"arn:aws:bedrock:{region}::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0",
        },
        "type": "KNOWLEDGE_BASE",
    },
    # Optional session ID can help improve results for follow-up questions:
    # sessionId='string'
)

print("Plain text response:")
print("--------------------")
print(rag_resp["output"]["text"], end="\n\n\n")

print("Full API output:")
print("----------------")
rag_resp

As shown in the full API response from the above cell, the `RetrieveAndGenerate` action provides:

- The final text answer
- The `retrievedReferences` from the search engine
- Specific `citations` localizing which references should be cited by different parts of the text answer

...Similarly to the information we saw when trying the KB out in the AWS Console.

It's also possible to run **only the retrieval** through the API, and skip the generative answer synthesis step - as shown below:

In [None]:
retrieve_resp = br_agents_runtime.retrieve(
    knowledgeBaseId=knowledge_base_id,
    retrievalQuery={"text": "In what country is Normandy located?"},
)
print(json.dumps(retrieve_resp["retrievalResults"], indent=2))

## Evaluate Bedrock Knowledge Bases with Ragas

Now we have our Bedrock Knowledge Base set up, we'd like to **evaluate** the quality of its results against a test dataset - to help us **optimize** the configuration for high quality and low cost.

First, let's load the sample dataset of questions, reference answers, and their source documents, that we [prepared earlier](datasets/Prepare-SQuAD.ipynb):

In [None]:
dataset_df = pd.read_json("datasets/question-answering/qa.manifest.jsonl", lines=True)
dataset_df.head(10)

Records in this dataset include:

- (`doc`) The full text of the source document for this example
- (`doc_id`) A unique identifier for the source document
- (`question`) The user question to be asked
- (`question_id`) A unique identifier for the question
- (`answers`) A list of (possibly multiple) reference 'correct' answers, supported by the document

### Run the knowledge base against our test set

As shown in [Ragas' API Reference](https://docs.ragas.io/en/latest/references/evaluation.html), records in Ragas evaluation datasets typically include:

- The `question` that was asked
- The `answer` the system generated
- The actual text `contexts` the answer was based on (i.e. snippets of document text retrieved by the search engine)
- The `ground_truth` answer(s)

We can run our example questions through the Bedrock KB RAG as shown below, to fetch the outputs ready to calculate metrics:

In [None]:
from concurrent.futures import ThreadPoolExecutor

def retrieve_and_generate(
    question: str,
    kb_id: str,
    generate_model_arn: str = f"arn:aws:bedrock:{region}::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0",
):
    rag_resp = br_agents_runtime.retrieve_and_generate(
        input={"text": question},
        retrieveAndGenerateConfiguration={
            "knowledgeBaseConfiguration": {
                "knowledgeBaseId": kb_id,
                "modelArn": generate_model_arn,
            },
            "type": "KNOWLEDGE_BASE",
        },
    )
    # Fetch flat list of references from the nested citations->retrievedReferences:
    all_refs = [r for cite in rag_resp["citations"] for r in cite["retrievedReferences"]]
    ref_s3uris = [r["location"]["s3Location"]["uri"] for r in all_refs]
    # Map e.g. 's3://.../doc_id.txt' to 'doc_id':
    ref_ids = [uri.rpartition("/")[2].rpartition(".")[0] for uri in ref_s3uris]
    return {
        "answer": rag_resp["output"]["text"],
        "retrieved_doc_ids": ref_ids,
        "retrieved_doc_texts": [r["content"]["text"] for r in all_refs]
    }

with ThreadPoolExecutor(max_workers=2) as pool:
    rag_futures = [
        pool.submit(retrieve_and_generate, question=rec.question, kb_id=knowledge_base_id)
        for _, rec in dataset_df.iterrows()
    ]
    outputs_df = pd.DataFrame([f.result() for f in tqdm(rag_futures, desc="Running RAG...")])
    # Combine & clarify the column names for a nice tabular representation:
    results_df = pd.concat((dataset_df, outputs_df), axis=1).rename(
        columns={
            "answer": "model_answer",
            "answers": "gt_answers",
            "doc_id": "gt_doc_id",
            "doc": "gt_doc_text",
        }
    )
results_df.head(10)

Ragas supports a broad [range of metrics](https://docs.ragas.io/en/latest/concepts/metrics/index.html) with the option to configure which ones you calculate - but many of these depend on providing an **evaluator LLM** (or evaluator embedding model) which will be used in scoring.

We'll set up Claude 3 Sonnet as the evaluator LLM and Cohere Embedding Multilingual as the evaluator embedding model, to be able to demonstrate the full suite of available metrics.

Although Ragas defines its own base classes (see [BaseRagasLLM](https://github.com/explodinggradients/ragas/blob/2d793651f778b6c0da07a834e9ce2765be13cc9f/src/ragas/llms/base.py#L46), [BaseRagasEmbeddings](https://github.com/explodinggradients/ragas/blob/2d793651f778b6c0da07a834e9ce2765be13cc9f/src/ragas/embeddings/base.py#L19)) for these interfaces, integration is typically via LangChain for simplicity - so that's the pattern we'll follow here:

In [None]:
ragas_result = ragas.evaluation.evaluate(
    Dataset.from_pandas(results_df),
    metrics=[
        # A looot of metrics to give a general overview:
        ragas.metrics.answer_relevancy,
        ragas.metrics.faithfulness,
        ragas.metrics.context_precision,
        ragas.metrics.ContextRelevancy(),
        ragas.metrics.context_recall,
        ragas.metrics.answer_similarity,
        ragas.metrics.answer_correctness,
        ragas.metrics.critique.conciseness,
    ],
    llm=LangChainBedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0"),
    embeddings=LangChainBedrockEmbed(model_id="cohere.embed-multilingual-v3"),
    is_async=False,
    column_map={
        "answer": "model_answer",
        "contexts": "retrieved_doc_texts",
        "ground_truths": "gt_answers",
        "question": "question",
    },
)

print("Overall scores")
print("--------------")
print(ragas_result, end="\n\n")
print("Details")
print("-------")
scores_df = ragas_result.to_pandas()
scores_df.head(10)

Although your figures may differ due to variability of generation, in our test run on the sample dataset we observed:

```json
{
    'answer_relevancy': 0.8377,
    'faithfulness': 0.9700,
    'context_precision': 1.0000,
    'context_relevancy': 0.3052,
    'context_recall': 0.9950,
    'answer_similarity': 0.5733,
    'answer_correctness': 0.6611,
    'conciseness': 1.0000,
}
```

Generally, the system appears to be performing well at retrieving the correct document and generating appropriate answers on this dataset - while many of the **lower** scores relate to the significant different distribution between the reference answers (which just extract specific words or phrases from the source document, without providing any explanation), versus the RAG system responses (which typically provide a bit more contextual background).

For full discussion of the various metrics and their interpretation, refer to the [Ragas documentation](https://docs.ragas.io/en/latest/concepts/metrics/index.html).

## Evaluation beyond Ragas

Some particular points to note about the Ragas evaluation results include:

1. Nearly all the metrics (even those like 'conciseness') are based on either **LLM self-critique, or embedding-based scores**
    - As a consequence, we should be careful to monitor for potential **bias** between different evaluator LLMs and candidate LLMs - since [LLM evaluators have been shown to recognise and prefer their own generations](https://arxiv.org/abs/2404.13076).
    - We should aim to have **humans review** a subset of the same data and rate the same metrics, allowing us to *measure* how well these automated evaluations align with real user preferences - and thus quantify how much trust we should place in the automated metrics when run over bigger datasets that it wouldn't be practical for humans to label.
2. Not all the available information is being utilized in this case
    - Although metrics like `context_relevancy` and `context_recall` aim to rate how well the retrieved context relates to the question and the ground-truth answer, in this case we have labelled examples of which document each answer should come from (our `gt_doc_text` and `gt_doc_id` fields), which the metrics are not using

If you have labelled data for the correct source document per question in your use-case, you might also be interested to calculate more traditional search engine performance metrics, like:

- **Recall at K:** In what percentage of cases did the target document appear in the first *K* snippets returned by the search engine?
- **Precision:** What percentage of the snippets returned by the search engine to use in generation, came from 'correct'/relevant documents?
- **Normalized Discounted Cumulative Gain (nDCG) at K:** Summarizing how early 'relevant' documents are ranked, in cases where many of your example questions have multiple relevant documents that should be returned.

Precision and recall are fairly straightforward to calculate from the available data as shown below:

In [None]:
sample_recall_at_1 = [
    # Recall@1 is % of the time the first retrieved doc was the target
    1.0 if rec.gt_doc_id == rec.retrieved_doc_ids[0] else 0.0
    for _, rec in results_df.iterrows()
]
print(f"Recall@1: {sum(sample_recall_at_1) / len(sample_recall_at_1):.2%}")

sample_precisions = [
    # Precision is count(retrieved_doc == target_doc) / number_of_retrieved_docs
    sum((doc_id == rec.gt_doc_id for doc_id in rec.retrieved_doc_ids)) / len(rec.retrieved_doc_ids)
    for _, rec in results_df.iterrows()
]
print(f"Average Precision: {sum(sample_precisions) / len(sample_precisions):.2%}")

In our tests, we recorded `100.00%` for both these metrics on the sample dataset.

In this dataset with only 20 documents each covering very distinct topics (see for yourself in [datasets/question-answering/corpus](datasets/question-answering/corpus)), it's easy for the retriever to fetch the correct document 100% of the time. In more typical enterprise contexts with large knowledge bases of heavily overlapping content, this retrieval is much more likely to be a challenging bottleneck to overall RAG performance.

## Clean-Up

Bedrock Knowledge Bases store and index data backed by an underlying vector store, so once you're done experimenting you should **delete your knowledge base** to avoid unnecessary ongoing charges from the vector store itself.

> ⚠️ **Before you clean up:** The [conversational tests example notebook](conversational-tests/Conversational%20Tests.ipynb) re-uses the Knowledge Base we created here... So if you're going to run through that example, **do it first** before cleaning up the resources below!

1. Find your KB in the [Orchestration > Knowledge bases section of the Amazon Bedrock console](https://console.aws.amazon.com/bedrock/home?#/knowledge-bases)
2. Check the underlying vector store from the KB details: In this example it should be an Amazon OpenSearch Serverless collection
3. **Delete** your KB from the Amazon Bedrock console
4. Check in the [Serverless Collections section of the Amazon OpenSearch console](https://console.aws.amazon.com/aos/home?region=#opensearch/collections) (or whichever other service is relevant, if your KB was backed by Aurora Postgres or a different store), that your underlying collection has also been deleted - and delete it manually if not.
5. Consider also removing the source data from [Amazon S3](https://console.aws.amazon.com/s3/buckets), to avoid any potential S3 charges.

For more information, refer to the pricing pages for [Amazon OpenSearch Serverless](https://aws.amazon.com/opensearch-service/pricing/), [Amazon Bedrock](https://aws.amazon.com/bedrock/pricing/), and [Amazon S3](https://aws.amazon.com/s3/pricing/).

## Summary

In this notebook we saw how to create a [Knowledge Base for Amazon Bedrock](https://aws.amazon.com/bedrock/knowledge-bases/) to deploy a fully-managed RAG pipeline on AWS, and then ran some basic result quality evaluations using the [RetrieveAndGenerate API](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent-runtime_RetrieveAndGenerate.html) and the open-source library [Ragas](https://docs.ragas.io/en/latest/) for RAG evaluations.

While Ragas provides a broad range of pre-implemented metrics that can help summarize performance and pinpoint limitations of RAG systems on validation datasets, it's important to remember that the included metrics are largely LLM-evaluated - and therefore potentially subject to bias, especially if comparing between different candidate and evaluator LLMs.

- **Collecting** source attribution data up-front in your validation datasets (i.e. source document, not just question and reference answer), will provide additional options for you to evaluate the performance of the retrieval component itself separately from the answer generation/synthesis - helping to identify which component is the bottleneck for overall result quality
- **Comparing** Ragas-reported metrics against human evaluations for the same datasets and metrics, can help to quantify how representative the automated metrics are of real user perceptions, and therefore how much trust should be placed in them.