# RAG Evaluation

Retrieval-Augmented Generation (RAG) has recently emerged as a promising approach in natural language processing. 
There are many real-world applications leveraging its capability to enhance generative models through the integration of external information retrieval. 
However, evaluating these RAG systems is not straightforward. There are unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. 
To address these challenges, we explore an evaluation and benchmarking of RAG systems call [RAGAS](https://docs.ragas.io/). 
Through this framework, we'll evaluate the effectiveness the Retrieval and Generation components, such as relevance, accuracy, and faithfulness, within the RAG benchmarks, encompassing the possible output and ground truth pairs. 

To demonstrate RAG application evaluation capabilities, we'll build a RAG application using KonwledgeBases for Bedrock. This RAG application serves as a book assistant, using the Q&As created in the [00-qa_generator.ipynb](00-qa_generator.ipynb) as the basis for the evaluation. 

Import library dependencies for the lab:

In [None]:
import boto3
import json
import uuid
import urllib.request
import sagemaker
from datasets import Dataset
from langchain_aws import ChatBedrock
from langchain_community.embeddings import BedrockEmbeddings
import re
from utils import bedrock_helper

Restore variables from the setup for this lab.

In [None]:
%store -r

## Upload Source Data to S3
In this example, we'll use a book titled [The Adventures of Sherlock Holmes](https://www.gutenberg.org/cache/epub/1661/pg1661.txt) as the source of the knowledge.  This book made available for free by [Project Gutenberg](https://www.gutenberg.org). The book has a copyright status of public domain. For more information please refer to the detail [here](https://www.gutenberg.org/ebooks/1661).

Based on the previous notebook, we extracted the book content into chapters. In this notebook we'll upload each chapter to S3 bucket so that it could be used by Knowledge Bases for Bedrock as the source for the RAG application.

In [None]:
# use boto3 s3 to upload a string to S3 
sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
s3_prefix = "bedrock/knowledgebase/datasources/sherlock_holmes"

start = 0
s3 = boto3.client("s3")
for idx, chapter_content in enumerate(chapter_contents[2:]):
    s3.put_object(
        Body=chapter_content,
        Bucket=default_bucket,
        Key=f"{s3_prefix}/chapter_{idx+1}.txt"
    )

## Setup
In our setup, we'll be using a Bedrock Claude models as the LLM. In addition, we'll use the Amazon Titan Text Embedding v2 to convert the documents into embeddings and store the vectors into Opensearch serverless collection.

In [None]:
execution_role = sagemaker.get_execution_role()
bedrock = boto3.client("bedrock")
bedrock_runtime = boto3.client("bedrock-runtime")
agent_runtime = boto3.client('bedrock-agent-runtime')
bedrock_agent = boto3.client("bedrock-agent")
evaluation_model_id = "anthropic.claude-3-sonnet-20240229-v1:0" # model ID to be used with RAGAS for evaluation.
embedding_dim = 1024
region = bedrock.meta.region_name
model_arn = f"arn:aws:bedrock:{region}::foundation-model/anthropic.claude-3-haiku-20240307-v1:0"
embedding_model_id = "amazon.titan-embed-text-v2:0" # model ID for the embedding model to be used by Knowledge Bases for Bedrock
embedding_model_arn = f"arn:aws:bedrock:{region}::foundation-model/amazon.titan-embed-text-v2:0" # model arn to be used for RAGAS evalaution.
boto3_credentials = boto3.Session().get_credentials() # creadentials for opensearch cluster

# Create a Knowledge Base using Amazon Bedrock

The following section describes the steps to take in order to create a knowledge base in Bedrock. We are going to use the Amazon Bedrock Agent SDK and Opensearch SDK to create the required components.

## How it works
Knowledge base for Amazon Bedrock help you take advantage of Retrieval Augmented Generation (RAG), a popular technique that involves drawing information from a data store to augment the responses generated by Large Language Models (LLMs). With this approach, your application can query the knowledge base to return most relevant information found in your knowledge base to answer the query either with direct quotations from sources or with natural responses generated from the query results.

There are 2 main processes involved in carrying out RAG functionality via Knowledge Bases for Bedrock:

1. Pre-processing - Ingest source data, create embeddings for the data and populate the embeddings into a vector database.
2. Runtime Execution - Query the vectorDB for similar documents based on user query and return topk documents as the basis for the LLM to provide a response. The following diagrams illustrate schematically how RAG is carried out. Knowledge base simplifies the setup and implementation of RAG by automating several steps in this process.

### Preprocesing Stage
![RAG preprocessing](img/br-kb-preprod-diagram.png)

### Execution Stage
![RAG execution](img/br-kb-runtime-diagram.png)


Define variables to use for creating a knowledge bases for Bedrock

In [None]:
random_id = str(uuid.uuid4().hex)[:5]
index_name = f"bedrock-kb-{random_id}"
knowledge_base_name = f"bedrock-kb-{random_id}"

## Steps for creating a Knowledge Base for Bedrock application
Creating a knowledge base involves the following steps:

* Create an opensearch serverless collection as the vector DB.
* Create an index for the collection to be used for all the documents
* Create the required IAM service roles for Bedrock to integrate with the collection
* Create a Knowledge Base for Bedrock application.
* Create a data ingestion job to create the embeddings into the opensearch serverless collection.

Luckily, all the steps outlined above are provided as a helper function so you don't have to do this yourself!

**Note:** The knowledge base creation step below takes about 5 minutes. Please be patient and and let it finish everything before stopping any processes.

Restores the environment variables from lab setup.

In [None]:
knowledge_base_id = bedrock_helper.create_knowledge_base(knowledge_base_name=knowledge_base_name, 
                               bedrock_kb_execution_role_arn=bedrock_kb_execution_role_arn, 
                               embedding_model_arn=embedding_model_arn, 
                               embedding_dim=embedding_dim, 
                               s3_bucket=default_bucket, 
                               s3_prefix=s3_prefix, 
                               oss_host=vector_host, 
                               oss_collection_id=vector_collection_id, 
                               oss_collection_arn=vector_collection_arn, 
                               index_name=index_name, 
                               region=region, 
                               credentials=boto3_credentials)

## Auto Generate Questions From the Documents Using LLM
We have prepared a list of questions and answers from the book which we'll used as the base for the questions and answers. These questions and answers are generated by an LLM in Bedrock.

Important: Just as many LLM applications, it's important to leverage human in the loop to validate the Q&A generated by the LLM to ensure they are correct and accurate. For our experiment, all the questions and answers have been validated by human, so that we could use them as the ground truth for a fair model and RAG evaluation process.

The Q&A data serves as the foundation for the RAG evaluation based on the approaches that we are going to implement. We'll define the generated answers from this step as ground truth data.

Next, based on the generated questions, we'll use Bedrock Agent SDK to retrieve the contexts that's most relevant to the question, and generate answers for each one of them. These data would be served as the source data for standard RAG approach.

We also share a notebook that walks through the process of using an LLM to generate questions and answers [here](00-qa_generator.ipynb).

The QA dataset is formatted as JSON defined as followed:

```
{
  "question" : [ ... ],
  "ground_truth" : [ ... ]
}
  

In [None]:
with open("data/qa_samples.json", "r") as f:
    data = f.read()
    data_samples = json.loads(data)

## Invoke Knowledge Bases For Bedrock 
In the following step, we'll iterate each generated question from the Q&A data as query to invoke Knowledge bases using the relevant [boto3 SDK](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve_and_generate.html) API call. The knowledge base  generates the response with the results and corresponding contexts. 

In [None]:
contexts = []
model_responses = []
for q in data_samples['question']:
    local_contexts = []
    response = agent_runtime.retrieve_and_generate(
        input={
            'text': q
        },
        retrieveAndGenerateConfiguration={
            'knowledgeBaseConfiguration': {
                'generationConfiguration': {
                    'inferenceConfig':  {
                        'textInferenceConfig': {
                            'stopSequences': [
                                'Human:',
                            ],
                            'temperature': 0.1,
                            'topP': 0.9
                        }
                    }
                },
                'knowledgeBaseId': knowledge_base_id,
                'modelArn':  model_arn
            },
            'type': 'KNOWLEDGE_BASE'
        }
    )
    model_response = response['citations'][0]['generatedResponsePart']['textResponsePart']['text']
    model_responses.append(model_response)
    for retrievedReference in response['citations'][0]['retrievedReferences']:
        context = retrievedReference['content']['text']
        local_contexts.append(context)
    contexts.append(local_contexts)

## Combining Dataset For RAGAS Evaluation
Now that we have the model responses and the contexts, we'll use these information to combine with QA dataset to build the dataset required to use RAGAS evaluation framework:

1. Questions 
2. Context for the RAG
3. Response from the model
4. Ground truths data

In [None]:
data_samples['contexts'] = contexts
data_samples['answer'] = model_responses

In [None]:
ds = Dataset.from_dict(data_samples)

## RAG Evaluation using RAGAS Framework
To evaluate the effectiveness of RAG, well use a framework called [RAGAS](https://github.com/explodinggradients/ragas). 
The framework provides a suite of metrics which can be used to evaluate different dimensions. 

At a high level, RAGAS evaluation focuses on the following key components:

![RAGAS evaluation](img/ragas-evaluations.png)

Here's are the summary of some of the evaluation components supported in RAGAS:

### Faithfullness
This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better. The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context. To calculate this a set of claims from the generated answer is first identified. Then each one of these claims are cross checked with given context to determine if it can be inferred from given context or not.

### Answer Correctness
The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score. Users also have the option to employ a ‘threshold’ value to round the resulting score to binary, if desired.

### Answer Relevancy
The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information. This metric is computed using the question and the answer, with values ranging between 0 and 1, where higher scores indicate better relevancy.

### Answer Similarity
The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.

### Context Precision
Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question, ground_truth and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.


### Context Recall 
Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.

To estimate context recall from the ground truth answer, each claim in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the ground truth answer should be attributable to the retrieved context.


In our example, we'll explore the following evaluation components:

* Faithfulness
* Answer Correctness
* Answer Similarity
* Answer Relevancy
* Context Precision
* Context Recall

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_correctness,
    answer_similarity,
    answer_relevancy,
    context_precision, 
    context_recall
)

In [None]:
import nest_asyncio 
nest_asyncio.apply() # Based on Ragas documenttion this is only needed when running in a jupyter notebook. 

In [None]:
metrics = [
    faithfulness,
    answer_correctness,
    answer_similarity,
    answer_relevancy,
    context_precision, 
    context_recall
]

## Setup Bedrock model configurations
RAGAS fully integrates Bedrock foundation models and embedding models into their framework. In this lab, we'll use a more powerful foundation model [Claude3 Sonnet](https://aws.amazon.com/bedrock/claude/) to help evaluate the RAG application. Since RAGAS works directly with [langchain](https://www.langchain.com/), an open source framework for building LLM applications, we'll define Bedrock models via langchain framework, and pass these model objects to RAGAS for evaluation.

In [None]:
config = {
    "region_name": region,  # E.g. "us-east-1"
    "model_id": evaluation_model_id,  # E.g "anthropic.claude-3-haiku-20240307-v1:0"
    "model_kwargs": {"temperature": 0.9},
}

bedrock_model = ChatBedrock(
    region_name=config["region_name"],
    model_id=config["model_id"],
    model_kwargs=config["model_kwargs"],
)

# init the embeddings
bedrock_embeddings = BedrockEmbeddings(
    region_name=config["region_name"],
    model_id = embedding_model_id
)

Invoke RAGAS evaluation

In [None]:
evaluation_result = evaluate(
    ds,
    metrics=metrics,
    llm=bedrock_model,
    embeddings=bedrock_embeddings,
    raise_exceptions=False
)

Shows the evaluation results for each entry via pandas dataframe

In [None]:
df = evaluation_result.to_pandas()
df

## RAG Evaluation Summary
Finally, we'll use RAGAS evaluation to help provide an average score for each evaluation metric.

In [None]:
print(f"""A summarized report for standard RAG approach based on RAGAS evaluation: 
faithfulness: {evaluation_result['faithfulness']}
answer_correctness: {evaluation_result['answer_correctness']}
answer_similarity: {evaluation_result['answer_similarity']}
answer_relevancy: {evaluation_result['answer_relevancy']}
context_precision: {evaluation_result['context_precision']}
context_recall: {evaluation_result['context_recall']}
""")

# RAG Evaluation Analysis and Recommendations
Proper RAG evaluation helps identify potential weaknesses in both the retrieval and generation components, allowing for targeted improvements. It also helps in assessing the model's ability to provide up-to-date, factual information, which is especially important in rapidly evolving fields or for time-sensitive applications. Furthermore, RAG evaluation plays a vital role in mitigating hallucinations or false information generation, a common concern with large language models. By rigorously testing and evaluating RAG systems, developers can enhance the reliability, accuracy, and trustworthiness of the generative AI application.

As mentioned in the diagram above, RAGAS framework focuses on evaluation in the Retrieval and Generation workflows. Let's look into those in more detail based on the metrics captured in the evaluation process:

## Retrieval Evaluation Metrics
1. **Context Precision** is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. As a result, a high context precision suggests the relevant contexts returned from the retrieval are effective. A low context precision could indicate the quality of the data is not good enough to provide relevant context. One recommendation is to revisit the chunking strategy and use a different chunking approach for the documents. Another suggestion is to consider using a more effective embedding model to better capture the semantics of the documents. Please refer to the [documentation](https://docs.ragas.io/en/latest/concepts/metrics/context_precision.html) for more information about context precision.
   
2. **Context Recall** measures the extent to which the retrieved context aligns with the ground truth. A high context recall suggests the retrieved contexts reflect accurately the ground truth data. A low context recall indicates the retrieved context might not accurately reflect the ground truth information. One recommendation is to revisit the chunking strategy for parsing the documents. Another suggestion is to revisit the documents to ensure the information in the ground truth are well presented within the documents. If these do not address the issue, consider a more effective embedding model to better capture the semantics of the documents. Please refer to the [documentation](https://docs.ragas.io/en/latest/concepts/metrics/context_recall.html) for more information about context recall.

## Generation Evaluation Metrics
1. **Faithfulness** measures the consistency of the generated answer against the given context. In other words, faithfulness measures the severity of hallucinations in the model response. There are 2 considerations when analyzing the faithfulness metric. First, we determine whether the retrieved contexts are good enough to provide the evidence for the LLM to generate an answer. This can be done using the retrieval metrics, such as context recall or context precision. If the retrieval metrics are high, then a low faithfulness would indicate a weakness in the LLM in generating the answers from the given context. Please refer to the [documentation](https://docs.ragas.io/en/latest/concepts/metrics/faithfulness.html) for more information about faithfulness metric. 

2. **Answer Correctness** measures the accuracy of the generated answer against the ground truth data. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. A low answer correctness score could indicate the LLM being capable of providing an answer that completely satisfy the question. This metric could be influenced by the quality of the retrieved contexts, therefore it should be considered along with the retrieval metrics to provide a complete evaluation. Please refer to this [documentation](https://docs.ragas.io/en/latest/concepts/metrics/answer_correctness.html) for more information about answer correctness.

3. **Answer Similarity** measures the semantic resemblance between the generated answer and the ground truth. It uses cosine similarity to calculate the similarity score between the ground truth and the answer. A high similarity score indicates a strong alignment between the generated answer and the corresponding ground truth data. A low similarity score could indicate incompleteness in the generated answer relative to the ground truth. This metric could be influenced by the quality of the retrieved contexts. If the context precision or context recall are low, there is a high chance the answer similarity will also be low. In that case, the problem is probably lies in the retrieval process. Additionally, if the faithfulness score is low, the answer similarity will also be impacted. In this situation, addressing faithfulness score should help improving anwer similarity. Please refer to this [documentation](https://docs.ragas.io/en/latest/concepts/metrics/semantic_similarity.html) for more information about answer similarity.
   
4. **Answer relevancy** focuses on assessing how relevant the generated answer is to the question. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. Similar to answer similarity, this metric should be analyzed together with the retrieval metrics. Keep in mind that since this metric does not rely on the ground truth data, it could potentially produce a higher score, even though the generated answer is not completely accurate. The recommendation is to use **answer similarity** score together to gain a better insights into the effectiveness of the RAG workflow. Please refer to this [documentation](https://docs.ragas.io/en/latest/concepts/metrics/answer_relevance.html) for more information about answer relevancy.

## Conclusion
In this notebook, we demonstrates how to evaluate a RAG application using Amazon Bedrock and RAGAS framework. We started by uploaded the sample texts into an S3 bucket for creating the corresponding vector embeddings. After the data is uploaded to S3, we created a knowledge Base for Bedrock application and integrated it with an OpenSearch serverless collection. We fired a data ingestion job using the Bedrock Agent SDK to create the vector embeddings for the data on S3, and persists the vectors into the given Opensearch serverless collection.

To perform RAG evaluation on both the standard RAG and the two stage retrieval with a reranking model approach, we used an open source framework RAGAS focusing on faithfullness, answer correctness,, answer similarity, answer_relevancy, context_precision and context_racall. 

## Clean up
If you are done with the experiment, you can delete the resources used in this notebook by running the following cells below.

In [None]:
response = bedrock_agent.delete_knowledge_base(
    knowledgeBaseId=knowledge_base_id
)