# Overview
The purpose of this notebook is to demonstrate how to evaluate a basic RAG system. Overall there are many factors that can impact its performance. That is why, it is crucial to evaluate each component indvidually as well as to validate the entire system.
Typically, a RAG system evaluation covers a wide range of componenents and configurations such as chunk size & strategy, embedding model, retrieval strategy, rerank model, LLM-As-A-Judge prompt, RAG prompt, and the LLM used to generate the final response.

Given that this is an abridged version, we focus on how to evaluate the information retrieval task as well as how to validate the entire system end-to-end. For additional evaluations that cover embeddings, reranking, and LLM-As-A-Judge and RAG prompt engineering, please refer to the deep dive workshop covered in the [GenAI System Evaluation repository](https://github.com/aws-samples/genai-system-evaluation/tree/main).


**Why do we need RAG evaluations?**

Unlike for LLMs or embedding models, there are no leaderboards for an entire RAG system or its components. This can make it difficult to assess with which components and configuration to start with, or what to optimize when working with an existing RAG system. And to make matters worse, there are many factors that can impact the performance of a RAG system and its components. Therefore, it is crucial to have a systematic evaluation approach.

Without it, a change in one part of the system, such as the chunk size that determines how a source text is stored in a knowledge base, could have an unintended impact on other parts of the system that could go unnoticed.

And even in cases where we have benchmarks or leaderboards like for LLMs or embedding models, it is still important to understand “What” we’re using the model for. For example for embedding models, the most popular public benchmark is the Massive Text Embedding Benchmark or MTEB. HuggingFace maintains a leaderboard to compare general purpose embedding models against each other to see how they stack up against a wide range of tasks. 
This is a decent starting place, but you have to ask yourself, how well does this dataset compliment the task you really care about. If you are creating a RAG solution for Lawyers, then you are much more interested in how well the embedding model works for comparing legal text vs. how well it works for medical text. This is why it’s important to build out your own evaluation. A model that might not rank high on a general-purpose benchmark, could rank very high on your specific use case. If none of them work very well, then you can make a case for fine tuning an existing model on your data.

**What metrics should you care about?**

Which metrics you care about, depends on which part of the RAG system you are evaluating. 
For the information retrieval task you typically look at metrics like recall@k and precision@k.
Whereas for the end-to-end evaluation besides human evaluation, you typically use an LLM-As-A-Judge technique with evaluation criteria such as context utilization, completeness, conciseness, context relevancy, and clarity.

**How to evaluate**

To perform the information retrieval evaluation, you need to set up a retrieval task. Generate vector representations of items (documents or chunks) in a shared semantic space and perform a K-nearest-neighbor search on them using a similarity measure (e.g. cosine-similarity). This gives you the top-k retrieved item for each query.
Then you need a set of relevance judgments that indicate which documents are relevant to each query. These are typically created by human annotators or derived from click data in production systems.
For each query, you then count the number of relevant items in the top-k retrieved results. Calculate the precision using (number of relevant documents / k). Average the precision values across all queries. And then you apply these same techniques to other metrics like recall, NDCG, or MAP for a more comprehensive information retrieval evaluation.

To performe an end-to-end evaluation, you start with defining a evaluation prompt that incorporates your evaluation criteria, and then once you calibrated and aligned the LLM-As-A-Judge prompt with human preferences, you can use the LLM-As-A-Judge technique to create numerical scores for each of your evaluation criteria.

## How do you create relevance judgements?
This is a pretty manual process. For this example, we curated a dataset by taking large chunks of the Opensearch documentation into an LLM and asked it to come up with a couple example questions about the context. 
To enable experimentation, we corelate answers to 1 to 3 pages. By doing this, you can tweak the chunking strategy, but the relative file paths will stay constant so you don't have to redo your validation dataset every time you make a change to the chunks. 


# What will you do in this notebook? 
* You start with a basic sentence splitting chunking strategy, create embeddings for them, and then store them in a vector store (ChromaDB).
* Then you use a pre-created, curated evaluation dataset to run a information retrieval task experiment and review the results based on metrics such as recall@k and precision@k.
* Assuming that you've validated chunking strategy, embedding models, rerank models, LLM-As-A-Judge prompt, and the RAG prompt itself, you then move forward with validating the entire system through an end-to-end evaluation with the LLM-As-A-Judge technique.

**Lets get started!**

# Initialize clients and libraries

In [1]:
import boto3
from botocore.config import Config
from botocore.exceptions import ClientError, BotoCoreError

from pydantic import BaseModel
from typing import List, Dict
from abc import ABC, abstractmethod

import chromadb
from chromadb.config import Settings
from chromadb import Documents, EmbeddingFunction, Embeddings

from typing import cast, Dict, Any, List
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
import random
import os
import re
import json
from functools import wraps
import pandas as pd
import numpy as np

from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import Node
from llama_index.core import SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline

from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

# Initialize Chroma client from our persisted store
chroma_client = chromadb.PersistentClient(path="../data/chroma")

# Also initialize the bedrock client so we can call models
config = Config(
   retries = {
      'max_attempts': 10,
      'mode': 'standard'
   }
)

session = boto3.Session(profile_name='team')
# Get the region from the session, default to us-east-1 if not set
region = session.region_name or 'us-east-1'

bedrock = session.client(
        service_name="bedrock-runtime",
        region_name=region,
        config=config
)

print("Chroma and Bedrock clients initialized for region:", region)

Chroma and Bedrock clients initialized for region: us-east-1


# Start running experiments!

## Experiment 1
In this first experiment we're going to set up a retrieval task using ChromaDB as vector store, Titan Text V2 as embedding model, and use a big chunk size.

## 1.1 Create chunks
In this step we first define a custom wrapper class around LlamaIndex to decouple the Chroma collection from LlamaIndex.
And then we use this wrapper to split up documents from the OpenSearch documentation in the input dir into ~2046 chunk sizes with the overlap (or smaller if the file isn't that big). We should get around ~10k chunks.

In [2]:
# Create a class to use instead of LlamaIndex Nodes. This way we decouple our chroma collections from LlamaIndexes
class RAGChunk(BaseModel):
    id_: str
    text: str
    metadata: Dict[str, Any] = {}


class SentenceSplitterChunkingStrategy:
    def __init__(self, input_dir: str, chunk_size: int = 256, chunk_overlap: int = 128):
        self.input_dir = input_dir
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.pipeline = self._create_pipeline()

        # Helper to get regex pattern for normalizing relative file paths.
        self.relative_path_pattern = rf"{re.escape(input_dir)}(/.*)"

    def _extract_relative_path(self, full_path):
        # Get Regex pattern
        pattern = self.relative_path_pattern
        match = re.search(pattern, full_path)
        if match:
            return match.group(1).lstrip('/')
        return None

    def _create_pipeline(self) -> IngestionPipeline:
        transformations = [
            SentenceSplitter(chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap),
        ]
        return IngestionPipeline(transformations=transformations)

    def load_documents(self) -> List[Document]:
        # If you're using a different type of file besides md, you'll want to change this. 
        return SimpleDirectoryReader(
            input_dir=self.input_dir, 
            recursive=True,
            required_exts=['.md']
        ).load_data()

    def to_ragchunks(self, nodes: List[Node]) -> List[RAGChunk]:
        return [
            RAGChunk(
                id_=node.node_id,
                text=node.text,
                metadata={
                    **node.metadata,
                    'relative_path': self._extract_relative_path(node.metadata['file_path'])
                }
            )
            for node in nodes
        ]

    def process(self) -> List[RAGChunk]:
        documents = self.load_documents()
        nodes = self.pipeline.run(documents=documents)
        rag_chunks = self.to_ragchunks(nodes)
        
        print(f"Processing complete. Created {len(rag_chunks)} chunks.")
        return rag_chunks
    

chunking_strategy = SentenceSplitterChunkingStrategy(
    input_dir="../data/opensearch-docs/documentation-website",
    chunk_size=1024,
    chunk_overlap=128
)

# Get the nodes from the chunker.
chunks: RAGChunk = chunking_strategy.process()

Processing complete. Created 10678 chunks.


### 1.2 Setup retrieval task
The next step is to set up a retrieval task. Here we use ChromaDB as the vector database and create a wrapper class for the retrieval task. 

The embedding function is used during the creation of the collection and is also automatically applied to queries during the actual retrieval. 

In [3]:

class RetrievalResult(BaseModel):
    id: str
    document: str
    embedding: List[float]
    distance: float
    metadata: Dict = {}

# Base retrieval class. Can be reused if you decide to implement a different retrieval class.
class BaseRetrievalTask(ABC):
    @abstractmethod
    def retrieve(self, query_text: str, n_results: int) -> List[RetrievalResult]:
        """
        Retrieve documents based on the given query.

        Args:
            query (str): The query string to search for.

        Returns:
            List[RetrievalResult]: A list of RetrievalResult objects that are relevant to the query.
        """
        pass



# Example of a concrete implementation
class ChromaDBRetrievalTask(BaseRetrievalTask):

    def __init__(self, chroma_client, collection_name: str, embedding_function, chunks: List[RAGChunk]):
        self.client = chroma_client
        self.collection_name = collection_name
        self.embedding_function = embedding_function
        self.chunks = chunks

        # Create the collection
        self.collection = self._create_collection()

    def _create_collection(self):
        return self.client.get_or_create_collection(
            name=self.collection_name,
            embedding_function=self.embedding_function
        )

    def add_chunks_to_collection(self, batch_size: int = 20, num_workers: int = 10):
        batches = [self.chunks[i:i + batch_size] for i in range(0, len(self.chunks), batch_size)]
        
        with ThreadPoolExecutor(max_workers=num_workers) as executor:
            futures = [executor.submit(self._add_batch, batch) for batch in batches]
            for future in as_completed(futures):
                future.result()  # This will raise an exception if one occurred during the execution
        print('Finished Ingesting Chunks Into Collection')

    def _add_batch(self, batch: List[RAGChunk]):
        self.collection.add(
            ids=[chunk.id_ for chunk in batch],
            documents=[chunk.text for chunk in batch],
            metadatas=[chunk.metadata for chunk in batch]
        )

    def retrieve(self, query_text: str, n_results: int = 5) -> List[RetrievalResult]:
        # Query the collection
        results = self.collection.query(
            query_texts=[query_text],
            n_results=n_results,
            include=['embeddings', 'documents', 'metadatas', 'distances']
        )

        # Transform the results into RetrievalResult objects
        retrieval_results = []
        for i in range(len(results['ids'][0])):
            retrieval_results.append(RetrievalResult(
                id=results['ids'][0][i],
                document=results['documents'][0][i],
                embedding=results['embeddings'][0][i],
                distance=results['distances'][0][i],
                metadata=results['metadatas'][0][i] if results['metadatas'][0] else {}
            ))

        return retrieval_results

In [4]:
# example of a custom embedding function with retry logic for Bedrock
class CustomBedrockEmbeddingFunction(EmbeddingFunction[Documents]):
    """
    A custom ChromaDB embedding function for Amazon Bedrock with robust throttling handling.

    This function is designed to replace the standard AmazonBedrockEmbeddingFunction
    by implementing a retry mechanism with exponential backoff and jitter using
    the 'tenacity' library, ensuring stability during high-volume requests.
    """
    def __init__(
        self,
        model_id: str = "amazon.titan-embed-text-v1",
        region_name: str = "us-east-1",
        client: Any = None,
    ):
        """
        Initializes the embedding function.

        Args:
            model_id: The identifier for the Amazon Bedrock embedding model.
            region_name: The AWS region to use.
            client: An optional pre-configured boto3 bedrock-runtime client.
        """
        self.model_id = model_id
        if client:
            self.client = client
        else:
            print("Missing Bedrock client - please provide one!")

    @retry(
        wait=wait_exponential(multiplier=1, min=1, max=10),
        stop=stop_after_attempt(10),
        retry=retry_if_exception_type((ClientError)),
        before_sleep=lambda retry_state: print(
            f"ThrottlingException encountered. Retrying attempt "
            f"{retry_state.attempt_number} in {retry_state.seconds_since_start:.2f}s..."
        ),
    )
    def _get_embeddings_with_retry(self, input_text: str) -> List[float]:
        """
        Internal method to get an embedding with retries on throttling errors.
        """
        body = json.dumps({"inputText": input_text})
        
        response = self.client.invoke_model(
            body=body,
            modelId=self.model_id,
            accept="*/*",
            contentType="application/json",
        )
        response_body = json.loads(response.get("body").read())
        
        return cast(List[float], response_body.get("embedding"))

    def __call__(self, input: Documents) -> Embeddings:
        """
        Processes a list of documents and returns their embeddings.

        Args:
            input: A list of text documents to embed.

        Returns:
            A list of embeddings for the input documents.
        """
        embeddings = []
        for text in input:
            embedding = self._get_embeddings_with_retry(text)
            embeddings.append(embedding)
        return embeddings

### 1.3 Populate the vector store
Next we define the embedding function and populate the vector database with vectors.

In [5]:

# Define some experiment variables
EMBEDDING_MODEL_ID: str = 'amazon.titan-embed-text-v2:0'
EXPERIMENT_1_COLLECTION_NAME: str = 'experiment_1_collection'

# Initialize the custom embedding function
bedrock_ef = CustomBedrockEmbeddingFunction(model_id=EMBEDDING_MODEL_ID, client=bedrock)

# Create our retrieval task. All retrieval tasks in this tutorial implement BaseRetrievalTask which has the method retrieve()
# If you'd like to extend this to a different retrieval configuration, all you have to do is create a class that that implements
# this abstract class and the rest is the same
experiment_1_retrieval_task: BaseRetrievalTask = ChromaDBRetrievalTask(
    chroma_client = chroma_client, 
    collection_name = EXPERIMENT_1_COLLECTION_NAME,
    embedding_function = bedrock_ef,
    chunks = chunks
)

# This takes a while to run, therefore we already created the collection for the purpose of this workshop and commented out this line
# experiment_1_retrieval_task.add_chunks_to_collection()

In [6]:
# Lets verify it works!
print(len(experiment_1_retrieval_task.retrieve('What does * do?', n_results=1)) == 1)

True


### 1.4 Define validation dataset
The pre-created validation dataset contains a set of 24 questions users might ask a RAG system designed to answer questions from the OpenSearch documentation.

It has the following structure:

query_text: I'm using version 2.1 of open search and trying to use zstd compression. Why isn't it working?

relevant_doc_ids: "[""_im-plugin/index-codecs.md"", ""_tuning-your-cluster/performance.md""]"


In [7]:
# Load and clean the eval dataset for the information retrieval task
def get_clean_eval_dataset():
    EVAL_PATH = '../data/eval-datasets/1_embeddings_validation.csv'
    eval_df = pd.read_csv(EVAL_PATH)

    # Clean up the DataFrame
    eval_df = eval_df.rename(columns=lambda x: x.strip())  # Remove any leading/trailing whitespace from column names
    eval_df = eval_df.drop(columns=[col for col in eval_df.columns if col.startswith('Unnamed')])  # Remove unnamed columns
    eval_df = eval_df.dropna(how='all')  # Remove rows that are all NaN
    
    # Strip whitespace from string columns
    for col in eval_df.select_dtypes(['object']):
        eval_df[col] = eval_df[col].str.strip()
    
    # Ensure 'relevant_doc_ids' is a string column
    eval_df['relevant_doc_ids'] = eval_df['relevant_doc_ids'].astype(str)

    return eval_df

eval_df = get_clean_eval_dataset()

print(f"Validation dataset loaded with {len(eval_df)} entries.")

Validation dataset loaded with 24 entries.


### 1.5 Define information retrieval metrics
The IRMetricsCalculator class below calculates a series of metrics that are useful for evaluating the information retrieval task in a RAG system. Remember, we are only evaluating the retrieval at this stage, and not yet the sytems/models ability to create an answer from the information retrieval results.

#### Metrics
* precision@k - are all the found chunks correct chunks?
* recall@k - did it find all the correct chunks?
* ndcg@k - how high are all of the relevant items ranked in the returned list?

These individual metrics are the basis for creating an aggregate view of our validation dataset to get a sense for how well it's performing.

In [8]:
#  Helper class for calculating metrics.
class IRMetricsCalculator:
    def __init__(self, df):
        self.df = df

    @staticmethod
    def precision_at_k(relevant, retrieved, k):
        retrieved_k = retrieved[:k]
        return len(set(relevant) & set(retrieved_k)) / k if k > 0 else 0

    @staticmethod
    def recall_at_k(relevant, retrieved, k):
        retrieved_k = retrieved[:k]
        return len(set(relevant) & set(retrieved_k)) / len(relevant) if len(relevant) > 0 else 0

    @staticmethod
    def dcg_at_k(relevant, retrieved, k):
        retrieved_k = retrieved[:k]
        dcg = 0
        for i, item in enumerate(retrieved_k):
            if item in relevant:
                dcg += 1 / np.log2(i + 2)
        return dcg

    @staticmethod
    def ndcg_at_k(relevant, retrieved, k):
        dcg = IRMetricsCalculator.dcg_at_k(relevant, retrieved, k)
        idcg = IRMetricsCalculator.dcg_at_k(relevant, relevant, k)
        return dcg / idcg if idcg > 0 else 0

    @staticmethod
    def parse_json_list(json_string):
        try:
            return json.loads(json_string)
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON: {json_string} with error {e}")
            return []

    def calculate_metrics(self, k_values=[1, 3, 5, 10]):
        for k in k_values:
            self.df[f'precision@{k}'] = self.df.apply(lambda row: self.precision_at_k(
                self.parse_json_list(row['relevant_doc_ids']),
                self.parse_json_list(row['retrieved_doc_ids']), k), axis=1)
            self.df[f'recall@{k}'] = self.df.apply(lambda row: self.recall_at_k(
                self.parse_json_list(row['relevant_doc_ids']),
                self.parse_json_list(row['retrieved_doc_ids']), k), axis=1)
            self.df[f'ndcg@{k}'] = self.df.apply(lambda row: self.ndcg_at_k(
                self.parse_json_list(row['relevant_doc_ids']),
                self.parse_json_list(row['retrieved_doc_ids']), k), axis=1)
        return self.df

### 1.6 Setup task runner
In the step below we setup a task runner that iterates through our dataframe, runs a retrieval task on the input and uses the IRCalculator to generate metrics on the results.

In [9]:
class RetrievalTaskRunner:
    def __init__(self, eval_df: pd.DataFrame, retrieval_task: BaseRetrievalTask):
        self.eval_df = eval_df
        self.retrieval_task = retrieval_task

    def _get_unique_file_paths(self, results: List[RetrievalResult]) -> List[str]:
        # Since Python 3.7, dicts retain insertion order.
        return list(dict.fromkeys(r.metadata['relative_path'] for r in results))
        

    def run(self) -> pd.DataFrame:
        # Make a copy of the dataframe so we don't modify the original.
        df = pd.DataFrame(self.eval_df)
        
        results = []
        for index, row in df.iterrows():
            query: str = row['query_text']
            
            # Run retrieval task
            retrieval_results: List[RetrievalResult] = self.retrieval_task.retrieve(query)
            
            # Extract unique page numbers for comparison with validation dataset.
            ordered_filepaths: List[str] = self._get_unique_file_paths(retrieval_results)

            retrieved_chunks = [ {'relative_path': r.metadata['relative_path'], 'chunk': r.document} for r in retrieval_results ]

            # Create new record
            result = {
                'query_text': query,
                'relevant_doc_ids': row['relevant_doc_ids'],
                'retrieved_doc_ids': json.dumps(ordered_filepaths),
                'retrieved_chunks': json.dumps(retrieved_chunks), # Best way to preserve the chunks
            }
            results.append(result)

        new_dataframe = pd.DataFrame(results)
        # return new_dataframe

        ir_calc: IRMetricsCalculator = IRMetricsCalculator(new_dataframe)
        return ir_calc.calculate_metrics()

### 1.7 Run first experiment
The command below triggers the first experiment and stores the results in a dataframe.

In [10]:
experiment_1_results: pd.DataFrame = RetrievalTaskRunner(eval_df, experiment_1_retrieval_task).run()

### 1.8 Create the information retrieval experiment summary
You can review the results for each individual query, but it doesn't quite tell the whole story. That's why we create a summary view showing Mean Average Precision, Mean Reciprocal Rank (MRR), as well as general averages across all the individual metrics we calculated. This tells us know how well the retrieval task is performing.

In [11]:
# helper class to summarize the experiment results
import pandas as pd
import numpy as np
from typing import List

class ExperimentSummarizer:
    def __init__(self, df):
        self.df = pd.DataFrame(df)
        self.summary_df = None

    @staticmethod
    def calculate_ap(relevant_docs, retrieved_docs):
        relevant_set = set(relevant_docs.split(','))
        retrieved_list = retrieved_docs.split(',')
        relevant_count = 0
        total_precision = 0
        
        for i, doc in enumerate(retrieved_list, 1):
            if doc in relevant_set:
                relevant_count += 1
                total_precision += relevant_count / i
        
        return total_precision / len(relevant_set) if relevant_set else 0

    @staticmethod
    def calculate_reciprocal_rank(relevant_docs, retrieved_docs):
        relevant_set = set(relevant_docs.split(','))
        retrieved_list = retrieved_docs.split(',')
        
        for i, doc in enumerate(retrieved_list, 1):
            if doc in relevant_set:
                return 1 / i
        
        return 0

    def calculate_map(self):
        self.df['AP'] = self.df.apply(lambda row: self.calculate_ap(row['relevant_doc_ids'], row['retrieved_doc_ids']), axis=1)
        return self.df['AP'].mean()

    def calculate_mrr(self):
        self.df['RR'] = self.df.apply(lambda row: self.calculate_reciprocal_rank(row['relevant_doc_ids'], row['retrieved_doc_ids']), axis=1)
        return self.df['RR'].mean()

    def calculate_mean_metrics(self):
        return self.df[[
            'precision@1', 'recall@1', 'ndcg@1',
            'precision@3', 'recall@3', 'ndcg@3',
            'precision@5', 'recall@5', 'ndcg@5'
        ]].mean()

    def calculate_top_k_percentages(self):
        top_1 = (self.df['precision@1'] > 0).mean() * 100
        top_3 = (self.df['precision@3'] > 0).mean() * 100
        top_5 = (self.df['precision@5'] > 0).mean() * 100
        return top_1, top_3, top_5

    def analyze(self):
        map_score = self.calculate_map()
        mrr_score = self.calculate_mrr()
        mean_metrics = self.calculate_mean_metrics()
        top_1, top_3, top_5 = self.calculate_top_k_percentages()

        self.summary_df = pd.DataFrame({
            'Metric': [
                'MAP (Mean Average Precision)',
                'MRR (Mean Reciprocal Rank)',
                'Mean Precision@1', 'Mean Recall@1', 'Mean NDCG@1',
                'Mean Precision@3', 'Mean Recall@3', 'Mean NDCG@3',
                'Mean Precision@5', 'Mean Recall@5', 'Mean NDCG@5',
                '% Queries with Relevant Doc in Top 1',
                '% Queries with Relevant Doc in Top 3',
                '% Queries with Relevant Doc in Top 5'
            ],
            'Value': [
                map_score,
                mrr_score,
                mean_metrics['precision@1'], mean_metrics['recall@1'], mean_metrics['ndcg@1'],
                mean_metrics['precision@3'], mean_metrics['recall@3'], mean_metrics['ndcg@3'],
                mean_metrics['precision@5'], mean_metrics['recall@5'], mean_metrics['ndcg@5'],
                top_1, top_3, top_5
            ]
        })
        return self.summary_df

    def get_summary(self):
        if self.summary_df is None:
            self.analyze()
        return self.summary_df
    
# Lets use the class above to create aggregate metrics to see how well the system performs.
experiment_1_summary = ExperimentSummarizer(experiment_1_results).analyze()
experiment_1_summary

Unnamed: 0,Metric,Value
0,MAP (Mean Average Precision),0.060185
1,MRR (Mean Reciprocal Rank),0.097222
2,Mean Precision@1,0.541667
3,Mean Recall@1,0.423611
4,Mean NDCG@1,0.541667
5,Mean Precision@3,0.291667
6,Mean Recall@3,0.638889
7,Mean NDCG@3,0.584949
8,Mean Precision@5,0.191667
9,Mean Recall@5,0.673611


### 1.9 Takeaways from the information retrieval experiment
The results could be a lot better. 
At this step, you typically assess whether the top k results have relevant data in them. This makes recall@5 arguably the most important metric for this step.

However, you cannot ignore the other metrics, as they inform you about other critical aspects of the information retrieval task's performance. For example, in general you want to limit the amount of context you pass back to the model to save on input token cost and minimize latency. Therefore knowing metrics such as precision@1 and precision@5 are useful metrics, as they give you an idea of how well the embeddings are working on their own at ranking the results. 

**Optional: Experiment with different chunk sizes, embedding models, or rerankers to see if it performs better**

It is important to remember that this is just the starting point, typically you would perform many more experiments with different chunk sizes, chunking strategies, embedding models, or rerankers to see if the information retrieval task results improve as demonstrated in the [deep dive for basic RAG evaluation](https://github.com/aws-samples/genai-system-evaluation/tree/main).

Once the information retrieval task is delivering good results, it is time to evaluate the answer generation as well as the overall end to end performance of your RAG system.

## 2. End-to-end RAG system evaluation


**What metrics should I care about?**

For an E2E system, we care about metrics such as context utilization, completeness, conciseness, context relevancy, and clarity.


**What will you do?**

* Curate a dataset of questions and ground truth answers (we've created one already)
* Review/Create a RAG prompt and grading rubric
* Setup RAG system
* Run the RAG process and the subsequent evaluation process for all of the sets in the validation dataset

### 2.1 Import validation dataset


In [12]:
# Load the eval dataset for the end-to-end evaluation
eval_df = pd.read_csv('../data/eval-datasets/5_e2e_validation.csv')

### 2.2 Setup retrieval task

We will reuse the retrieval task from step 1.3.

In [13]:
# Define some experiment variables
EMBEDDING_MODEL_ID: str = 'amazon.titan-embed-text-v2:0'
EXPERIMENT_1_COLLECTION_NAME: str = 'experiment_1_collection'

# Initialize the custom embedding function
bedrock_ef = CustomBedrockEmbeddingFunction(model_id=EMBEDDING_MODEL_ID, client=bedrock)

# Create our retrieval task. All retrieval tasks in this tutorial implement BaseRetrievalTask which has the method retrieve()
# If you'd like to extend this to a different retrieval configuration, all you have to do is create a class that that implements
# this abstract class and the rest is the same
experiment_1_retrieval_task: BaseRetrievalTask = ChromaDBRetrievalTask(
    chroma_client = chroma_client, 
    collection_name = EXPERIMENT_1_COLLECTION_NAME,
    embedding_function = bedrock_ef,
    chunks = chunks
)

### 2.3 Setup RAG with Bedrock
In the RAGClient, we're making a retrieval call to populate the context. We store it in context for the LLM-As-A-Judge evaluation.

In [14]:
class BaseBedrockClient:
    def __init__(self, bedrock_client, user_prompt: str, system_prompt: str, model_id: str, hyper_params: dict):
        self.client = bedrock_client
        self.user_prompt = user_prompt
        self.system_prompt = system_prompt
        self.model_id = model_id
        self.hyper_params = hyper_params

    def create_chat_payload(self, inputs: dict) -> list[dict]:
        prompt = self.user_prompt.format(**inputs)
        return [{"role": "user", "content": [{"text": prompt}]}]

    def call(self, messages: list[dict]) -> str:
        response = self.client.converse(
            modelId=self.model_id,
            messages=messages,
            inferenceConfig=self.hyper_params,
            system=[{"text": self.system_prompt}]
        )
        return response['output']['message']['content'][0]['text']

    def call_threaded(self, message_lists: List[List[Dict[str, Any]]]) -> List[str]:
        future_to_position = {}
        with ThreadPoolExecutor(max_workers=5) as executor:
            for i, request in enumerate(message_lists):
                future = executor.submit(self.call, request)
                future_to_position[future] = i
            
            responses = [None] * len(message_lists)
            for future in as_completed(future_to_position):
                position = future_to_position[future]
                try:
                    response: str = future.result()
                    responses[position] = response
                except Exception as exc:
                    print(f"Request at position {position} generated an exception: {exc}")
                    responses[position] = None
        return responses

class RAGClient(BaseBedrockClient):
    def __init__(self, bedrock_client, user_prompt: str, system_prompt: str, model_id: str, hyper_params: dict, retrieval_task: BaseRetrievalTask):
        super().__init__(bedrock_client, user_prompt, system_prompt, model_id, hyper_params)
        self.retrieval_task = retrieval_task

    def extract_response(self, llm_output: str) -> str:
        response_match = re.search(r'<response>(.*?)</response>', llm_output, re.DOTALL)
        return response_match.group(1).strip() if response_match else "No response found"

    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        df = df.copy()

        message_lists = []
        contexts = []  # Store context as it's passed into the prompt
        # context_lists = [] # Store context for RAGAS evaluation
        for _, row in df.iterrows():
            # # Get passages for context
            passages: List[RetrievalResult] = self.retrieval_task.retrieve(row["query_text"])
            # Combine into single context
            context = "\n\n".join(f"###File name:\n{p.metadata}\n###Passage:\n{p.document}" for p in passages)

            # Store contexts for downstream dependencies
            contexts.append(context)
            # context_lists.append(json.dumps([p.chunk for p in passages]))
            
            # Construct message list using the query text and relevant passages retrieved.
            message_lists.append(self.create_chat_payload({
                "query_text": row["query_text"],
                "context": context
            }))
        
        responses = self.call_threaded(message_lists)

        df['context'] = contexts
        # df['context_chunks'] = context_lists
        df['llm_response'] = [self.extract_response(r) for r in responses]
        return df

class EvaluationClient(BaseBedrockClient):
    def __init__(self, bedrock_client, user_prompt: str, system_prompt: str, model_id: str, hyper_params: dict):
        super().__init__(bedrock_client, user_prompt, system_prompt, model_id, hyper_params)

    def extract_score_and_thinking(self, llm_output: str) -> tuple:
        thinking_match = re.search(r'<thinking>(.*?)</thinking>', llm_output, re.DOTALL)
        score_match = re.search(r'<score>(.*?)</score>', llm_output, re.DOTALL)

        thinking = thinking_match.group(1).strip() if thinking_match else "No thinking found"
        score = float(score_match.group(1)) if score_match else None
        
        return score, thinking

    def evaluate(self, df: pd.DataFrame) -> pd.DataFrame:
        df = df.copy()
        message_lists = [self.create_chat_payload({
            "query_text": row["query_text"],
            "context": row["context"],
            "llm_response": row["llm_response"],
            "ground_truth": row["ground_truth"]
        }) for _, row in df.iterrows()]
        
        responses = self.call_threaded(message_lists)

        llm_scores = []
        llm_thinking = []

        for response in responses:
            if response is not None:
                score, thinking = self.extract_score_and_thinking(response)
                llm_scores.append(score)
                llm_thinking.append(thinking)
            else:
                llm_scores.append(None)
                llm_thinking.append("Error occurred during processing")

        df['grade'] = llm_scores
        df['reasoning'] = llm_thinking
        
        return df

### 2.4 Define RAG prompt
Before we evaluate anything, we need to construct a prompt that can take in context and generate answers.

In [15]:
# System Prompt
RAG_SYSTEM_PROMPT = """You are an advanced AI assistant specialized in Retrieval Augmented Generation (RAG).
Your primary function is to provide accurate, concise, and relevant answers based solely on the given context.
Follow these guidelines strictly:

1. Use only information from the provided context. Do not introduce external knowledge or make assumptions.
2. Ensure your answers are complete, addressing all aspects of the question using available information.
3. Be extremely concise. Use as few words as possible while maintaining clarity and completeness.
4. Maintain 100% accuracy based on the given context. If the context doesn't contain enough information to answer fully, state this clearly.
5. Structure your responses for maximum clarity. Use bullet points or numbered lists when appropriate.
6. If the context contains technical information, explain it in simple terms as if speaking to a non-technical person.
7. Do not apologize or use phrases like "Based on the context provided" or "According to the information given".
8. If asked about something not in the context, simply state "The provided context does not contain information about [topic]."

Your goal is to achieve the highest possible score on context utilization, completeness, conciseness, accuracy, and clarity."""

# User Prompt
RAG_USER_PROMPT = """Answer the following question using only the provided context:

<query>
{query_text}
</query>

<context>
{context}
</context>

Instructions:
1. Read the question and context carefully.
2. Formulate a concise and accurate answer based solely on the given context.
3. Ensure your response is clear and easily understandable to a non-technical person.
4. Do not include any information not present in the context.
5. If the context doesn't contain relevant information, state this clearly and concisely.
6. Place your response in <response></response> tags."""

### 2.5 Define evaluation rubric 
We generate a score from 0-5 (arbitrary) for each of the defined evaluation criteria and we incorporate "ground truth" in the evaluation to assess context relevancy.

In [16]:
# System Prompt
RUBRIC_SYSTEM_PROMPT = """You are an expert judge evaluating Retrieval Augmented Generation (RAG) applications.
Your task is to evaluate given answers based on context and questions using the criteria provided.
Evaluation Criteria (Score either 0 or 1 for each, total score is the sum):
1. Context Utilization: Does the answer use only information provided in the context, without introducing external or fabricated details?
2. Completeness: Does the answer thoroughly address all key elements of the question based on the available context, without significant omissions?
3. Conciseness: Does the answer efficiently use words to address the question and avoid unnecessary redundancy?
4. Context Relevancy: Is the context returned sufficient to provide an answer like the gold standard answer.
5. Clarity: Is the answer easy to understand and follow?
Your role is to provide a fair and thorough evaluation for each criterion, explaining your reasoning clearly."""

# User Prompt
RUBRIC_USER_PROMPT = """Please evaluate the following RAG response:

Question:
<query_text>
{query_text}
</query_text>

Ground Truth Answer
<llm_response>
{ground_truth}
</llm_response>


Generated answer:
<llm_response>
{llm_response}
</llm_response>

Context:
<context>
{context}
</context>

Evaluation Steps:
1. Carefully read the provided context, question, and answer.
2. For each evaluation criterion, assign a score of either 0 or 1:
   - Context Utilization
   - Completeness
   - Conciseness
   - Context Relevancy
   - Clarity
3. Provide a clear explanation for each score, referencing specific aspects of the response.
4. Calculate the total score by adding up the points awarded (minimum 0, maximum 5).
5. Present your evaluation inside <thinking></thinking> tags.
6. Include individual criterion scores (0 or 1) in the thinking tags and the total score inside <score></score> tags.
7. Ensure your response is valid XML and provides a comprehensive evaluation.
8. Use the ground truth to evaluate whether the information returned was not relevant to answer the question fully. If not, 

Example Output Format:
<thinking>
Context Utilization: 1 - The answer strictly uses information from the context without introducing external details.
Completeness: 1 - The response covers all key elements of the question based on the available context.
Conciseness: 1 - The answer is helpful and doesn't repeat the same information more than once.
Context Relevancy: 0 - The context was not relevant to the question.
Clarity: 1 - The response is clear and easy to follow.
</thinking>
<score>4</score>

Please provide your detailed evaluation."""

In [17]:
# Test different LLMs by changing the model ID here
HAIKU_ID = "us.anthropic.claude-3-5-haiku-20241022-v1:0"

# Initialize RAG Client
rag_client: RAGClient = RAGClient(
    bedrock,
    RAG_USER_PROMPT, 
    RAG_SYSTEM_PROMPT, 
    HAIKU_ID,
    {"temperature": 0.5, "maxTokens": 2096},
    experiment_1_retrieval_task
)

# Initialize Eval Client
eval_client = EvaluationClient(
    bedrock,
    RUBRIC_USER_PROMPT, 
    RUBRIC_SYSTEM_PROMPT, 
    HAIKU_ID, 
    {"temperature": 0.7, "maxTokens": 4096}
)

In [18]:
# Generate RAG responses
rag_df = rag_client.process(eval_df)

In [19]:
# Evaluate RAG Responses
llm_as_a_judge_results_df = eval_client.evaluate(rag_df)

In [20]:
# Create Summary View of Results
import pandas as pd
import numpy as np
from textwrap import fill

class E2EEvaluator:
    def __init__(self, df):
        self.df = df
        self.grades = df['grade'].astype(float)
    
    def calculate_metrics(self):
        return {
            'Mean': np.mean(self.grades),
            'Median': np.median(self.grades),
            'Standard Deviation': np.std(self.grades),
            'Minimum Grade': np.min(self.grades),
            'Maximum Grade': np.max(self.grades)
        }
    
    def generate_report(self):
        metrics = self.calculate_metrics()
        report = "E2E Validation Result\n"
        report += "========================\n\n"
        
        for metric, value in metrics.items():
            report += f"{metric}: {value:.2f}\n"
        
        return report
    
    def analyze_grade_distribution(self):
        return self.df['grade'].value_counts().sort_index()

    def pretty_print_lowest_results(self, n=3, width=80):
        lowest_results = self.df.nsmallest(n, 'grade')
        for index, row in lowest_results.iterrows():
            print(f"{'='*width}\n")
            print(f"Grade: {row['grade']}\n")
            print("Query Text:")
            print(fill(row['query_text'], width=width))
            print("\nLLM Response:")
            print(fill(row['llm_response'], width=width))
            print("\nReasoning:")
            print(fill(row['reasoning'], width=width))
            print(f"\n{'='*width}\n")

In [21]:
# Assuming your dataframe is named 'df'
evaluator = E2EEvaluator(llm_as_a_judge_results_df)

# Generate and print the report
print(evaluator.generate_report())

# Analyze grade distribution
print(evaluator.analyze_grade_distribution())

E2E Validation Result

Mean: 4.46
Median: 5.00
Standard Deviation: 0.96
Minimum Grade: 2.00
Maximum Grade: 5.00

grade
2.0     2
3.0     2
4.0     3
5.0    17
Name: count, dtype: int64


In [22]:
# Look at the results and spot check them
llm_as_a_judge_results_df

Unnamed: 0,query_text,ground_truth,context,llm_response,grade,reasoning
0,I'm using version 2.1 of open search and tryin...,The `zstd` and `zstd_no_dict` codecs were intr...,"###File name:\n{'file_type': 'text/markdown', ...",The context indicates that zstd compression is...,5.0,Context Utilization: 1 - The answer directly u...
1,I'm trying to set up this new aggregate view t...,To ensure you don't break your existing multi-...,###File name:\n{'last_modified_date': '2025-03...,Important points about the aggregate view for ...,5.0,Context Utilization: 1 - The generated answer ...
2,What software license does Opensearch use?,"According to the context provided, the OpenSea...","###File name:\n{'file_type': 'text/markdown', ...","Apache License, Version 2.0. All components of...",5.0,Context Utilization: 1 - The answer directly u...
3,Does GPU accelerated nodes support Pytorch?,"Yes, according to the documentation, GPU-accel...",###File name:\n{'relative_path': '_ml-commons-...,"Yes, GPU-accelerated nodes support PyTorch. Sp...",5.0,Context Utilization: 1 - The answer directly u...
4,Does opensearch support hugging face models? I...,"Yes, OpenSearch supports various Hugging Face ...",###File name:\n{'relative_path': '_vector-sear...,"Yes, OpenSearch supports Hugging Face models, ...",3.0,Context Utilization: 0 - While the answer uses...
5,"I have a custom model, can I run it in Opensea...","Yes, OpenSearch supports running custom local ...",###File name:\n{'file_path': '/Users/huthmac/D...,"Yes, you can run a custom model in OpenSearch....",5.0,Context Utilization: 1 - The generated answer ...
6,"I have a model and some ML nodes, how do I boo...","Based on the context provided, to boost the pe...","###File name:\n{'file_size': 12788, 'last_modi...",To boost your ML model's performance:\n\n1. Us...,5.0,Context Utilization: 1 - The generated answer ...
7,Can you show me an example of how to use lat/l...,"Yes, the context provides several examples of ...","###File name:\n{'creation_date': '2025-09-22',...",Here are several ways to use latitude/longitud...,5.0,Context Utilization: 1 - The answer directly u...
8,How do I use vector search?,"According to the provided context, there are t...","###File name:\n{'file_size': 3509, 'last_modif...",To use vector search:\n\n1. Understand the bas...,2.0,Context Utilization: 0 - The generated answer ...
9,How do I understand the memory requirements fo...,The memory requirements for using HNSW (Hierar...,"###File name:\n{'creation_date': '2025-09-22',...","To understand HNSW memory requirements, use th...",5.0,Context Utilization: 1 \n- The answer directly...


### 2.6 E2E test results
The E2E test results are pretty good! However, it doesn't account for scenarios where you simply don't have the correct context. 

In this notebook we combined our embeddings and prompt together to run E2E tests on our entire RAG system. 
Based on our findings, document level chunking worked very well for this use case. 

# Takeaways

By adding validation at each touchpoint in the RAG system, we can get a comprehensive view of what's happening and where the bottlenecks to better performance are. 

Another important takeaway is that how you chunk your data matters (aguably) more than what model you choose to vend the RAG results. We only explored a very basic chunking strategy and did not cover more advanced retrieval strategies or other important components such as different embedding models, rerankers that could unlock greater performance. If you want to dive deeper into any of these other evaluation touch points, please refer to the [GenAI System Evaluation repository](https://github.com/aws-samples/genai-system-evaluation/tree/main).

Lastly, no dataset or information retrieval problem is exactly the same. It's important to evaluate it to understand how your RAG system is performing and update your chunking and validation sets over time.