# Evaluating RAG Pipeline- Answer Accuracy, Context Relevancy, and Groundedness via RAGAS

In this notebook, we will evaluate our RAG system for three metrics using the [Ragas](https://docs.ragas.io/en/stable/) library. 

Ragas provides a set of evaluation metrics that can be used to measure the performance of your LLM application. These metrics are designed to help you objectively measure the performance of your application. 

In this notebook, we will use the following three metrics, introduced to Ragas by NVIDIA.
1. Answer Accuracy measures the agreement between a model’s response and a reference ground truth for a given question.
2. Context Relevancy Context Relevance evaluates whether the retrieved_contexts (chunks or passages) are pertinent to the user_input. 
3. Response Groundedness measures how well a response is supported or "grounded" by the retrieved contexts. It assesses whether each claim in the response can be found, either wholly or partially, in the provided contexts.

## 1. Download Evaluation Documents

First, let's download a dataset to evaluate our RAG system on. We will use the FinanceBench dataset, which includes PDF files with information and reports about publicly traded companies, as well as ground truth question and answer pairs.

Let's start by cloning the repo into our data directory, in a subdirectory called `financebench`. Inside `financebench`, you can find the PDFs in a subdirectory called `pdfs`.


In [None]:
! git clone https://github.com/patronus-ai/financebench.git ../data/financebench

## 2. Ingest Evaluation Documents

For evaluation, we will use the KG_RAG dataset. In the data directory, we have the pdf files for the KG_RAG dataset, as well as train.json file, which includes ground truth question and answer pairs. 
Let's start by creating a collection called `financebench`, and upload the relevant documents.

This is similar to the `ingestion_api_usage` notebook. 

In [None]:
import aiohttp
import os
import json
import glob

In [None]:
IPADDRESS = "ingestor-server" if os.environ.get("AI_WORKBENCH", "false") == "true" else "localhost" # Replace this with the correct IP address
INGESTOR_SERVER_PORT = "8082"
INGESTOR_BASE_URL = f"http://{IPADDRESS}:{INGESTOR_SERVER_PORT}"  # Replace with your server URL

async def print_response(response):
    """Helper to print API response."""
    try:
        response_json = await response.json()
        print(json.dumps(response_json, indent=2))
    except aiohttp.ClientResponseError:
        print(await response.text())


In [None]:
async def create_collection(
    collection_name: list = None,
    embedding_dimension: int = 2048,
    metadata_schema: list = []
):

    data = {
        "collection_name": collection_name,
        "embedding_dimension": embedding_dimension,
        "metadata_schema": metadata_schema
    }

    HEADERS = {"Content-Type": "application/json"}

    async with aiohttp.ClientSession() as session:
        try:
            async with session.post(f"{INGESTOR_BASE_URL}/v1/collection", json=data, headers=HEADERS) as response:
                await print_response(response)
        except aiohttp.ClientError as e:
            return 500, {"error": str(e)}



# Call create collection method
await create_collection(
    collection_name="financebench",
)

In [None]:
FILEPATHS = glob.glob(os.path.join("../data/financebench/pdfs", "*.pdf"))

async def upload_documents(collection_name: str = ""):

    data = {
        "collection_name": collection_name,
        "blocking": False, # If True, upload is blocking; else async. Status API not needed when blocking
        "split_options": {
            "chunk_size": 512,
            "chunk_overlap": 150
        },
        "generate_summary": False # Set to True to optionally generate summaries for all documents after ingestion
    }

    form_data = aiohttp.FormData()
    for file_path in FILEPATHS:
        form_data.add_field("documents", open(file_path, "rb"), filename=os.path.basename(file_path), content_type="application/pdf")

    form_data.add_field("data", json.dumps(data), content_type="application/json")

    async with aiohttp.ClientSession() as session:
        try:
            async with session.post(f"{INGESTOR_BASE_URL}/v1/documents", data=form_data) as response: # Replace with session.patch for reingesting
                await print_response(response)
        except aiohttp.ClientError as e:
            print(f"Error: {e}")

await upload_documents(collection_name="financebench")


In [None]:
# This might take a few minutes to complete depending on the number of documents uploaded
async def get_task_status(
    task_id: str
):

    params = {
        "task_id": task_id,
    }

    HEADERS = {"Content-Type": "application/json"}

    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(f"{INGESTOR_BASE_URL}/v1/status", params=params, headers=HEADERS) as response:
                await print_response(response)
        except aiohttp.ClientError as e:
            return 500, {"error": str(e)}

await get_task_status(task_id=["*****************************"]) # Please enter the task_id obtained from upload documents API

## 3. Create Dataset for Ragas Evaluation

In `data/financebench/data`, there is a file called `financebench_open_source.jsonl`. This file contains questions about the PDFs, as well as correlating ground truth answers.
For each ground-truth question and answer`, we will generate an answer from our RAG system and retrieve the relevant docs. 

The answer and context retrieval from the RAG system is similar to `retriever_api_uasge` notebook.


In [None]:
IPADDRESS = "rag-server" if os.environ.get("AI_WORKBENCH", "false") == "true" else "localhost" #Replace this with the correct IP address
RAG_SERVER_PORT = "8081"
RAG_BASE_URL = f"http://{IPADDRESS}:{RAG_SERVER_PORT}"  # Replace with your server URL

async def print_response(response):
    """Helper to print API response."""
    try:
        response_json = await response.json()
        print(json.dumps(response_json, indent=2))
    except aiohttp.ClientResponseError:
        print(await response.text())



In [None]:
genrate_url = f"{RAG_BASE_URL}/v1/generate"
async def generate_answer(payload):
    async with aiohttp.ClientSession() as session:
        try:
            async with session.post(url=genrate_url, json=payload) as response:
                await print_response(response)
        except aiohttp.ClientError as e:
            print(f"Error: {e}")

search_url = f"{RAG_BASE_URL}/v1/search"
async def document_seach(payload):
    async with aiohttp.ClientSession() as session:
        try:
            async with session.post(url=search_url, json=payload) as response:
                await print_response(response)
        except aiohttp.ClientError as e:
            print(f"Error: {e}")



In [None]:
# Here we open get the question and ground-truth answer pairs. This is provided for us us a part of the KG RAG dataset. 

# Open and read the JSON file
with open('../data/financebench/data/financebench_open_source.jsonl', 'r') as file:
    gt_qa_pairs = [json.loads(line) for line in file]


dataset = []

n = 50 # For the purposes of keeping this demo brief, we will only evaluate on 50 questions. You can increase this to the full dataset of 194 questions for more comprehensive results.
for i in gt_qa_pairs[:n]:
    question = i['question']

    print(question)

    generate_payload = {
        "messages": [
            {
            "role": "user",
            "content": question
            }
        ],
        "use_knowledge_base": True,
        "temperature": 0.2,
        "top_p": 0.7,
        "max_tokens": 1024,
        "reranker_top_k": 2,
        "vdb_top_k": 10,
        "vdb_endpoint": "http://milvus:19530",
        "collection_names": ["financebench"],
        "enable_query_rewriting": True,
        "enable_reranker": True,
        "enable_citations": True,
        "model": "nvidia/llama-3.3-nemotron-super-49b-v1",
        "reranker_model": "nvidia/llama-3.2-nv-rerankqa-1b-v2",
        "embedding_model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
        # Provide url of the model endpoints if deployed elsewhere
        # "llm_endpoint": "",
        #"embedding_endpoint": "",
        #"reranker_endpoint": "",
        "stop": [],
        "filter_expr": ''
        }
    
    search_payload={
        "query": question,
        "reranker_top_k": 2,
        "vdb_top_k": 10,
        "vdb_endpoint": "http://milvus:19530",
        "collection_names": ["financebench"],
        "messages": [],
        "enable_query_rewriting": True,
        "enable_reranker": True,
        "embedding_model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
        # Provide url of the model endpoints if deployed elsewhere
        #"embedding_endpoint": "",
        #"reranker_endpoint": "",
        "reranker_model": "nvidia/llama-3.2-nv-rerankqa-1b-v2",

        }
    
    rag_answer = await generate_answer(generate_payload)
    rag_search = await document_seach(search_payload)

    dataset.append({
        "user_input": question,
        "retrieved_contexts": rag_search,
        "response": rag_answer,
        "reference": i['answer'],
    })



## 4. Evaluate with Ragas

In this example, we will use the NVIDIA hosted enpoint for our judge model. To use this endpoint, please provide your NVIDIA API Key below. 

When using the public endpoint for the Judge LLM, you will likely encounter Rate Limit Error. We can try to reduce the number of errors by adjusting the config, which we do below. 

Alternatively, you can use self hosted NIM Microservices endpoints to avoid these errors altogether. If you're using a self-hosted NIM you do not need to provide you API Key. Paste your key below to save it as an environment variable. 
                                              
To generate an API Key, go to [build.nvidia.com](https://build.nvidia.com/), and click the green "Get API Key" button in the top right corner.

To deploy the Judge LLM as a NIM on your own, follow the instructions [here](https://build.nvidia.com/mistralai/mixtral-8x22b-instruct/deploy)


In [None]:

import os
from langchain_nvidia_ai_endpoints.chat_models import ChatNVIDIA

os.environ["NVIDIA_API_KEY"] = "nvapi-***"

llm = ChatNVIDIA(model="nvidia/llama-3.3-nemotron-super-49b-v1")

In [None]:
from ragas import EvaluationDataset
evaluation_dataset = EvaluationDataset.from_list(dataset)

In [None]:
# evaluator_llm = LangchainLLMWrapper(llm)
from ragas.metrics import AnswerAccuracy, ContextRelevance, ResponseGroundedness
from ragas import evaluate

from ragas.llms import LangchainLLMWrapper
evaluator_llm = LangchainLLMWrapper(llm)

In [None]:
from ragas.run_config import RunConfig

custom_config = RunConfig(max_workers=1, max_wait=120)

In [None]:
results = evaluate(dataset=evaluation_dataset,metrics=[AnswerAccuracy(), ContextRelevance(), ResponseGroundedness()],llm=evaluator_llm, run_config=custom_config)


Finally, let's take a look at our results

In [None]:
results