# How to evaluate a RAG application

This example uses [Langchain](https://www.langchain.com) and [Giskard](https://github.com/Giskard-AI/giskard) to evaluate the quality of a RAG application.

## Get the Content

In [1]:
from src.euprojectsrag.configurations import get_project_conf
from src.euprojectsrag.file_reader import FileReader

reader = FileReader()

project_name = "SPECTRO"
project_conf = get_project_conf(project_name)
split_docs = reader.read_project_files(project_conf)

print(f"Number of documents in {project_name}: {len(split_docs)}")
print(f"First 10 document in {project_name}: {split_docs[10]}")

  from .autonotebook import tqdm as notebook_tqdm
Token indices sequence length is longer than the specified maximum sequence length for this model (3683 > 512). Running this sequence through the model will result in indexing errors


Number of documents in SPECTRO: 1460
First 10 document in SPECTRO: page_content='budget.......................................................................................................................13, 1 = . 4. Timetable and deadlines ............................................................................................................14, 1 = . 5. Admissibility and documents, 1' metadata={'source': 'call-fiche_digital-2022-skills-03-specialised-edu_en.pdf', 'doc_type': 'Call', 'project_name': 'SPECTRO', 'page_numbers': '3', 'title': 'TABLE OF CONTENTS'}


## Create a Knowledge Base

Let's start by loading the content in a pandas DataFrame.

In [2]:
import pandas as pd

df = pd.DataFrame([d.page_content for d in split_docs], columns=["page_content"])
df.head(10)

Unnamed: 0,page_content
0,Advanced Digital Skills (DIGITAL-2022-SKILLS-0...
1,"1.0, Publication Date = 15.09.2022. 1.0, Chang..."
2,"HADEA. B - Digital, Industry and Space HaDEA.B..."
3,0. Introduction .................................
4,Background.......................................
5,.................................................
6,Scope............................................
7,.................................................
8,deliverables.....................................
9,.................................................


We can now create a Knowledge Base using the DataFrame we created before.

In [11]:
import giskard
from giskard.rag import KnowledgeBase

giskard.llm.set_embedding_model("text-embedding-ada-002")
knowledge_base = KnowledgeBase(df)

## Generate the Test Set

In [12]:
from giskard.rag import generate_testset

testset = generate_testset(
    knowledge_base,
    num_questions=60,
    agent_description=f"A chatbot answering questions about the {project_name} project files.",
)

2025-06-23 07:55:28,046 pid:18220 MainThread giskard.rag  INFO     Finding topics in the knowledge base.




2025-06-23 07:56:39,604 pid:18220 MainThread giskard.rag  INFO     Found 49 topics in the knowledge base.


Generating questions: 100%|██████████| 60/60 [03:12<00:00,  3.21s/it]


Let's display a few samples from the test set.

In [13]:
test_set_df = testset.to_pandas()

for index, row in enumerate(test_set_df.head(3).iterrows()):
    print(f"Question {index + 1}: {row[1]['question']}")
    print(f"Reference answer: {row[1]['reference_answer']}")
    print("Reference context:")
    print(row[1]['reference_context'])
    print("******************", end="\n\n")


Question 1: What information is included in the context about the project files?
Reference answer: The context includes columns for Nº, Name, WP nº, Lead beneficiary, Type, Dissemination level, Due date, and Description.
Reference context:
Document 673: Nº
Name
WP
nº
Lead beneficiary
Type
Dissemin ation level
Due date
Description
******************

Question 2: How many master's programmes on Cybersecurity and Robotics are developed by SPECTRO?
Reference answer: SPECTRO developed 1 master's programme on Cybersecurity and 1 master's programme on Robotics.
Reference context:
Document 604: Design and delivery of education programmes responding to the needs of the of the labour markets and increasing the capacity of the education offer for advanced technologies and competences related to Cybersecurity and Robotics., SPECTRO strategy and impact = WP1 and WP2 will contribute to this expected outcome and to the achievement of SO1 - Addressing skill needs by delivering education programmes in 

Let's now save the test set to a file:

In [14]:
testset.save("test-set.jsonl")

## Prepare the Prompt Template

In [15]:
from langchain.prompts import PromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
print(prompt.format(context="Here is some context", question="Here is a question"))


Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: Here is some context

Question: Here is a question



## Evaluating the Model on the Test Set

We need to create a function that invokes the chain with a specific question and returns the answer.

In [18]:
from src.euprojectsrag.rag_chain import RAGChain

def answer_fn(question, history=None):
    rag_chain = RAGChain()

    messages = history if history else []
    messages.append({"role": "user", "content": question})

    response = rag_chain.query_project(messages, project_name)
    return response.answer

We can now use the `evaluate()` function to evaluate the model on the test set. This function will compare the answers from the chain with the reference answers in the test set.

In [20]:
from giskard.rag import evaluate

report = evaluate(answer_fn, testset=testset, knowledge_base=knowledge_base)

Asking questions to the agent:   0%|          | 0/60 [00:00<?, ?it/s]

2025-06-23 08:15:34,520 pid:18220 Thread-66 backoff      INFO     Backing off send_request(...) for 0.1s (requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='us.i.posthog.com', port=443): Read timed out. (read timeout=15))


Asking questions to the agent:  78%|███████▊  | 47/60 [55:21<53:00, 244.68s/it]

2025-06-23 09:10:46,047 pid:18220 Thread-66 backoff      INFO     Backing off send_request(...) for 0.7s (requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='us.i.posthog.com', port=443): Read timed out. (read timeout=15))


Asking questions to the agent:  80%|████████  | 48/60 [1:08:25<1:21:17, 406.49s/it]

2025-06-23 09:23:50,506 pid:18220 Thread-66 backoff      INFO     Backing off send_request(...) for 0.3s (requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='us.i.posthog.com', port=443): Read timed out. (read timeout=15))


Asking questions to the agent:  82%|████████▏ | 49/60 [1:20:22<1:31:35, 499.57s/it]

2025-06-23 09:35:47,154 pid:18220 Thread-66 backoff      INFO     Backing off send_request(...) for 0.2s (requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='us.i.posthog.com', port=443): Read timed out. (read timeout=15))


Asking questions to the agent:  83%|████████▎ | 50/60 [1:30:53<1:29:50, 539.08s/it]

2025-06-23 09:46:18,055 pid:18220 Thread-66 backoff      INFO     Backing off send_request(...) for 0.7s (requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='us.i.posthog.com', port=443): Read timed out. (read timeout=15))


Asking questions to the agent:  85%|████████▌ | 51/60 [1:42:41<1:28:27, 589.74s/it]

2025-06-23 09:58:05,688 pid:18220 Thread-66 backoff      INFO     Backing off send_request(...) for 0.1s (requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='us.i.posthog.com', port=443): Read timed out. (read timeout=15))


Asking questions to the agent:  90%|█████████ | 54/60 [2:00:53<49:52, 498.80s/it]  

2025-06-23 10:16:29,253 pid:18220 Thread-66 backoff      INFO     Backing off send_request(...) for 0.7s (requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='us.i.posthog.com', port=443): Read timed out. (read timeout=15))


Asking questions to the agent:  92%|█████████▏| 55/60 [2:16:14<52:08, 625.62s/it]

2025-06-23 10:31:40,836 pid:18220 Thread-66 backoff      INFO     Backing off send_request(...) for 0.8s (requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='us.i.posthog.com', port=443): Read timed out. (read timeout=15))


Asking questions to the agent:  93%|█████████▎| 56/60 [2:29:55<45:36, 684.17s/it]

2025-06-23 10:45:22,161 pid:18220 Thread-66 backoff      INFO     Backing off send_request(...) for 0.3s (requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='us.i.posthog.com', port=443): Read timed out. (read timeout=15))


Asking questions to the agent:  95%|█████████▌| 57/60 [2:40:14<33:13, 664.54s/it]

2025-06-23 10:55:50,150 pid:18220 Thread-66 backoff      INFO     Backing off send_request(...) for 0.3s (requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='us.i.posthog.com', port=443): Read timed out. (read timeout=15))


Asking questions to the agent:  97%|█████████▋| 58/60 [2:50:29<21:39, 649.81s/it]

2025-06-23 11:05:56,500 pid:18220 Thread-66 backoff      INFO     Backing off send_request(...) for 0.8s (requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='us.i.posthog.com', port=443): Read timed out. (read timeout=15))


Asking questions to the agent:  98%|█████████▊| 59/60 [2:57:47<09:46, 586.12s/it]

2025-06-23 11:13:11,652 pid:18220 Thread-66 backoff      INFO     Backing off send_request(...) for 0.7s (requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='us.i.posthog.com', port=443): Read timed out. (read timeout=15))


Asking questions to the agent: 100%|██████████| 60/60 [3:04:23<00:00, 184.39s/it]
CorrectnessMetric evaluation: 100%|██████████| 60/60 [01:23<00:00,  1.39s/it]


Let now display the report.

Here are the five components of our RAG application:

* **Generator**: This is the LLM used in the chain to generate the answers.
* **Retriever**: This is the retriever that fetches relevant documents from the knowledge base according to a query.
* **Rewriter**: This is a component that rewrites the user query to make it more relevant to the knowledge base or to account for chat history.
* **Router**: This is a component that filters the query of the user based on his intentions.
* **Knowledge Base**: This is the set of documents given to the RAG to generate the answers.

In [21]:
display(report)

In [22]:
report.to_html("report.html")

We can display the correctness results organized by question type.

In [23]:
report.correctness_by_question_type()

Unnamed: 0_level_0,correctness
question_type,Unnamed: 1_level_1
complex,0.3
conversational,0.1
distracting element,0.4
double,0.2
simple,0.3
situational,0.3


We can also display the specific failures.

In [24]:
report.get_failures()

Unnamed: 0_level_0,question,reference_answer,reference_context,conversation_history,metadata,agent_answer,correctness,correctness_reason
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
b493aeac-41bc-4942-8f14-d60af55d8297,What information is included in the context ab...,"The context includes columns for Nº, Name, WP ...",Document 673: Nº\nName\nWP\nnº\nLead beneficia...,[],"{'question_type': 'simple', 'seed_document_id'...",The project files for SPECTRO include three ma...,False,The agent provided detailed information about ...
cb968df2-36ff-48c9-96d0-77e5681b7389,What is the maximum EU contribution to costs f...,The maximum EU contribution to costs for Unive...,Document 309: EU contribution to costs/ = 199 ...,[],"{'question_type': 'simple', 'seed_document_id'...",The maximum EU contribution to costs for Unive...,False,The agent stated that the maximum EU contribut...
3b494dbe-fe0c-427b-9ced-8e6fed264a1c,What role does Prof. Simona Motogna currently ...,"Since September 2020, Prof. Simona Motogna rep...","Document 1295: Curriculum Developer, and Lectu...",[],"{'question_type': 'simple', 'seed_document_id'...","I am sorry, but the provided context does not ...",False,The agent stated that there is no information ...
0691a043-057e-4ab6-8184-5c60e2e5fb6b,Which actions from the European Skills Agenda ...,The SPECTRO project contributes to Action 1: A...,Document 426: The activities and education pro...,[],"{'question_type': 'simple', 'seed_document_id'...",While the provided context does not explicitly...,False,The agent did not specifically mention the act...
56f494e5-1260-4a37-a298-e8b3ddecada3,What resources and expertise will UNIBO bring ...,UNIBO will bring its unique history of excelle...,Document 1181: Readiness to start implementati...,[],"{'question_type': 'simple', 'seed_document_id'...",ALMA MATER STUDIORUM - UNIVERSITA DI BOLOGNA (...,False,The agent provided detailed information about ...
6ad0c90f-84f3-4151-82cf-c18af4e9eda7,What is the maximum amount allowed for financi...,The maximum amount allowed for financial suppo...,Document 1443: -Standard supplementary payment...,[],"{'question_type': 'simple', 'seed_document_id'...","Based on the provided context, the maximum amo...",False,The agent stated that the maximum amount is no...
59b0be13-b4f0-47e4-8c03-eb9ac4b558d7,What is one of the main targets set by the Dig...,The Digital Compass sets the target of having ...,Document 371: The Digital Compass sets a serie...,[],"{'question_type': 'simple', 'seed_document_id'...","I am sorry, but the provided context does not ...",False,The agent stated that the context does not con...
762fbe59-b35a-484d-9629-ff7a3b044cd8,Could you identify which universities explicit...,The universities with a confirmed entry year f...,"Document 482: Entry Year, ELTE = ✓. Entry Year...",[],"{'question_type': 'complex', 'seed_document_id...","Based on the provided information, the univers...",False,The agent stated that the universities are ELT...
47b10898-7de1-4cf8-98cb-8290780a1455,Under what specific title or role does Federic...,Interim CEO,Document 149: The name and e-mail of contact p...,[],"{'question_type': 'complex', 'seed_document_id...","I apologize, but the provided context does not...",False,The agent stated that there is no information ...
37a27832-13df-4b0d-842b-9c425fbaa443,Considering the strategic focus of SPECTRO on ...,"Finland, France, Italy, Hungary, the Netherlan...",Document 1334: SPECTRO will deliver educati...,[],"{'question_type': 'complex', 'seed_document_id...","The SPECTRO project, focused on reducing the d...",False,The agent's answer includes Estonia instead of...
