## 2.4 Automated evaluation of the Q&A bot's performance

### 🚄 Preface

The new Q&A bot may encounter some issues during real-world use, especially when users ask specific questions that require detailed knowledge from internal documents. For example, when a new employee asks, "How do I request leave?" the bot might provide a generic response instead of consulting the company’s official policy documents for accurate guidance.

Just as conventional software development requires testing and validation, it is equally important to establish an **evaluation system** for your Q&A bot project. This ensures that similar issues can be quickly identified and resolved. Moreover, after implementing any optimization or improvement, you should run a batch of test questions to confirm that the changes positively impact  the overall performance of the Q&A bot.

In this chapter, you will learn how to **automate evaluation processes** using LLMs and specialized frameworks like **Ragas**, enabling you to measure both the quality of answers and the effectiveness of retrieval.

## 🍁 Goals
After completing this chapter, you will be able to:

- Understand how to automate evaluations for LLM applications.
- Evaluate RAG chatbots using automated tools such as Ragas.
- Identify and solve problems in your Q&A bot by analyzing evaluation scores.

<!-- ## 📖 Course Outline
In this chapter, we will first understand some current issues with RAG chatbot through a specific problem. Then, we will attempt to discover the issue by implementing a simple automated test ourselves. Finally, we will learn how to use the more mature RAG application testing framework, Ragas, to assess the performance of RAG chatbot.

- 1.&nbsp;Evaluating RAG Application Performance
    - 1.1 Issues with the Q&A robot
    - 1.2 Checking RAG retrieval results to troubleshoot issues
    - 1.3 Attempting to establish an automated testing mechanism

- 2.&nbsp;Using Ragas to Evaluate Application Performance
     - 2.1 Evaluating the quality of responses from the RAG chatbot
       - 2.1.1 Quick start
       - 2.1.2 Understanding the calculation process of answer correctness
     - 2.2 Evaluating the recall effectiveness of retrieval
       - 2.2.1 Quick start
       - 2.2.2 Understanding the calculation process of context recall and context precision -->

## 1. Evaluating RAG application performance

### 1.1 Issues with the Q&A Bot

In the previous section, you completed the development of a Q&A bot and began exploring how to evaluate its performance.

In [None]:
import os
from config.load_key import load_key
load_key()
print(f'Your configured API Key is: {os.environ["DASHSCOPE_API_KEY"][:5]+"*"*5}')

Your configured API Key is: sk-98*****


In [3]:
from chatbot import rag
rag.indexing()
query_engine = rag.create_query_engine(rag.load_index())
print('Question: Which department is Michael Johnson in?')
response = query_engine.query('Which department is Michael Johnson in?')
print('Answer: ', end='')
response.print_response_stream()

Question: Which department is Michael Johnson in?
Answer: Michael Johnson is in the IT Infrastructure department. He holds the position of System Administrator and works under the supervision of Michael Chen at the 456 Tech Hub.

As part of this exploration, you asked the question:

> **"Which department is Michael Johnson in?"**

And received the following answer from the bot:

> **"Michael Johnson is in the IT Infrastructure department Department. He serves as a System Administrator, working under Michael Chen at the 456 Tech Hub."**

The original document contain multiple individuals named Michael Johnson, none of whom are associated with the **IT Infrastructure department Department**. However the LLM generated a confident and specific response that combined elements from different contexts — creating the illusion of accuracy without being grounded in factual data.

<a href="https://img.alicdn.com/imgextra/i1/O1CN01C6ZkQG1uGdQbJVw19_!!6000000006010-2-tps-1478-732.png" target="_blank">
<img src="https://img.alicdn.com/imgextra/i1/O1CN01C6ZkQG1uGdQbJVw19_!!6000000006010-2-tps-1478-732.png" width="800">
</a>

This highlights a critical issue: the answer was not based on accurate or unambiguous context, but rather on the model's aggregation or assumptions—potentially resulting in misleading conclusions.

Therefore, the next step is to examine the retrieval results used by the RAG system before generating the final answer, to ensure the context provided is accurate,  relevant, and aligned with the user's question.

### 1.2. Check RAG retrieval results for problem diagnosis

To validate the reasoning behind the answer, we inspect the context chunks retrieved by the RAG system before generating the response.

Here is the retrieved context:

In [4]:
contexts = [node.get_content() for node in response.source_nodes]
contexts

['Employee Key Contact Information.md 2025-07-10\n3 / 6Employee\nIDName SupervisorOffice\nLocationPosition\nTitlePhone\nNumberEmail AddressKey\nResponsibilities\nEID-210Jennifer\nLeeMichael\nChen456\nTech\nHub\n#210Data Analyst(555)\n234-\n5687jennifer.l@educompany.comPerformance\nmetrics tracking\nEID-211Christopher\nLeeMichael\nChen456\nTech\nHub\n#211Security\nAnalyst(555)\n234-\n5688christopher.l@educompany.comSystem\nvulnerability\nassessments\nEID-212Olivia\nTaylorMichael\nChen456\nTech\nHub\n#212Project\nCoordinator(555)\n234-\n5689olivia.t@educompany.comTimeline\nmanagement\nEID-213Daniel\nSmithMichael\nChen456\nTech\nHub\n#213Frontend\nDeveloper(555)\n234-\n5690daniel.s@educompany.comUI\nimplementation\nEID-214Rachel KimMichael\nChen456\nTech\nHub\n#214Mobile App\nDeveloper(555)\n234-\n5691rachel.k@educompany.comCross-platform\ndevelopment\nEID-215Thomas\nNguyenMichael\nChen456\nTech\nHub\n#215Cloud\nEngineer(555)\n234-\n5692thomas.nguyen@educompany.comInfrastructure\nmanageme

From this context, we can see that Michael Johnson is indeed listed as a System Administrator in the Course Development Department, which  supports the generated answer.

✅ Conclusion: The retrieval performed well—relevant and sufficient information was retrieved to support the correct answer.

While this example shows good retrieval performance, not all queries will yield such clear results. Therefore, it's essential to build an automated evaluation framework to consistently measure retrieval quality and answer accuracy across multiple test cases.

### 1.3 Building an automated testing mechanism

Although manual inspection helps understand individual cases, it becomes impractical when dealing with hundreds or thousands of questions. Hence, we aim to build an automated testing mechanism to streamline the evaluation process.

#### 1.3.1 Validating answer quality using LLMs

LLMs can be used not only to generate answers but also to evaluate them. By providing both the question and the generated answer, we can prompt the LLM to determine whether the answer is valid or invalid based on the reference material.

Here is a function that checks if the answer effectively addresses the question:

In [5]:
from chatbot import llm

def test_answer(question, answer):
    prompt = ("You are a tester.\n"
        "You need to check whether the following answer effectively responds to the user's question.\n"
        "The reply can only be: Valid response or Invalid response. Do not provide any other information.\n"
        "------"
        f"The answer is {answer}"
        "------"
        f"The question is: {question}"
    )
    return llm.invoke(prompt,model_name="qwen-max")


test_answer("Which department is Michael Johnson in?", "Michael Johnson is in the IT Infrastructure department Department. He holds the position of System Administrator and works under the supervision of Michael Chen at the 456 Tech Hub.")

'Valid response'

The LLM confirms that the answer is valid and correctly addresses the question.



#### 1.3.2 Validating context relevance

Equally important is ensuring that the retrieved context is relevant and useful for answering the question. To do this, we define another function to evaluate whether the context provided aligns with the user's query. This step helps ensure that the model is not only generating a plausible response but also basing it on accurate and contextually appropriate information.

This validation process strengthens the reliability of the Q&A bot by filtering out irrelevant or misleading content before the final answer is generated.

In [7]:
def test_contexts(question, answer, contexts):
    prompt = (
        "You are a tester. Your task is to determine whether the provided reference materials directly support the given answer to the question.\n"
        "If the answer can be clearly found or derived from the reference materials, respond with: The reference information is useful.\n"
        "Otherwise, respond with: The reference information is not useful.\n"
        "Do not provide any other explanation or information.\n"
        "------\n"
        f"Question: {question}\n"
        f"Answer: {answer}\n"
        f"Reference materials: {' '.join(contexts)}\n"
        "------"
    )
    return llm.invoke(prompt, model_name="qwen-max")
test_contexts(
    "Which department is Michael Johnson in?", 
    "Michael Johnson is in the IT Infrastructure department Department. He holds the position of System Administrator and works under the supervision of Michael Chen at the 456 Tech Hub.", 
    contexts[0]+contexts[1]
    )

'The reference information is useful.'

#### 1.3.3 Summary of Evaluation Logic

| Component         | Method                           | Result                |
|------------------|----------------------------------|------------------------|
| Question          | "Which department is Michael Johnson in?" |                        |
| Generated Answer  | Evaluated using `test_answer`     | Valid response         |
| Retrieved Context | Evaluated using `test_contexts`   | Reference info is useful |

With the two methods above, you've already taken the first steps in setting up a prototype for an LLM testing project. However, the current implementation is still incomplete and has several limitations:

- Hallucination detection: LLMs can generate responses that sound confident and plausible but are not factually accurate. The `test_answer` method, as currently implemented, may not be able to effectively detect such hallucinations, leading to false positives in evaluation.
- Relevance of retrieved context: The quality of a RAG-based system heavily depends on the relevance and accuracy of the retrieved context. A higher signal-to-noise ratio—meaning more relevant information and less irrelevant or misleading content—leads to better answers. However, the current testing approach is relatively simplistic and does not account for this critical factor.


To address these issues and improve the robustness of your testing framework, it’s highly recommended to integrate mature evaluation tools such as [Ragas](https://docs.ragas.io/en/stable), a specialized framework designed for evaluating the performance of RAG-based chatbots.

## 2. Using Ragas to Evaluate Application Performance

Ragas offers a comprehensive set of metrics to assess the quality of question-answering across the entire application pipeline. These metrics help ensure that both the retrieval and generation phases of a RAG (Retrieval-Augmented Generation) system perform effectively.
Here are key evaluation metrics provided by Ragas:
* Overall response quality
    * Answer correctness: Measures how accurate the generated answers are in relation to the actual knowledge in the dataset.
* Generation phase evaluation
    * Answer Relevance: Evaluates whether the generated answer is relevant to the user’s question.
    * Faithfulness: Checks if the answer is factually consistent with the retrieved reference materials, ensuring it doesn’t introduce incorrect or fabricated information.
* Retrieval phase evaluation
    * Context precision: Assesses whether the retrieved context contains a high proportion of relevant information related to the correct answer.
    * Context recall: Measures how many of the relevant reference materials are successfully retrieved; a higher score indicates fewer relevant documents are missed.
These metrics provide a structured way to evaluate and improve the performance of your Q&A bot, ensuring that it delivers accurate, relevant, and well-supported responses.

<a href="https://img.alicdn.com/imgextra/i4/O1CN01b2lVQp21JZCJy6Nfe_!!6000000006964-0-tps-739-420.jpg" target="_blank">
<img src="https://img.alicdn.com/imgextra/i4/O1CN01b2lVQp21JZCJy6Nfe_!!6000000006964-0-tps-739-420.jpg" width="500">
</a>  



### 2.1 Evaluating the Response Quality of RAG Applications

#### 2.1.1 Quick Start  


When evaluating the overall response quality of a RAG chatbot, using Ragas' Answer Correctness is an excellent metric. To calculate this metric, you need to prepare the following two types of data to evaluate the quality of the answer generated by the RAG chatbot:

1. question (The question input to the RAG chatbot)
2. ground_truth (The correct answer you already know)

To illustrate the differences in evaluation metrics for different responses, we have prepared three sets of RAG chatbot responses to the question:

**Question:**  
"Which department does Michael Johnson belong to?"

We will compare each model-generated **answer** against the known **ground truth**.

Three sample answers are provided below, each representing a different level of correctness:

- **Answer 1:** Based on the provided information, there is no mention of the department Michael Johnson belongs to. If you can provide more information about Michael Johnson, I may be able to help you find the answer.  
  ➤ This is considered an **invalid answer**, as it fails to provide the correct response even when context may have been available.

- **Answer 2:** Michael Johnson belongs to the Human Resources Department.  
  ➤ This is a **hallucinated answer**, as it provides a confident but incorrect response.

- **Answer 3:** Michael Johnson belongs to the Course Development Department.  
  ➤ This is the **correct answer**, matching the ground truth exactly.                                         |

We can then run the following code to calculate the score for response accuracy (i.e., Answer Correctness) using Ragas:

In [8]:
from langchain_community.llms.tongyi import Tongyi
from langchain_community.embeddings import DashScopeEmbeddings
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import answer_correctness

data_samples = {
    'question': [
        'Which department is Michael Johnson in?',
        'Which department is Michael Johnson in?',
        'Which department is Michael Johnson in?'
    ],
    'answer': [
        'According to the provided information, there is no mention of the department where Michael Johnson works. If you can provide more information about Michael Johnson, I may be able to help you find the answer.',
        'Michael Johnson is in the HR department',
        'Michael Johnson is in the Course Development Department'
    ],
    'ground_truth':[
        'Michael Johnson is a member of the Course Development Department',
        'Michael Johnson is a member of the Course Development Department',
        'Michael Johnson is a member of the Course Development Department'
    ]
}

dataset = Dataset.from_dict(data_samples)
score = evaluate(
    dataset = dataset,
    metrics=[answer_correctness],
    llm=Tongyi(model_name="qwen-plus-0919"),
    embeddings=DashScopeEmbeddings(model="text-embedding-v3")
)
score.to_pandas()

Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,question,answer,ground_truth,answer_correctness
0,Which department is Michael Johnson in?,"According to the provided information, there i...",Michael Johnson is a member of the Course Deve...,0.168191
1,Which department is Michael Johnson in?,Michael Johnson is in the HR department,Michael Johnson is a member of the Course Deve...,0.496046
2,Which department is Michael Johnson in?,Michael Johnson is in the Course Development D...,Michael Johnson is a member of the Course Deve...,0.998264


This code will generate a score that reflects how well each model-generated answer aligns with the known correct answer. By comparing the scores across different responses, you can identify which answers are accurate, which are incorrect, and which may be hallucinated.



<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>question</th>
      <th>answer</th>
      <th>ground_truth</th>
      <th>answer_correctness</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Which department does Michael Johnson belong to?</td>
      <td>According to the provided information, there is no mention of the department where Michael Johnson works. If you can provide more information about Michael Johnson, I may be able to help you find the answer.</td>
      <td>MMichael Johnson is a member of the Course Development Department</td>
      <td>0.168191</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Which department does Michael Johnson belong to?</td>
      <td>Michael Johnson belongs to the Human Resources Department.</td>
      <td>Michael Johnson is a member of the Course Development Department</td>
      <td>0.496046</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Which department does Michael Johnson belong to?</td>
      <td>Michael Johnson is in the Course Development Department</td>
      <td>Michael Johnson is a member of the Course Development Department</td>
      <td>0.998264
</td>
    </tr>
  </tbody>
</table>
</div>  



As you can see, Ragas' Answer Correctness metric accurately reflects the performance of the three responses, with the more factually accurate answers receiving higher scores.

#### 2.1.2 Understanding the Calculation Process of Answer Correctness

Intuitively, the scoring of Answer Correctness aligns with your expectations. The scoring process involves using an LLM (in the code `llm=Tongyi(model_name="qwen-plus"))` and an embedding model (in the code `embeddings=DashScopeEmbeddings(model="text-embedding-v3"))` to calculate the result based on  **semantic similarity** and **factual accuracy** between the answer and the ground truth.

##### Semantic Similarity
Semantic similarity is determined by generating text vectors for both the answer and the ground truth using the embedding model. These vectors are then compared using methods like cosine similarity, which is the most commonly used in Ragas, along with Euclidean distance and Manhattan distance. 

##### Factual Accuracy

Factual accuracy measures the consistency of factual information between the answer and the ground truth. For example:

* answer: Michael Johnson is a colleague in the Course Development Department responsible for big data direction.
* ground_truth: Michael Johnson is a colleague in the Course Development Department responsible for technical writer tasks.

While both statements agree on the department (Course Development), they differ in the job role (big data direction versus technical writer tasks). Such differences are not easily captured through simple LLM or embedding model calls.

To address this, Ragas uses an LLM to generate lists of assertions for both the answer and the ground truth. It then compares these lists to identify matching and conflicting facts, allowing for a more nuanced evaluation of factual accuracy.

The following diagram illustrates how Ragas evaluates factual accuracy:

<a href="https://img.alicdn.com/imgextra/i1/O1CN01OKrBYc21eB610hjF3_!!6000000007009-2-tps-1967-347.png" target="_blank">
<img src="https://img.alicdn.com/imgextra/i1/O1CN01OKrBYc21eB610hjF3_!!6000000007009-2-tps-1967-347.png" width="1000">
</a>

1. Generate respective lists of assertions for the answer and ground truth using a LLMs. For example:
    - **Generate the assertion list for the answer**: Michael Johnson is a colleague in the Course Development Department responsible for big data direction. ---> ["*Michael Johnson is in the Course Development Department*", "*Michael Johnson is responsible for big data direction*"]
    - **Generate the assertion list for ground_truth**: Michael Johnson is a colleague in the Course Development Department responsible for technical writer tasks. ---> ["*Michael Johnson is in the Course Development Department*", "*Michael Johnson is responsible for technical writer tasks*"]

2. Traverse the assertion lists for the answer and ground_truth, initializing three lists: TP, FP, and FN.
    - For the assertions generated from the **answer**:
      - If the assertion matches one from the ground_truth, add it to the TP list. For example: "*Michael Johnson is in the Course Development Department*".
      - If the assertion cannot be found in the ground_truth list, add it to the FP list. For example: "*Michael Johnson is responsible for big data direction*".
    - For the assertions generated from the **ground_truth**:
      - If the assertion cannot be found in the answer list, add it to the FN list. For example: "*Michael Johnson is responsible for technical writer tasks*".
      > The judgment process in this step is entirely provided by the LLMs.

3. Count the number of elements in the TP, FP, and FN lists, and calculate the F1 score as follows:



```shell
f1 score = tp / (tp + 0.5 * (fp + fn)) if tp > 0 else 0
```

Taking the above text as an example: f1 score = 1/(1+0.5*(1+1)) = 0.5

##### Score Summary
After obtaining the scores for semantic similarity and factual accuracy, a weighted sum of the two can be calculated to obtain the final Answer Correctness score.  



```
Answer Correctness score = 0.25 * Semantic Similarity score + 0.75 * Factual Accuracy score
```

### 2.2 Evaluating the response quality of RAG applications

#### 2.2.1 Quick Start

The context precision and context recall metrics in Ragas are used to evaluate the effectiveness of retrieval recall in RAG (Retrieval-Augmented Generation) applications.

* Context Precision: Measures whether the relevant information from the retrieved context is ranked highly and makes up a large proportion (signal-to-noise ratio). It focuses on relevance.
* Context Recall: Assesses how well the retrieved context aligns with the ground truth, ensuring that important factual information is not missed. It focuses on factual accuracy.

In practical applications, these two metrics are often used together to provide a more comprehensive evaluation of the retrieval process.

To calculate these metrics, your dataset should include the following:
* question: The question input to the RAG application.
* contexts: The retrieved reference information.
* ground_truth: The correct answer you already know.

You can continue using the question "Which department is Jimmy Peterson from?" and prepare three sets of data for testing. Run the following code  to  calculate both context precision and context recall scores simultaneously.  



In [10]:
from langchain_community.llms.tongyi import Tongyi
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import context_recall, context_precision

data_samples = {
    'question': [
        'Which department is Michael Johnson in?',
        'Which department is Michael Johnson in?',
        'Which department is Michael Johnson in?'
    ],
    'answer': [
        'Based on the provided information, there is no mention of the department where Michael Johnson works. If you can provide more information about Michael Johnson, I may be able to help you find the answer.',
        'Michael Johnson is in the HR department',
        'Michael Johnson is in the Course Development Department'
    ],
    'ground_truth': [
        'Michael Johnson is a member of the Course Development Department',
        'Michael Johnson is a member of the Course Development Department',
        'Michael Johnson is a member of the Course Development Department'
    ],
    'contexts': [
        ['Provides administrative management and coordination support, optimizing administrative workflows.', 'Performance Management Department Robert Carter EID-701 Course Development Department'],
        ['Michael Chen, Director of the Course Development Department', 'Newton discovered the law of universal gravitation'],
        ['Newton discovered the law of universal gravitation', 'Michael Johnson, engineer in the Course Development Department, has recently been responsible for technical writer tasks.'],
    ],
}

dataset = Dataset.from_dict(data_samples)
score = evaluate(
    dataset=dataset,
    metrics=[context_recall, context_precision],
    llm=Tongyi(model_name="qwen-plus-0919"))
score.to_pandas()

Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

Unnamed: 0,question,answer,ground_truth,contexts,context_recall,context_precision
0,Which department is Michael Johnson in?,"Based on the provided information, there is no...",Michael Johnson is a member of the Course Deve...,[Provides administrative management and coordi...,1.0,0.0
1,Which department is Michael Johnson in?,Michael Johnson is in the HR department,Michael Johnson is a member of the Course Deve...,"[Michael Chen, Director of the Course Developm...",1.0,0.0
2,Which department is Michael Johnson in?,Michael Johnson is in the Course Development D...,Michael Johnson is a member of the Course Deve...,[Newton discovered the law of universal gravit...,1.0,0.5


<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>question</th>
      <th>answer</th>
      <th>ground_truth</th>
      <th>contexts</th>
      <th>context_recall</th>
      <th>context_precision</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Which department is Michael Johnson in?</td>
      <td>Based on the provided information, there is no mention of the department where Michael Johnson works. If you can provide more information about Michael Johnson, I may be able to help you find the answer.</td>
      <td>Michael Johnson is a member of the Course Development Department.</td>
      <td>[Provides administrative management and coordination support, optimizing administrative workflows., Performance Management Department Robert Carter EID-701 Course Development Department]</td>
      <td>1.0</td>
      <td>0.0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Which department is Michael Johnson in?</td>
      <td>Michael Johnson is in the HR department</td>
      <td>Michael Johnson is a member of the Course Development Department.</td>
      <td>[Michael Chen, Director of the Course Development Department, Newton discovered the law of universal gravitation]</td>
      <td>1.0</td>
      <td>0.0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Which department is Michael Johnson in?</td>
      <td>Michael Johnson is in the Course Development Department</td>
      <td>Michael Johnson is a member of the Course Development Department.</td>
      <td>[Newton discovered the law of universal gravitation, Michael Johnson, engineer in the Course Development Department, has recently been responsible for curriculum development]</td>
      <td>1.0</td>
      <td>0.5</td>
    </tr>
  </tbody>
</table>
</div>

From the data above, we can see that:
- The answer in the last row of data is accurate.
- The reference materials (contexts) retrieved during the process also contain the correct viewpoint, i.e., "Michael Johnson belongs to the Course Development Department." This situation is reflected in the context recall score being 1.
- However, not every piece of information in the contexts is relevant to the question and answer. For example, "Newton discovered gravity." This situation is reflected in the context precision score being 0.5.

#### 2.2.2 Understanding the calculation process of context recall and context precision

##### Context Recall

You have already learned from the previous text that context recall is a metric used to measure whether retrieved contexts are consistent with the ground truth.
In Ragas, context recall evaluates what proportion of viewpoints in the ground truth can be supported by the retrieved contexts. The calculation process is as follows:
1. An LLM breaks down the ground truth into a list of statements.
   > For example, from the ground truth "Jimmy Peterson is a member of the Course Development Department," the LLM might generate a list of statements such as ["Jimmy Peterson belongs to the Course Development Department"]..
2. The LLM determines whether each statement can find supporting evidence in the retrieved referencecontexts.
   > For instance, this statement can find supporting evidence in the third row of data's contexts: "Jimmy Peterson, engineer in the Course Development Department, has recently been responsible for curriculum development."
3. The context recall score is calculated as the proportion of statements in the ground truth list that are supported by the contexts.
   > In this case, the score is 1 = 1/1, meaning all statements are supported.

##### Context Precision

Context precision in Ragas not only measures what proportion of the retrieved contexts are relevant to the ground truth but also evaluates the ranking of those contexts. The calculation process is more complex:
1. Each context is evaluated sequentially based on whether it is relevant to the question and ground truth.
   * If relevant, it scores 1; otherwise, it scores 0.
   * For example, in the third row of data:
      * Context 1: "Newton discovered gravity" → irrelevant → score 0
      * Context 2: "Jimmy PetersonMichael Johnson, engineer in the Course Development Department, has recently been responsible for curriculum development." → relevant → score 1


2. For each context, the precision score is calculated by dividing the cumulative sum of scores of the current context and all preceding contexts by its position in the sequence .
    * For the third row of data:
        * Context 1 : 0/1 = 0
        * Context 2 : 1/2 = 0.5


3. The context precision score is obtained by summing up the precision scores of all contexts and dividing by the number of relevant contexts.
    * For the third row of data: context_precision = (0 + 0.5) / 1 = 0.5

> If you're still unclear about the calculation process, don’t worry—the key takeaway is that context precision evaluates how well the retrieval system ranks relevant contexts higher than irrelevant ones.
If you’re interested, we encourage you to explore [Ragas's source code](https://github.com/explodinggradients/ragas/blob/cc31f65d4b7c7cd6bbf686b9073a0dfaacfbcbc5/src/ragas/metrics/_context_precision.py#L250) for a deeper understanding of how these metrics are implemented.

### 2.3 Other Recommended Metrics to Explore

Ragas provides many other evaluation metrics, which are not introduced one by one here. You can visit the Ragas documentation to learn more about their use cases and working principles.

The list of metrics supported by Ragas can be found at: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/  



## 3. How to optimize based on ragas metrics
The ultimate goal of evaluation is not just to obtain scores, but to identify areas for improvement based on these scores. You have now learned the concepts and calculation methods of three key metrics: answer correctness, context recall, and context precision.

When you observe that certain metrics have low scores, you should take action to improve them. Below are some optimization strategies based on specific metrics:

### 3.1 Context recall
This metric evaluates the **retrieval** phase of your RAG application. A low context recall score indicates that the retrieved contexts do not cover enough relevant information from the ground truth.If this metric has a low score, you can try optimizing from the following aspects:

- **Check the Knowledge Base**

    <img src="https://wanx.alicdn.com/wanx/1937257750879544/text_to_image_lite_v2/6e4ca1055a0c467992b6b719b443a33e_0.png?x-oss-process=image/watermark,image_aW1nL3dhdGVybWFyazIwMjQxMTEyLnBuZz94LW9zcy1wcm9jZXNzPWltYWdlL3Jlc2l6ZSxtX2ZpeGVkLHdfMzAzLGhfNTI=,t_80,g_se,x_10,y_10/format,webp" width="300">

    The knowledge base is the source of a RAG application. If the content within it is incomplete or insufficient, it can lead to inadequate reference information being retrieved, which directly impacts context recall—the ability of the system to retrieve relevant information that supports accurate and meaningful answers.

    To ensure the knowledge base is comprehensive and effective, you can:

    * Compare the knowledge base content with test samples to verify whether the information in the knowledge base is sufficient to support each query.
    * Use LLMs to assist in this process: By prompting the model to evaluate whether the knowledge base contains the necessary information to answer a given question, you can identify gaps or inconsistencies.

    If you find that some test samples lack relevant information in the knowledge base, it’s essential to supplement the knowledge base with additional data. This ensures that the RAG system has access to the most accurate and complete information, improving both retrieval quality and answer reliability.

- **Replace the Embedding Model**

    <img src="https://img.alicdn.com/imgextra/i4/O1CN01MMsV3b1U2GzviZv6y_!!6000000002459-2-tps-991-320.png" width="750">

    If your knowledge base content is already comprehensive and well-structured,  it may be beneficial to replace the embedding model used for vectorization. A high-quality embedding model can better capture the deep semantic meaning of text, allowing for more accurate and meaningful similarity comparisons between queries and retrieved content.

For example, consider the following:
* Question: "Who is responsible for curriculum development?"
* Relevant Knowledge Base Text: "Jimmy Peterson is a member of the Course Development Department."

Even though the two sentences don’t share many surface-level words, a strong embedding model can recognize the semantic relationship between them—understanding that "responsible for curriculum development" and "member of the Course Development Department" are closely related in meaning.

- **query rewriting**

    <img src="https://img.alicdn.com/imgextra/i1/O1CN01RpktVQ1FEtg8r4QCX_!!6000000000456-2-tps-1704-1322.png" width="800">

    As a developer, it's unrealistic to expect users to phrase their questions in a specific or detailed way. Therefore, you might receive vague or ambiguous queries such as: "Course Development Department," "Leave Request," or "Project Management." If these questions are directly input into a RAG application, they are unlikely to retrieve relevant and effective text segments. To address this, you can design a prompt template by organizing common employee questions and use LLMs to rewrite the queries, improving the accuracy of the retrieval process.


### 3.2 Context Precision
Similar to context recall, the context precision metric  evaluates the performance of a RAG application during the **retrieval** phase, but it focuses more on whether the relevant text segments have are ranked highly. A low context precision score suggests that while some relevant information may be retrieved, it is mixed with irrelevant or less important content, reducing the effectiveness of the retrieval.

If this metric has a low score, you can apply the same optimization measures used for context recall, such as improving the knowledge base or refining the retrieval algorithm. Additionally, you can implement **reranking** during the retrieval phase to improve the ranking of related text segments and ensure that the most relevant information appears first.

### 3.3 Answer Correctness
The answer correctness metric evaluates the overall comprehensive performance of a RAG system. If this metric has a low score while the previous two metrics have high scores, it indicates that the RAG system performs well in the **retrieval** phase but encounters issues in the **generation** phase. You can try the methods learned in previous tutorials, such as optimizing prompts, adjusting hyperparameters (such as temperature) of LLMs generation, or replacing with a more powerful LLMs, and even fine-tuning the LLMs (which will be introduced in later tutorials) to enhance the accuracy of generated answers.

## ✅ Summary

Through the study of this section, you have learned how to establish automated testing for RAG chatbots.


Automated testing is a crucial tool for engineering optimization. With quantified metrics, it  helps you move from intuitive improvements to data-driven evaluations, ensuring that your RAG chatbot performs better after each enhancement. This not only allows you to evaluate the question-answering quality more efficiently and identify areas for improvement but also enables you to quantify the results of your optimizations.

While automated testing is powerful, it does not eliminate the need for human evaluation entirely. In practical applications, it is recommended to involve domain experts who can help build a test set that reflects real-world scenarios. These experts can ensure the test set covers a wide range of typical queries and edge cases, and it should be continuously updated to reflect evolving needs.

Additionally, since LLMs are not always 100% accurate, it’s advisable to regularly sample and review the results of automated testing in real-world use. Avoid frequently changing the LLM or embedding models unless necessary, as this can introduce instability. For Ragas, you can improve its performance by adjusting the default prompts—for example, by adding domain-specific reference examples that align with your business context. (For more details, refer to the extended reading materials.)



## Further Reading

### Changing the Prompt Template in Ragas
Many of Ragas’ evaluation metrics rely on large language models to compute scores. Like LlamaIndex, Ragas provides default prompt templates, but it also supports custom prompts that you can modify to suit your specific use case.
Example prompt templates are included in the `ragas_prompt` folder to help you customize the prompts used by Ragas for different evaluation metrics. You can refer to the following code to integrate these updated prompts into your workflow.

> Note: Ragas includes example cases in its prompts to guide the model on how to make judgments or generate lists of statements. You can replace or modify these examples to better align with your specific domain or use case.

In [11]:
# Import prompt templates
from ragas_prompt.ragas_test_prompt import ContextRecall, ContextPrecision, AnswerCorrectness

# Customize prompt settings for each metric
context_recall.context_recall_prompt.instruction = ContextRecall.context_recall_prompt["instruction"]
context_recall.context_recall_prompt.output_format_instruction = ContextRecall.context_recall_prompt["output_format_instruction"]
context_recall.context_recall_prompt.examples = ContextRecall.context_recall_prompt["examples"]

context_precision.context_precision_prompt.instruction = ContextPrecision.context_precision_prompt["instruction"]
context_precision.context_precision_prompt.output_format_instruction = ContextPrecision.context_precision_prompt["output_format_instruction"]
context_precision.context_precision_prompt.examples = ContextPrecision.context_precision_prompt["examples"]

answer_correctness.correctness_prompt.instruction = AnswerCorrectness.correctness_prompt["instruction"]
answer_correctness.correctness_prompt.output_format_instruction = AnswerCorrectness.correctness_prompt["output_format_instruction"]
answer_correctness.correctness_prompt.examples = AnswerCorrectness.correctness_prompt["examples"]

data_samples = {
    'question': [
        'Which department is Michael Johnson in?',
        'Which department is Michael Johnson in?',
        'Which department is Michael Johnson in?'
    ],
    'answer': [
        'Based on the provided information, there is no mention of the department where Michael Johnson works. If you can provide more information about Michael Johnson, I may be able to help you find the answer.',
        'Michael Johnson is in the HR department',
        'Michael Johnson is in the Course Development Department'
    ],
    'ground_truth': [
        'Michael Johnson is a member of the Course Development Department',
        'Michael Johnson is a member of the Course Development Department',
        'Michael Johnson is a member of the Course Development Department'
    ],
    'contexts': [
        ['Provides administrative management and coordination support, optimizing administrative workflows.', 'Performance Management Department Han Shan Li Fei I902 041 Human Resources'],
        ['Li Kai, Director of the Course Development Department', 'Newton discovered the law of universal gravitation'],
        ['Newton discovered the law of universal gravitation', 'Michael Johnson, engineer in the Course Development Department, has recently been responsible for curriculum development.'],
    ],
}

dataset = Dataset.from_dict(data_samples)

score = evaluate(
    dataset=dataset,
    metrics=[answer_correctness, context_recall, context_precision],
    llm=Tongyi(model_name="qwen-plus-0919"),
    embeddings=DashScopeEmbeddings(model="text-embedding-v3"))

score.to_pandas()

Evaluating:   0%|          | 0/9 [00:00<?, ?it/s]

Unnamed: 0,question,answer,ground_truth,contexts,answer_correctness,context_recall,context_precision
0,Which department is Michael Johnson in?,"Based on the provided information, there is no...",Michael Johnson is a member of the Course Deve...,[Provides administrative management and coordi...,0.166901,0.0,0.0
1,Which department is Michael Johnson in?,Michael Johnson is in the HR department,Michael Johnson is a member of the Course Deve...,"[Li Kai, Director of the Course Development De...",0.196046,1.0,0.0
2,Which department is Michael Johnson in?,Michael Johnson is in the Course Development D...,Michael Johnson is a member of the Course Deve...,[Newton discovered the law of universal gravit...,0.998264,1.0,0.5


<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>question</th>
      <th>answer</th>
      <th>ground_truth</th>
      <th>contexts</th>
      <th>answer_correctness</th>
      <th>context_recall</th>
      <th>context_precision</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Which department is Michael Johnson in?</td>
      <td>Based on the provided information, there is no mention of the department where Michael Johnson works. If you can provide more information about Michael Johnson, I may be able to help you find the answer.</td>
      <td>Michael Johnson is a member of the Course Development Department.</td>
      <td>[Providing administrative management and coordination support, optimizing administrative workflows. , Performance Management Department Han Shan Li Fei I902 041 ...</td>
      <td>0.166901</td>
      <td>0.0</td>
      <td>0.0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Which department is Michael Johnson in?</td>
      <td>Michael Johnson is in the Human Resources Department.</td>
      <td>Michael Johnson is a member of the Course Development Department.</td>
      <td>[Li Kai, Director of the Course Development Department , Newton discovered the law of universal gravitation]</td>
      <td>0.196046</td>
      <td>0.0</td>
      <td>0.0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Which department is Michael Johnson in?</td>
      <td>Michael Johnson is in the Course Development Department.</td>
      <td>Michael Johnson is a member of the Course Development Department.</td>
      <td>[Newton discovered the law of universal gravitation, Michael Johnson, an engineer in the Course Development Department, has recently been responsible for curriculum development]</td>
      <td>0.998264</td>
      <td>1.0</td>
      <td>0.5</td>
    </tr>
  </tbody>
</table>
</div>  



### More Evaluation Metrics
In addition to RAG, LLMs and natural language processing (NLP) are applied in a wide range of tasks, such as agents, natural language to SQL conversion, machine translation, text summarization,  andquestion answering. Ragas provides a variety of metrics that can be used to evaluate the performance of these tasks, ensuring that the outputs of LLMs are accurate, relevant, and aligned with the expected results.


| Evaluation Metric            | Use Case | Metric Meaning                                                                 |
|---------------------|----------|--------------------------------------------------------------------------|
| [ToolCallAccuracy](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/agents/#example)    | Agent    | Evaluates the LLM's performance in identifying and invoking tools required to complete specific tasks. This metric is obtained by comparing reference tool calls with tool calls made by the LLM, with a value range of 0-1. |
| [DataCompyScore](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/sql/)      | natural language to SQL   | Evaluates the difference between the results obtained from database queries using SQL statements generated by the LLM and the correct results. The value ranges from 0 to 1.                     |
| [LLMSQLEquivalence](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/sql/#non-execution-based-metrics)   | natural language to SQL   | Unlike the previous metric, this does not require actual database retrieval; it only evaluates the differences between SQL statements generated by the LLM and the correct SQL statements. The value ranges from 0 to 1.   |
| [BleuScore](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/traditional/#bleu-score)           | General     | Evaluates the similarity between responses and correct answers based on n-grams. Initially designed for evaluating machine translation systems, this metric does not require the use of an LLM during evaluation, and its value ranges from 0 to 1. In the [2.7 tutorial](2_7_Improve_Model_Accuracy_and_Efficiency_via_Fine_Tuning.ipynb), you will learn how to fine-tune LLMs, and BleuScore can be used to measure the benefits brought by fine-tuning.  



## 🔥 Post-class Quiz

### 🔍 Single-choice Question
<details>
<summary style="cursor: pointer; padding: 12px;  border: 1px solid #dee2e6; border-radius: 6px;">
<b>What does the Context Precision metric measure? ❓</b>

- A. Overall response quality
- B. Whether retrieved text segments relevant to the question are ranked higher
- C. Whether the generated answer is relevant to the retrieved text segments
- D. Whether the generated answer is relevant to the question

**[Click to view the answer]**
</summary>

<div style="margin-top: 10px; padding: 15px; border: 1px solid #dee2e6; border-radius: 0 0 6px 6px;">

✅ **Reference Answer: B**  
📝 **Explanation**:  
- Context Precision directly evaluates the ranking quality of retrieval results, not the relevance of the content itself or the quality of the final answer.

</div>
</details>  

