# 2.4 Automated Evaluation of the Q&A Robot's Performance## 🚄 PrefaceThe new Q&A robot may encounter some issues during actual use. For example, when a newcomer asks "how to request leave," the robot might provide a generic response instead of answering based on the content of policy documents.Just as conventional software development requires testing, you should also establish an evaluation system in your Q&A robot project to ensure that similar issues can be quickly diagnosed. Additionally, after optimizing for a specific issue, you should test a batch of questions to confirm that the optimization positively impacts the overall performance of the Q&A robot.## 🍁 Course ObjectivesAfter completing this course, you will be able to:* How to automate evaluations for large language models (LLMs) applications.* How to evaluate RAG chatbot using Ragas.* How to identify and solve problems through Ragas scores.<!-- ## 📖 Course OutlineIn this chapter, we will first understand some current issues with RAG chatbot through a specific problem. Then, we will attempt to discover the issue by implementing a simple automated test ourselves. Finally, we will learn how to use the more mature RAG application testing framework, Ragas, to assess the performance of RAG chatbot.- 1.&nbsp;Evaluating RAG Application Performance    - 1.1 Issues with the Q&A robot    - 1.2 Checking RAG retrieval results to troubleshoot issues    - 1.3 Attempting to establish an automated testing mechanism- 2.&nbsp;Using Ragas to Evaluate Application Performance     - 2.1 Evaluating the quality of responses from the RAG chatbot       - 2.1.1 Quick start       - 2.1.2 Understanding the calculation process of answer correctness     - 2.2 Evaluating the recall effectiveness of retrieval       - 2.2.1 Quick start       - 2.2.2 Understanding the calculation process of context recall and context precision -->

## 1. Evaluating RAG Application Performance### 1.1 Issues with the Q&A BotIn the previous chapter, you completed the development of a Q&A bot, but you noticed that it currently performs poorly on employee query-related questions.For example, Zhang Wei is the first employee in the employee information table, but your Q&A bot cannot answer the question "Which department does Zhang Wei belong to?"<a href="https://img.alicdn.com/imgextra/i1/O1CN01MsuZGI1E0rrkVNNnO_!!6000000000290-0-tps-1626-278.jpg" target="_blank"><img src="https://img.alicdn.com/imgextra/i1/O1CN01MsuZGI1E0rrkVNNnO_!!6000000000290-0-tps-1626-278.jpg" width="600"></a>  

In [None]:
import osfrom config.load_key import load_keyload_key()print(f"Your configured API Key is: {os.environ["DASHSCOPE_API_KEY"][:5]+"*"*5}")

In [None]:
from chatbot import ragquery_engine = rag.create_query_engine(rag.load_index())print('Question: Which department is Zhang Wei in?')response = query_engine.query('Which department is Zhang Wei in?')print('Answer: ', end='')response.print_response_stream()

### 1.2. Check RAG Retrieval Results for Problem DiagnosisTo resolve this issue, you need to confirm whether the reference materials retrieved before your RAG chatbot answers the question contain relevant information about Zhang Wei.Using the following code, you can obtain the reference information retrieved by the RAG chatbot when answering this question.

In [None]:
contexts = [node.get_content() for node in response.source_nodes]contexts

This shows that the problem is caused by poor retrieval performance.> This chapter will focus on building automated testing, and this retrieval performance issue will be addressed in subsequent chapters.### 1.3. Attempt to Establish an Automated Testing MechanismAlthough you can always find a way to locate the problem, it would be very time-consuming if you had to confirm every time whether it was due to retrieval errors or correct retrieval but incorrect model-generated answers. You should establish a testing mechanism that can automatically test a batch of questions you have prepared.In previous studies, you already know that large language models (LLMs) can be used to answer questions and check for errors. Similarly, LLMs can also be used to detect whether the responses from the Q&A bot accurately answer the question, as long as reference information is provided in the prompt and the response format is restricted.The `test_answer` function below can be used to check whether the Q&A bot's response effectively answers the question. You need to input the question and the Q&A bot’s response into the prompt and restrict the response format to: "Only valid response or invalid response".

In [None]:
from chatbot import llmdef test_answer(question, answer):    prompt = ("You are a tester.\n"        "You need to check whether the following answer effectively responds to the user's question.\n"        "The reply can only be: Valid response or Invalid response. Do not provide any other information.\n"        "------"        f"The answer is {answer}"        "------"        f"The question is: {question}"    )    return llm.invoke(prompt,model_name="qwen-max")test_answer("Which department is Zhang Wei in?", "According to the provided information, there is no mention of the department where Zhang Wei works. If you can provide more information about Zhang Wei, I may be able to help you find the answer.")

The response provided does not effectively answer the question 'Which department is Zhang Wei in?' The large language models (LLMs)'s reply of 'invalid response' is consistent with expectations.In RAG chatbot, apart from the effectiveness of responses, you also need to ensure that the retrieved reference information is useful. The `test_contexts` function below can be used to check whether the retrieved reference information is effective. You need to pass the question and the retrieved reference information into the prompt and restrict the response format to: 'Only: Reference information is useful or Reference information is not useful'.

In [None]:
def test_contexts(question, contexts):    prompt = ("You are a tester.\n"        "You need to check whether the following reference materials can help answer the question.\n"        "The response can only be: The reference information is useful or The reference information is not useful. Do not provide any other information.\n"        "------"        f"The reference material is {contexts}"        "------"        f"The question is: {question}"    )    return llm.invoke(prompt,model_name="qwen-max")test_contexts("Which department is Zhang Wei in?", "Core, providing administrative management and coordination support, optimizing administrative workflows. \nAdministrative Department Qin Fei Cai Jing G705 034 Administration Administrative Specialist 13800000034 qinf@educompany.com Maintaining company archives and information systems, responsible for issuing company notices and announcements,\n\nSupport. \nPerformance Management Department Han Shan Li Fei I902 041 Human Resources Performance Specialist 13800000041 hanshan@educompany.com Establishing and maintaining employee performance records, regularly organizing performance review meetings, coordinating feedback from various departments, formulating evaluation processes and standards, ensuring performance")

With the two methods above, you have already preliminarily set up a prototype of a large language models (LLMs) testing project. However, the current implementation is still incomplete. For instance:- Because large language models (LLMs) sometimes hallucination, their answers may appear convincingly real. In such cases, the `test_answer` method cannot effectively detect this issue.- The higher the proportion of relevant information in the retrieved references, the better (signal-to-noise ratio). However, our current testing method is relatively simple and does not take these factors into account.You might consider using some mature testing frameworks to further improve your testing project. For example, [Ragas](https://docs.ragas.io/en/stable), which is a testing framework specifically designed to evaluate the performance of RAG chatbot.

## 2. Using Ragas to Evaluate Application PerformanceRagas provides multiple metrics that can be used to evaluate the quality of question-answering across the entire chain of an application. For example:- Evaluation of overall response quality:  - Answer Correctness, used to assess the accuracy of answers generated by the RAG application.- Evaluation of the generation phase:  - Answer Relevancy, used to assess whether the answers generated by the RAG application are relevant to the question.  - Faithfulness, used to evaluate the factual consistency between the answers generated by the RAG application and the retrieved reference materials.- Evaluation of the recall phase:  - Context Precision, used to evaluate whether entries related to the correct answer in contexts are ranked high and have a high proportion (signal-to-noise ratio).  - Context Recall, used to evaluate how many relevant reference materials are retrieved; a higher score means fewer relevant references are missed.<a href="https://img.alicdn.com/imgextra/i4/O1CN01b2lVQp21JZCJy6Nfe_!!6000000006964-0-tps-739-420.jpg" target="_blank"><img src="https://img.alicdn.com/imgextra/i4/O1CN01b2lVQp21JZCJy6Nfe_!!6000000006964-0-tps-739-420.jpg" width="500"></a>  

### 2.1 Evaluating the Response Quality of RAG Applications#### 2.1.1 Quick Start  

When evaluating the overall response quality of a RAG chatbot, using Ragas' Answer Correctness is an excellent metric. To calculate this metric, you need to prepare the following two types of data to evaluate the quality of the answer generated by the RAG chatbot:1. question (The question input to the RAG chatbot)2. ground_truth (The correct answer you already know)To illustrate the differences in evaluation metrics for different responses, we have prepared three sets of RAG chatbot responses to the question "*Which department does Zhang Wei belong to?":| question            | ground_truth     | answer                                                                 ||---------------------|------------------|------------------------------------------------------------------------|| Which department does Zhang Wei belong to?    | Zhang Wei belongs to the Teaching and Research Department. | Based on the provided information, there is no mention of the department Zhang Wei belongs to. If you can provide more information about Zhang Wei, I may be able to help you find the answer. (Invalid answer) | | Which department does Zhang Wei belong to?    | Zhang Wei belongs to the Teaching and Research Department. | Zhang Wei belongs to the Human Resources Department. (hallucination)                                                     || Which department does Zhang Wei belong to?    | Zhang Wei belongs to the Teaching and Research Department. | Zhang Wei belongs to the Teaching and Research Department. (Correct)                                                       |We can then run the following code to calculate the score for response accuracy (i.e., Answer Correctness).

In [None]:
from langchain_community.llms.tongyi import Tongyifrom langchain_community.embeddings import DashScopeEmbeddingsfrom datasets import Datasetfrom ragas import evaluatefrom ragas.metrics import answer_correctnessdata_samples = {    'question': [        'Which department is Zhang Wei in?',        'Which department is Zhang Wei in?',        'Which department is Zhang Wei in?'    ],    'answer': [        'According to the provided information, there is no mention of the department where Zhang Wei works. If you can provide more information about Zhang Wei, I may be able to help you find the answer.',        'Zhang Wei is in the HR department',        'Zhang Wei is in the Teaching and Research Department'    ],    'ground_truth':[        'Zhang Wei is a member of the Teaching and Research Department',        'Zhang Wei is a member of the Teaching and Research Department',        'Zhang Wei is a member of the Teaching and Research Department'    ]}dataset = Dataset.from_dict(data_samples)score = evaluate(    dataset = dataset,    metrics=[answer_correctness],    llm=Tongyi(model_name="qwen-plus-0919"),    embeddings=DashScopeEmbeddings(model="text-embedding-v3"))score.to_pandas()

<div><style scoped>    .dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }</style><table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>question</th>      <th>answer</th>      <th>ground_truth</th>      <th>answer_correctness</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>Which department does Zhang Wei belong to?</td>      <td>Based on the provided information, there is no mention of the department Zhang Wei belongs to. If you can provide more information about Zhang Wei, I may be able to help you find the answer.</td>      <td>Zhang Wei is a member of the Teaching and Research Department.</td>      <td>0.175227</td>    </tr>    <tr>      <th>1</th>      <td>Which department does Zhang Wei belong to?</td>      <td>Zhang Wei belongs to the Human Resources Department.</td>      <td>Zhang Wei is a member of the Teaching and Research Department.</td>      <td>0.193980</td>    </tr>    <tr>      <th>2</th>      <td>Which department does Zhang Wei belong to?</td>      <td>Zhang Wei belongs to the Teaching and Research Department.</td>      <td>Zhang Wei is a member of the Teaching and Research Department.</td>      <td>0.994619</td>    </tr>  </tbody></table></div>  

As you can see, Ragas's Answer Correctness metric accurately reflects the performance of the three responses, with the more factually accurate answers receiving higher scores.

#### 2.1.2 Understanding the Calculation Process of Answer CorrectnessIntuitively, the scoring of Answer Correctness aligns with your expectations. The scoring process utilizes a large language models (LLMs)s (LLMs) (in the code `llm=Tongyi(model_name="qwen-plus")`) and an embedding model (in the code `embeddings=DashScopeEmbeddings(model="text-embedding-v3")`), calculating the result based on the **semantic similarity** and **factual accuracy** between the answer and ground_truth.##### Semantic SimilaritySemantic similarity is obtained by generating text vectors for the answer and ground_truth using the embedding model, then computing the similarity between the two text vectors. There are various methods to calculate vector similarity, such as cosine similarity, Euclidean distance, and Manhattan distance. Ragas uses the most common method, cosine similarity.##### Factual AccuracyFactual accuracy measures the differences in factual descriptions between the answer and ground_truth. For example, consider the following two descriptions:- answer: Zhang Wei is a colleague in the teaching and research department responsible for large language models (LLMs)s (LLMs) courses.- ground_truth: Zhang Wei is a colleague in the teaching and research department responsible for big data direction.There are factual differences between the answer and ground_truth (work direction), but there are also consistent aspects (work department). Such differences are difficult to quantify through simple calls to a large language models (LLMs)s (LLMs) or embedding model. Ragas generates respective lists of assertions for the answer and ground_truth using a large language models (LLMs)s (LLMs) and compares and calculates the elements within these assertion lists.The following diagram can help you understand how Ragas measures factual accuracy:<a href="https://img.alicdn.com/imgextra/i3/O1CN01NXmu4B1xpOGyZDDdB_!!6000000006492-0-tps-2382-1186.jpg" target="_blank"><img src="https://img.alicdn.com/imgextra/i3/O1CN01NXmu4B1xpOGyZDDdB_!!6000000006492-0-tps-2382-1186.jpg" width="600"></a>1. Generate respective lists of assertions for the answer and ground_truth using a large language models (LLMs)s (LLMs). For example:    - **Generate the assertion list for the answer**: Zhang Wei is a colleague in the teaching and research department responsible for large language models (LLMs)s (LLMs) courses. ---> ["*Zhang Wei is in the teaching and research department*", "*Zhang Wei is responsible for large language models (LLMs)s (LLMs) courses*"]    - **Generate the assertion list for ground_truth**: Zhang Wei is a colleague in the teaching and research department responsible for big data direction. ---> ["*Zhang Wei is in the teaching and research department*", "*Zhang Wei is responsible for big data direction*"]2. Traverse the assertion lists for the answer and ground_truth, initializing three lists: TP, FP, and FN.    - For the assertions generated from the **answer**:      - If the assertion matches one from the ground_truth, add it to the TP list. For example: "*Zhang Wei is in the teaching and research department*".      - If the assertion cannot be found in the ground_truth list, add it to the FP list. For example: "*Zhang Wei is responsible for large language models (LLMs)s (LLMs) courses*".    - For the assertions generated from the **ground_truth**:      - If the assertion cannot be found in the answer list, add it to the FN list. For example: "*Zhang Wei is responsible for big data direction*".      <!-- ![image](https://alidocs.oss-cn-zhangjiakou.aliyuncs.com/res/ybEnBBXZ6LoPnP13/img/86c4839b-bd48-4faf-9842-c42599181339.png) -->      > The judgment process in this step is entirely provided by the large language models (LLMs)s (LLMs).3. Count the number of elements in the TP, FP, and FN lists, and calculate the F1 score as follows:```shellf1 score = tp / (tp + 0.5 * (fp + fn)) if tp > 0 else 0```Taking the above text as an example: f1 score = 1/(1+0.5*(1+1)) = 0.5##### Score SummaryAfter obtaining the scores for semantic similarity and factual accuracy, a weighted sum of the two can be calculated to obtain the final Answer Correctness score.  ```Answer Correctness score = 0.25 * Semantic Similarity score + 0.75 * Factual Accuracy score```

### 2.2 Evaluating the Effectiveness of Retrieval Recall#### 2.2.1 Quick StartThe context precision and context recall metrics in Ragas can be used to evaluate the effectiveness of retrieval recall in RAG applications.- Context precision evaluates whether the entries in the retrieved reference information (contexts) that are relevant to the correct answers are ranked higher and have a high proportion (signal-to-noise ratio), **focusing on relevance**.- Context recall evaluates the factual consistency between contexts and ground_truth, **focusing on factual accuracy**.In practical applications, these two metrics can be used together.To calculate these metrics, your dataset should include the following information:- **question**, the question input to the RAG application.- **contexts**, the retrieved reference information.- **ground_truth**, the correct answer you already know.You can continue using the question "*Which department is Zhang Wei from?*" and prepare three sets of data. Run the code below to simultaneously calculate the scores for context precision and context recall.  

In [None]:
from langchain_community.llms.tongyi import Tongyifrom datasets import Datasetfrom ragas import evaluatefrom ragas.metrics import context_recall, context_precisiondata_samples = {    'question': [        'Which department is Zhang Wei in?',        'Which department is Zhang Wei in?',        'Which department is Zhang Wei in?'    ],    'answer': [        'Based on the provided information, there is no mention of the department where Zhang Wei works. If you can provide more information about Zhang Wei, I may be able to help you find the answer.',        'Zhang Wei is in the HR department',        'Zhang Wei is in the Teaching and Research Department'    ],    'ground_truth': [        'Zhang Wei is a member of the Teaching and Research Department',        'Zhang Wei is a member of the Teaching and Research Department',        'Zhang Wei is a member of the Teaching and Research Department'    ],    'contexts': [        ['Provides administrative management and coordination support, optimizing administrative workflows.', 'Performance Management Department Han Shan Li Fei I902 041 Human Resources'],        ['Li Kai, Director of the Teaching and Research Department', 'Newton discovered the law of universal gravitation'],        ['Newton discovered the law of universal gravitation', 'Zhang Wei, engineer in the Teaching and Research Department, has recently been responsible for curriculum development'],    ],}dataset = Dataset.from_dict(data_samples)score = evaluate(    dataset=dataset,    metrics=[context_recall, context_precision],    llm=Tongyi(model_name="qwen-plus-0919"))score.to_pandas()

<div><style scoped>    .dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }</style><table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>question</th>      <th>answer</th>      <th>ground_truth</th>      <th>contexts</th>      <th>context_recall</th>      <th>context_precision</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>Which department is Zhang Wei in?</td>      <td>Based on the information provided, there is no mention of the department where Zhang Wei works. If you can provide more information about Zhang Wei, I may be able to help you find the answer.</td>      <td>Zhang Wei is a member of the Teaching and Research Department.</td>      <td>[Providing administrative management and coordination support, optimizing administrative workflows. , Performance Management Department Han Shan Li Fei I902 041 ...</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>1</th>      <td>Which department is Zhang Wei in?</td>      <td>Zhang Wei is in the Human Resources Department.</td>      <td>Zhang Wei is a member of the Teaching and Research Department.</td>      <td>[Li Kai, Director of the Teaching and Research Department , Newton discovered the law of universal gravitation]</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>2</th>      <td>Which department is Zhang Wei in?</td>      <td>Zhang Wei is in the Teaching and Research Department.</td>      <td>Zhang Wei is a member of the Teaching and Research Department.</td>      <td>[Newton discovered the law of universal gravitation, Zhang Wei, an engineer in the Teaching and Research Department, has recently been responsible for curriculum development]</td>      <td>1.0</td>      <td>0.5</td>    </tr>  </tbody></table></div>  

From the data above, we can see that:- The answer in the last row of data is accurate.- The reference materials (contexts) retrieved during the process also contain the correct viewpoint, i.e., "Zhang Wei belongs to the teaching and research department." This situation is reflected in the context recall score being 1.- However, not every piece of information in the contexts is relevant to the question and answer. For example, "Newton discovered gravity." This situation is reflected in the context precision score being 0.5.#### 2.2.2 Understanding the calculation process of context recall and context precision##### Context RecallYou have already learned from the previous text that context recall is a metric used to measure whether contexts are consistent with ground_truth.In Ragas, context recall is used to describe what proportion of viewpoints in ground_truth can be supported by contexts. The calculation process is as follows:1. A large language model (LLM) breaks down the ground_truth into n statements.   For example, from the ground_truth "Zhang Wei is a member of the teaching and research department," a list of statements such as ["Zhang Wei belongs to the teaching and research department"] can be generated.2. The LLM determines whether each statement can find supporting evidence in the retrieved reference materials (contexts), or whether the context can support the viewpoint of the ground_truth.   For instance, this statement can find supporting evidence in the third row of data's contexts: "*Zhang Wei, an engineer in the teaching and research department, has been responsible for course development recently.*"3. Then, the proportion of statements in the ground_truth list that can find supporting evidence in the contexts is calculated as the context_recall score.   Here, the score is 1 = 1/1.##### Context PrecisionIn Ragas, context precision not only measures what proportion of contexts are related to the ground_truth but also evaluates the ranking of contexts. The calculation process is more complex:1. Read through the contexts sequentially, and based on the question and ground_truth, determine whether context<sub>i</sub> is relevant. If relevant, it scores 1; otherwise, it scores 0.   For example, in the third row of data, context<sub>1</sub> ("Newton discovered gravity") is irrelevant, while context<sub>2</sub> is relevant.2. For each context, calculate the precision score by dividing the cumulative sum of scores of the current context and all preceding contexts (numerator) by the position of the context in the sequence (denominator).   For the third row of data, the precision score for context<sub>1</sub> is 0/1 = 0, and for context<sub>2</sub>, it is 1/2 = 0.5.3. Sum up the precision scores of all contexts and divide by the number of relevant contexts to obtain the context_precision.   For the third row of data, context_precision = (0 + 0.5) / 1 = 0.5.> If you cannot fully understand the calculation process above, it doesn't matter. You only need to know that this metric evaluates the ranking of contexts. If you're interested, we encourage you to read [Ragas's source code](https://github.com/explodinggradients/ragas/blob/cc31f65d4b7c7cd6bbf686b9073a0dfaacfbcbc5/src/ragas/metrics/_context_precision.py#L250).

### 2.3 Other Recommended Metrics to ExploreRagas also provides many other metrics, which will not be introduced one by one here. You can visit the Ragas documentation to learn more about the applicable scenarios and working principles of these metrics.The metrics supported by Ragas can be accessed at: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/  

## 3. How to Optimize Based on Ragas MetricsThe ultimate goal of evaluation is not to obtain scores, but to determine the direction of optimization based on these scores. You have already learned the concepts and calculation methods of three metrics: answer correctness, context recall, and context precision. When you observe that the scores of certain metrics are low, corresponding optimization measures should be formulated.### 3.1 Context RecallThe context recall metric evaluates the performance of a RAG application during the **retrieval** phase. If this metric has a low score, you can try optimizing from the following aspects:- **Check the Knowledge Base**    <img src="https://wanx.alicdn.com/wanx/1937257750879544/text_to_image_lite_v2/6e4ca1055a0c467992b6b719b443a33e_0.png?x-oss-process=image/watermark,image_aW1nL3dhdGVybWFyazIwMjQxMTEyLnBuZz94LW9zcy1wcm9jZXNzPWltYWdlL3Jlc2l6ZSxtX2ZpeGVkLHdfMzAzLGhfNTI=,t_80,g_se,x_10,y_10/format,webp" width="300">    The knowledge base is the source of a RAG application. If the content of the knowledge base is incomplete, it will lead to insufficient reference information being recalled, thereby affecting context recall. You can compare the content of the knowledge base with test samples and observe whether the content of the knowledge base can support each test sample (this process can also be assisted by large language models (LLMs)s (LLMs)s (LLMs)). If you find that some test samples lack relevant knowledge, you need to supplement the knowledge base.- **Replace the Embedding Model**    <img src="https://img.alicdn.com/imgextra/i1/O1CN01WjZDdd1FlPjxo2B44_!!6000000000527-0-tps-2476-1120.jpg" width="750">    If your knowledge base content is already very complete, consider replacing the embedding model. A good embedding model can understand the deep semantic meaning of text. If two sentences are deeply related, even if they don't appear to be related, they can still receive a high similarity score. For example, if the question is "Who is responsible for curriculum development?" and the corresponding text segment in the knowledge base is "Zhang Wei is a member of the teaching and research department," despite fewer overlapping words, an excellent embedding model can still assign a high similarity score to these two sentences, thereby recalling the text segment "Zhang Wei is a member of the teaching and research department."- **query rewriting**    <img src="https://img.alicdn.com/imgextra/i4/O1CN01Z28xyO1KXjzp6JNaq_!!6000000001174-0-tps-2352-1154.jpg" width="750">    As a developer, it is unrealistic to impose too many requirements on how users ask questions. Therefore, you might receive vague questions like: "Teaching and Research Department," "Leave Request," "Project Management." If such questions are directly input into a RAG application, they are unlikely to recall effective text segments. You can design a prompt template by organizing common employee questions and use large language models (LLMs)s (LLMs)s (LLMs) to rewrite queries, improving the accuracy of recall.### 3.2 Context PrecisionSimilar to context recall, the context precision metric also evaluates the performance of a RAG application during the **retrieval** phase, but it focuses more on whether related text segments have higher rankings. If this metric has a low score, you can try the optimization measures mentioned under context recall, and you can also attempt to add **reranking** during the retrieval phase to improve the ranking of related text segments.### 3.3 Answer CorrectnessThe answer correctness metric evaluates the overall comprehensive performance of a RAG system. If this metric has a low score while the previous two metrics have high scores, it indicates that the RAG system performs well in the **retrieval** phase but encounters issues in the **generation** phase. You can try the methods learned in previous tutorials, such as optimizing prompts, adjusting hyperparameters (such as temperature) of large language models (LLMs)s (LLMs) generation, or replacing with a more powerful large language models (LLMs)s (LLMs), and even fine-tuning the large language models (LLMs)s (LLMs) (which will be introduced in later tutorials) to enhance the accuracy of generated answers.

## ✅ Summary of this sectionThrough the study of this section, you have learned how to establish automated testing for RAG chatbots.Automated testing is an important means of engineering optimization. With quantified automated testing, it can help you shift from **feeling** better to metrics **quantization** that the application performs better when improving your RAG chatbot. This not only helps you evaluate the question-answering quality of the RAG chatbot faster and find optimization directions, but also quantifies the optimization results you have achieved.Of course, having automated testing does not mean that you no longer need human evaluation at all. It is recommended that in practical applications, you invite domain experts corresponding to the RAG chatbot to jointly build a test set that reflects the distribution of real-world scenario problems, and continuously update the test set.At the same time, since large language models (LLMs) cannot always achieve 100% accuracy, it is also recommended that you regularly sample and evaluate the accuracy of automated testing results during actual use, and try not to frequently change the LLMs and embedding models. For Ragas, you can improve its performance by adjusting the prompts in the default evaluation method (for example, supplementing reference examples related to your business domain) (for more details, please refer to the extended reading).  

## Further Reading### Changing the Prompt Template in RagasMany evaluation metrics in Ragas are implemented based on large language models (LLMs). Similar to LlamaIndex, Ragas' default prompt template is in English but allows for customization. You can translate Ragas' default prompts for various metrics into Chinese, making the evaluation results more suitable for Chinese question-answering scenarios.We provide a Chinese prompt template in the ragas_prompt folder. You can refer to the following code to adapt the Chinese prompts to different metrics in Ragas.> Ragas includes examples in its prompts to help the large model understand how to make judgments or generate lists of opinions, etc. Therefore, you can also modify these examples to fit your business scenario.An LLM sees prompts as a sequence of tokens where different models (or versions of a model) can tokenize the same prompt in different ways. Since LLMs are trained on tokens (and not on raw text), the way prompts get tokenized has a direct impact on the quality of the generated response."Prompts" now become the primary programming interface for generative AI apps, telling the models what to do and influencing the quality of returned responses. "Prompt Engineering" is a fast-growing field of study that focuses on the design and optimization of prompts to deliver consistent and quality responses at scale.In its more complex form like this example from LangChain it contains placeholders that can be replaced with data from a variety of sources (user input, system context, external data sources etc.) to generate a prompt dynamically. This allows us to create a library of reusable prompts that can be used to drive consistent user experiences programmatically at scale.Finally, the real value of templates lies in the ability to create and publish prompt libraries for vertical application domains - where the prompt template is now optimized to reflect application-specific context or examples that make the responses more relevant and accurate for the targeted user audience. The Prompts For Edu repository is a great example of this approach, curating a library of prompts for the education domain with emphasis on key objectives like lesson planning, curriculum design, student tutoring etc.What Why Evaluate the latest models. New model generations are likely to have improved features and quality - but may also incur higher costs. Evaluate them for impact, then make migration decisions. Separate instructions & context Check if your model/provider defines delimiters to distinguish instructions, primary and secondary content more clearly. This can help models assign weights more accurately to tokens. Be specific and clear Give more details about the desired context, outcome, length, format, style etc. This will improve both the quality and consistency of responses. Capture recipes in reusable templates. Be descriptive, use examples Models may respond better to a "show and tell" approach. Start with a zero-shot approach where you give it an instruction (but no examples) then try few-shot as a refinement, providing a few examples of the desired output. Use analogies. Use cues to jumpstart completions Nudge it towards a desired outcome by giving it some leading words or phrases that it can use as a starting point for the response. Double Down Sometimes you may need to repeat yourself to the model. Give instructions before and after your primary content, use an instruction and a cue, etc. Iterate & validate to see what works. Order Matters The order in which you present information to the model may impact the output, even in the learning examples, thanks to recency bias. Try different options to see what works best. Give the model an “out” Give the model a fallback completion response it can provide if it cannot complete the task for any reason. This can reduce chances of models generating false or fabricated responses.Above, you see how the prompt is constructed using a template. In the template there's a number of variables, denoted by {{variable}}, that will be replaced with actual values from a company API.In the search applications lesson, we briefly learned how to integrate your own data into Large Language Models (LLMs). In this lesson, we will delve further into the concepts of grounding your data in your LLM application, the mechanics of the process and the methods for storing data, including both embeddings and text.Large Language Models - These are the models referred throughout this course such as GPT-3.5, GPT-4, Llama-2, etc.

| Ragas comes with a prompt template ||------------------------------------|| <a href="https://img.alicdn.com/imgextra/i4/O1CN016BngIW1nRtnidE0hK_!!6000000005087-0-tps-2710-1334.jpg" target="_blank"><img src="https://img.alicdn.com/imgextra/i4/O1CN016BngIW1nRtnidE0hK_!!6000000005087-0-tps-2710-1334.jpg" width="900"></a> || After modifying the prompt template ||------------------------------------|| <a href="https://img.alicdn.com/imgextra/i2/O1CN01m7wDt21fhSKQ7d8CT_!!6000000004038-0-tps-2548-1278.jpg" target="_blank"><img src="https://img.alicdn.com/imgextra/i2/O1CN01m7wDt21fhSKQ7d8CT_!!6000000004038-0-tps-2548-1278.jpg" width="900"></a> |  

In [None]:
# Import Chinese prompt templatesfrom ragas_prompt.chinese_prompt import ContextRecall, ContextPrecision, AnswerCorrectness# Customize prompt settings for each metriccontext_recall.context_recall_prompt.instruction = ContextRecall.context_recall_prompt["instruction"]context_recall.context_recall_prompt.output_format_instruction = ContextRecall.context_recall_prompt["output_format_instruction"]context_recall.context_recall_prompt.examples = ContextRecall.context_recall_prompt["examples"]context_precision.context_precision_prompt.instruction = ContextPrecision.context_precision_prompt["instruction"]context_precision.context_precision_prompt.output_format_instruction = ContextPrecision.context_precision_prompt["output_format_instruction"]context_precision.context_precision_prompt.examples = ContextPrecision.context_precision_prompt["examples"]answer_correctness.correctness_prompt.instruction = AnswerCorrectness.correctness_prompt["instruction"]answer_correctness.correctness_prompt.output_format_instruction = AnswerCorrectness.correctness_prompt["output_format_instruction"]answer_correctness.correctness_prompt.examples = AnswerCorrectness.correctness_prompt["examples"]data_samples = {    'question': [        'Which department is Zhang Wei in?',        'Which department is Zhang Wei in?',        'Which department is Zhang Wei in?'    ],    'answer': [        'Based on the provided information, there is no mention of the department where Zhang Wei works. If you can provide more information about Zhang Wei, I may be able to help you find the answer.',        'Zhang Wei is in the HR department',        'Zhang Wei is in the Teaching and Research Department'    ],    'ground_truth': [        'Zhang Wei is a member of the Teaching and Research Department',        'Zhang Wei is a member of the Teaching and Research Department',        'Zhang Wei is a member of the Teaching and Research Department'    ],    'contexts': [        ['Provides administrative management and coordination support, optimizing administrative workflows.', 'Performance Management Department Han Shan Li Fei I902 041 Human Resources'],        ['Li Kai, Director of the Teaching and Research Department', 'Newton discovered the law of universal gravitation'],        ['Newton discovered the law of universal gravitation', 'Zhang Wei, engineer in the Teaching and Research Department, has recently been responsible for curriculum development.'],    ],}dataset = Dataset.from_dict(data_samples)score = evaluate(    dataset=dataset,    metrics=[answer_correctness, context_recall, context_precision],    llm=Tongyi(model_name="qwen-plus-0919"),    embeddings=DashScopeEmbeddings(model="text-embedding-v3"))score.to_pandas()

<div><style scoped>    .dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }</style><table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>question</th>      <th>answer</th>      <th>ground_truth</th>      <th>contexts</th>      <th>answer_correctness</th>      <th>context_recall</th>      <th>context_precision</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>Which department is Zhang Wei in?</td>      <td>Based on the provided information, there is no mention of the department where Zhang Wei works. If you can provide more information about Zhang Wei, I may be able to help you find the answer.</td>      <td>Zhang Wei is a member of the Teaching and Research Department.</td>      <td>[Providing administrative management and coordination support, optimizing administrative workflows. , Performance Management Department Han Shan Li Fei I902 041 ...</td>      <td>0.175227</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>1</th>      <td>Which department is Zhang Wei in?</td>      <td>Zhang Wei is in the Human Resources Department.</td>      <td>Zhang Wei is a member of the Teaching and Research Department.</td>      <td>[Li Kai, Director of the Teaching and Research Department , Newton discovered the law of universal gravitation]</td>      <td>0.193980</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>2</th>      <td>Which department is Zhang Wei in?</td>      <td>Zhang Wei is in the Teaching and Research Department.</td>      <td>Zhang Wei is a member of the Teaching and Research Department.</td>      <td>[Newton discovered the law of universal gravitation, Zhang Wei, an engineer in the Teaching and Research Department, has recently been responsible for curriculum development]</td>      <td>0.994619</td>      <td>1.0</td>      <td>0.5</td>    </tr>  </tbody></table></div>  

### More Evaluation MetricsIn addition to RAG, there are many applications or tasks of large language models (LLMs) or natural language processing (NLP), such as Agents, natural language to SQL, machine translation, summarization, etc. Ragas provides many metrics that can be used to evaluate these tasks.| Evaluation Metric            | Use Case | Metric Meaning                                                                 ||---------------------|----------|--------------------------------------------------------------------------|| [ToolCallAccuracy](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/agents/#example)    | Agent    | Evaluates the LLM's performance in identifying and invoking tools required to complete specific tasks. This metric is obtained by comparing reference tool calls with tool calls made by the LLM, with a value range of 0-1. || [DataCompyScore](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/sql/)      | natural language to SQL   | Evaluates the difference between the results obtained from database queries using SQL statements generated by the LLM and the correct results. The value ranges from 0 to 1.                     || [LLMSQLEquivalence](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/sql/#non-execution-based-metrics)   | natural language to SQL   | Unlike the previous metric, this does not require actual database retrieval; it only evaluates the differences between SQL statements generated by the LLM and the correct SQL statements. The value ranges from 0 to 1.   || [BleuScore](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/traditional/#bleu-score)           | General     | Evaluates the similarity between responses and correct answers based on n-grams. Initially designed for evaluating machine translation systems, this metric does not require the use of an LLM during evaluation, and its value ranges from 0 to 1. In the [2.7 tutorial](2_7_Fine_Tuning_LLMs_for_Improved_Accuracy_and_Efficiency.ipynb), you will learn how to fine-tune LLMs, and BleuScore can be used to measure the benefits brought by fine-tuning.  

## 🔥 Post-class Quiz### 🔍 Single-choice Question<details><summary style="cursor: pointer; padding: 12px;  border: 1px solid #dee2e6; border-radius: 6px;"><b>What does the Context Precision metric measure? ❓</b>- A. Evaluation of overall response quality- B. Evaluation of whether retrieved text segments relevant to the question are ranked higher- C. Whether the generated answer is related to the retrieved text segments- D. Whether the generated answer is relevant to the question**[Click to view the answer]**</summary><div style="margin-top: 10px; padding: 15px; border: 1px solid #dee2e6; border-radius: 0 0 6px 6px;">✅ **Reference Answer: B**  📝 **Explanation**:  - Context Precision directly evaluates the ranking quality of retrieval results, not the relevance of the content itself or the quality of the final answer.</div></details>  

## ✉️ Evaluation and FeedbackThank you for studying the Alibaba Cloud Large Model ACP Certification course. If you think there are parts of the course that are well-written or need improvement, we look forward to your [evaluation and feedback through this questionnaire](https://survey.aliyun.com/apps/zhiliao/Mo5O9vuie).Your criticism and encouragement are both driving forces for our progress.  