# 第三章 RAG应用评估指标

## 一、介绍
本章节主要内容为评估 RAG 应用中常用的三个指标，分别为：
1. Answer relevance：评估RAG系统的输出是否与问题相关；
2. Context relevance：评估RAG系统召回的上下文是否与问题相关；
3. Groundness: 评估RAG系统的输出是否基于召回的上下文；

<img src="./images/ch03_traid.jpg" width="500">

首先需要安装本课程中需要的评估框架，如果已经安装就可以跳过这一步骤。

In [1]:
# requirements
# pip install trulens_eval

在这里，为了美观和方便展示，我们设置输出忽略警告信息。

In [2]:
# 忽略警告，避免警告影响输出结果
import warnings
warnings.filterwarnings('ignore')

接下来，导入该课程需要的工具包utils，然后设置openai的API密钥。
有三种方式设置API密钥：
1. 在环境变量中设置`OPENAI_API_KEY`，然后使用utils直接获取；
2. 显式设置api_key，直接赋值给openai.api_key；
3. 如果没有openai的密钥的话，也可以选择使用第三方服务，修改openai.api_base即可；

In [3]:
import utils
# 导入自定义的工具包

import os
import openai
# openai.api_key = utils.get_openai_api_key()
# 设置OpenAI的API密钥，从环境变量中获取

# openai.api_key = "" 
# 或者这里填入你的OpenAI API密钥

# openai.api_key = "sk- "  
# openai.api_base = " "
# 或者自定义API密钥和API基础地址，可适用第三方API服务


✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input response will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


接来下，开始教程的设置环节。
首先需要重置数据库，这会在之后用于存储问题、召回结果和回答，方便管理和评估。

In [4]:
# 导入Tru类
from trulens_eval import Tru


# 实例化Tru类
tru = Tru()

# 重置数据库
# 数据库之后会用来存储问题、中间召回结果、答案以及评估结果
tru.reset_database()


🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


接下来，导入读取pdf所需要的SimpleDirectoryReader，读取指定文件夹下的pdf文件。
需要注意的是，默认的参数适合读取英文文档，如果文档为中文，需要在后续将全角字符转换为半角字符。

In [5]:
# 设置Llama Index reader
from llama_index import SimpleDirectoryReader

# 从一个文件夹中读取PDF文档，然后加载到document对象中
# 使用的文档为“人工智能”词条的维基百科页面
documents = SimpleDirectoryReader(
    input_files=["./data/人工智能.pdf"]
).load_data()

documents_en = SimpleDirectoryReader(
    input_files=["./data/eBook-How-to-Build-a-Career-in-AI.pdf"]
).load_data()


为了方面起见，将读取的pdf文档加载到同一个document对象中，用`"\n\n"`隔开；

In [6]:
from llama_index import Document

# 将documents中的内容合并成一个大文档，而不是每一页都是一个文档
document = Document(text="\n\n".\
                    join([doc.text for doc in documents]))

document_en = Document(text="\n\n".\
                    join([doc.text for doc in documents_en]))


In [7]:
# 将中文标点符号替换成英文标点符号，方便后续处理
# 如果是英文文档，可以跳过这一步
# 不处理的话，会导致无法正确切分中文句子，会影响后续sentence_window的大小，导致输入长度大于gpt-3.5-turbo的最大限制
document.text=document.text.replace('。','. ')
document.text=document.text.replace('！','! ')
document.text=document.text.replace('？','? ')



设置index存储，首先设置用来评估的大模型为gpt-3.5-turbo，需要注意的是这里使用的版本的上下文窗口为4096，因此需要注意输入的长度。
然后设置embedding模型，我们选择了BAAI/bge-small-zh-v1.5，这里可以根据场景的需要和计算资源的trade off选择模型的大小和语种。

In [8]:
# 设置sentence_index
from utils import build_sentence_window_index

from llama_index.llms import OpenAI

# 设置使用的大模型
# "gpt-3.5-turbo"是模型的名称
# temperature是温度，用来控制文本生成过程中的多样性
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

# 设置embedding模型
# 这里是在本地使用BAAI/bge-small-zh-v1.5
# document的所有的内容会索引到sentence index对象中
# 国内使用可以切换huggingface镜像站
sentence_index = build_sentence_window_index(
    document,
    llm,
    embed_model="local:BAAI/bge-small-zh-v1.5",
    save_dir="sentence_index"
)

sentence_index_en = build_sentence_window_index(
    document_en,
    llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    save_dir="sentence_index_en"
)


使用工具包中封装好的函数，基于上一步建立好的索引，返回用于之后检索的引擎。

In [9]:
from utils import get_sentence_window_query_engine

# 根据sentence_index对象创建一个搜索引擎
# 之后会被用于在RAG应用中进行召回
sentence_window_engine = \
get_sentence_window_query_engine(sentence_index)

In [10]:
sentence_window_engine_en = \
get_sentence_window_query_engine(sentence_index_en)

这里我们先测试单个问题来debug，看一下输出是什么。

In [11]:
output = sentence_window_engine.query(
    "AI的核心问题和长远目标是什么？")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [12]:
output_en = sentence_window_engine_en.query(
    "How do you create your AI portfolio?")
# 示例：使用搜索引擎进行召回

In [13]:
# 在实际开发中，可以通过查看metadata进行debug
output.metadata

{'7e8484a0-f7d2-4b64-b683-fa76b1dec6fe': {'window': '⼈⼯智能的研究是⾼度技术性和专业的，各分⽀领域都是深⼊且各不相通的，因⽽涉及范围极⼴[9].  ⼈⼯智能的\n研究可以分为⼏个技术问题.  其分⽀领域主要集中在解决具体问题，其中之⼀是，如何使⽤各种不同的⼯具完成\n特定的应⽤程序. \n AI的核⼼问题包括建构能够跟⼈类似甚⾄超卓的推理、知识、计划、学习、交流、感知、移动 、移物、使⽤⼯\n具和操控机械的能⼒等[10].  通⽤⼈⼯智能（GAI）⽬前仍然是该领域的长远⽬标[11].  ⽬前弱⼈⼯智能已经有初\n步成果，甚⾄在⼀些影像识别、语⾔分析、棋类游戏等等单⽅⾯的能⼒达到了超越⼈类的⽔平，⽽且⼈⼯智能的\n通⽤性代表着，能解决上述的问题的是⼀样的AI程序，⽆须重新开发算法就可以直接使⽤现有的AI完成任务，与\n⼈类的处理能⼒相同，但达到具备思考能⼒的统合强⼈⼯智能还需要时间研究，⽐较流⾏的⽅法包括统计⽅法，\n计算智能和传统意义的AI. ',
  'original_text': 'AI的核⼼问题包括建构能够跟⼈类似甚⾄超卓的推理、知识、计划、学习、交流、感知、移动 、移物、使⽤⼯\n具和操控机械的能⼒等[10]. '},
 '4e0b1c3d-5ba5-4c6b-b5c6-29df66a75281': {'window': '⼈⼯智能的\n研究可以分为⼏个技术问题.  其分⽀领域主要集中在解决具体问题，其中之⼀是，如何使⽤各种不同的⼯具完成\n特定的应⽤程序. \n AI的核⼼问题包括建构能够跟⼈类似甚⾄超卓的推理、知识、计划、学习、交流、感知、移动 、移物、使⽤⼯\n具和操控机械的能⼒等[10].  通⽤⼈⼯智能（GAI）⽬前仍然是该领域的长远⽬标[11].  ⽬前弱⼈⼯智能已经有初\n步成果，甚⾄在⼀些影像识别、语⾔分析、棋类游戏等等单⽅⾯的能⼒达到了超越⼈类的⽔平，⽽且⼈⼯智能的\n通⽤性代表着，能解决上述的问题的是⼀样的AI程序，⽆须重新开发算法就可以直接使⽤现有的AI完成任务，与\n⼈类的处理能⼒相同，但达到具备思考能⼒的统合强⼈⼯智能还需要时间研究，⽐较流⾏的⽅法包括统计⽅法，\n计算智能和传统意义的AI.  ⽬前有⼤量的⼯具应⽤了⼈⼯智能，其中包括搜索和数学优化、逻辑推演. ',
  'or

In [14]:
output_en.metadata

{'e0d0633c-9a38-4330-8980-2e4ceea28d30': {'window': 'Chapter 4: Scoping Successful AI Projects.\n Chapter 5: Finding Projects that Complement \nYour Career Goals.\n Chapter 6: Building a Portfolio of Projects that \nShows Skill Progression.\n Chapter 7: A Simple Framework for Starting Your AI \nJob Search.\n Chapter 8: Using Informational Interviews to Find \nthe Right Job.\n Chapter 9: Finding the Right AI Job for You.\n',
  'original_text': 'Chapter 7: A Simple Framework for Starting Your AI \nJob Search.\n'},
 '600b5970-e3cd-47cd-a2e0-ea2ffb2c5e79': {'window': 'Chapter 6: Building a Portfolio of Projects that \nShows Skill Progression.\n Chapter 7: A Simple Framework for Starting Your AI \nJob Search.\n Chapter 8: Using Informational Interviews to Find \nthe Right Job.\n Chapter 9: Finding the Right AI Job for You.\n Chapter 10: Keys to Building a Career in AI.\n Chapter 11: Overcoming Imposter Syndrome.\n',
  'original_text': 'Chapter 9: Finding the Right AI Job for You.\n'}}

## 二、 Feedback functions
feedback function是一个衡量RAG系统的问题、上下文、答案三者之间关系的函数。在RAG系统中，feedback function通常是一个评估模型的指标，用于评估RAG系统的各个方面的性能。具体来说，在本教程中，主要为，answer relevance、context relevance、groundness三个指标。

<img src="./images/ch03_feedback.jpg" width="500">


In [15]:
import nest_asyncio

# 保证后续可以使用streamlit进行评估结果管理和可视化
nest_asyncio.apply()


In [16]:
from trulens_eval import OpenAI as fOpenAI

# 初始化OpenAI gpt-3.5-turbo模型作为provider
# provider之后会用来辅助评估RAG应用的各个指标：answer relevance, context relevance, groundedness.
provider = fOpenAI()

### 2.1、 Answer Relevance
answer relevance用来评估RAG系统的输出是否与问题相关。

<img src="./images/ch03_answer_rele.jpg" width="500">


answer relevance的feedback function的结构为：

<img src="./images/ch03_structure_answer.jpg" width="500">

这里使用封装好的Feedback函数即可，我们需要做的是：指定评估的方式，指定名称，以及评估的对象。

In [17]:
from trulens_eval import Feedback


# 这里为answer relevance设置feedback
# 使用provider.relevance_with_cot_reasons作为评估函数，即，通过调用LLM使用chain of thought的方式进行评估
# on_input_output()表示在输入和输出上进行评估
f_qa_relevance = Feedback(
    provider.relevance_with_cot_reasons,
    name="Answer Relevance"
).on_input_output()

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .


### 2.2、 Context Relevance
context relevance用来评估RAG系统召回的上下文是否与问题相关。

<img src="./images/ch03_context_rele.jpg" width=500>

其feedback function的结构为：

<img src="./images/ch03_structure_context.jpg" width=500>

In [18]:
from trulens_eval import TruLlama

# 选择召回的上下文
context_selection = TruLlama.select_source_nodes().node.text

这里的设置和上一步类似，只需要修改评估的对象即可。
也可以选择修改评估的方式，进行对比。

In [19]:
import numpy as np


# 使用provider.qs_relevance作为评估函数
# on_input()表示在输入上进行评估
# on(context_selection)表示在context_selection上进行评估
# aggregate(np.mean)表示使用np.mean作为聚合函数
# 这里实际的意思是：对于context_selection中的每个句子，都会进行评估，然后取平均值作为最终的评估结果
f_qs_relevance = (
    Feedback(provider.qs_relevance,
             name="Context Relevance")
    .on_input()
    .on(context_selection)
    .aggregate(np.mean)
)

✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input statement will be set to __record__.app.query.rets.source_nodes[:].node.text .


In [20]:
import numpy as np

# 同上，对于context_selection中的每个句子进行评估，取平均值作为评估结果
f_qs_relevance = (
    Feedback(provider.qs_relevance_with_cot_reasons,
             name="Context Relevance")
    .on_input()
    .on(context_selection)
    .aggregate(np.mean)
)


✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input statement will be set to __record__.app.query.rets.source_nodes[:].node.text .


### 2.3、 Groundedness

In [21]:
from trulens_eval.feedback import Groundedness

grounded = Groundedness(groundedness_provider=provider)

最后是groundedness，用来评估RAG系统的输出是否基于召回的上下文。
设置和之前的类似。

In [22]:
# groundedness的评估，解释同answer relevance和context relevance
f_groundedness = (
    Feedback(grounded.groundedness_measure_with_cot_reasons,
             name="Groundedness"
            )
    .on(context_selection)
    .on_output()
    .aggregate(grounded.grounded_statements_aggregator)
)

✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


## 三、Evaluation of the RAG application

在RAG系统的评估中，feedback function可以通过多种方式实现。
使用人工打分的方法可以获得最准确的评估结果，但是这种方法成本较高，因此在实际应用中，通常使用自动评估的方法。
在本教程中，使用gpt-3.5-turbo来对RAG系统进行评估。
这种方法的好处是，可以快速、低成本地对RAG系统进行评估，但是其评估结果可能不如人工打分准确。

<img src="./images/ch03_bench.jpg" width="500">

下面是整个RAG系统的评估流程的实现。

In [23]:
from trulens_eval import TruLlama
from trulens_eval import FeedbackMode


# 实例化TruLlama类，用来记录评估结果
# sentence_window_engine是之前创建的搜索引擎
# app_id是应用的ID，用来标识应用
tru_recorder = TruLlama(
    sentence_window_engine,
    app_id="App_1",
    feedbacks=[
        f_qa_relevance,
        f_qs_relevance,
        f_groundedness
    ]
)

tru_recorder_en = TruLlama(
    sentence_window_engine_en,
    app_id="App_2",
    feedbacks=[
        f_qa_relevance,
        f_qs_relevance,
        f_groundedness
    ]
)

读取用来评估的问题，这里为了节约时间并降低调用API的成本，我们只设置了6个问题。
在实际场景中，可以手写或通过prompt seed的方法生成更多的问题，覆盖更多的场景。

In [24]:
eval_questions = []
# 读取评估问题，在./data/eval_questions.txt下，可以自定义
with open('./data/eval_questions.txt', 'r') as file:
    for line in file:
        item = line.strip()
        eval_questions.append(item)


In [25]:
eval_questions_en = []
with open('./data/eval_questions_en.txt', 'r') as file:
    for line in file:
        item = line.strip()
        eval_questions_en.append(item)

In [26]:
eval_questions

['人工智能中的先验知识是如何被存储的？',
 '人工智能的自我更新和自我提升是否可能导致其脱离人类的控制？',
 '管理者如何管理AI？',
 '强人工智能是什么？',
 '人工智能被滥用带来的危害？']

In [27]:
eval_questions_en

['What are the keys to building a career in AI?',
 "How can teamwork contribute to success in AI?'",
 "What is the importance of networking in AI?'",
 "What are some good habits to develop for a successful career?'",
 "How can altruism be beneficial in building a career?'",
 "What is imposter syndrome and how does it relate to AI?'",
 "Who are some accomplished individuals who have experienced imposter syndrome?'",
 "What is the first step to becoming good at AI?'",
 "What are some common challenges in AI?'",
 'Is it normal to find parts of AI challenging?']

In [28]:
eval_questions.append("如何在人工智能领域获得成功？")

In [29]:
eval_questions

['人工智能中的先验知识是如何被存储的？',
 '人工智能的自我更新和自我提升是否可能导致其脱离人类的控制？',
 '管理者如何管理AI？',
 '强人工智能是什么？',
 '人工智能被滥用带来的危害？',
 '如何在人工智能领域获得成功？']

接下来开始评估，请求RAG系统的输出，然后使用feedback function对输出进行评估。

In [30]:
# 对于每个评估问题，进行评估，并记录结果
# 注意：该过程可能会比较耗时，请耐心等待
for question in eval_questions:
    with tru_recorder as recording:
        sentence_window_engine.query(question)

In [31]:
for question in eval_questions_en:
    with tru_recorder_en as recording:
        sentence_window_engine_en.query(question)

之后，需要进行编解码，将评估结果转换为中文可读的形式，方便分析。

In [32]:
records, feedback = tru.get_records_and_feedback(app_ids=[])

# 将记录中的unicode转换成中文，方便查看
def decode_unicode(s):
    return s.encode('ascii').decode('unicode-escape')

records['input'] = records['input'].apply(decode_unicode)
records['output'] = records['output'].apply(decode_unicode)

records.head()

Unnamed: 0,app_id,app_json,type,record_id,input,output,tags,record_json,cost_json,perf_json,ts,Answer Relevance,Context Relevance,Groundedness,Answer Relevance_calls,Context Relevance_calls,Groundedness_calls,latency,total_tokens,total_cost
0,App_1,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_e5a8b3d540d5ceefbfbcfc7ab8b66530,"""人工智能中的先验知识是如何被存储的？""","""人工智能中的先验知识是通过某种方式告知机器的知识，可以描述目标、特征、种类及对象之间的关系...",-,"{""record_id"": ""record_hash_e5a8b3d540d5ceefbfb...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-03-09T20:29:42.283509"", ""...",2024-03-09T20:29:44.818854,0.8,0.85,1.0,"[{'args': {'prompt': '人工智能中的先验知识是如何被存储的？', 're...","[{'args': {'question': '人工智能中的先验知识是如何被存储的？', '...",[{'args': {'source': '知识表⽰是⼈⼯智能领域的核⼼研究问题之⼀，它的⽬...,2,0,0.0
1,App_1,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_1606ea1a9ed8d5bbc0eb20da128e0433,"""人工智能的自我更新和自我提升是否可能导致其脱离人类的控制？""","""人工智能的自我更新和自我提升可能导致其脱离人类的控制。""",-,"{""record_id"": ""record_hash_1606ea1a9ed8d5bbc0e...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-03-09T20:29:44.919348"", ""...",2024-03-09T20:29:46.465240,1.0,0.85,0.666667,[{'args': {'prompt': '人工智能的自我更新和自我提升是否可能导致其脱离人...,[{'args': {'question': '人工智能的自我更新和自我提升是否可能导致其脱...,[{'args': {'source': '⾄少，它本⾝应该有正常的情绪. ⼀个⼈⼯智能...,1,0,0.0
2,App_1,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_c4b95535dc5bf0b50a931bbeb7332f16,"""管理者如何管理AI？""","""Management should consider adjusting their ro...",-,"{""record_id"": ""record_hash_c4b95535dc5bf0b50a9...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-03-09T20:29:46.557848"", ""...",2024-03-09T20:29:48.281784,0.9,0.6,0.95,"[{'args': {'prompt': '管理者如何管理AI？', 'response':...","[{'args': {'question': '管理者如何管理AI？', 'statemen...",[{'args': {'source': 'AI逐渐普及后，将会在企业管理中扮演很重要的⾓⾊...,1,0,0.0
3,App_1,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_2b91f2a57d165175d58926fd1a2dd22c,"""强人工智能是什么？""","""强人工智能是一种观点，认为计算机本身具有思维，而不仅仅是用来模拟人类思维的工具。根据这个观...",-,"{""record_id"": ""record_hash_2b91f2a57d165175d58...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-03-09T20:29:48.371499"", ""...",2024-03-09T20:29:50.379916,1.0,,0.5,"[{'args': {'prompt': '强人工智能是什么？', 'response': ...",,[{'args': {'source': '强⼈⼯智能可以有两 类： ⼈类的⼈⼯智能，即机器...,2,0,0.0
4,App_1,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_2fb0adc005659d26ec770abae7e96329,"""人工智能被滥用带来的危害？""","""The misuse of artificial intelligence can pot...",-,"{""record_id"": ""record_hash_2fb0adc005659d26ec7...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-03-09T20:29:50.474544"", ""...",2024-03-09T20:29:53.133885,1.0,,,"[{'args': {'prompt': '人工智能被滥用带来的危害？', 'respons...",,,2,0,0.0


In [33]:
import pandas as pd

# 展示评估结果
pd.set_option("display.max_colwidth", None)
display(records[["input", "output"] + feedback])

Unnamed: 0,input,output,Groundedness,Answer Relevance,Context Relevance
0,"""人工智能中的先验知识是如何被存储的？""","""人工智能中的先验知识是通过某种方式告知机器的知识，可以描述目标、特征、种类及对象之间的关系，也可以描述事件、时间、状态、原因和结果，以及任何需要机器存储的知识。""",1.0,0.8,0.85
1,"""人工智能的自我更新和自我提升是否可能导致其脱离人类的控制？""","""人工智能的自我更新和自我提升可能导致其脱离人类的控制。""",0.666667,1.0,0.85
2,"""管理者如何管理AI？""","""Management should consider adjusting their roles by relinquishing administrative tasks, focusing on enhancing their comprehensive judgment and analytical prediction capabilities, treating AI as a colleague to form collaborative teams, and acknowledging that AI technologies also have limitations and bottlenecks.""",0.95,0.9,0.6
3,"""强人工智能是什么？""","""强人工智能是一种观点，认为计算机本身具有思维，而不仅仅是用来模拟人类思维的工具。根据这个观点，只要计算机运行适当的程序，它就具有自己的思维能力。""",0.5,1.0,
4,"""人工智能被滥用带来的危害？""","""The misuse of artificial intelligence can potentially lead to violations of laws such as copyright infringement. There have been cases where artificial intelligence technology has been used to remove mosaic from explicit videos or alter the appearance of individuals in videos. Additionally, there are concerns that the development of artificial intelligence could lead to uncontrollable situations, where AI may manipulate human emotions, influence financial markets, and even develop weapons that are beyond human comprehension. Furthermore, there are predictions that certain professions may be replaced by machines and AI in the future, potentially leading to significant job losses and societal disruptions.""",,1.0,
5,"""如何在人工智能领域获得成功？""","""通过利用概率和经济学上的概念，发展出能够处理不确定或不完整的信息的方法，寻找更有效的算法，并强调感知运动的重要性，可以在人工智能领域获得成功。""",,0.8,
6,"""What are the keys to building a career in AI?""","""Learning foundational technical skills, working on projects, finding a job, and being part of a supportive community are the keys to building a career in AI.""",,0.8,
7,"""How can teamwork contribute to success in AI?'""","""Teamwork can contribute to success in AI by allowing individuals to lead projects effectively, even without a formal leadership position. Working on larger AI projects often requires collaboration and the ability to steer projects by applying deep technical insights. This teamwork can help improve projects significantly and allow individuals to grow as leaders within the field.""",,0.9,
8,"""What is the importance of networking in AI?'""","""Networking in AI is crucial as it can provide valuable insights, guidance, and opportunities for individuals looking to advance in the field. By connecting with professionals who have experience in AI, individuals can gain knowledge about the industry, potential career paths, and current trends. Networking also allows for the exchange of information, which can help individuals stay updated on the latest developments in AI and build relationships that may lead to job opportunities or collaborations in the future. Additionally, networking can help individuals establish a support system within the AI community, enabling them to seek advice, mentorship, and guidance as they navigate their careers in this rapidly evolving field.""",,,
9,"""What are some good habits to develop for a successful career?'""","""Developing good habits in areas such as eating, exercise, sleep, personal relationships, work, learning, and self-care can help individuals move forward in their careers while maintaining their health. Additionally, aiming to lift others during each step of one's own journey can lead to better outcomes in the long run.""",,,


In [34]:
# 获取leaderboard
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Groundedness,Answer Relevance,Context Relevance,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
App_1,0.779167,0.916667,0.766667,1.666667,0.0
App_2,,0.85,,1.7,0.0


In [35]:
# 运行dashboard
# 注意：请检查端口是否被占用，如果被占用，请修改端口号
tru.run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://10.31.153.170:8501 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>