<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/callbacks/UpTrainCallback.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="在 Colab 中打开"/></a>


# UpTrain回调处理程序

UpTrain ([github](https://github.com/uptrain-ai/uptrain) || [website](https://github.com/uptrain-ai/uptrain/) || [docs](https://docs.uptrain.ai/)) 是一个开源平台，用于评估和改进GenAI应用程序。它为20多个预配置的检查提供评分（涵盖语言、代码、嵌入使用案例），对失败案例进行根本原因分析，并提供如何解决这些问题的见解。

本笔记本展示了如何使用UpTrain回调处理程序来评估RAG管道的不同组件。

## 1. **RAG查询引擎评估**：
RAG查询引擎在检索上下文和生成响应方面起着至关重要的作用。为了确保其性能和响应质量，我们进行以下评估：

- **[上下文相关性](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-relevance)**：确定检索到的上下文是否具有足够的信息来回答用户的查询。
- **[事实准确性](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy)**：评估LLM的响应是否可以通过检索到的上下文进行验证。
- **[响应完整性](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-completeness)**：检查响应是否包含回答用户查询所需的所有信息。

## 2. **子问题查询生成评估**：
SubQuestionQueryGeneration操作符将问题分解为子问题，并使用RAG查询引擎为每个子问题生成响应。为了衡量其准确性，我们使用：

- **[子查询完整性](https://docs.uptrain.ai/predefined-evaluations/query-quality/sub-query-completeness)**：确保子问题准确全面地涵盖原始查询。

## 3. **重新排序评估**：
重新排序涉及根据与查询相关性重新排序节点并选择顶部节点。根据重新排序后返回的节点数量执行不同的评估。

a. 相同数量的节点
- **[上下文重新排序](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-reranking)**：检查重新排序后的节点顺序是否比原始顺序更相关于查询。

b. 不同数量的节点：
- **[上下文简洁性](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-conciseness)**：检查减少的节点数量是否仍然提供所有所需的信息。

这些评估共同确保了RAG查询引擎、SubQuestionQueryGeneration操作符和LlamaIndex管道中的重新排序过程的健壮性和有效性。


#### **注意：**
- 我们已经使用基本的RAG查询引擎进行了评估，同样的评估也可以使用高级RAG查询引擎进行。
- 对于重新排序的评估也是如此，我们已经使用了SentenceTransformerRerank进行了评估，同样的评估也可以使用其他重新排序器进行。


## 安装依赖项并导入库

安装笔记本的依赖项。


In [None]:
%pip install llama-index-readers-web
%pip install llama-index-callbacks-uptrain
%pip install -q html2text llama-index pandas tqdm uptrain torch sentence-transformers

导入库。


In [None]:
from getpass import getpass

from llama_index.core import Settings, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.readers.web import SimpleWebPageReader
from llama_index.core.callbacks import CallbackManager
from llama_index.callbacks.uptrain.base import UpTrainCallbackHandler
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.postprocessor import SentenceTransformerRerank

import os

## 设置

UpTrain为您提供：
1. 具有高级钻取和过滤选项的仪表板
1. 失败案例中的见解和常见主题
1. 对生产数据的可观察性和实时监控
1. 通过与CI/CD流水线的无缝集成进行回归测试

您可以选择以下选项来使用UpTrain进行评估：
### 1. **UpTrain的开源软件（OSS）**：
您可以使用开源评估服务来评估您的模型。在这种情况下，您需要提供OpenAI API密钥。您可以在[这里](https://platform.openai.com/account/api-keys)获取您自己的密钥。

为了在UpTrain仪表板中查看您的评估结果，您需要通过在终端中运行以下命令来设置它：

```bash
git clone https://github.com/uptrain-ai/uptrain
cd uptrain
bash run_uptrain.sh
```

这将在您的本地机器上启动UpTrain仪表板。您可以在`http://localhost:3000/dashboard`上访问它。

参数：
- key_type="openai"
- api_key="OPENAI_API_KEY"
- project_name="PROJECT_NAME"


### 2. **UpTrain托管服务和仪表板**：
或者，您可以使用UpTrain的托管服务来评估您的模型。您可以在[这里](https://uptrain.ai/)创建免费的UpTrain账户并获得免费试用积分。如果您想要更多的试用积分，可以在[这里](https://calendly.com/uptrain-sourabh/30min)与UpTrain的维护人员预约电话。

使用托管服务的好处包括：
1. 无需在本地机器上设置UpTrain仪表板。
1. 可以访问许多LLM而无需它们的API密钥。

一旦您执行评估，您可以在UpTrain仪表板上查看它们，网址为`https://dashboard.uptrain.ai/dashboard`

参数：
- key_type="uptrain"
- api_key="UPTRAIN_API_KEY"
- project_name="PROJECT_NAME"


**注意：** `project_name`将是在UpTrain仪表板中显示执行的评估的项目名称。


## 创建UpTrain回调处理程序


In [None]:
os.environ["OPENAI_API_KEY"] = getpass()

callback_handler = UpTrainCallbackHandler(
    key_type="openai",
    api_key=os.environ["OPENAI_API_KEY"],
    project_name="uptrain_llamaindex",
)

Settings.callback_manager = CallbackManager([callback_handler])

## 加载和解析文档

从Paul Graham的文章"What I Worked On"中加载文档。


In [None]:
documents = SimpleWebPageReader().load_data(
    [
        "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt"
    ]
)

解析文档为节点。


In [None]:
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)

# 1. RAG 查询引擎评估


UpTrain回调处理程序将自动捕获查询、上下文和生成的响应，并将在响应上运行以下三个评估（从0到1分）：
- **[上下文相关性](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-relevance)**：确定检索到的上下文是否具有足够的信息来回答用户的查询。
- **[事实准确性](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy)**：评估LLM的响应是否可以通过检索到的上下文进行验证。
- **[响应完整性](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-completeness)**：检查响应是否包含回答用户查询所需的所有信息。


In [None]:
index = VectorStoreIndex.from_documents(
    documents,
)
query_engine = index.as_query_engine()

max_characters_per_line = 80
queries = [
    "What did Paul Graham do growing up?",
    "When and how did Paul Graham's mother die?",
    "What, in Paul Graham's opinion, is the most distinctive thing about YC?",
    "When and how did Paul Graham meet Jessica Livingston?",
    "What is Bel, and when and where was it written?",
]
for query in queries:
    response = query_engine.query(query)

100%|██████████| 1/1 [00:01<00:00,  1.33s/it]
100%|██████████| 1/1 [00:01<00:00,  1.36s/it]
100%|██████████| 1/1 [00:03<00:00,  3.50s/it]
100%|██████████| 1/1 [00:01<00:00,  1.32s/it]



Question: What did Paul Graham do growing up?
Response: Growing up, Paul Graham worked on writing short stories and programming. He started programming on an IBM 1401 in 9th grade using an early version of Fortran. Later, he got a TRS-80 computer and wrote simple games, a rocket prediction program, and a word processor. Despite his interest in programming, he initially planned to study philosophy in college before eventually switching to AI.

Context Relevance Score: 0.0
Factual Accuracy Score: 1.0
Response Completeness Score: 1.0



100%|██████████| 1/1 [00:01<00:00,  1.59s/it]
100%|██████████| 1/1 [00:00<00:00,  1.01it/s]
100%|██████████| 1/1 [00:01<00:00,  1.76s/it]
100%|██████████| 1/1 [00:01<00:00,  1.28s/it]



Question: When and how did Paul Graham's mother die?
Response: Paul Graham's mother died when he was 18 years old, from a brain tumor.

Context Relevance Score: 0.0
Factual Accuracy Score: 0.0
Response Completeness Score: 0.5



100%|██████████| 1/1 [00:01<00:00,  1.75s/it]
100%|██████████| 1/1 [00:01<00:00,  1.55s/it]
100%|██████████| 1/1 [00:03<00:00,  3.39s/it]
100%|██████████| 1/1 [00:01<00:00,  1.48s/it]



Question: What, in Paul Graham's opinion, is the most distinctive thing about YC?
Response: The most distinctive thing about Y Combinator, according to Paul Graham, is that instead of deciding for himself what to work on, the problems come to him. Every 6 months, a new batch of startups brings their problems, which then become the focus of YC. This engagement with a variety of startup problems and the direct involvement in solving them is what Graham finds most unique about Y Combinator.

Context Relevance Score: 1.0
Factual Accuracy Score: 0.3333333333333333
Response Completeness Score: 1.0



100%|██████████| 1/1 [00:01<00:00,  1.92s/it]
100%|██████████| 1/1 [00:00<00:00,  1.20it/s]
100%|██████████| 1/1 [00:02<00:00,  2.15s/it]
100%|██████████| 1/1 [00:01<00:00,  1.08s/it]



Question: When and how did Paul Graham meet Jessica Livingston?
Response: Paul Graham met Jessica Livingston at a big party at his house in October 2003.

Context Relevance Score: 1.0
Factual Accuracy Score: 0.5
Response Completeness Score: 1.0



100%|██████████| 1/1 [00:01<00:00,  1.82s/it]
100%|██████████| 1/1 [00:01<00:00,  1.14s/it]
100%|██████████| 1/1 [00:03<00:00,  3.19s/it]
100%|██████████| 1/1 [00:01<00:00,  1.50s/it]


Question: What is Bel, and when and where was it written?
Response: Bel is a new Lisp that was written in Arc. It was developed over a period of 4 years, from March 26, 2015 to October 12, 2019. The majority of Bel was written in England.

Context Relevance Score: 1.0
Factual Accuracy Score: 1.0
Response Completeness Score: 1.0






# 2. 子问题查询引擎评估

**子问题查询引擎**用于解决使用多个数据源回答复杂查询的问题。它首先将复杂查询分解为每个相关数据源的子问题，然后收集所有中间响应并合成最终响应。

一旦生成，UpTrain回调处理程序将自动捕获子问题和它们的响应，并对响应运行以下三个评估（从0到1分）：
- **[上下文相关性](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-relevance)**：确定检索到的上下文是否具有足够的信息来回答用户查询。
- **[事实准确性](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy)**：评估LLM的响应是否可以通过检索到的上下文进行验证。
- **[响应完整性](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-completeness)**：检查响应是否包含回答用户查询所需的所有信息。

除了上述评估之外，回调处理程序还将运行以下评估：
- **[子查询完整性](https://docs.uptrain.ai/predefined-evaluations/query-quality/sub-query-completeness)**：确保子问题准确全面地涵盖了原始查询。


In [None]:
# 构建索引和查询引擎vector_query_engine = VectorStoreIndex.from_documents(    documents=documents,    use_async=True,).as_query_engine()query_engine_tools = [    QueryEngineTool(        query_engine=vector_query_engine,        metadata=ToolMetadata(            name="documents",            description="Paul Graham关于我所从事的工作的文章",        ),    ),]query_engine = SubQuestionQueryEngine.from_defaults(    query_engine_tools=query_engine_tools,    use_async=True,)response = query_engine.query(    "Paul Graham在YC之前、期间和之后的生活有何不同？")

Generated 3 sub questions.
[1;3;38;2;237;90;200m[documents] Q: What did Paul Graham work on before YC?
[0m[1;3;38;2;90;149;237m[documents] Q: What did Paul Graham work on during YC?
[0m[1;3;38;2;11;159;203m[documents] Q: What did Paul Graham work on after YC?
[0m[1;3;38;2;11;159;203m[documents] A: After Y Combinator, Paul Graham decided to focus on painting as his next endeavor.
[0m[1;3;38;2;90;149;237m[documents] A: Paul Graham worked on writing essays and working on Y Combinator during YC.
[0m[1;3;38;2;237;90;200m[documents] A: Before Y Combinator, Paul Graham worked on projects with his colleagues Robert and Trevor.
[0m

100%|██████████| 3/3 [00:02<00:00,  1.47it/s]
100%|██████████| 3/3 [00:00<00:00,  3.28it/s]
100%|██████████| 3/3 [00:01<00:00,  1.68it/s]
100%|██████████| 3/3 [00:01<00:00,  2.28it/s]



Question: What did Paul Graham work on after YC?
Response: After Y Combinator, Paul Graham decided to focus on painting as his next endeavor.

Context Relevance Score: 0.0
Factual Accuracy Score: 0.0
Response Completeness Score: 0.5


Question: What did Paul Graham work on during YC?
Response: Paul Graham worked on writing essays and working on Y Combinator during YC.

Context Relevance Score: 0.0
Factual Accuracy Score: 1.0
Response Completeness Score: 0.5


Question: What did Paul Graham work on before YC?
Response: Before Y Combinator, Paul Graham worked on projects with his colleagues Robert and Trevor.

Context Relevance Score: 0.0
Factual Accuracy Score: 0.0
Response Completeness Score: 0.5



100%|██████████| 1/1 [00:01<00:00,  1.24s/it]


Question: How was Paul Grahams life different before, during, and after YC?
Sub Query Completeness Score: 1.0






# 3. 重新排序

重新排序是根据查询与节点的相关性对节点进行重新排序的过程。Llamaindex提供了多种重新排序算法。在本示例中，我们使用了LLMRerank。

重新排序器允许您输入返回重新排序后的前n个节点的数量。如果这个值与原始节点数量保持不变，重新排序器将只重新排序节点而不改变节点数量。否则，它将重新排序节点并返回前n个节点。

我们将根据重新排序后返回的节点数量进行不同的评估。


## 3a. 重新排序（节点数相同）

如果重新排序后返回的节点数与原始节点数相同，则将执行以下评估：

- **[上下文重新排序](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-reranking)**：检查重新排序后节点的顺序是否比原始顺序更相关于查询。


In [None]:
callback_handler = UpTrainCallbackHandler(    key_type="openai",    api_key=os.environ["OPENAI_API_KEY"],    project_name="uptrain_llamaindex",)Settings.callback_manager = CallbackManager([callback_handler])rerank_postprocessor = SentenceTransformerRerank(    top_n=3,  # 重新排序后的节点数    keep_retrieval_score=True,)index = VectorStoreIndex.from_documents(    documents=documents,)query_engine = index.as_query_engine(    similarity_top_k=3,  # 重新排序前的节点数    node_postprocessors=[rerank_postprocessor],)response = query_engine.query(    "Sam Altman在这篇文章中做了什么？",)

100%|██████████| 1/1 [00:01<00:00,  1.89s/it]



Question: What did Sam Altman do in this essay?
Context Reranking Score: 1.0



100%|██████████| 1/1 [00:01<00:00,  1.88s/it]
100%|██████████| 1/1 [00:01<00:00,  1.44s/it]
100%|██████████| 1/1 [00:02<00:00,  2.77s/it]
100%|██████████| 1/1 [00:01<00:00,  1.45s/it]


Question: What did Sam Altman do in this essay?
Response: Sam Altman was asked to become the president of Y Combinator after the original founders decided to step down and reorganize the company for long-term sustainability.

Context Relevance Score: 1.0
Factual Accuracy Score: 1.0
Response Completeness Score: 0.5






# 3b. 重新排名（使用不同数量的节点）

如果重新排名后返回的节点数量少于原始节点数量，则将执行以下评估：

- **[上下文简洁性](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-conciseness)**：检查减少的节点数量是否仍然提供所有所需的信息。


In [None]:
callback_handler = UpTrainCallbackHandler(    key_type="openai",    api_key=os.environ["OPENAI_API_KEY"],    project_name="uptrain_llamaindex",)Settings.callback_manager = CallbackManager([callback_handler])rerank_postprocessor = SentenceTransformerRerank(    top_n=2,  # 重新排序后的节点数    keep_retrieval_score=True,)index = VectorStoreIndex.from_documents(    documents=documents,)query_engine = index.as_query_engine(    similarity_top_k=5,  # 重新排序前的节点数    node_postprocessors=[rerank_postprocessor],)# 使用您的高级RAGresponse = query_engine.query(    "Sam Altman在这篇文章中做了什么？",)

100%|██████████| 1/1 [00:02<00:00,  2.22s/it]



Question: What did Sam Altman do in this essay?
Context Conciseness Score: 0.0



100%|██████████| 1/1 [00:01<00:00,  1.58s/it]
100%|██████████| 1/1 [00:00<00:00,  1.19it/s]
100%|██████████| 1/1 [00:01<00:00,  1.62s/it]
100%|██████████| 1/1 [00:01<00:00,  1.42s/it]


Question: What did Sam Altman do in this essay?
Response: Sam Altman offered unsolicited advice to the author during a visit to California for interviews.

Context Relevance Score: 0.0
Factual Accuracy Score: 1.0
Response Completeness Score: 0.5






# UpTrain的仪表板和洞察

这里有一个简短的视频展示了仪表板和洞察： 

![llamaindex_uptrain.gif](https://uptrain-assets.s3.ap-south-1.amazonaws.com/images/llamaindex/llamaindex_uptrain.gif)
