<a href="https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/docs/examples/retrievers/auto_merging_retriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="在 Colab 中打开"/></a>


# 自动合并检索器

在这个笔记本中，我们展示了我们的`AutoMergingRetriever`，它查看一组叶节点，并递归地“合并”引用超过给定阈值的父节点的叶节点子集。这使我们能够将潜在不同的、较小的上下文合并成一个更大的上下文，这可能有助于综合。

您可以自己在一组文档上定义这种层次结构，或者您可以使用我们全新的文本解析器：一个接受候选文档集并输出整个节点层次结构的HierarchicalNodeParser，从“粗到细”。


In [None]:
%pip install llama-index-llms-openai
%pip install llama-index-readers-file pymupdf

In [None]:
%load_ext autoreload
%autoreload 2

如果您在colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。


In [None]:
!pip install llama-index

## 加载数据

让我们首先加载Llama 2论文：https://arxiv.org/pdf/2307.09288.pdf。这将成为我们的测试数据。


In [None]:
!mkdir -p 'data/'
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

In [None]:
from pathlib import Path

from llama_index.readers.file import PDFReader
from llama_index.readers.file import PyMuPDFReader

In [None]:
loader = PyMuPDFReader()# docs0 = loader.load_data(file=Path("./data/llama2.pdf"))docs0 = loader.load(file_path=Path("./data/llama2.pdf"))

默认情况下，PDF阅读器为每一页创建一个单独的文档。
为了这个笔记本的目的，我们将这些文档拼接在一起成为一个文档。
这将帮助我们更好地突出后面将“拼接”块合并在一起的自动合并功能。


In [None]:
from llama_index.core import Document

doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]

## 从文本中解析块层次结构，加载到存储中

在本节中，我们将使用`HierarchicalNodeParser`。这将输出一个节点层次结构，从具有更大块大小的顶级节点到具有较小块大小的子节点，其中每个子节点都有一个具有更大块大小的父节点。

默认情况下，层次结构如下：
- 第1级：块大小为2048
- 第2级：块大小为512
- 第3级：块大小为128

然后，我们将这些节点加载到存储中。叶节点被索引并通过向量存储检索 - 这些节点将首先通过相似性搜索直接检索。其他节点将从文档存储中检索。


In [None]:
from llama_index.core.node_parser import (
    HierarchicalNodeParser,
    SentenceSplitter,
)

In [None]:
node_parser = HierarchicalNodeParser.from_defaults()

In [None]:
nodes = node_parser.get_nodes_from_documents(docs)

In [None]:
len(nodes)

1029

这里我们导入一个简单的辅助函数，用于获取节点列表中的“叶子”节点。这些节点没有自己的子节点。


In [None]:
from llama_index.core.node_parser import get_leaf_nodes, get_root_nodes

In [None]:
leaf_nodes = get_leaf_nodes(nodes)

In [None]:
len(leaf_nodes)

795

In [None]:
root_nodes = get_root_nodes(nodes)

### 加载到存储

我们定义了一个文档存储，将所有节点加载到其中。

然后，我们定义一个 `VectorStoreIndex`，其中只包含叶子级节点。


In [None]:
# 定义存储上下文from llama_index.core.storage.docstore import SimpleDocumentStorefrom llama_index.core import StorageContextfrom llama_index.llms.openai import OpenAIdocstore = SimpleDocumentStore()# 将节点插入到文档存储中docstore.add_documents(nodes)# 定义存储上下文（默认情况下也将包括向量存储）storage_context = StorageContext.from_defaults(docstore=docstore)llm = OpenAI(model="gpt-3.5-turbo")

In [None]:
## 将索引加载到向量索引中from llama_index.core import VectorStoreIndexbase_index = VectorStoreIndex(    leaf_nodes,    storage_context=storage_context,)

## 定义检索器


In [None]:
from llama_index.core.retrievers import AutoMergingRetriever

In [None]:
base_retriever = base_index.as_retriever(similarity_top_k=6)
retriever = AutoMergingRetriever(base_retriever, storage_context, verbose=True)

In [None]:
# 查询字符串 = "从红队行动中学到了哪些教训？"# 查询字符串 = "您能告诉我有关安全微调的关键概念吗？"查询字符串 = (    "在RLHF阶段调整安全数据量可能会产生什么潜在结果？")节点 = retriever.retrieve(查询字符串)基础节点 = base_retriever.retrieve(查询字符串)

> Merging 4 nodes into parent node.
> Parent node id: caf5f81c-842f-46a4-b679-6be584bd6aff.
> Parent node text: We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: an...


In [None]:
len(nodes)

3

In [None]:
len(base_nodes)

6

In [None]:
from llama_index.core.response.notebook_utils import display_source_node

for node in nodes:
    display_source_node(node, source_length=10000)

**Node ID:** d4d67180-71c8-4328-b3f1-1e98fa42ab69<br>**Similarity:** 0.8694979150607424<br>**Text:** We also list two
qualitative examples where safety and helpfulness reward models don’t agree with each other in Table 35.
A.4.2
Qualitative Results on Safety Data Scaling
In Section 4.2.3, we study the impact of adding more safety data into model RLHF in a quantitative manner.
Here we showcase a few samples to qualitatively examine the evolution of model behavior when we scale
safety data in Tables 36, 37, and 38. In general, we are observing that Llama 2-Chat becomes safer responding
to unsafe prompts with more safety data used.<br>

**Node ID:** caf5f81c-842f-46a4-b679-6be584bd6aff<br>**Similarity:** 0.86168727941324<br>**Text:** We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators
write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to
the prompts, selecting the response that is safest according to a set of guidelines. We then use the human
preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to
sample from the model during the RLHF stage.
Better Long-Tail Safety Robustness without Hurting Helpfulness
Safety is inherently a long-tail problem,
where the challenge comes from a small number of very specific cases. We investigate the impact of Safety
RLHF by taking two intermediate Llama 2-Chat checkpoints—one without adversarial prompts in the RLHF
stage and one with them—and score their responses on our test sets using our safety and helpfulness reward
models. In Figure 14, we plot the score distribution shift of the safety RM on the safety test set (left) and that
of the helpfulness RM on the helpfulness test set (right). In the left hand side of the figure, we observe that
the distribution of safety RM scores on the safety set shifts to higher reward scores after safety tuning with
RLHF, and that the long tail of the distribution near zero thins out. A clear cluster appears on the top-left
corner suggesting the improvements of model safety. On the right side, we do not observe any gathering
pattern below the y = x line on the right hand side of Figure 14, which indicates that the helpfulness score
distribution is preserved after safety tuning with RLHF. Put another way, given sufficient helpfulness training
data, the addition of an additional stage of safety mitigation does not negatively impact model performance
on helpfulness to any notable degradation. A qualitative example is shown in Table 12.
Impact of Safety Data Scaling.
A tension between helpfulness and safety of LLMs has been observed in
previous studies (Bai et al., 2022a). To better understand how the addition of safety training data affects
general model performance, especially helpfulness, we investigate the trends in safety data scaling by
adjusting the amount of safety data used in the RLHF stage.<br>

**Node ID:** d9893bef-a5a7-4248-a0a1-d7c28800ae59<br>**Similarity:** 0.8546977459150967<br>**Text:** 0
0.2
0.4
0.6
0.8
1.0
Helpfulness RM Score before Safety RLHF
0.0
0.2
0.4
0.6
0.8
1.0
Helpfulness RM Score after Safety RLHF
0
1000
0
1000
Figure 14: Impact of safety RLHF measured by reward model score distributions. Left: safety reward
model scores of generations on the Meta Safety test set. The clustering of samples in the top left corner
suggests the improvements of model safety.<br>

In [None]:
for node in base_nodes:
    display_source_node(node, source_length=10000)

**Node ID:** 16328561-9ff7-4307-8d31-adf6bb74b71b<br>**Similarity:** 0.8770715326726375<br>**Text:** A qualitative example is shown in Table 12.
Impact of Safety Data Scaling.
A tension between helpfulness and safety of LLMs has been observed in
previous studies (Bai et al., 2022a). To better understand how the addition of safety training data affects
general model performance, especially helpfulness, we investigate the trends in safety data scaling by
adjusting the amount of safety data used in the RLHF stage.<br>

**Node ID:** e756d327-1a28-4228-ac38-f8a831b1bf77<br>**Similarity:** 0.8728111844788112<br>**Text:** A clear cluster appears on the top-left
corner suggesting the improvements of model safety. On the right side, we do not observe any gathering
pattern below the y = x line on the right hand side of Figure 14, which indicates that the helpfulness score
distribution is preserved after safety tuning with RLHF. Put another way, given sufficient helpfulness training
data, the addition of an additional stage of safety mitigation does not negatively impact model performance
on helpfulness to any notable degradation. A qualitative example is shown in Table 12.
Impact of Safety Data Scaling.<br>

**Node ID:** d4d67180-71c8-4328-b3f1-1e98fa42ab69<br>**Similarity:** 0.8697379697028405<br>**Text:** We also list two
qualitative examples where safety and helpfulness reward models don’t agree with each other in Table 35.
A.4.2
Qualitative Results on Safety Data Scaling
In Section 4.2.3, we study the impact of adding more safety data into model RLHF in a quantitative manner.
Here we showcase a few samples to qualitatively examine the evolution of model behavior when we scale
safety data in Tables 36, 37, and 38. In general, we are observing that Llama 2-Chat becomes safer responding
to unsafe prompts with more safety data used.<br>

**Node ID:** d9893bef-a5a7-4248-a0a1-d7c28800ae59<br>**Similarity:** 0.855087365309258<br>**Text:** 0
0.2
0.4
0.6
0.8
1.0
Helpfulness RM Score before Safety RLHF
0.0
0.2
0.4
0.6
0.8
1.0
Helpfulness RM Score after Safety RLHF
0
1000
0
1000
Figure 14: Impact of safety RLHF measured by reward model score distributions. Left: safety reward
model scores of generations on the Meta Safety test set. The clustering of samples in the top left corner
suggests the improvements of model safety.<br>

**Node ID:** d62ee107-9841-44b5-8b70-bc6487ad6315<br>**Similarity:** 0.8492541852986794<br>**Text:** Better Long-Tail Safety Robustness without Hurting Helpfulness
Safety is inherently a long-tail problem,
where the challenge comes from a small number of very specific cases. We investigate the impact of Safety
RLHF by taking two intermediate Llama 2-Chat checkpoints—one without adversarial prompts in the RLHF
stage and one with them—and score their responses on our test sets using our safety and helpfulness reward
models.<br>

**Node ID:** 312a63b3-5e28-4fbf-a3e1-4e8dc0c026ea<br>**Similarity:** 0.8488371951811564<br>**Text:** We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators
write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to
the prompts, selecting the response that is safest according to a set of guidelines. We then use the human
preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to
sample from the model during the RLHF stage.<br>

## 将其连接到查询引擎




In [None]:
from llama_index.core.query_engine import RetrieverQueryEngine

In [None]:
query_engine = RetrieverQueryEngine.from_args(retriever)
base_query_engine = RetrieverQueryEngine.from_args(base_retriever)

In [None]:
response = query_engine.query(query_str)

> Merging 4 nodes into parent node.
> Parent node id: 3671b20d-ea5e-4afc-983e-02be6ee8302d.
> Parent node text: We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: an...


In [None]:
print(str(response))

Adjusting the amount of safety data used in the RLHF stage could potentially have the following outcomes:
1. Improved model safety: Increasing the amount of safety data used in RLHF may lead to improvements in model safety. This means that the model becomes better at responding to unsafe prompts and avoids generating unsafe or harmful outputs.
2. Thinning out of the long tail of safety RM scores: Increasing the amount of safety data may result in a shift in the distribution of safety reward model (RM) scores towards higher reward scores. This means that the model becomes more consistent in generating safe responses and reduces the occurrence of low safety scores.
3. Preservation of helpfulness performance: Adjusting the amount of safety data used in RLHF is not expected to negatively impact model performance on helpfulness. This means that the model's ability to generate helpful responses is maintained even after incorporating additional safety training.
4. Gathering pattern in helpful

In [None]:
base_response = base_query_engine.query(query_str)

In [None]:
print(str(base_response))

Adjusting the amount of safety data used in the RLHF stage could potentially lead to improvements in model safety. This can be observed by a clear cluster appearing on the top-left corner, suggesting enhanced model safety. Additionally, it is indicated that the helpfulness score distribution is preserved after safety tuning with RLHF, indicating that the addition of safety data does not negatively impact model performance on helpfulness.


## 评估

我们以更加定量的方式评估分层检索器相对于基准检索器的工作效果。

**警告**：这可能会*耗费*大量资源，特别是使用GPT-4。请谨慎使用，并调整样本大小以适应您的预算。


In [None]:
from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset
from llama_index.llms.openai import OpenAI
import nest_asyncio

nest_asyncio.apply()

In [None]:
# 注意：如果数据集尚未保存，请运行此代码# 注意：我们只从前20个节点生成，因为其余的是引用eval_llm = OpenAI(model="gpt-4")dataset_generator = DatasetGenerator(    root_nodes[:20],    llm=eval_llm,    show_progress=True,    num_questions_per_chunk=3,)

In [None]:
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=60)

In [None]:
eval_dataset.save_json("data/llama2_eval_qr_dataset.json")

In [None]:
# 可选eval_dataset = QueryResponseDataset.from_json(    "data/llama2_eval_qr_dataset.json")

### 比较结果

我们对每个检索器进行了评估：正确性、语义相似性、相关性和忠实度。


In [None]:
import asyncio
import nest_asyncio

nest_asyncio.apply()

In [None]:
from llama_index.core.evaluation import (    CorrectnessEvaluator,  # 正确性评估器    SemanticSimilarityEvaluator,  # 语义相似性评估器    RelevancyEvaluator,  # 相关性评估器    FaithfulnessEvaluator,  # 忠实度评估器    PairwiseComparisonEvaluator,  # 两两比较评估器)from collections import defaultdictimport pandas as pd# 注意：可以取消其他评估器的注释evaluator_c = CorrectnessEvaluator(llm=eval_llm)  # 正确性评估器evaluator_s = SemanticSimilarityEvaluator(llm=eval_llm)  # 语义相似性评估器evaluator_r = RelevancyEvaluator(llm=eval_llm)  # 相关性评估器evaluator_f = FaithfulnessEvaluator(llm=eval_llm)  # 忠实度评估器# pairwise_evaluator = PairwiseComparisonEvaluator(llm=eval_llm)

In [None]:
from llama_index.core.evaluation.eval_utils import (
    get_responses,
    get_results_df,
)
from llama_index.core.evaluation import BatchEvalRunner

In [None]:
eval_qs = eval_dataset.questions
qr_pairs = eval_dataset.qr_pairs
ref_response_strs = [r for (_, r) in qr_pairs]

In [None]:
pred_responses = get_responses(eval_qs, query_engine, show_progress=True)

In [None]:
base_pred_responses = get_responses(
    eval_qs, base_query_engine, show_progress=True
)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:07<00:00,  8.17it/s]


In [None]:
import numpy as np

pred_response_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]

In [None]:
evaluator_dict = {
    "correctness": evaluator_c,
    "faithfulness": evaluator_f,
    "relevancy": evaluator_r,
    "semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)

In [None]:
eval_results = await batch_runner.aevaluate_responses(
    eval_qs, responses=pred_responses, reference=ref_response_strs
)

In [None]:
base_eval_results = await batch_runner.aevaluate_responses(
    eval_qs, responses=base_pred_responses, reference=ref_response_strs
)

In [None]:
results_df = get_results_df(
    [eval_results, base_eval_results],
    ["Auto Merging Retriever", "Base Retriever"],
    ["correctness", "relevancy", "faithfulness", "semantic_similarity"],
)
display(results_df)

Unnamed: 0,names,correctness,relevancy,faithfulness,semantic_similarity
0,Auto Merging Retriever,4.266667,0.916667,0.95,0.962196
1,Base Retriever,4.208333,0.916667,0.95,0.960602


**分析**：结果大致相同。

让我们也尝试使用我们的成对评估来看看GPT-4更喜欢哪个答案。


In [None]:
batch_runner = BatchEvalRunner(
    {"pairwise": pairwise_evaluator}, workers=10, show_progress=True
)

In [None]:
pairwise_eval_results = await batch_runner.aevaluate_response_strs(
    eval_qs,
    response_strs=pred_response_strs,
    reference=base_pred_response_strs,
)
pairwise_score = np.array(
    [r.score for r in pairwise_eval_results["pairwise"]]
).mean()

In [None]:
pairwise_score

0.525

**分析**：成对比较分数是候选答案（使用自动合并检索器）被优先选择的百分比与基础答案（使用基础检索器）相比的度量。在这里，我们可以看到它大致是平衡的。
