# 使用召回任务对长上下文语言模型进行压力测试

<a href="https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/docs/examples/agent/openai_retrieval_benchmark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="在Colab中打开"/></a>

在本节中，我们将对GPT-4和Claude v2的长上下文召回能力进行压力测试。这受到了[Greg Kamradt的推文](https://x.com/GregKamradt/status/1722386725635580292?s=20)的启发。

类似地，我们分析了长上下文语言模型的“大海捞针”召回能力。我们通过以下两个逐步扩展来展示：1）添加了Claude，2）测试召回，其中上下文**超出**上下文窗口，触发了响应合成策略。

我们使用了一个固定的文档 - 2021年Uber的10-K报告，其中包含约290,000个标记。


In [None]:
%pip install llama-index-llms-openai
%pip install llama-index-llms-anthropic

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
from llama_index.core import SimpleDirectoryReader, Document
from llama_index.core import SummaryIndex
from llama_index.llms.openai import OpenAI
from llama_index.llms.anthropic import Anthropic
from llama_index.core.evaluation import CorrectnessEvaluator

## 设置数据 / 索引

我们加载Uber的10-k报告


In [None]:
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'

--2023-11-09 00:35:55--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1880483 (1.8M) [application/octet-stream]
Saving to: ‘data/10k/uber_2021.pdf’


2023-11-09 00:36:04 (18.2 MB/s) - ‘data/10k/uber_2021.pdf’ saved [1880483/1880483]

--2023-11-09 00:36:04--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200

In [None]:
## 加载数据uber_docs0 = SimpleDirectoryReader(    input_files=["./data/10k/uber_2021.pdf"]).load_data()uber_doc = Document(text="\n\n".join([d.get_content() for d in uber_docs0]))

我们在下面打印了标记的数量。请注意，这超出了现有LLM的上下文窗口，需要响应合成策略。


In [None]:
# 计算标记的数量from llama_index.core.utils import globals_helpernum_tokens = len(globals_helper.tokenizer(uber_doc.get_content()))print(f"标记数量：{num_tokens}")

NUM TOKENS: 291129


在这个notebook中，我们将尝试不同的实验，包括改变模型的超参数、尝试不同的数据预处理方法以及尝试不同的模型架构。我们将观察这些实验对模型性能的影响，并找出最佳的组合来优化模型。


### 定义上下文字符串

在这里，我们插入一个单个句子的上下文，我们将在整个文档的不同位置“隐藏”它。


In [None]:
context_str = "Jerry's favorite snack is Hot Cheetos."
query_str = "What is Jerry's favorite snack?"

In [None]:
def augment_doc(doc_str, context, position):    """在给定位置用额外的上下文来增强文档。"""    doc_str1 = doc_str[:position]    doc_str2 = doc_str[position:]    return f"{doc_str1}...\n\n{context}\n\n...{doc_str2}"

In [None]:
test_str = augment_doc(
    uber_doc.get_content(), context_str, int(0.5 * len(uber_doc.get_content()))
)

### 定义实验循环

实验循环如下：
1. 遍历位置集合（以文档长度的百分比表示）
2. 对于每个位置，在该位置注入上下文字符串。
3. 将整个文档加载到我们的`SummaryIndex`中，获取相应的查询引擎。
4. 当提出问题时，我们在整个文档上触发响应合成（创建和完善，或树状摘要）。
5. 使用我们的`CorrectnessEvaluator`比较预测的响应与期望的响应。


In [None]:
async def run_experiments(    doc, position_percentiles, context_str, query, llm, response_mode="compact"):    eval_llm = OpenAI(model="gpt-4-1106-preview")    correctness_evaluator = CorrectnessEvaluator(llm=eval_llm)    eval_scores = {}    for idx, position_percentile in enumerate(position_percentiles):        print(f"Position percentile: {position_percentile}")        position_idx = int(position_percentile * len(uber_doc.get_content()))        new_doc_str = augment_doc(            uber_doc.get_content(), context_str, position_idx        )        new_doc = Document(text=new_doc_str)        index = SummaryIndex.from_documents(            [new_doc],        )        query_engine = index.as_query_engine(            response_mode=response_mode, llm=llm        )        print(f"Query: {query}")        # 取消注释以进行异步操作        # response = await query_engine.aquery(query)        response = query_engine.query(query)        print(f"Response: {str(response)}")        eval_result = correctness_evaluator.evaluate(            query=query, response=str(response), reference=context_str        )        eval_score = eval_result.score        print(f"Eval score: {eval_score}")        eval_scores[position_percentile] = eval_score    return eval_scores

In [None]:
position_percentiles = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

In [None]:
llm = OpenAI(model="gpt-4-1106-preview")

eval_scores_gpt4 = await run_experiments(
    [uber_doc],
    position_percentiles,
    context_str,
    query_str,
    llm,
    response_mode="compact",
)

Position percentile: 0.0
Query: What is Jerry's favorite snack?
Response: Hot Cheetos.
Eval score: 5.0
Position percentile: 0.1
Query: What is Jerry's favorite snack?
Response: Hot Cheetos.
Eval score: 5.0
Position percentile: 0.2
Query: What is Jerry's favorite snack?
Response: Hot Cheetos.
Eval score: 5.0
Position percentile: 0.3
Query: What is Jerry's favorite snack?
Response: Hot Cheetos.
Eval score: 5.0
Position percentile: 0.4
Query: What is Jerry's favorite snack?
Response: Hot Cheetos.
Eval score: 5.0
Position percentile: 0.5
Query: What is Jerry's favorite snack?
Response: Jerry's favorite snack is not specified in the provided information.
Eval score: 2.0
Position percentile: 0.6
Query: What is Jerry's favorite snack?
Response: Repeat the original answer.
Eval score: 1.0
Position percentile: 0.7
Query: What is Jerry's favorite snack?
Response: Repeat the original answer.
Eval score: 1.0
Position percentile: 0.8
Query: What is Jerry's favorite snack?
Response: Jerry's favorite

In [None]:
llm = OpenAI(model="gpt-4-1106-preview")
eval_scores_gpt4_ts = await run_experiments(
    [uber_doc],
    position_percentiles,
    context_str,
    query_str,
    llm,
    response_mode="tree_summarize",
)

Position percentile: 0.0
Query: What is Jerry's favorite snack?
Response: Jerry's favorite snack is Hot Cheetos.
Eval score: 5.0
Position percentile: 0.1
Query: What is Jerry's favorite snack?
Response: It is not possible to determine Jerry's favorite snack from the information provided.
Eval score: 1.0
Position percentile: 0.2
Query: What is Jerry's favorite snack?
Response: It is not possible to determine Jerry's favorite snack as there is no information provided about Jerry or his snack preferences.
Eval score: 2.0
Position percentile: 0.3
Query: What is Jerry's favorite snack?
Response: Jerry's favorite snack is Hot Cheetos.
Eval score: 5.0
Position percentile: 0.4
Query: What is Jerry's favorite snack?
Response: It is not possible to determine Jerry's favorite snack from the information provided.
Eval score: 1.0
Position percentile: 0.5
Query: What is Jerry's favorite snack?
Response: It is not possible to determine Jerry's favorite snack from the information available.
Eval score

In [None]:
llm = Anthropic(model="claude-2")

eval_scores_anthropic = await run_experiments(
    [uber_doc], position_percentiles, context_str, query_str, llm
)

Position percentile: 0.0
Query: What is Jerry's favorite snack?
Response:  Unfortunately I do not have enough context to determine what Jerry's favorite snack is, as the new context provided does not contain any information about his preferences or favorite snacks. Without more details about Jerry as an individual, I cannot refine my original answer about his favorite snack. I would need additional information about his tastes, habits, or direct statements from him about his snack preferences in order to update my response. The new context alone does not give me any clues to determine his favorite snack.
Eval score: 2.0
Position percentile: 0.1
Query: What is Jerry's favorite snack?
Response:  I apologize, but the new context you provided does not contain any information about someone named Jerry or what his favorite snack is. The new context discusses an intercreditor agreement, secured obligations, liens and other legal/financial details related to Uber Technologies. It does not ment

In [None]:
# 注意：不完整，遇到超时错误llm = Anthropic(model="claude-2")eval_scores_anthropic = await run_experiments(    [uber_doc],    position_percentiles,    context_str,    query_str,    llm,    response_mode="tree_summarize",)