<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/retrievers/reciprocal_rerank_fusion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="在 Colab 中打开"/></a>


# 逆向重排融合检索器

在这个示例中，我们将介绍如何将多个查询和多个索引的检索结果进行合并。

根据这篇[论文](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf)中演示的`逆向重排融合`算法，检索到的节点将被重新排名。该算法提供了一种有效的方法来重新排列检索结果，而不需要过多的计算或依赖外部模型。

完全的功劳归功于github上的@Raduaschl，感谢他们的[示例实现](https://github.com/Raudaschl/rag-fusion)。


In [None]:
%pip install llama-index-llms-openai
%pip install llama-index-retrievers-bm25

In [None]:
import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]

## 设置


如果您在colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。


# 下载数据


In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-02-12 17:59:58--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-02-12 17:59:59 (327 KB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

接下来，我们将在文档上设置一个向量索引。


In [None]:
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=256)

index = VectorStoreIndex.from_documents(documents, transformations=[splitter])

## 创建一个混合融合检索器

在这一步中，我们将我们的索引与基于BM25的检索器进行融合。这将使我们能够捕捉输入查询中的语义关系和关键词。

由于这两个检索器都会计算一个分数，我们可以使用倒数重排序算法来重新排序我们的节点，而无需使用额外的模型或过多的计算。

这个设置还会查询4次，一次使用您的原始查询，并生成3个更多的查询。

默认情况下，它使用以下提示来生成额外的查询：

```python
QUERY_GEN_PROMPT = (
    "您是一个乐于助人的助手，根据单个输入查询生成多个搜索查询。生成{num_queries}个搜索查询，每行一个，与以下输入查询相关：\n"
    "查询：{query}\n"
    "查询：\n"
)
```


首先，我们创建我们的检索器。每个检索器将检索出相似度最高的前两个节点：


In [None]:
from llama_index.retrievers.bm25 import BM25Retriever

vector_retriever = index.as_retriever(similarity_top_k=2)

bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore, similarity_top_k=2
)

接下来，我们可以创建我们的融合检索器，它将从检索器返回的4个节点中返回相似度最高的前两个节点：


In [None]:
from llama_index.core.retrievers import QueryFusionRetrieverretriever = QueryFusionRetriever(    [vector_retriever, bm25_retriever],    similarity_top_k=2,    num_queries=4,  # 将此设置为1以禁用查询生成    mode="reciprocal_rerank",    use_async=True,    verbose=True,    # query_gen_prompt="...",  # 我们可以在这里覆盖查询生成提示)

In [None]:
# 将嵌套的异步应用于在笔记本中运行import nest_asyncionest_asyncio.apply()

In [None]:
nodes_with_scores = retriever.retrieve(
    "What happened at Interleafe and Viaweb?"
)

Generated queries:
1. What were the major events or milestones in the history of Interleafe and Viaweb?
2. Can you provide a timeline of the key developments and achievements of Interleafe and Viaweb?
3. What were the successes and failures of Interleafe and Viaweb as companies?


In [None]:
for node in nodes_with_scores:
    print(f"Score: {node.score:.2f} - {node.text}...\n-----\n")

Score: 0.03 - The UI was horrible, but it proved you could build a whole store through the browser, without any client software or typing anything into the command line on the server.

Now we felt like we were really onto something. I had visions of a whole new generation of software working this way. You wouldn't need versions, or ports, or any of that crap. At Interleaf there had been a whole group called Release Engineering that seemed to be at least as big as the group that actually wrote the software. Now you could just update the software right on the server.

We started a new company we called Viaweb, after the fact that our software worked via the web, and we got $10,000 in seed funding from Idelle's husband Julian. In return for that and doing the initial legal work and giving us business advice, we gave him 10% of the company. Ten years later this deal became the model for Y Combinator's. We knew founders needed something like this, because we'd needed it ourselves....
-----


正如我们所看到的，返回的两个节点都正确提到了Viaweb和Interleaf！


## 在查询引擎中的应用！

现在，我们可以将我们的检索器插入到查询引擎中，以合成自然语言响应。


In [None]:
from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(retriever)

In [None]:
response = query_engine.query("What happened at Interleafe and Viaweb?")

Generated queries:
1. What were the major events or milestones in the history of Interleafe and Viaweb?
2. Can you provide a timeline of the key developments and achievements of Interleafe and Viaweb?
3. What were the outcomes or impacts of Interleafe and Viaweb on the respective industries they operated in?


In [None]:
from llama_index.core.response.notebook_utils import display_response

display_response(response)

**`Final Response:`** At Interleaf, there was a group called Release Engineering that was as big as the group that actually wrote the software. This suggests that there was a significant focus on managing versions and ports of the software. However, at Viaweb, the founders realized that they could update the software directly on the server, eliminating the need for versions and ports. They started Viaweb, a company that built software that worked via the web. They received $10,000 in seed funding and gave 10% of the company to Julian, who provided the funding and business advice. This deal later became the model for Y Combinator's.