### Long-Context Reorder（长上下文重新排序）

**无论您的模型的体系结构如何，当您包含10个以上检索到的文档时，检索的时候都会出现严重的性能下降。**

**简而言之: 当模型必须在较长的上下文中访问相关信息时，它们往往会忽略提供的文档。看: https://arxiv.org/abs/2307.03172**

为了避免这个问题，您可以在从向量数据库检索后重新排序文档，以避免性能下降。

In [5]:
import os
os.environ["OPENAI_API_KEY"] = "sk-xxx"
os.environ["OPENAI_API_BASE"] = "https://api.chatanywhere.tech/v1"
os.environ["OPENAI_API_MODEL"] = "gpt-4-turbo"

In [1]:
! pip install --upgrade --quiet  sentence-transformers langchain-chroma langchain langchain-openai

In [2]:
from langchain.chains import LLMChain, StuffDocumentsChain
from langchain_chroma import Chroma
from langchain_community.document_transformers import (
    LongContextReorder,
)
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI

# 从HuggingFace上加载模型
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

texts = [
    "Basquetball is a great sport.",
    "Fly me to the moon is one of my favourite songs.",
    "The Celtics are my favourite team.",
    "This is a document about the Boston Celtics",
    "I simply love going to the movies",
    "The Boston Celtics won the game by 20 points",
    "This is just a random text.",
    "Elden Ring is one of the best games in the last 15 years.",
    "L. Kornet is one of the best Celtics players.",
    "Larry Bird was an iconic NBA player.",
]

# 创建一个向量数据库的检索器
retriever = Chroma.from_texts(texts, embedding=embeddings).as_retriever(
    search_kwargs={"k": 10}
)
# 使用中文查询问题
query = "关于凯尔特人，你知道些什么？"

# 根据相关性得分排序相关文件
docs = retriever.invoke(query)
docs

  from .autonotebook import tqdm as notebook_tqdm


[Document(page_content='This is just a random text.'),
 Document(page_content='Fly me to the moon is one of my favourite songs.'),
 Document(page_content='This is a document about the Boston Celtics'),
 Document(page_content='L. Kornet is one of the best Celtics players.'),
 Document(page_content='The Celtics are my favourite team.'),
 Document(page_content='I simply love going to the movies'),
 Document(page_content='Basquetball is a great sport.'),
 Document(page_content='The Boston Celtics won the game by 20 points'),
 Document(page_content='Elden Ring is one of the best games in the last 15 years.'),
 Document(page_content='Larry Bird was an iconic NBA player.')]

In [3]:
# 重新排序文件:
# 相关性较低的文件将位于列表的中间位置
# 相关性高的则位于开头/结尾
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs)

reordered_docs

[Document(page_content='Fly me to the moon is one of my favourite songs.'),
 Document(page_content='L. Kornet is one of the best Celtics players.'),
 Document(page_content='I simply love going to the movies'),
 Document(page_content='The Boston Celtics won the game by 20 points'),
 Document(page_content='Larry Bird was an iconic NBA player.'),
 Document(page_content='Elden Ring is one of the best games in the last 15 years.'),
 Document(page_content='Basquetball is a great sport.'),
 Document(page_content='The Celtics are my favourite team.'),
 Document(page_content='This is a document about the Boston Celtics'),
 Document(page_content='This is just a random text.')]

In [6]:
# 我们用重新排序的文档作为上下文准备并运行一个定制的 Stuff 链。

# 重写提示
document_prompt = PromptTemplate(
    input_variables=["page_content"], template="{page_content}"
)
document_variable_name = "context"
llm = OpenAI()
stuff_prompt_override = """Given this text extracts:
-----
{context}
-----
Please answer the following question:
{query}"""
prompt = PromptTemplate(
    template=stuff_prompt_override, input_variables=["context", "query"]
)

# 实例化链
llm_chain = LLMChain(llm=llm, prompt=prompt)
chain = StuffDocumentsChain(
    llm_chain=llm_chain,
    document_prompt=document_prompt,
    document_variable_name=document_variable_name,
)
chain.run(input_documents=reordered_docs, query=query)

  warn_deprecated(
  warn_deprecated(
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


'\n我知道凯尔特人是一支NBA的篮球队，总部位于美国马萨诸塞州的波士顿市。他们是联盟中最成功的球队之一，曾经赢得过17次总冠军。他们的队徽是一只绿色的三叶草，代表着凯尔特人的爱尔兰血统。一些著名的球员包括拉里·伯德、凯文·麦克海尔和保罗·皮尔斯。凯尔特人也被认为是NBA历史上最伟大的球队之一。'