# Retrievers
检索器是一个接口，它在给定非结构化查询时返回文档。它比矢量存储更通用。检索器不需要能够存储文档，只需要能够返回(或检索)它们。向量存储可以用作猎犬的主干，但也有其他类型的猎犬。

检索器接受字符串查询作为输入，并返回`Document`的列表作为输出。

## 高级检索类型
LangChain提供了几种高级检索类型。下面是一个完整的列表，以及以下信息
名称：检索算法的名称。

索引类型：哪个索引类型（如果有）依赖。

使用LLM：此检索方法是否使用LLM。

何时使用：我们何时应该考虑使用此检索方法的评论。

描述：该检索算法正在做什么的描述。



| 名称 | 索引类型 | 使用LLM | 何时使用 | 描述 |
| --- | --- | --- | --- | --- |
| Vectorstore | Vectorstore | 否 | 如果你刚刚开始寻找一些快速和简单的东西。 | 这是最简单的方法，也是最容易上手的方法。它包括为每一段文本创建嵌入。 |
| ParentDocument | Vectorstore +文档存储 | 否 |如果您的页面有许多不同的小块信息，它们最好单独索引，但最好一起检索。 | 这包括为每个文档索引多个块。然后找到在嵌入空间中最相似的块，但是要检索整个父文档并返回它(而不是单个块)。 | 
| Multi Vector |  Vectorstore +文档存储 | 有时在索引期间 | 如果您能够从文档中提取您认为与索引更相关的信息，而不是文本本身。 | 这涉及到为每个文档创建多个向量。每个向量可以用无数种方式创建——例如文本摘要和假设问题。 |
| Self Query | Vectorstore | 是 | 如果用户提出的问题可以通过基于元数据(而不是与文本的相似性)获取文档来更好地回答。 | 它使用LLM将用户输入转换为两种东西:(1)用于语义查找的字符串，(2)与之配套的元数据过滤器。这很有用，因为问题通常是关于文档的元数据的(而不是内容本身)。 |
| Contextual Compression | 任何 | 有时 | 如果您发现检索到的文档包含太多不相关的信息，并且分散了LLM的注意力。 | 这将后处理步骤置于另一个检索程序之上，并仅从检索到的文档中提取最相关的信息。这可以通过嵌入或LLM来完成。 |
| Time-Weighted Vectorstore | Vectorstore | 否 | 如果您有与文档相关联的时间戳，并且希望检索最近的时间戳 | 它使用LLM从原始查询生成多个查询。当原始查询需要正确回答关于多个主题的信息片段时，这很有用。通过生成多个查询，我们可以为每个查询获取文档。 |
| Multi-Query Retriever | 任何 | 是 | 如果用户问的问题很复杂，需要多个不同的信息来回答 |  它使用LLM从原始查询生成多个查询。当原始查询需要正确回答关于多个主题的信息片段时，这很有用。通过生成多个查询，我们可以为每个查询获取文档。 |
| Ensemble | 任何 | 否 | 如果您有多种检索方法，并希望尝试将它们组合起来。 | 这将从多个检索器中获取文档，然后将它们组合起来。 |
| Long-Context Reorder | 任何 | 否 | 如果您正在使用长上下文模型，并且注意到它没有注意到检索文档中间的信息。 | 这将从底层检索器获取文档，然后对它们重新排序，以便最相似的文档位于开头和结尾附近。这很有用，因为已经证明，对于较长的上下文模型，它们有时不会注意上下文窗口中间的信息。 |


In [3]:
from dotenv import load_dotenv, find_dotenv
from langchain.globals import set_debug
import os
load_dotenv(find_dotenv())
set_debug(False)

### [第三方集成](https://python.langchain.com/docs/integrations/retrievers/)
LangChain还集成了许多第三方检索服务。有关这些集成的完整列表，请查看所有集成的列表。

### 在LCEL中使用检索器
由于检索器是可运行的，我们可以很容易地将它们与其他可运行对象组合在一起

In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """Answer the question based only on the following context:

{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI()


def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])


chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

chain.invoke("What did the president say about technology?")


## 矢量存储检索器
向量存储检索器是使用向量存储检索文档的检索器。它是围绕矢量存储类的轻量级包装器，使其符合检索器接口。它使用向量存储实现的搜索方法(如相似性搜索和MMR)来查询向量存储中的文本。

一旦你构造了一个向量存储，构造一个检索器就很容易了。让我们看一个例子。

In [4]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/state_of_the_union.txt")

In [5]:
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(texts, embeddings)

In [6]:
retriever = db.as_retriever()

In [7]:
docs = retriever.invoke("what did he say about ketanji brown jackson")

In [8]:
docs

[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'data/state_of_the_union.txt'}),
 Document(page_content='A former top litigator in private practice. A former federal public defender. And from a family of publ

## 最大边际相关性检索
默认情况下，向量存储检索器使用相似性搜索。如果底层向量存储支持最大边际相关性搜索，则可以将其指定为搜索类型。


In [9]:
retriever = db.as_retriever(search_type="mmr")
docs = retriever.invoke("what did he say about ketanji brown jackson")
docs

[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'data/state_of_the_union.txt'}),
 Document(page_content='One was stationed at bases and breathing in toxic smoke from “burn pits” that incinerated wastes of war

### 相似分数阈值检索
您还可以设置一个检索方法，该方法设置一个相似度分数阈值，并且只返回分数高于该阈值的文档。



In [10]:
retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5}
)
docs = retriever.invoke("what did he say about ketanji brown jackson")

In [11]:
docs

[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'data/state_of_the_union.txt'}),
 Document(page_content='A former top litigator in private practice. A former federal public defender. And from a family of publ

### Specifying top k

您还可以指定搜索变量，如k，以便在进行检索时使用。

In [12]:
retriever = db.as_retriever(search_kwargs={"k": 1})
docs = retriever.invoke("what did he say about ketanji brown jackson")
len(docs)

1

## MultiQueryRetriever
基于距离的矢量数据库检索嵌入（表示）在高维空间中的查询，并根据“距离”找到类似的嵌入式文档。但是，检索可能会因查询措辞的细微变化而产生不同的结果，或者如果嵌入并不能很好地捕获数据的语义。有时会进行及时的工程 /调整以手动解决这些问题，但可能很乏味。

`MultiQueryRetriever`通过使用LLM从不同的角度为给定的用户输入查询生成多个查询，从而使提示调优过程自动化。对于每个查询，它检索一组相关文档，并在所有查询中获取唯一联合，以获得更大的潜在相关文档集。通过在同一个问题上生成多个透视图，`multiqueryretriver`可能能够克服基于距离的检索的一些限制，并获得更丰富的结果集。

In [24]:
# Build a sample vectorDB
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load blog post
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

# VectorDB
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

简单用法指定要用于生成查询的LLM，检索器将完成其余的工作。

In [25]:
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

question = "What are the approaches to Task Decomposition?"
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm
)

In [26]:
# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [27]:
unique_docs = retriever_from_llm.invoke(question)
len(unique_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. How can Task Decomposition be achieved through different methods?', '2. What strategies are commonly used for Task Decomposition?', '3. What are the various techniques for breaking down tasks in Task Decomposition?']


2

您还可以提供提示和输出解析器，以便将结果拆分为查询列表。

In [28]:
from typing import List

from langchain.chains import LLMChain
from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from pydantic import BaseModel, Field


# Output parser will split the LLM result into a list of queries
class LineList(BaseModel):
    # "lines" is the key (attribute name) of the parsed output
    lines: List[str] = Field(description="Lines of text")


class LineListOutputParser(PydanticOutputParser):
    def __init__(self) -> None:
        super().__init__(pydantic_object=LineList)

    def parse(self, text: str) -> LineList:
        
        lines = text.strip().split("\n")
        return LineList(lines=lines)

output_parser = LineListOutputParser()

QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five 
    different versions of the given user question to retrieve relevant documents from a vector 
    database. By generating multiple perspectives on the user question, your goal is to help
    the user overcome some of the limitations of the distance-based similarity search. 
    Provide these alternative questions separated by newlines.
    Original question: {question}""",
)
llm = ChatOpenAI(temperature=0)

# Chain
llm_chain = LLMChain(llm=llm, prompt=QUERY_PROMPT, output_parser=output_parser)

# Other inputs
question = "What are the approaches to Task Decomposition?"

In [29]:
# Run
retriever = MultiQueryRetriever(
    retriever=vectordb.as_retriever(), 
    llm_chain=llm_chain, 
    parser_key="lines"
)  # "lines" is the key (attribute name) of the parsed output

# Results
unique_docs = retriever.invoke("What does the course say about regression?")
len(unique_docs)

OutputParserException: Failed to parse LineList from completion 1. Got: 1 validation error for LineList
  Input should be a valid dictionary or instance of LineList [type=model_type, input_value=1, input_type=int]
    For further information visit https://errors.pydantic.dev/2.7/v/model_type

### Contextual compression
检索的一个挑战是，当您将数据摄取到系统中时，您通常不知道文档存储系统将面临的特定查询。这意味着与查询最相关的信息可能隐藏在包含大量不相关文本的文档中。在应用程序中传递完整的文档可能会导致LLM调用成本更高，响应也更差。

上下文压缩就是为了解决这个问题。其思想很简单:与其按原样立即返回检索到的文档，不如使用给定查询的上下文压缩它们，以便只返回相关信息。这里的压缩既指压缩单个文档的内容，也指过滤掉整个文档。

要使用上下文压缩检索器，您需要:—基本检索器—文档压缩器

上下文压缩检索器将查询传递给基本检索器，获取初始文档并通过文档压缩器传递它们。`Document Compressor`获取一个文档列表，并通过减少文档的内容或完全删除文档来缩短它。

## Get started


In [1]:
# Helper function for printing docs


def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

使用一个vanilla矢量存储检索器  
让我们首先初始化一个简单的向量存储检索器，并存储2023年国情咨文演讲(以块为单位)。我们可以看到，给定一个示例问题，我们的检索器返回一两个相关的文档和一些不相关的文档。即使是相关的文档也有很多不相关的信息。

In [4]:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

documents = TextLoader("data/state_of_the_union.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever()

docs = retriever.invoke("What did the president say about Ketanji Brown Jackson")
pretty_print_docs(docs)

Document 1:

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
----------------------------------------------------------------------------------------------------
Document 2:

A former top litigator in private practice. A former federal public defender. And fro

### 使用LLMChainExtractor添加上下文压缩

现在让我们用一个`contextualcompressionretriver`来包装我们的基本检索器。我们将添加一个`LLMChainExtractor`，它将遍历最初返回的文档，并仅从每个文档中提取与查询相关的内容。

In [5]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import OpenAI

llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "What did the president say about Ketanji Jackson Brown"
)
pretty_print_docs(compressed_docs)



Document 1:

I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson.


### 更多内置压缩器:过滤器 

`LLMChainFilter`是一个稍微简单但更健壮的压缩器，它使用LLM链来决定过滤掉哪些最初检索到的文档以及返回哪些文档，而不需要操作文档内容。

In [6]:
from langchain.retrievers.document_compressors import LLMChainFilter

_filter = LLMChainFilter.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "What did the president say about Ketanji Jackson Brown"
)
pretty_print_docs(compressed_docs)



ValueError: {'message': 'null', 'type': 'invalid_request_error'}

### EmbeddingsFilter
对每个检索到的文档进行额外的LLM调用既昂贵又缓慢。EmbeddingsFilter通过嵌入文档和查询并只返回那些与查询具有足够相似嵌入的文档，提供了一种更便宜和更快的选择。

In [7]:
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter, base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "What did the president say about Ketanji Jackson Brown"
)
pretty_print_docs(compressed_docs)

Document 1:

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
----------------------------------------------------------------------------------------------------
Document 2:

A former top litigator in private practice. A former federal public defender. And fro

### 将压缩机和文件转换器串在一起
使用`DocumentCompressorPipeline`，我们还可以很容易地按顺序组合多个压缩器。除了压缩器，我们还可以将`basedocumenttransformer`添加到我们的管道中，它不执行任何上下文压缩，而只是对一组文档执行一些转换。例如，`textsplitter`可以用作文档转换器，将文档分成更小的部分，`EmbeddingsRedundantFilter`可以用于根据文档之间的嵌入相似性过滤掉冗余文档。

下面我们创建一个压缩器管道，首先将我们的文档分成更小的块，然后删除冗余的文档，然后根据与查询的相关性进行过滤。

In [8]:
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain_community.document_transformers import EmbeddingsRedundantFilter
from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0, separator=". ")
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
pipeline_compressor = DocumentCompressorPipeline(
    transformers=[splitter, redundant_filter, relevant_filter]
)

In [9]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=pipeline_compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "What did the president say about Ketanji Jackson Brown"
)
pretty_print_docs(compressed_docs)

Document 1:

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson
----------------------------------------------------------------------------------------------------
Document 2:

As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. 

While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year
----------------------------------------------------------------------------------------------------
Document 3:

A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder
---------------------------------------------------------------------

### 自定义检索
许多LLM应用程序涉及使用retriver从外部数据源检索信息。   
检索器负责检索给定用户`query`的相关`Documents`列表。   
检索到的文档通常被格式化为输入到LLM中的提示，允许LLM使用其中的信息来生成适当的响应(例如，根据知识库回答用户的问题)。

### Interface
要创建自己的检索器，需要扩展BaseRetriever类并实现以下方法
表格
| 方法 | 描述 |必需/可选|
| ------------ | ------------ | ------------ |
| `_get_relevant_documents` | 获取与查询相关的文档。 |必需|
|`_aget_relevant_documents`| 实现提供异步本机支持。 | 可选 |
`的逻辑可能涉及使用请求对数据库或web的任意调用。`的逻辑可能涉及使用请求对数据库或web的任意调用。

通过继承`BaseRetriever`，您的检索器自动成为一个`LangChain Runnable`，并将获得开箱即用的标准`Runnable`功能

您可以使用`RunnableLambda`或`RunnableGenerator`来实现检索器。


将检索器实现为`BaseRetriever`而不是`RunnableLambda`(自定义可运行函数)的主要好处是，`BaseRetriever`是众所周知的`LangChain`实体，因此一些监控工具可能会为检索器实现专门的行为。另一个区别是，在某些api中，`BaseRetriever`的行为与`RunnableLambda`略有不同;例如，流事件API中的启动事件将在检索器启动而不是链启动。

#### Example
让我们实现一个玩具检索器，它返回其文本包含用户查询中的文本的所有文档。



In [10]:
from typing import List

from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever


class ToyRetriever(BaseRetriever):
    """A toy retriever that contains the top k documents that contain the user query.

    This retriever only implements the sync method _get_relevant_documents.

    If the retriever were to involve file access or network access, it could benefit
    from a native async implementation of `_aget_relevant_documents`.

    As usual, with Runnables, there's a default async implementation that's provided
    that delegates to the sync implementation running on another thread.
    """

    documents: List[Document]
    """List of documents to retrieve from."""
    k: int
    """Number of top results to return"""

    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Sync implementations for retriever."""
        matching_documents = []
        for document in self.documents:
            if len(matching_documents) > self.k:
                return matching_documents

            if query.lower() in document.page_content.lower():
                matching_documents.append(document)
        return matching_documents

    # Optional: Provide a more efficient native implementation by overriding
    # _aget_relevant_documents
    # async def _aget_relevant_documents(
    #     self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun
    # ) -> List[Document]:
    #     """Asynchronously get documents relevant to a query.

    #     Args:
    #         query: String to find relevant documents for
    #         run_manager: The callbacks handler to use

    #     Returns:
    #         List of relevant documents
    #     """

In [11]:
documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"type": "dog", "trait": "loyalty"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"type": "cat", "trait": "independence"},
    ),
    Document(
        page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
        metadata={"type": "fish", "trait": "low maintenance"},
    ),
    Document(
        page_content="Parrots are intelligent birds capable of mimicking human speech.",
        metadata={"type": "bird", "trait": "intelligence"},
    ),
    Document(
        page_content="Rabbits are social animals that need plenty of space to hop around.",
        metadata={"type": "rabbit", "trait": "social"},
    ),
]
retriever = ToyRetriever(documents=documents, k=3)

In [12]:
retriever.invoke("that")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]

它是一个可运行的，所以它将受益于标准的可运行接口

In [13]:
await retriever.ainvoke("that")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]

In [14]:
retriever.batch(["dog", "cat"])

[[Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'type': 'dog', 'trait': 'loyalty'})],
 [Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'})]]

In [15]:
async for event in retriever.astream_events("bar", version="v1"):
    print(event)

{'event': 'on_retriever_start', 'run_id': '79131a20-2720-4a9d-b7c2-c663eef65542', 'name': 'ToyRetriever', 'tags': [], 'metadata': {}, 'data': {'input': 'bar'}}
{'event': 'on_retriever_stream', 'run_id': '79131a20-2720-4a9d-b7c2-c663eef65542', 'tags': [], 'metadata': {}, 'name': 'ToyRetriever', 'data': {'chunk': []}}
{'event': 'on_retriever_end', 'name': 'ToyRetriever', 'run_id': '79131a20-2720-4a9d-b7c2-c663eef65542', 'tags': [], 'metadata': {}, 'data': {'output': []}}


  warn_beta(


### Ensemble Retriever
`EnsembleRetriever`将检索器列表作为输入，并集成它们的get相关文档()方法的结果，并基于互反秩融合算法对结果重新排序。

通过利用不同算法的优势，`EnsembleRetriever`可以获得比任何单一算法更好的性能。

最常见的模式是将稀疏检索器(如BM25)与密集检索器(如嵌入相似度)结合起来，因为它们的优势是互补的。它也被称为混合搜索。稀疏检索器擅长根据关键词找到相关文档，而密集检索器擅长根据语义相似度找到相关文档。

In [16]:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

In [17]:
doc_list_1 = [
    "I like apples",
    "I like oranges",
    "Apples and oranges are fruits",
]

# initialize the bm25 retriever and faiss retriever
bm25_retriever = BM25Retriever.from_texts(
    doc_list_1, metadatas=[{"source": 1}] * len(doc_list_1)
)
bm25_retriever.k = 2

doc_list_2 = [
    "You like apples",
    "You like oranges",
]

embedding = OpenAIEmbeddings()
faiss_vectorstore = FAISS.from_texts(
    doc_list_2, embedding, metadatas=[{"source": 2}] * len(doc_list_2)
)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)
docs = ensemble_retriever.invoke("apples")
docs

[Document(page_content='I like apples', metadata={'source': 1}),
 Document(page_content='You like apples', metadata={'source': 2}),
 Document(page_content='Apples and oranges are fruits', metadata={'source': 1}),
 Document(page_content='You like oranges', metadata={'source': 2})]

#### Runtime Configuration
我们还可以在运行时配置检索器。为了做到这一点，我们需要将字段标记为可配置的

In [18]:
from langchain_core.runnables import ConfigurableField

In [19]:
faiss_retriever = faiss_vectorstore.as_retriever(
    search_kwargs={"k": 2}
).configurable_fields(
    search_kwargs=ConfigurableField(
        id="search_kwargs_faiss",
        name="Search Kwargs",
        description="The search kwargs to use",
    )
)
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)
config = {"configurable": {"search_kwargs_faiss": {"k": 1}}}
docs = ensemble_retriever.invoke("apples", config=config)
docs

[Document(page_content='I like apples', metadata={'source': 1}),
 Document(page_content='You like apples', metadata={'source': 2}),
 Document(page_content='Apples and oranges are fruits', metadata={'source': 1})]

### Long-Context Reorder
无论您的模型架构如何，当您包含10多个已检索的文档时，都会出现大量的性能降级。简而言之：当模型必须在长篇小说中间访问相关信息时，它们倾向于忽略提供的文档。请参阅：https：//arxiv.org/abs/2307.03172

In [None]:
from langchain.chains import LLMChain, StuffDocumentsChain
from langchain_chroma import Chroma
from langchain_community.document_transformers import (
    LongContextReorder,
)
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI

# Get embeddings.
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

texts = [
    "Basquetball is a great sport.",
    "Fly me to the moon is one of my favourite songs.",
    "The Celtics are my favourite team.",
    "This is a document about the Boston Celtics",
    "I simply love going to the movies",
    "The Boston Celtics won the game by 20 points",
    "This is just a random text.",
    "Elden Ring is one of the best games in the last 15 years.",
    "L. Kornet is one of the best Celtics players.",
    "Larry Bird was an iconic NBA player.",
]

# Create a retriever
retriever = Chroma.from_texts(texts, embedding=embeddings).as_retriever(
    search_kwargs={"k": 10}
)
query = "What can you tell me about the Celtics?"

# Get relevant documents ordered by relevance score
docs = retriever.invoke(query)
docs

In [None]:
# Reorder the documents:
# Less relevant document will be at the middle of the list and more
# relevant elements at beginning / end.
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs)

# Confirm that the 4 relevant documents are at beginning and end.
reordered_docs

In [None]:
# We prepare and run a custom Stuff chain with reordered docs as context.

# Override prompts
document_prompt = PromptTemplate(
    input_variables=["page_content"], template="{page_content}"
)
document_variable_name = "context"
llm = OpenAI()
stuff_prompt_override = """Given this text extracts:
-----
{context}
-----
Please answer the following question:
{query}"""
prompt = PromptTemplate(
    template=stuff_prompt_override, input_variables=["context", "query"]
)

# Instantiate the chain
llm_chain = LLMChain(llm=llm, prompt=prompt)
chain = StuffDocumentsChain(
    llm_chain=llm_chain,
    document_prompt=document_prompt,
    document_variable_name=document_variable_name,
)
chain.run(input_documents=reordered_docs, query=query)

### MultiVector Retriever
在每个文档中存储多个向量通常是有益的。在许多用例中，这是有益的。LangChain有一个基本的`multivectorretriver`，这使得查询这种类型的设置变得容易。很多复杂性在于如何创建每个文档的多个向量。本手册涵盖了一些创建这些向量和使用`MultiVectorRetriever`的常用方法。

- 创建每个文档的多个向量的方法包括
- 摘要:为每个文档创建一个摘要，将其与文档一起嵌入(或代替)。
- 假设性问题:创建每个文档都适合回答的假设性问题，将这些问题与文档一起嵌入(或代替)。

In [20]:
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryByteStore
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

loaders = [
    TextLoader("data/whatsapp_chat.txt"),
    TextLoader("data/state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

### Smaller chunks
通常，检索较大的信息块而嵌入较小的信息块是有用的。这允许嵌入尽可能接近地捕获语义，同时尽可能多地向下游传递上下文。注意，这就是parentdocumenttriiever所做的。这里我们将展示在引擎盖下发生了什么。

In [21]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", 
    embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
import uuid

doc_ids = [str(uuid.uuid4()) for _ in docs]

In [22]:
doc_ids

['54e70ef2-49d8-449b-aca6-45664d78609e',
 '84b41567-b962-44c5-9603-c98b87f876fd',
 '889acd4a-34ca-40f0-a9a6-8d8162523f7c',
 'f04f27ba-ec0d-42a6-8e60-111b12746713',
 'd9c728d1-9194-44b5-b498-b1a78e8541d9']

In [23]:
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

In [24]:
sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

In [25]:
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
# Vectorstore alone retrieves the small chunks
retriever.vectorstore.similarity_search("justice breyer")[0]

Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': 'f04f27ba-ec0d-42a6-8e60-111b12746713', 'source': 'data/state_of_the_union.txt'})

In [26]:
# Retriever returns larger chunks
len(retriever.invoke("justice breyer")[0].page_content)

9875

检索器在矢量数据库上执行的默认搜索类型是相似性搜索。`LangChain Vector Stores`也支持通过最大边际相关性进行搜索，所以如果你想要这样做，你可以设置搜索类型属性如下

In [27]:
from langchain.retrievers.multi_vector import SearchType

retriever.search_type = SearchType.mmr

len(retriever.invoke("justice breyer")[0].page_content)

9875

#### Summary
通常，摘要可以更准确地提炼出一个块的内容，从而更好地检索。这里我们将展示如何创建摘要，然后嵌入这些摘要。

In [28]:
import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

In [29]:
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(max_retries=0)
    | StrOutputParser()
)

In [30]:
summaries = chain.batch(docs, {"max_concurrency": 5})

In [31]:
summaries

['User 1 inquires about a bag being sold by User 2 for $50, but User 2 says it is too low. User 2 then offers a different bag for $129, but User 1 clarifies they are only interested in the blue one, which is not for sale. The conversation ends with User 1 saying goodbye and expressing interest in future updates.',
 "President Biden addresses Congress and the American people, discussing the recent Russian invasion of Ukraine and the global response. He outlines the economic sanctions and military support being provided to Ukraine and emphasizes the unity of the international community in standing against aggression. The President also highlights the progress made in the American economy, particularly through the American Rescue Plan, and emphasizes the administration's commitment to investing in infrastructure and supporting the middle class. He concludes with a message of hope and unity, emphasizing the strength and resilience of the Ukrainian people and the importance of democracy in 

In [32]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [33]:
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

In [34]:
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [35]:
# # We can also add the original chunks to the vectorstore if we so want
# for i, doc in enumerate(docs):
#     doc.metadata[id_key] = doc_ids[i]
# retriever.vectorstore.add_documents(docs)

In [36]:
sub_docs = vectorstore.similarity_search("justice breyer")

In [38]:
sub_docs[0]

Document(page_content="The document outlines President Biden's recent nomination of Judge Ketanji Brown Jackson to the Supreme Court, as well as his plans for immigration reform, protecting women's rights, passing the Equality Act, and supporting veterans. Biden also discusses his Unity Agenda, which includes addressing the opioid epidemic, mental health, and ending cancer. The President expresses optimism about America's future and emphasizes unity and strength in the face of challenges.", metadata={'doc_id': '39fab56b-12b6-40cc-bc9e-f02454f04bb0'})

In [39]:
retrieved_docs = retriever.invoke("justice breyer")

In [40]:
len(retrieved_docs[0].page_content)

9194

#### 假设的查询
LLM还可以用来生成一系列针对特定文档的假设性问题。然后可以嵌入这些问题

In [41]:
functions = [
    {
        "name": "hypothetical_questions",
        "description": "Generate hypothetical questions",
        "parameters": {
            "type": "object",
            "properties": {
                "questions": {
                    "type": "array",
                    "items": {"type": "string"},
                },
            },
            "required": ["questions"],
        },
    }
]

In [42]:
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4").bind(
        functions=functions, function_call={"name": "hypothetical_questions"}
    )
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

In [43]:
chain.invoke(docs[0])

['What was the initial offer made by User 1 for the bag?',
 'Was User 1 interested in the bag that User 2 had originally offered?',
 'Was the blue bag that User 1 was interested in available for sale?']

In [44]:
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})

In [45]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [46]:
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )

In [47]:
retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [48]:
sub_docs = vectorstore.similarity_search("justice breyer")

In [49]:
sub_docs

[Document(page_content='What steps has the President taken to secure the border and fix the immigration system?', metadata={'doc_id': 'e0fffad5-d5c0-4559-ae84-74e1f3b2e9e8'}),
 Document(page_content='What measures are being proposed to protect the rights of women and LGBTQ+ Americans?', metadata={'doc_id': 'e0fffad5-d5c0-4559-ae84-74e1f3b2e9e8'}),
 Document(page_content='How does the Bipartisan Infrastructure Law aim to rebuild America?', metadata={'doc_id': 'f6c71388-bdc1-46cd-93a6-f6fc2af4c8bf'}),
 Document(page_content='What are the expected outcomes of the Bipartisan Innovation Act in terms of technological advancements and job creation?', metadata={'doc_id': 'f6c71388-bdc1-46cd-93a6-f6fc2af4c8bf'})]

In [50]:
retrieved_docs = retriever.invoke("justice breyer")

In [51]:
len(retrieved_docs[0].page_content)

9194

### Parent Document Retriever
在分割文档进行检索时，常常存在相互冲突的需求
- 您可能希望使用较小的文档，以便它们的嵌入可以最准确地反映其含义。如果时间过长，嵌入内容就会失去意义。
- 注意，父文档指的是小块的来源文档。这可以是整个原始文档，也可以是更大的块。

`parentdocumenttriiever`通过分割和存储小块数据来实现这种平衡。在检索期间，它首先获取小块，然后查找这些块的父id并返回那些较大的文档。

注意，父文档指的是小块的来源文档。这可以是整个原始文档，也可以是更大的块。

In [53]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

loaders = [
    TextLoader("data/whatsapp_chat.txt"),
    TextLoader("data/state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

#### 检索完整文档
在这种模式下，我们希望检索完整的文档。因此，我们只指定一个子拆分器。

In [54]:
# This text splitter is used to create the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

In [55]:
retriever.add_documents(docs, ids=None)

In [56]:
list(store.yield_keys())

['8dce3a2b-7bfe-44dd-a185-790343f5810d',
 'd30cd4c8-9dbc-4306-9be8-75450fe26ea8']

现在我们调用向量存储搜索功能——我们应该看到它返回小块(因为我们重新存储了小块)。

In [57]:
sub_docs = vectorstore.similarity_search("justice breyer")

In [58]:
print(sub_docs[0].page_content)

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.


现在让我们从整个检索器中检索。这将返回较大的文档——因为它返回较小块所在的文档。

#### 检索更大的块
有时，完整的文档可能太大而不想原样检索它们。在这种情况下，我们真正想做的是首先将原始文档分成较大的块，然后再分成较小的块。然后我们索引较小的块，但在检索时，我们检索较大的块(但仍然不是完整的文档)。


In [59]:
# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="split_parents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()

In [60]:
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [61]:
retriever.add_documents(docs)

In [62]:
len(list(store.yield_keys()))

23

In [63]:
sub_docs = vectorstore.similarity_search("justice breyer")

In [64]:
print(sub_docs[0].page_content)

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.


In [65]:
retrieved_docs = retriever.invoke("justice breyer")

In [66]:
len(retrieved_docs[0].page_content)

1849

In [67]:
print(retrieved_docs[0].page_content)

In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. 

We cannot let this happen. 

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. 

A former top litigator in private practice. A former federal publi

### Self-querying
请前往Integrations获取关于内置自查询支持的矢量存储的文档。

顾名思义，自查询检索器是具有查询自身能力的检索器。具体来说，给定任何自然语言查询，检索器使用一个构造查询的LLM链来编写结构化查询，然后将该结构化查询应用于其底层VectorStore。这使得检索器不仅可以使用用户输入查询与存储文档的内容进行语义相似性比较，还可以从存储文档的元数据上的用户查询中提取过滤器，并执行这些过滤器。

<img src="https://python.langchain.com/assets/images/self_querying-26ac0fc8692e85bc3cd9b8640509404f.jpg">

#### Get started
为了演示，我们将使用色度矢量存储。我们创建了一个小的文档演示集，其中包含电影摘要。



In [68]:
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

#### 创建我们的自查询检索器
现在我们可以实例化我们的检索器。为此，我们需要预先提供一些关于文档支持的元数据字段的信息，以及文档内容的简短描述。

In [69]:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import ChatOpenAI

metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]
document_content_description = "Brief summary of a movie"
llm = ChatOpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
)

ValueError: Self query retriever with Vector Store type <class 'langchain_chroma.vectorstores.Chroma'> not supported.

In [None]:
# This example only specifies a filter
retriever.invoke("I want to watch a movie rated higher than 8.5")
# This example specifies a query and a filter
retriever.invoke("Has Greta Gerwig directed any movies about women")
# This example specifies a composite filter
retriever.invoke("What's a highly rated (above 8.5) science fiction film?")
# This example specifies a query and composite filter
retriever.invoke(
    "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
)

#### Filter k

我们还可以使用self查询检索器指定k:要获取的文档数量。
我们可以通过向构造函数传递`enable _limit=True`来实现这一点。

In [70]:
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    enable_limit=True,
)

# This example only specifies a relevant query
retriever.invoke("What are two movies about dinosaurs")

ValueError: Self query retriever with Vector Store type <class 'langchain_chroma.vectorstores.Chroma'> not supported.

#### 用LCEL从头开始构建
为了了解引擎盖下发生了什么，并有更多的自定义控制，我们可以从头开始重建我们的检索器。

首先，我们需要创建一个查询构造链。该链将接受一个用户查询，并生成一个`StructuredQuery`对象，该对象捕获用户指定的过滤器。我们提供了一些帮助函数来创建提示和输出解析器。它们有许多可调参数，为了简单起见，我们将在这里忽略它们。


In [None]:
from langchain.chains.query_constructor.base import (
    StructuredQueryOutputParser,
    get_query_constructor_prompt,
)

prompt = get_query_constructor_prompt(
    document_content_description,
    metadata_field_info,
)
output_parser = StructuredQueryOutputParser.from_components()
query_constructor = prompt | llm | output_parser

让我们看看提示符

In [None]:
print(prompt.format(query="dummy question"))

Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

```json
{
    "query": string \ text string to compare to document contents
    "filter": string \ logical condition statement for filtering documents
}
```

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: `comp(attr, val)`:
- `comp` (eq | ne | gt | gte | lt | lte | contain | like | in | nin): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison value

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` (and | or | not): logical operator
- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to

Make sure that you only use the comparators and logical operators listed above and no others.
Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters only use the attributed names with its function names if there are functions applied on them.
Make sure that filters only use format `YYYY-MM-DD` when handling date data typed values.
Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.

<< Example 1. >>
Data Source:
```json
{
    "content": "Lyrics of a song",
    "attributes": {
        "artist": {
            "type": "string",
            "description": "Name of the song artist"
        },
        "length": {
            "type": "integer",
            "description": "Length of the song in seconds"
        },
        "genre": {
            "type": "string",
            "description": "The song genre, one of "pop", "rock" or "rap""
        }
    }
}
```

User Query:
What are songs by Taylor Swift or Katy Perry about teenage romance under 3 minutes long in the dance pop genre

Structured Request:
```json
{
    "query": "teenager love",
    "filter": "and(or(eq(\"artist\", \"Taylor Swift\"), eq(\"artist\", \"Katy Perry\")), lt(\"length\", 180), eq(\"genre\", \"pop\"))"
}
```


<< Example 2. >>
Data Source:
```json
{
    "content": "Lyrics of a song",
    "attributes": {
        "artist": {
            "type": "string",
            "description": "Name of the song artist"
        },
        "length": {
            "type": "integer",
            "description": "Length of the song in seconds"
        },
        "genre": {
            "type": "string",
            "description": "The song genre, one of "pop", "rock" or "rap""
        }
    }
}
```

User Query:
What are songs that were not published on Spotify

Structured Request:
```json
{
    "query": "",
    "filter": "NO_FILTER"
}
```


<< Example 3. >>
Data Source:
```json
{
    "content": "Brief summary of a movie",
    "attributes": {
    "genre": {
        "description": "The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        "type": "string"
    },
    "year": {
        "description": "The year the movie was released",
        "type": "integer"
    },
    "director": {
        "description": "The name of the movie director",
        "type": "string"
    },
    "rating": {
        "description": "A 1-10 rating for the movie",
        "type": "float"
    }
}
}
```

User Query:
dummy question

Structured Request:

In [None]:
query_constructor.invoke(
    {
        "query": "What are some sci-fi movies from the 90's directed by Luc Besson about taxi drivers"
    }
)

查询构造函数是自查询检索器的关键元素。要创建一个出色的检索系统，您需要确保查询构造函数工作良好。这通常需要调整提示符、提示符中的示例、属性描述等。要了解如何在一些酒店库存数据上细化查询构造函数的示例，

下一个关键元素是结构化查询转换器。这个对象负责将通用的`StructuredQuery`对象转换为您正在使用的矢量存储语法中的元数据过滤器。LangChain附带了许多内置的翻译器。要查看它们，请转到集成部分。

In [None]:
from langchain.retrievers.self_query.chroma import ChromaTranslator

retriever = SelfQueryRetriever(
    query_constructor=query_constructor,
    vectorstore=vectorstore,
    structured_query_translator=ChromaTranslator(),
)

In [None]:
retriever.invoke(
    "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
)

### Time-weighted vector store retriever
这个检索器结合了语义相似性和时间衰减。

打分的算法是

semantic_similarity + (1.0 - decay_rate) ^ hours_passed

值得注意的是，经过的小时数指的是自检索器中的对象最后一次被访问以来经过的小时数，而不是自它被创建以来经过的小时数。这意味着频繁访问的对象保持新鲜。

In [71]:
from datetime import datetime, timedelta

import faiss
from langchain.retrievers import TimeWeightedVectorStoreRetriever
from langchain_community.docstore import InMemoryDocstore
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

#### 低衰减率
低衰减率(在这里，极端地说，我们将其设置为接近0)意味着记忆将被记住更长的时间。衰减率为0意味着记忆永远不会被遗忘，这使得这个检索器相当于向量查找。


In [None]:
# Define your embedding model
embeddings_model = OpenAIEmbeddings()
# Initialize the vectorstore as empty
embedding_size = 1536
index = faiss.IndexFlatL2(embedding_size)
vectorstore = FAISS(embeddings_model, index, InMemoryDocstore({}), {})
retriever = TimeWeightedVectorStoreRetriever(
    vectorstore=vectorstore, decay_rate=0.0000000000000000000000001, k=1
)

yesterday = datetime.now() - timedelta(days=1)
retriever.add_documents(
    [Document(page_content="hello world", metadata={"last_accessed_at": yesterday})]
)
retriever.add_documents([Document(page_content="hello foo")])
# "Hello World" is returned first because it is most salient, and the decay rate is close to 0., meaning it's still recent enough
retriever.invoke("hello world")

#### High decay rate
在高衰减率下(例如，几个9秒)，近代性得分很快就会变为0!如果你把这个一直设为1，所有对象的近代性都是0，再一次等价于向量查找。

In [None]:
# Define your embedding model
embeddings_model = OpenAIEmbeddings()
# Initialize the vectorstore as empty
embedding_size = 1536
index = faiss.IndexFlatL2(embedding_size)
vectorstore = FAISS(embeddings_model, index, InMemoryDocstore({}), {})
retriever = TimeWeightedVectorStoreRetriever(
    vectorstore=vectorstore, decay_rate=0.999, k=1
)

yesterday = datetime.now() - timedelta(days=1)
retriever.add_documents(
    [Document(page_content="hello world", metadata={"last_accessed_at": yesterday})]
)
retriever.add_documents([Document(page_content="hello foo")])

# "Hello Foo" is returned first because "hello world" is mostly forgotten
retriever.invoke("hello world")

#### Virtual time
使用LangChain中的一些实用程序，您可以模拟出时间组件。

In [None]:
import datetime

from langchain.utils import mock_now
# Notice the last access time is that date time
with mock_now(datetime.datetime(2024, 2, 3, 10, 11)):
    print(retriever.invoke("hello world"))