# Retrievers
检索器是一个接口，它在给定非结构化查询时返回文档。它比矢量存储更通用。检索器不需要能够存储文档，只需要能够返回(或检索)它们。向量存储可以用作猎犬的主干，但也有其他类型的猎犬。

检索器接受字符串查询作为输入，并返回`Document`的列表作为输出。

## 高级检索类型
LangChain提供了几种高级检索类型。下面是一个完整的列表，以及以下信息
名称：检索算法的名称。

索引类型：哪个索引类型（如果有）依赖。

使用LLM：此检索方法是否使用LLM。

何时使用：我们何时应该考虑使用此检索方法的评论。

描述：该检索算法正在做什么的描述。



| 名称 | 索引类型 | 使用LLM | 何时使用 | 描述 |
| --- | --- | --- | --- | --- |
| Vectorstore | Vectorstore | 否 | 如果你刚刚开始寻找一些快速和简单的东西。 | 这是最简单的方法，也是最容易上手的方法。它包括为每一段文本创建嵌入。 |
| ParentDocument | Vectorstore +文档存储 | 否 |如果您的页面有许多不同的小块信息，它们最好单独索引，但最好一起检索。 | 这包括为每个文档索引多个块。然后找到在嵌入空间中最相似的块，但是要检索整个父文档并返回它(而不是单个块)。 | 
| Multi Vector |  Vectorstore +文档存储 | 有时在索引期间 | 如果您能够从文档中提取您认为与索引更相关的信息，而不是文本本身。 | 这涉及到为每个文档创建多个向量。每个向量可以用无数种方式创建——例如文本摘要和假设问题。 |
| Self Query | Vectorstore | 是 | 如果用户提出的问题可以通过基于元数据(而不是与文本的相似性)获取文档来更好地回答。 | 它使用LLM将用户输入转换为两种东西:(1)用于语义查找的字符串，(2)与之配套的元数据过滤器。这很有用，因为问题通常是关于文档的元数据的(而不是内容本身)。 |
| Contextual Compression | 任何 | 有时 | 如果您发现检索到的文档包含太多不相关的信息，并且分散了LLM的注意力。 | 这将后处理步骤置于另一个检索程序之上，并仅从检索到的文档中提取最相关的信息。这可以通过嵌入或LLM来完成。 |
| Time-Weighted Vectorstore | Vectorstore | 否 | 如果您有与文档相关联的时间戳，并且希望检索最近的时间戳 | 它使用LLM从原始查询生成多个查询。当原始查询需要正确回答关于多个主题的信息片段时，这很有用。通过生成多个查询，我们可以为每个查询获取文档。 |
| Multi-Query Retriever | 任何 | 是 | 如果用户问的问题很复杂，需要多个不同的信息来回答 |  它使用LLM从原始查询生成多个查询。当原始查询需要正确回答关于多个主题的信息片段时，这很有用。通过生成多个查询，我们可以为每个查询获取文档。 |
| Ensemble | 任何 | 否 | 如果您有多种检索方法，并希望尝试将它们组合起来。 | 这将从多个检索器中获取文档，然后将它们组合起来。 |
| Long-Context Reorder | 任何 | 否 | 如果您正在使用长上下文模型，并且注意到它没有注意到检索文档中间的信息。 | 这将从底层检索器获取文档，然后对它们重新排序，以便最相似的文档位于开头和结尾附近。这很有用，因为已经证明，对于较长的上下文模型，它们有时不会注意上下文窗口中间的信息。 |


In [3]:
from dotenv import load_dotenv, find_dotenv
from langchain.globals import set_debug
import os
load_dotenv(find_dotenv())
set_debug(False)

### [第三方集成](https://python.langchain.com/docs/integrations/retrievers/)
LangChain还集成了许多第三方检索服务。有关这些集成的完整列表，请查看所有集成的列表。

### 在LCEL中使用检索器
由于检索器是可运行的，我们可以很容易地将它们与其他可运行对象组合在一起

In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """Answer the question based only on the following context:

{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI()


def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])


chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

chain.invoke("What did the president say about technology?")


## 矢量存储检索器
向量存储检索器是使用向量存储检索文档的检索器。它是围绕矢量存储类的轻量级包装器，使其符合检索器接口。它使用向量存储实现的搜索方法(如相似性搜索和MMR)来查询向量存储中的文本。

一旦你构造了一个向量存储，构造一个检索器就很容易了。让我们看一个例子。

In [4]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/state_of_the_union.txt")

In [5]:
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(texts, embeddings)

In [6]:
retriever = db.as_retriever()

In [7]:
docs = retriever.invoke("what did he say about ketanji brown jackson")

In [8]:
docs

[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'data/state_of_the_union.txt'}),
 Document(page_content='A former top litigator in private practice. A former federal public defender. And from a family of publ

## 最大边际相关性检索
默认情况下，向量存储检索器使用相似性搜索。如果底层向量存储支持最大边际相关性搜索，则可以将其指定为搜索类型。


In [9]:
retriever = db.as_retriever(search_type="mmr")
docs = retriever.invoke("what did he say about ketanji brown jackson")
docs

[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'data/state_of_the_union.txt'}),
 Document(page_content='One was stationed at bases and breathing in toxic smoke from “burn pits” that incinerated wastes of war

### 相似分数阈值检索
您还可以设置一个检索方法，该方法设置一个相似度分数阈值，并且只返回分数高于该阈值的文档。



In [10]:
retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5}
)
docs = retriever.invoke("what did he say about ketanji brown jackson")

In [11]:
docs

[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'data/state_of_the_union.txt'}),
 Document(page_content='A former top litigator in private practice. A former federal public defender. And from a family of publ

### Specifying top k

您还可以指定搜索变量，如k，以便在进行检索时使用。

In [12]:
retriever = db.as_retriever(search_kwargs={"k": 1})
docs = retriever.invoke("what did he say about ketanji brown jackson")
len(docs)

1

## MultiQueryRetriever
基于距离的矢量数据库检索嵌入（表示）在高维空间中的查询，并根据“距离”找到类似的嵌入式文档。但是，检索可能会因查询措辞的细微变化而产生不同的结果，或者如果嵌入并不能很好地捕获数据的语义。有时会进行及时的工程 /调整以手动解决这些问题，但可能很乏味。

`MultiQueryRetriever`通过使用LLM从不同的角度为给定的用户输入查询生成多个查询，从而使提示调优过程自动化。对于每个查询，它检索一组相关文档，并在所有查询中获取唯一联合，以获得更大的潜在相关文档集。通过在同一个问题上生成多个透视图，`multiqueryretriver`可能能够克服基于距离的检索的一些限制，并获得更丰富的结果集。

In [24]:
# Build a sample vectorDB
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load blog post
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

# VectorDB
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

简单用法指定要用于生成查询的LLM，检索器将完成其余的工作。

In [25]:
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

question = "What are the approaches to Task Decomposition?"
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm
)

In [26]:
# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [27]:
unique_docs = retriever_from_llm.invoke(question)
len(unique_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. How can Task Decomposition be achieved through different methods?', '2. What strategies are commonly used for Task Decomposition?', '3. What are the various techniques for breaking down tasks in Task Decomposition?']


2

您还可以提供提示和输出解析器，以便将结果拆分为查询列表。

In [28]:
from typing import List

from langchain.chains import LLMChain
from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from pydantic import BaseModel, Field


# Output parser will split the LLM result into a list of queries
class LineList(BaseModel):
    # "lines" is the key (attribute name) of the parsed output
    lines: List[str] = Field(description="Lines of text")


class LineListOutputParser(PydanticOutputParser):
    def __init__(self) -> None:
        super().__init__(pydantic_object=LineList)

    def parse(self, text: str) -> LineList:
        
        lines = text.strip().split("\n")
        return LineList(lines=lines)

output_parser = LineListOutputParser()

QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five 
    different versions of the given user question to retrieve relevant documents from a vector 
    database. By generating multiple perspectives on the user question, your goal is to help
    the user overcome some of the limitations of the distance-based similarity search. 
    Provide these alternative questions separated by newlines.
    Original question: {question}""",
)
llm = ChatOpenAI(temperature=0)

# Chain
llm_chain = LLMChain(llm=llm, prompt=QUERY_PROMPT, output_parser=output_parser)

# Other inputs
question = "What are the approaches to Task Decomposition?"

In [29]:
# Run
retriever = MultiQueryRetriever(
    retriever=vectordb.as_retriever(), 
    llm_chain=llm_chain, 
    parser_key="lines"
)  # "lines" is the key (attribute name) of the parsed output

# Results
unique_docs = retriever.invoke("What does the course say about regression?")
len(unique_docs)

OutputParserException: Failed to parse LineList from completion 1. Got: 1 validation error for LineList
  Input should be a valid dictionary or instance of LineList [type=model_type, input_value=1, input_type=int]
    For further information visit https://errors.pydantic.dev/2.7/v/model_type

### Contextual compression
