# 结构化分层检索

对多个文档进行良好的RAG是困难的。一个通用的框架是在给定用户查询后，首先选择相关文档，然后再选择其中的内容。

但是选择文档可能很困难 - 我们如何根据用户查询动态选择具有不同属性的文档呢？

在这个笔记本中，我们将向您展示我们的多文档RAG架构：

- 将每个文档表示为一个简洁的**元数据**字典，其中包含不同的属性：提取的摘要以及结构化元数据。
- 将这些元数据字典存储为向量数据库中的过滤器。
- 给定用户查询，首先进行**自动检索** - 推断相关的语义查询和一组用于查询这些数据的过滤器（有效地结合了文本到SQL和语义搜索）。


In [None]:
%pip install llama-index-readers-github
%pip install llama-index-vector-stores-weaviate
%pip install llama-index-llms-openai

In [None]:
!pip install llama-index llama-hub

## 设置和下载数据

在这一部分，我们将加载LlamaIndex Github的问题。


In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
import os

os.environ["GITHUB_TOKEN"] = "ghp_..."
os.environ["OPENAI_API_KEY"] = "sk-..."

In [None]:
import os

from llama_index.readers.github import (
    GitHubRepositoryIssuesReader,
    GitHubIssuesClient,
)

github_client = GitHubIssuesClient()
loader = GitHubRepositoryIssuesReader(
    github_client,
    owner="run-llama",
    repo="llama_index",
    verbose=True,
)

orig_docs = loader.load_data()

limit = 100

docs = []
for idx, doc in enumerate(orig_docs):
    doc.metadata["index_id"] = int(doc.id_)
    if idx >= limit:
        break
    docs.append(doc)

Found 100 issues in the repo page 1
Resulted in 100 documents
Found 100 issues in the repo page 2
Resulted in 200 documents
Found 100 issues in the repo page 3
Resulted in 300 documents
Found 64 issues in the repo page 4
Resulted in 364 documents
No more issues found, stopping


## 设置向量存储和索引


In [None]:
import weaviate# 云auth_config = weaviate.AuthApiKey(    api_key="XRa15cDIkYRT7AkrpqT6jLfE4wropK1c1TGk")client = weaviate.Client(    "https://llama-index-test-v0oggsoz.weaviate.network",    auth_client_secret=auth_config,)class_name = "LlamaIndex_docs"

In [None]:
# 可选：删除模式client.schema.delete_class(class_name)

In [None]:
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core import VectorStoreIndex, StorageContext

vector_store = WeaviateVectorStore(
    weaviate_client=client, index_name=class_name
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [None]:
doc_index = VectorStoreIndex.from_documents(
    docs, storage_context=storage_context
)

## 创建用于检索和过滤的IndexNodes


In [None]:
from llama_index.core import SummaryIndexfrom llama_index.core.async_utils import run_jobsfrom llama_index.llms.openai import OpenAIfrom llama_index.core.schema import IndexNodefrom llama_index.core.vector_stores import (    FilterOperator,    MetadataFilter,    MetadataFilters,)async def aprocess_doc(doc, include_summary: bool = True):    """处理文档。"""    metadata = doc.metadata    date_tokens = metadata["created_at"].split("T")[0].split("-")    year = int(date_tokens[0])    month = int(date_tokens[1])    day = int(date_tokens[2])    assignee = (        "" if "assignee" not in doc.metadata else doc.metadata["assignee"]    )    size = ""    if len(doc.metadata["labels"]) > 0:        size_arr = [l for l in doc.metadata["labels"] if "size:" in l]        size = size_arr[0].split(":")[1] if len(size_arr) > 0 else ""    new_metadata = {        "state": metadata["state"],        "year": year,        "month": month,        "day": day,        "assignee": assignee,        "size": size,    }    # 现在提取摘要    summary_index = SummaryIndex.from_documents([doc])    query_str = "给出这个问题的一句简洁的摘要。"    query_engine = summary_index.as_query_engine(        llm=OpenAI(model="gpt-3.5-turbo")    )    summary_txt = await query_engine.aquery(query_str)    summary_txt = str(summary_txt)    index_id = doc.metadata["index_id"]    # 过滤特定的文档id    filters = MetadataFilters(        filters=[            MetadataFilter(                key="index_id", operator=FilterOperator.EQ, value=int(index_id)            ),        ]    )    # 使用摘要文本创建索引节点    index_node = IndexNode(        text=summary_txt,        metadata=new_metadata,        obj=doc_index.as_retriever(filters=filters),        index_id=doc.id_,    )    return index_nodeasync def aprocess_docs(docs):    """处理文档的元数据。"""    index_nodes = []    tasks = []    for doc in docs:        task = aprocess_doc(doc)        tasks.append(task)    index_nodes = await run_jobs(tasks, show_progress=True, workers=3)    return index_nodes

In [None]:
index_nodes = await aprocess_docs(docs)

  self._delete = client.delete
  completions.create,
  self._get = client.get
  self._get = client.get
  self._get = client.get
  return partial(update_wrapper, wrapped=wrapped,
100%|██████████| 100/100 [00:36<00:00,  2.71it/s]


In [None]:
index_nodes[5].metadata

{'state': 'open',
 'year': 2024,
 'month': 1,
 'day': 13,
 'assignee': '',
 'size': 'XL'}

## 创建顶层AutoRetriever

我们将摘要元数据和原始文档加载到向量数据库中。
1. **摘要元数据**：存储在 `LlamaIndex_auto` 集合中。
2. **原始文档**：存储在 `LlamaIndex_docs` 集合中。

通过存储摘要元数据和原始文档，我们可以执行结构化的、分层的检索策略。

我们加载到支持自动检索的向量数据库中。


### 加载汇总的元数据

这将进入 `LlamaIndex_auto`


In [None]:
import weaviate# 云auth_config = weaviate.AuthApiKey(    api_key="XRa15cDIkYRT7AkrpqT6jLfE4wropK1c1TGk")client = weaviate.Client(    "https://llama-index-test-v0oggsoz.weaviate.network",    auth_client_secret=auth_config,)class_name = "LlamaIndex_auto"

In [None]:
# 可选：删除模式client.schema.delete_class(class_name)

In [None]:
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core import VectorStoreIndex, StorageContext

vector_store_auto = WeaviateVectorStore(
    weaviate_client=client, index_name=class_name
)
storage_context_auto = StorageContext.from_defaults(
    vector_store=vector_store_auto
)

In [None]:
# 由于“index_nodes”是简洁的摘要，我们可以直接将它们作为对象输入到VectorStoreIndex中。 index = VectorStoreIndex(    objects=index_nodes, storage_context=storage_context_auto)

## 设置可组合的自动检索器

在这一部分，我们将设置我们的自动检索器。我们需要执行一些步骤。

1. **定义模式**：定义向量数据库模式（例如元数据字段）。这将被放入LLM输入提示中，用于确定要推断的元数据过滤器。
2. **实例化VectorIndexAutoRetriever类**：这将在我们总结的元数据索引之上创建一个检索器，并将定义的模式作为输入。
3. **定义包装检索器**：这允许我们将每个节点后处理为`IndexNode`，并使用索引ID将其链接回源文档。这将允许我们在下一节进行递归检索（依赖于链接到下游检索器/查询引擎/其他节点的IndexNode对象）。**注意**：我们正在努力改进这个抽象。

运行此检索器将基于我们的文本摘要和顶层`IndexNode`对象的元数据进行检索。然后，它们的基础检索器将用于从特定的GitHub问题中检索内容。


### 1. 定义模式


In [None]:
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo


vector_store_info = VectorStoreInfo(
    content_info="Github Issues",
    metadata_info=[
        MetadataInfo(
            name="state",
            description="Whether the issue is `open` or `closed`",
            type="string",
        ),
        MetadataInfo(
            name="year",
            description="The year issue was created",
            type="integer",
        ),
        MetadataInfo(
            name="month",
            description="The month issue was created",
            type="integer",
        ),
        MetadataInfo(
            name="day",
            description="The day issue was created",
            type="integer",
        ),
        MetadataInfo(
            name="assignee",
            description="The assignee of the ticket",
            type="string",
        ),
        MetadataInfo(
            name="size",
            description="How big the issue is (XS, S, M, L, XL, XXL)",
            type="string",
        ),
    ],
)

### 2. 实例化VectorIndexAutoRetriever


In [None]:
from llama_index.core.retrievers import VectorIndexAutoRetrieverretriever = VectorIndexAutoRetriever(    index,    vector_store_info=vector_store_info,    similarity_top_k=2,    empty_query_top_k=10,  # 如果只指定了元数据过滤器，则这是限制    verbose=True,)

## 试一试

现在我们可以开始在Github Issues中检索相关的上下文了！

为了完成RAG管道的设置，我们将把我们的递归检索器与我们的`RetrieverQueryEngine`结合起来，以生成响应以及检索到的节点。


### 尝试检索


In [None]:
from llama_index.core import QueryBundle

nodes = retriever.retrieve(QueryBundle("Tell me about some issues on 01/11"))

Using query str: issues
Using filters: [('day', '==', '11'), ('month', '==', '01')]
[1;3;38;2;11;159;203mRetrieval entering 9995: VectorIndexRetriever
[0m[1;3;38;2;237;90;200mRetrieving from object VectorIndexRetriever with query issues
[0m[1;3;38;2;11;159;203mRetrieval entering 9985: VectorIndexRetriever
[0m[1;3;38;2;237;90;200mRetrieving from object VectorIndexRetriever with query issues
[0m

结果是相关文档中的源代码块。

让我们看一下源代码块附加的日期（原始元数据中存在）。


In [None]:
print(f"Number of source nodes: {len(nodes)}")
nodes[0].node.metadata

Number of source nodes: 2


{'state': 'open',
 'created_at': '2024-01-11T20:37:34Z',
 'url': 'https://api.github.com/repos/run-llama/llama_index/issues/9995',
 'source': 'https://github.com/run-llama/llama_index/pull/9995',
 'labels': ['size:XXL'],
 'index_id': 9995}

### 插入到 `RetrieverQueryEngine` 中

我们插入到 `RetrieverQueryEngine` 中以合成结果。


In [None]:
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")

query_engine = RetrieverQueryEngine.from_args(retriever, llm=llm)

In [None]:
response = query_engine.query("Tell me about some issues on 01/11")

Using query str: issues
Using filters: [('day', '==', '11'), ('month', '==', '01')]
[1;3;38;2;11;159;203mRetrieval entering 9995: VectorIndexRetriever
[0m[1;3;38;2;237;90;200mRetrieving from object VectorIndexRetriever with query issues
[0m[1;3;38;2;11;159;203mRetrieval entering 9985: VectorIndexRetriever
[0m[1;3;38;2;237;90;200mRetrieving from object VectorIndexRetriever with query issues
[0m

In [None]:
print(str(response))

There are two issues that were created on 01/11. The first issue is related to ensuring backwards compatibility with the new Pinecone client version bifurcation. The second issue is a feature request to implement the Language Agent Tree Search (LATS) agent in llama-index.


In [None]:
response = query_engine.query(
    "Tell me about some open issues related to agents"
)

Using query str: agents
Using filters: [('state', '==', 'open')]
[1;3;38;2;11;159;203mRetrieval entering 10058: VectorIndexRetriever
[0m[1;3;38;2;237;90;200mRetrieving from object VectorIndexRetriever with query agents
[0m[1;3;38;2;11;159;203mRetrieval entering 9899: VectorIndexRetriever
[0m[1;3;38;2;237;90;200mRetrieving from object VectorIndexRetriever with query agents
[0m

In [None]:
print(str(response))

There are two open issues related to agents. One issue is about adding context for agents, updating a stale link, and adding a notebook to demo a react agent with context. The other issue is a feature request for parallelism when using the top agent from a multi-document agent while comparing multiple documents.


## 总结思路

这展示了如何在文档摘要上创建一个结构化的检索层，使您能够根据用户查询动态地获取相关文档。

您可能会注意到这与我们的[多文档代理](https://docs.llamaindex.ai/en/stable/examples/agent/multi_document_agents.html)之间的相似之处。这两种架构都旨在实现强大的多文档检索。

本笔记的目标是向您展示如何在多文档设置中应用结构化查询。实际上，您也可以将这种自动检索算法应用到我们的多代理设置中。多代理设置主要专注于在文档和每个文档之间添加代理推理，使用思维链实现多部分查询。
