## LangChain Parent Document Retriever 简介
当为检索而切分文档时，通常在切分大小的选择上存在困惑：

您可能希望拥有较小的文档切片，以便它们的嵌入可以更准确地反映它们的含义。如果太长，嵌入可能会失去意义。
您可能又希望有足够长的文档，以保留更完整的上下文。
ParentDocumentRetriever 通过分级切分和存储文档数据块来取得平衡。在检索过程中，首先获取小块，然后查找这些块的父文档切片，并返回这些较大的文档片段。

In [1]:
# 基本配置
from langchain_openai import ChatOpenAI
import os
from dotenv import load_dotenv

load_dotenv(override=True)

qw_llm_openai = ChatOpenAI(
    openai_api_base=os.getenv('DASHSCOPE_API_BASE'),
    openai_api_key=os.getenv('DASHSCOPE_API_KEY'),
    model_name="qwen2-1.5b-instruct",
    temperature=0,
    streaming=True,
)

ms_llm_openai = ChatOpenAI(
    openai_api_base=os.getenv('MOONSHOT_API_BASE'),
    openai_api_key=os.getenv('MOONSHOT_API_KEY'),
    model_name="moonshot-v1-8k",
    temperature=0,
    streaming=True,
)

cf_llm_openai = ChatOpenAI(
    openai_api_base=os.getenv('CF_API_BASE'),
    openai_api_key=os.getenv('CF_API_TOKEN'),
    model_name="@cf/meta/llama-3-8b-instruct",
    temperature=0,
    streaming=True,
)

groq_llm_openai = ChatOpenAI(
    openai_api_base=os.getenv('GROQ_API_BASE'),
    openai_api_key=os.getenv('GROQ_API_KEY'),
    model_name="llama3-8b-8192",
    temperature=0,
    streaming=True,
)

from langchain_community.embeddings.cloudflare_workersai import CloudflareWorkersAIEmbeddings
import os
from dotenv import load_dotenv

load_dotenv()
embedding = CloudflareWorkersAIEmbeddings(
    account_id=os.getenv('CF_ACCOUNT_ID'),
    api_token=os.getenv('CF_API_TOKEN'),
    model_name="@cf/baai/bge-small-en-v1.5",
)

In [1]:
# !wget https://developer.apple.com/carplay/documentation/CarPlay-App-Programming-Guide.pdf

--2024-07-11 11:47:32--  https://developer.apple.com/carplay/documentation/CarPlay-App-Programming-Guide.pdf
Resolving developer.apple.com (developer.apple.com)... 17.253.87.195, 17.253.87.201
Connecting to developer.apple.com (developer.apple.com)|17.253.87.195|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8772250 (8.4M) [application/pdf]
Saving to: ‘CarPlay-App-Programming-Guide.pdf’


2024-07-11 11:47:34 (4.71 MB/s) - ‘CarPlay-App-Programming-Guide.pdf’ saved [8772250/8772250]



In [2]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import PyMuPDFLoader

In [3]:
loader = PyMuPDFLoader('./CarPlay-App-Programming-Guide.pdf')
docs = loader.load()

In [4]:
len(docs)

66

ParentDocumentRetriever支持的参数
支持的参数中，值得关注的是 parent_splitter 和 child_splitter。它们分别指定父文档拆分器和子文档拆分器。

不指定 parent_splitter
这时，文档不会进行父子两级拆分。原始文档即父文档。

父文档存储在 InMemoryStore 中，子文档的嵌入数据被存储在向量存储中。本例中我们使用了 Chromadb。

In [5]:
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=embedding
)
# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

In [8]:
# Adds documents to the docstore and vectorstores
retriever.add_documents(docs, ids=None)

In [9]:
list(store.yield_keys())

['ecea4cd6-6d0c-4363-b0cb-a18495d7623d',
 '49ce29e7-b534-4c47-a171-fd5d2be34795',
 '1a77a8c2-62a7-4e96-b191-bf0f034c9244',
 'dbc974fa-f925-41fd-8fa5-4ba5f6ad1b85',
 'b184ee92-9292-4f8c-97aa-06ca6958b200',
 '1d4b8a92-ef2f-48f9-95ee-d386c04fcce1',
 'ef8b237e-d966-4d78-8022-c4cbc3721082',
 '141c4389-c1ef-49c2-bbd8-a0086303687d',
 '111e9aa6-8c3a-403f-8376-a69e0e020379',
 '689c1426-b3a6-4166-9dcd-2b107997b4ff',
 '8d9d76e3-56dd-4fa1-8b23-46d8e39fcf35',
 '829469ec-b180-42ea-bff4-89025498deff',
 '26325b33-82a2-49d8-82ca-25545fd2489c',
 'cf753ba0-268a-4cc6-be95-dfa8793471f2',
 '9ab979e9-922f-40e3-9d47-6a6fa54c5446',
 '7649a360-a750-44ca-93ca-8eff630d23bd',
 '637557f1-d8e4-4f4b-8ce9-c605d82903f4',
 'fdfb1aea-ab12-4950-8966-6e361b10acf4',
 '08a76ba5-6833-4ce6-90cb-89484994fddd',
 '19d210bd-1eac-46ee-9ecc-f31b04797440',
 '8428f1d5-440f-46df-8b2c-7800c7ecda50',
 '6848d052-3d19-48f4-bbc0-c337d3238875',
 'f158b1ed-00be-4546-a89f-26084a30d6a0',
 'd3dab67b-4fe0-4fff-9889-469f2e53daa6',
 '2c8c66af-6b3c-

In [10]:
# 相似性查询
sub_docs = vectorstore.similarity_search("How to build a CarPlay navigation app?")
len(sub_docs)

4

In [11]:
for sub_doc in sub_docs:
    print(len(sub_doc.page_content))

399
57
345
312


In [13]:
# 相关性查询
retrieved_docs = retriever.invoke("How to build a CarPlay navigation app?")

In [14]:
len(retrieved_docs)

3

In [15]:
for retrieved_doc in retrieved_docs:
    print(len(retrieved_doc.page_content))

4588
58
1271


指定 parent_splitter
这时，文档进行父子两级拆分。原始文档被 parent_splitter 拆分成较大的块后，再由 child_splitter 拆分成更小的块。

In [16]:
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=500)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
vectorstore = Chroma(collection_name="carplay_collection", embedding_function=embedding)
store = InMemoryStore()

In [17]:
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)
     

In [18]:
retriever.add_documents(documents=docs, ids=None)

In [19]:
len(list(store.yield_keys()))

277

In [20]:
sub_docs = vectorstore.similarity_search("How to build a CarPlay navigation app?")

len(sub_docs)

4

In [21]:
for sub_doc in sub_docs:
    print(len(sub_doc.page_content))

103
148
173
173


In [22]:
retrieved_docs = retriever.invoke("How to build a CarPlay navigation app?")

In [23]:
len(retrieved_docs)

3

In [24]:
for retrieved_doc in retrieved_docs:
    print(len(retrieved_doc.page_content))

484
455
444


In [25]:
print(retrieved_docs[2].page_content)

Create a now playing template 
 
..................................................................................................30
Work while iPhone is locked 
 
.......................................................................................................31
Launch other apps 
 
.......................................................................................................................31
Build a CarPlay navigation app
