# How to use the Parent Document Retriever


The ParentDocumentRetriever in LangChain helps balance the need for precise embeddings from smaller document chunks and the context provided by larger documents.

# Why Use Parent Document Retriever?
* The Parent Document Retriever balances two competing goals in document retrieval:

   * Small Chunks for Accurate Embeddings: Smaller chunks yield embeddings that more precisely represent the content of a specific portion of the document.

   * Larger Chunks for Context: Larger chunks preserve the context that might get lost when breaking documents into smaller pieces.

# Key Workflow
Document Loading: Load raw text documents.
Splitting:

Parent Splitting: Split into medium-sized chunks for context.

Child Splitting: Further divide the parent chunks into smaller chunks for embedding storage.

# Storage:
Use a vector store (e.g., Chroma) to store embeddings of smaller chunks.

Use a document store (e.g., InMemoryStore) to maintain parent chunks.

# Retrieval:

Use embeddings to locate relevant small chunks.

Retrieve the corresponding parent documents or larger chunks based on the match.

In [3]:
pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.3.8-py3-none-any.whl.metadata (2.9 kB)
Collecting SQLAlchemy<2.0.36,>=1.4 (from langchain_community)
  Downloading SQLAlchemy-2.0.35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.8 (from langchain_community)
  Downloading langchain-0.3.9-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.21 (from langchain_community)
  Downloading langchain_core-0.3.21-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.6.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from datac

# Implementation Using Hugging Face Token
Here’s how to use the Parent Document Retriever:

# 1. Setup the Environment
Make sure you have the following dependencies installed:

In [11]:
pip install langchain langchain-chroma langchain-community chromadb




# 2. Load Documents
Load your text documents using the TextLoader:

In [10]:
from langchain_community.document_loaders import TextLoader

# Load text documents
loaders = [
    TextLoader("paul_graham_essay.txt"),
    TextLoader("state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())


# 3. Create Text Splitters
Define two splitters:

Parent Splitter: For larger chunks.
Child Splitter: For smaller, more granular chunks.

In [12]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Define the splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)


# 4. Initialize the Vector Store and Document Store
Use the Hugging Face token for the embeddings and set up the storage layers.

In [14]:

from langchain.storage import InMemoryStore
from langchain_chroma import Chroma

# Initialize Hugging Face embeddings
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

# Use Hugging Face embeddings with your token
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",  # Hugging Face model
    cache_folder="./models",  # Optional: Local cache for models
)

# Setup vector and document stores
vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings)
store = InMemoryStore()


  embeddings = HuggingFaceEmbeddings(
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# 5. Initialize the Retriever
Combine all components in the ParentDocumentRetriever:

In [16]:
from langchain.retrievers import ParentDocumentRetriever

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add documents
retriever.add_documents(docs)


# 6. Test Retrieval
Perform retrieval tasks with your query:

In [17]:
# Search small chunks in the vector store
sub_docs = vectorstore.similarity_search("justice breyer")
print(sub_docs[0].page_content)

# Retrieve larger parent documents
retrieved_docs = retriever.invoke("justice breyer")
print(retrieved_docs[0].page_content)


One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. 

We cannot let this happen. 

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President 

# Key Features to Test
Small Chunk Retrieval: Confirm embeddings yield smaller text segments for precision.

Parent Retrieval: Ensure the retriever returns meaningful, larger contextual chunks.

#Benefits of Using Hugging Face Token
By specifying a Hugging Face model and token, you can:

Avoid dependency on commercial API keys like OpenAI.

Leverage pre-trained transformer-based embeddings.

Maintain full control of local or cloud-based document processing.