# Comparing LangChain Retrievers: Basic, Parent, and MMR

When building a RAG (Retrieval-Augmented Generation) pipeline, most people obsess over embeddings and LLM choice.  
But **retrievers are just as important**: they decide what information the model sees.  

This notebook shows how different retrievers (Basic, Parent, MMR, and a Hybrid) behave in practice, and when to use which one.


##  0. Setup and Imports

First, let's import all the necessary libraries and set up our environment.


In [1]:
# Core LangChain imports
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.docstore.document import Document
from langchain_community.vectorstores import FAISS
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
import os
import sys
import warnings
import random
warnings.filterwarnings('ignore')

# Add project root to path
sys.path.append('..')
import config

print("✅ All imports successful!")


  from .autonotebook import tqdm as notebook_tqdm


✅ All imports successful!


### Downlaod and Load Embedding Model

In [None]:
# Initialize embedding model
print("🔤 Initializing embedding model...")
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={'device': config.DEVICE},
    cache_folder=config.EMBEDDING_MODEL_CACHE
)
print("✅ Embedding model ready!")


🔤 Initializing embedding model...


No sentence-transformers model found with name sentence-transformers/all-MiniLM-L6-v2. Creating a new one with mean pooling.
Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 13] Permission denied: '/path'


OSError: PermissionError at /path when downloading sentence-transformers/all-MiniLM-L6-v2. Check cache directory permissions. Common causes: 1) another user is downloading the same model (please wait); 2) a previous download was canceled and the lock file needs manual removal.

### You can use this code to download "embedding gemma" from the 🤗 Hub. Also specify to use the "query" and "document" prompts
### Uncomment this code to download embedding gemma from the 🤗 Hub.


In [None]:

# embeddings = HuggingFaceEmbeddings(
#     model_name="google/embeddinggemma-300m",
#     query_encode_kwargs={"prompt_name": "query"},
#     encode_kwargs={"prompt_name": "document"},
#     cache_folder=config.EMBEDDING_MODEL_CACHE
# )

# 1. Basic vs Parent Document Retrieval

## 1.1 Data Preparation

Let's load and prepare our document for both retrieval methods.


## 1. Data Used for Basic vs Parent Document Retrieval

**Source**: Business Conduct Policy PDF (20 pages)

**Content**: Apple corporate policies, trademarks, intellectual property guidelines

**Chunking Strategy**: 400-character chunks with 150-character overlap for basic retrieval

**Why This Data**: Corporate documents where context preservation is crucial for accurate policy interpretation


In [3]:
# Load the PDF document 
print("📄 Loading PDF document...")
loader = PyPDFLoader(os.path.join(config.DATA_DIR, "Business-Conduct-Policy.pdf"))
documents = loader.load()
for i in range(len(documents)):
    documents[i].page_content=documents[i].page_content[108:]
print(f"✅ Loaded {len(documents)} pages from PDF")
rnd_index = random.randint(0, len(documents))
print(f"📊 Document preview (Index {rnd_index+1}) : \n {documents[rnd_index].page_content[:50]}...")


📄 Loading PDF document...
✅ Loaded 20 pages from PDF
📊 Document preview (Index 14) : 
 
For more information about restrictions on tradin...


In [5]:
print(documents[15].page_content[:400])


Gifts to Public Officials
Apple permits providing gifts to public officials only when permissible under applicable laws and policies. A public official 
is any person who is paid with government funds or performs a public function. This includes individuals who are elected 
or appointed to public office, as well as individuals who work for local, state/provincial or national government, public 
i


## 1.2 Basic Retrieval Implementation

### What is Basic Retrieval?
Basic retrieval splits documents into chunks and stores them directly in a vector database. When you query, it retrieves the most similar chunks.

**Pros:**
- Simple to implement
- Fast retrieval
- Good for short, focused queries

**Cons:**
- May lose context from surrounding content
- Chunks might be too small for complex questions

 **When to use Basic Retrieval:**  
Use for short, focused queries when speed and simplicity are more important than deep context.  
Examples: FAQs, small documents, quick lookups.



In [6]:
# Create text splitter for basic retrieval
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=150,
)

# Split documents into chunks
print("✂️ Splitting documents for basic retrieval...")
basic_chunks = text_splitter.split_documents(documents)
print(f"✅ Created {len(basic_chunks)} chunks for basic retrieval")


✂️ Splitting documents for basic retrieval...
✅ Created 218 chunks for basic retrieval


In [7]:
# Create ChromaDB vector store for basic retrieval
print("🗄️ Creating ChromaDB for basic retrieval...")
basic_vectorstore = Chroma.from_documents(
    documents=basic_chunks,
    embedding=embeddings,
    persist_directory=os.path.join(config.CHROMA_PERSIST_DIRECTORY, "basic")
)
print("✅ Basic retrieval vectorstore created!")


🗄️ Creating ChromaDB for basic retrieval...
✅ Basic retrieval vectorstore created!


In [8]:
# Test basic retrieval
print("🔍 Testing basic retrieval...")
test_query = "The Apple Identity and Trademarks"

basic_results = basic_vectorstore.similarity_search(
    test_query, 
    k=5
)
print(f"📋 Retrieved {len(basic_results)} results:")
for i, doc in enumerate(basic_results):  # Show first 2 results
    print(f"\n--- Result {i+1} ---")
    print(f"Content: {doc.page_content}...")
    # print(f"Metadata: {doc.metadata}")


🔍 Testing basic retrieval...
📋 Retrieved 5 results:

--- Result 1 ---
Content: contracting process.
The Apple Identity and Trademarks
The Apple name, names of products (such as iPhone), names of services (such as AppleCare), taglines (such as ”Think 
Different”), and logos collectively create the Apple identity. Before publicly using any of these assets, review the Trademark...

--- Result 2 ---
Content: Different”), and logos collectively create the Apple identity. Before publicly using any of these assets, review the Trademark 
List, Trademark and Copyright Guidelines, and Corporate Identity Guidelines for how to properly do so. You should also 
check with Legal before using the product names, service names, taglines, or logos of any third parties.
Third-Party Intellectual Property...

--- Result 3 ---
Content: check with Legal before using the product names, service names, taglines, or logos of any third parties.
Third-Party Intellectual Property
Apple respects third-party intellect

## 1.3 Parent Document Retrieval Implementation

### What is Parent Document Retrieval?
Parent document retrieval uses a two-step process:
1. **Child chunks**: Small chunks for precise similarity search
2. **Parent documents**: Larger documents that contain the child chunks

When you query, it finds relevant child chunks, then returns their parent documents for better context.

**Pros:**
- Better context preservation
- More comprehensive answers
- Reduces hallucination

**Cons:**
- More complex to implement
- Slightly slower retrieval
- Uses more storage

**When to use Parent Document Retrieval:**  
Use when you need larger context preserved, and want fewer hallucinations.  
Examples: policy documents, legal contracts, technical manuals, research papers.  


In [3]:
# Create text splitters for parent document retrieval
# Child splitter: small chunks for similarity search
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,  # Smaller chunks for better similarity
    chunk_overlap=150
)

# Parent splitter: larger chunks for context
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Larger chunks for context
    chunk_overlap=400
)

print("Creating child and parent chunks...")
child_chunks = child_splitter.split_documents(documents)
parent_chunks = parent_splitter.split_documents(documents)

print(f"✅ Created {len(child_chunks)} child chunks and {len(parent_chunks)} parent chunks")


Creating child and parent chunks...


NameError: name 'documents' is not defined

In [11]:
# Create vector store for child chunks
print("🗄️ Creating vector store for child chunks...")
child_vectorstore = Chroma.from_documents(
    documents=child_chunks,
    embedding=embeddings,
    persist_directory=os.path.join(config.CHROMA_PERSIST_DIRECTORY, "child")
)

# Create store for parent documents
print("📦 Creating parent document store...")
parent_store = InMemoryStore()

# Store parent documents with IDs
parent_ids = [f"parent_{i}" for i in range(len(parent_chunks))]
parent_store.mset([(parent_ids[i], parent_chunks[i]) for i in range(len(parent_chunks))])

print("✅ Parent document store created!")


🗄️ Creating vector store for child chunks...
📦 Creating parent document store...
✅ Parent document store created!


In [13]:
# Create parent document retriever
print("Setting up parent document retriever...")
parent_retriever = ParentDocumentRetriever(
    vectorstore=child_vectorstore,
    docstore=parent_store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
    
)

# Add documents to the retriever
parent_retriever.add_documents(documents)
print("✅ Parent document retriever ready!")


Setting up parent document retriever...
✅ Parent document retriever ready!


In [14]:
# Test parent document retrieval
print("🔍 Testing parent document retrieval...")

parent_results = parent_retriever.get_relevant_documents(
    test_query,
)

print(f"📋 Retrieved {len(parent_results)} parent documents:")
for i, doc in enumerate(parent_results):  # Show first 2 results
    print(f"\n--- Parent Document {i+1} ---")
    print(f"Content: {doc.page_content}...")
    # print(f"Metadata: {doc.metadata}")


🔍 Testing parent document retrieval...
📋 Retrieved 2 parent documents:

--- Parent Document 1 ---
Content: commitments that create a new agreement or modify an existing agreement without securing approval through the formal 
contracting process.
The Apple Identity and Trademarks
The Apple name, names of products (such as iPhone), names of services (such as AppleCare), taglines (such as ”Think 
Different”), and logos collectively create the Apple identity. Before publicly using any of these assets, review the Trademark 
List, Trademark and Copyright Guidelines, and Corporate Identity Guidelines for how to properly do so. You should also 
check with Legal before using the product names, service names, taglines, or logos of any third parties.
Third-Party Intellectual Property
Apple respects third-party intellectual property. Never use the intellectual property of any third party without permission 
or legal right. If you are told or suspect that Apple may be infringing on third-party inte

**Sample Query**: "The Apple Identity and Trademarks"

**Basic Retrieval**: Returns small chunks that may miss surrounding policy context

**Parent Document Retrieval**: Returns larger documents (1000 chars)

containing:
  - Complete trademark guidelines
  - Related policies (side deals, intellectual property)
  - Full context for comprehensive understanding

# 2. Basic vs MMR

## 2.1 Data Preparations


**Source**: 8 Wikipedia articles focused on "Technology"

**Content**: MIT, creative technology, ON Technology Corporation, general technology concepts

**Why This Data**:
  - Multiple articles mention similar concepts (e.g., MIT appears in several articles)
  - Rich semantic diversity across technology domains
  - Perfect for demonstrating MMR's diversity vs basic retrieval's redundancy
  - Real-world scenario where users need comprehensive, non-repetitive information


In [17]:


from langchain_community.document_loaders import WikipediaLoader

print("🌐 Loading Wikipedia documents...")
# You can change the topic if you want to showcase another domain
wiki_loader = WikipediaLoader(query="Technology", load_max_docs=8)
wiki_docs = wiki_loader.load()

print(f"✅ Loaded {len(wiki_docs)} Wikipedia articles")


🌐 Loading Wikipedia documents...
✅ Loaded 8 Wikipedia articles


In [18]:

# Split into chunks
wiki_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=100
)
wiki_chunks = wiki_splitter.split_documents(wiki_docs)
print(f"✅ Created {len(wiki_chunks)} chunks from Wikipedia articles")

# Create vectorstore
wiki_vectorstore = Chroma.from_documents(
    documents=wiki_chunks,
    embedding=embeddings,
    persist_directory=os.path.join(config.CHROMA_PERSIST_DIRECTORY, "wiki")
)


✅ Created 113 chunks from Wikipedia articles


## 2.1 Basic Retrieval

In [27]:

# Test query
wiki_query = "History of deep learning and neural networks"

# Basic Retrieval
print(" Basic Retrieval (Wikipedia)...")
wiki_basic_results = wiki_vectorstore.similarity_search(wiki_query, k=5)
for i, doc in enumerate(wiki_basic_results):
    print(f"\n--- Basic Result {i+1} ---")
    print(f"Content: {doc.page_content}...")
    print(doc.metadata['title'])


 Basic Retrieval (Wikipedia)...

--- Basic Result 1 ---
Content: topics such as artificial intelligence began to be brought up as Turing was beginning to question such technology of the time period....
Information technology

--- Basic Result 2 ---
Content: Ideas of computer science were first mentioned before the 1950s under the Massachusetts Institute of Technology (MIT) and Harvard University, where they had discussed and began thinking of computer circuits and numerical calculations.  As time went on, the field of information technology and computer science became more complex and was able to handle the processing of more data. Scholarly...
Information technology

--- Basic Result 3 ---
Content: During the early computing, Alan Turing, J. Presper Eckert, and John Mauchly were considered some of the major pioneers of computer technology in the mid-1900s. Giving them such credit for their developments, most of their efforts were focused on designing the first digital computer. Along 

## 2.2 MMR Retrieval

In [26]:

# MMR Retrieval
print("MMR Retrieval (Wikipedia)...")
wiki_mmr_results = wiki_vectorstore.max_marginal_relevance_search(
    wiki_query,
    k=5,          # number of final results
    fetch_k=20,   # larger candidate pool for diversity
    lambda_mult=0.5  # balance between relevance (1.0) and diversity (0.0)
)
for i, doc in enumerate(wiki_mmr_results):
    print(f"\n--- MMR Result {i+1} ---")
    print(f"Content: {doc.page_content}...")
    print(doc.metadata['title'])



MMR Retrieval (Wikipedia)...

--- MMR Result 1 ---
Content: topics such as artificial intelligence began to be brought up as Turing was beginning to question such technology of the time period....
Information technology

--- MMR Result 2 ---
Content: engineering. MIT moved from Boston to Cambridge in 1916 and grew rapidly through collaboration with private industry, military branches, and new federal basic research agencies, the formation of which was influenced by MIT faculty like Vannevar Bush. In the late twentieth century, MIT became a leading center for research in computer science, digital technology, artificial intelligence and big...
Massachusetts Institute of Technology

--- MMR Result 3 ---
Content: mechanism. Comparable geared devices did not emerge in Europe until the 16th century, and it was not until 1645 that the first mechanical calculator capable of performing the four basic arithmetical operations was developed....
Information technology

--- MMR Result 4 ---
Content:

 **Query**: "History of deep learning and neural networks"

**Basic Retrieval**: Returns similar, redundant results from same concepts

**MMR Retrieval**: Returns diverse results from various sources:
  - Academic perspectives (MIT)
  - Corporate technology (ON Technology)
  - Creative applications (creative technology)
  - Historical evolution (general technology)

**Key Benefit**: MMR provides broader coverage with less redundancy than basic retrieval

# Bonus : Parent Document + MMR Retrieval

In [None]:

print("Setting up MMR + Parent Document Retriever...")
from utils import MMRParentDocumentRetriever, create_mmr_parent_retriever

# Create MMR Parent Document Retriever using our custom implementation
mmr_parent_retriever = create_mmr_parent_retriever(
    documents=wiki_docs,
    embeddings=embeddings,
    child_chunk_size=400,
    child_chunk_overlap=150,
    parent_chunk_size=1000,
    parent_chunk_overlap=400,
    persist_directory=os.path.join(config.CHROMA_PERSIST_DIRECTORY, "mmr_parent")
)

print("MMR + Parent Document Retriever ready!")

🔗 Setting up MMR + Parent Document Retriever...
✅ MMR + Parent Document Retriever ready!


In [25]:
print(" Testing MMR + Parent Document Retrieval...")

test_queries = [
    "Masachuset Institute ofTechnology Role in Technology"]
for query in test_queries:
    print(f"\n Query: '{query}'")
    print("=" * 50)
    
    # Test with different lambda_mult values
    for lambda_mult in [0.5]:
        print(f"\n MMR with lambda_mult={lambda_mult} (diversity vs relevance):")
        
        mmr_parent_results = mmr_parent_retriever.get_relevant_documents(
            query,
            k=5,
            fetch_k=10,
            lambda_mult=lambda_mult
        )
        
        print(f"Retrieved {len(mmr_parent_results)} parent documents:")
        for i, doc in enumerate(mmr_parent_results):
            print(f"\n--- MMR Parent Document {i+1} ---")
            print(f"Content: {doc.page_content}...")
            # print(f"Length: {len(doc.page_content)} characters")
            print(f"Metadata: {doc.metadata['title']}")

 Testing MMR + Parent Document Retrieval...

 Query: 'Masachuset Institute ofTechnology Role in Technology'

 MMR with lambda_mult=0.5 (diversity vs relevance):
Retrieved 5 parent documents:

--- MMR Parent Document 1 ---
Content: == History ==


=== Foundation and vision ===
[...] a school of industrial science aiding the advancement, development and practical application of science in connection with arts, agriculture, manufactures, and commerce [...]
In 1859, a proposal was submitted to the Massachusetts General Court to use newly filled lands in Back Bay, Boston for a "Conservatory of Art and Science", but the proposal failed. A charter for the incorporation of the Massachusetts Institute of Technology, proposed by William Barton Rogers, was signed by John Albion Andrew, the governor of Massachusetts, on April 10, 1861.
Rogers, a geologist who had recently arrived in Boston from the University of Virginia, wanted to establish an institution to address rapid scientific and technolog

##  3. Conclusion: Why Retrievers Matter

Retrievers are not just plumbing. they shape the quality of answers in RAG.  

- **Basic** → speed & simplicity  
- **Parent** → context preservation  
- **MMR** → diverse perspectives  

**Key Message**: Choosing the right retriever can be just as impactful as picking the right LLM or embedding model.


| Retriever  | Pros                       | Cons                  | Best Use Case           |
| ---------- | -------------------------- | --------------------- | ----------------------- |
| Basic      | Fast, simple               | Loses context         | FAQs, simple Q&A        |
| Parent     | Context-rich               | Slower, storage-heavy | Policies, long docs     |
| MMR        | Diverse, avoids duplicates | Needs tuning          | Research, exploration   |
| Parent+MMR | Context + diversity        | Most complex          | Broad, critical queries |
