# Advanced Retrieval Techniques for RAG Systems

## Overview

This notebook demonstrates **advanced retrieval strategies** to improve the quality and relevance of document retrieval in RAG systems. We'll explore techniques beyond simple similarity search.

## Retrieval Techniques Covered

### 1. **Similarity Search** (Baseline)
- Standard vector similarity using cosine distance
- Returns k most similar documents
- Fast but may return redundant results

### 2. **Maximum Marginal Relevance (MMR)**
- Balances relevance with diversity
- Reduces redundancy in results
- Formula: Œª √ó (relevance) - (1-Œª) √ó (similarity to already selected docs)

### 3. **Metadata Filtering**
- Filter documents by structured attributes (page, source, date, etc.)
- Combines semantic search with traditional database queries
- Useful for narrowing search scope

### 4. **Self-Query Retrieval** (‚≠ê Advanced)
- **Natural language to structured queries**
- LLM automatically extracts:
  - Search terms (for semantic search)
  - Metadata filters (for structured filtering)
- Example: "What does page 5 say about consent?" ‚Üí 
  - Search: "consent"
  - Filter: `{"page": 5}`

## Why These Matter

| Technique | Problem It Solves | Use Case |
|-----------|------------------|----------|
| **Similarity Search** | Basic retrieval | General Q&A |
| **MMR** | Result redundancy | When you need diverse perspectives |
| **Metadata Filtering** | Too broad results | Search within specific document sections |
| **Self-Query** | Complex user queries | Production chatbots, intuitive interfaces |

## Document Used: DPDPA

**Digital Personal Data Protection Act (India, 2023)**
- 21 pages of legal text
- Structured by sections and chapters
- Metadata: page numbers, source path
- Perfect for demonstrating filtered retrieval

## Environment Setup

Standard imports and API key configuration.

In [30]:
import os
import openai
import tiktoken  # OpenAI's tokenizer for accurate token counting
from dotenv import load_dotenv, find_dotenv

# Load environment variables from .env file (contains OPENAI_API_KEY)
_ = load_dotenv(find_dotenv())

openai.api_key = os.environ['OPENAI_API_KEY']

In [31]:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

## Load Vector Database

Loading the pre-built vector database from the previous notebook (3-VectorsAndEmbeddings.ipynb).
This contains 55 chunks from the DPDPA document, already embedded and indexed.

In [32]:
embedding = OpenAIEmbeddings()
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [33]:
print(vectordb._collection.count())

55


## Verify Database

Confirming we have 55 document chunks loaded and ready for retrieval experiments.

In [34]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [35]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

---

## Technique 1: Similarity Search vs MMR

### Demo with Small Database

Let's create a small test database to clearly see the difference between similarity search and MMR.

In [36]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [37]:
smalldb.similarity_search(question, k=2)

[Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.')]

In [38]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.')]

### Similarity Search Result

Returns the 2 most similar documents. Notice both are about large fruiting bodies - potentially redundant information.

In [39]:
question = "what did they say about penalties for non compliance and data breach for one user?"
docs_ss = vectordb.similarity_search(question,k=3)
docs_ss[0].page_content[:100]

'SEC. 1] THE GAZETTE OF INDIA EXTRAORDINARY 21\nBreach of provisions of  this Act or rules made thereu'

In [40]:
docs_ss[1].page_content[:100]

'SEC. 1] THE GAZETTE OF INDIA EXTRAORDINARY 17\nperson an opportunity of being heard, impose such mone'

### MMR Search Result

Returns diverse results:
1. First result: Most relevant (all-white, large fruiting body)
2. Second result: Different aspect (poisonous) - adds new information

**Key Insight**: MMR sacrifices some relevance for diversity, giving users broader context.

In [41]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [42]:
docs_mmr[0].page_content[:100]

'SEC. 1] THE GAZETTE OF INDIA EXTRAORDINARY 21\nBreach of provisions of  this Act or rules made thereu'

---

## Technique 2: Similarity Search on DPDPA

Let's search the DPDPA document using standard similarity search.

In [43]:
docs_mmr[1].page_content[:100]

'accessing her mobile phone contact list, and X signifies her consent to both. Since phone\ncontact li'

In [46]:
# Demonstrating metadata filtering with DPDPA document
# Let's find information about consent from specific pages
# ChromaDB requires explicit operators: use $and for multiple conditions, $eq for equality
question = "what are the requirements for consent?"   
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"$and": [{"source": {"$eq": "./99-DPDPA.pdf"}}, {"page": {"$eq": 4}}]}
)

print(f"Found {len(docs)} documents matching the filter\n")
for i, d in enumerate(docs, 1):
    print(f"Document {i} metadata:", d.metadata)
    print(f"Content preview: {d.page_content[:150]}...\n")

Found 3 documents matching the filter

Document 1 metadata: {'creator': 'PyPDF', 'producer': 'iTextSharp‚Ñ¢ 5.5.13.1 ¬©2000-2019 iText Group NV (AGPL-version)', 'creationdate': '2023-08-12T02:13:03+05:30', 'total_pages': 21, 'page_label': '5', 'page': 4, 'moddate': '2023-08-12T02:14:35+05:30', 'source': './99-DPDPA.pdf'}
Content preview: processing without her consent is required or authorised under the provisions of this Act
or the rules made thereunder or any other law for the time b...

Document 2 metadata: {'total_pages': 21, 'producer': 'iTextSharp‚Ñ¢ 5.5.13.1 ¬©2000-2019 iText Group NV (AGPL-version)', 'page': 4, 'source': './99-DPDPA.pdf', 'moddate': '2023-08-12T02:14:35+05:30', 'page_label': '5', 'creationdate': '2023-08-12T02:13:03+05:30', 'creator': 'PyPDF'}
Content preview: accessing her mobile phone contact list, and X signifies her consent to both. Since phone
contact list is not necessary for making available telemedic...

Document 3 metadata: {'producer': 'iTextSharp‚Ñ¢ 

In [52]:
from langchain_openai import OpenAI

# Import AttributeInfo and SelfQueryRetriever from langchain_classic
# Note: In LangChain 1.2.7+, these classes have been moved to langchain_classic
from langchain_classic.chains.query_constructor.schema import AttributeInfo
from langchain_classic.retrievers.self_query.base import SelfQueryRetriever

In [53]:
# Define metadata fields for DPDPA document
# This tells the LLM what metadata fields are available for filtering
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The source document file path, specifically './99-DPDPA.pdf' for the Digital Personal Data Protection Act",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page number from the DPDPA document (0-indexed, so page 0 is the first page)",
        type="integer",
    ),
    AttributeInfo(
        name="page_label",
        description="The page label as it appears in the document",
        type="string",
    ),
]

In [54]:
# Configure self-query retriever for DPDPA document
document_content_description = "Legal document about Digital Personal Data Protection Act of India, covering data protection rights, obligations of data fiduciaries, consent requirements, and penalties"

# Initialize LLM for query construction
llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)

# Create self-query retriever
# This will automatically extract filters from natural language queries
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [None]:
# Example 1: Query that should filter by page number
# The LLM will automatically extract "page 5" and create appropriate filter
question = "What does page 5 say about data principal rights?"

## Technique 3: MMR on DPDPA

Same query, but using MMR to get diverse results about penalties.

In [56]:
# Execute self-query retrieval
# The retriever will:
# 1. Parse the natural language query
# 2. Extract semantic content: "data principal rights"
# 3. Extract metadata filter: page = 5
# 4. Apply both to find relevant documents
docs = retriever.invoke(question)

print(f"\nRetrieved {len(docs)} documents")


Retrieved 4 documents


In [57]:
# Display results with metadata and content preview
for i, d in enumerate(docs, 1):
    print(f"\n{'='*60}")
    print(f"Document {i}")
    print(f"{'='*60}")
    print(f"Metadata: {d.metadata}")
    print(f"\nContent preview:")
    print(d.page_content[:300] + "..." if len(d.page_content) > 300 else d.page_content)


Document 1
Metadata: {'total_pages': 21, 'creationdate': '2023-08-12T02:13:03+05:30', 'source': './99-DPDPA.pdf', 'creator': 'PyPDF', 'producer': 'iTextSharp‚Ñ¢ 5.5.13.1 ¬©2000-2019 iText Group NV (AGPL-version)', 'moddate': '2023-08-12T02:14:35+05:30', 'page_label': '5', 'page': 4}

Content preview:
processing without her consent is required or authorised under the provisions of this Act
or the rules made thereunder or any other law for the time being in force in India.
Consent.

Document 2
Metadata: {'moddate': '2023-08-12T02:14:35+05:30', 'creator': 'PyPDF', 'page_label': '5', 'creationdate': '2023-08-12T02:13:03+05:30', 'source': './99-DPDPA.pdf', 'producer': 'iTextSharp‚Ñ¢ 5.5.13.1 ¬©2000-2019 iText Group NV (AGPL-version)', 'total_pages': 21, 'page': 4}

Content preview:
accessing her mobile phone contact list, and X signifies her consent to both. Since phone
contact list is not necessary for making available telemedicine services, her consent shall be
limited to the processing 

In [58]:
# Example 2: Query about penalties without page specification
# The LLM should search semantically without metadata filter
question2 = "What are the penalties for data breach?"

print(f"\nQuery: {question2}\n")
docs2 = retriever.invoke(question2)

print(f"\nRetrieved {len(docs2)} documents")
for i, d in enumerate(docs2, 1):
    print(f"\n--- Document {i} ---")
    print(f"Page: {d.metadata.get('page', 'N/A')}")
    print(f"Content: {d.page_content[:200]}...")


Query: What are the penalties for data breach?


Retrieved 4 documents

--- Document 1 ---
Page: 20
Content: SEC. 1] THE GAZETTE OF INDIA EXTRAORDINARY 21
Breach of provisions of  this Act or rules made thereunder
(2)
Breach in observing the obligation of Data Fiduciary to
take reasonable security safeguards...

--- Document 2 ---
Page: 16
Content: SEC. 1] THE GAZETTE OF INDIA EXTRAORDINARY 17
person an opportunity of being heard, impose such monetary penalty specified in the
Schedule.
(2) While determining the amount of monetary penalty to be i...

--- Document 3 ---
Page: 16
Content: Act or the rules made thereunder.
36. The Central Government may, for the purposes of this Act, require the Board and
any Data Fiduciary or intermediary to furnish such information as it may call for....

--- Document 4 ---
Page: 13
Content: provided in this Act;
(c) on a complaint made by a Data Principal in respect of a breach in observance
by a Consent Manager of its obligations in relation to her pe

In [61]:
# Example 3: Complex query with implicit filtering
# Testing if LLM can extract page range or specific section references
question3 = "What information is provided in the early pages about consent?"

print(f"\nQuery: {question3}\n")
docs3 = retriever.invoke(question3)

print(f"\nRetrieved {len(docs3)} documents")
for i, d in enumerate(docs3, 1):
    print(f"\n--- Document {i} ---")
    print(f"Metadata: {d.metadata}")
    print(f"Content: {d.page_content[:250]}...")


Query: What information is provided in the early pages about consent?


Retrieved 4 documents

--- Document 1 ---
Metadata: {'creationdate': '2023-08-12T02:13:03+05:30', 'source': './99-DPDPA.pdf', 'page': 4, 'page_label': '5', 'producer': 'iTextSharp‚Ñ¢ 5.5.13.1 ¬©2000-2019 iText Group NV (AGPL-version)', 'moddate': '2023-08-12T02:14:35+05:30', 'creator': 'PyPDF', 'total_pages': 21}
Content: processing without her consent is required or authorised under the provisions of this Act
or the rules made thereunder or any other law for the time being in force in India.
Consent....

--- Document 2 ---
Metadata: {'creator': 'PyPDF', 'moddate': '2023-08-12T02:14:35+05:30', 'creationdate': '2023-08-12T02:13:03+05:30', 'producer': 'iTextSharp‚Ñ¢ 5.5.13.1 ¬©2000-2019 iText Group NV (AGPL-version)', 'page': 4, 'total_pages': 21, 'page_label': '5', 'source': './99-DPDPA.pdf'}
Content: accessing her mobile phone contact list, and X signifies her consent to both. Since phone
contact list is not neces

## Comparison: Manual Filter vs Self-Query Retriever

### Manual Filtering (Traditional Approach)
```python
docs = vectordb.similarity_search(
    "consent requirements",
    k=3,
    filter={"page": 5}  # You manually specify the filter
)
```

### Self-Query Retriever (Intelligent Approach)
```python
docs = retriever.invoke(
    "What does page 5 say about consent?"  # LLM extracts filter automatically
)
```

**Key Benefits:**
1. **Natural Language Interface**: Users don't need to know metadata structure
2. **Automatic Filter Extraction**: LLM parses queries and applies appropriate filters
3. **Hybrid Search**: Combines semantic search with structured filtering
4. **Flexible Queries**: Handles both filtered and unfiltered queries seamlessly

**How It Works:**
1. Query goes to LLM (gpt-3.5-turbo-instruct)
2. LLM analyzes query and metadata field definitions
3. LLM extracts: 
   - Search content: "consent"
   - Metadata filter: `{"page": 5}`
4. Vector database performs filtered similarity search
5. Returns most relevant documents from specified page

**MMR Diversity**: Notice the second result is different from similarity search - MMR selected content about consent rather than another penalty-focused chunk, providing broader context.

In [60]:
# Example 4: Testing different query patterns
test_queries = [
    "Tell me about Data Fiduciary obligations",
    "What does the first page of the document cover?",
    "Are there any requirements for children's data on page 10?",
    "Explain consent withdrawal process"
]

print("Testing various query patterns with Self-Query Retriever\n")
print("="*80)

for query in test_queries:
    print(f"\nQuery: '{query}'")
    print("-"*80)
    try:
        results = retriever.invoke(query)
        print(f"‚úì Retrieved {len(results)} document(s)")
        if results:
            print(f"  First result from page: {results[0].metadata.get('page', 'N/A')}")
            print(f"  Preview: {results[0].page_content[:100]}...")
    except Exception as e:
        print(f"‚úó Error: {str(e)}")

Testing various query patterns with Self-Query Retriever


Query: 'Tell me about Data Fiduciary obligations'
--------------------------------------------------------------------------------
‚úì Retrieved 4 document(s)
  First result from page: 6
  Preview: to ensure effective observance of  the provisions of this Act and the rules made thereunder.
(5) A D...

Query: 'What does the first page of the document cover?'
--------------------------------------------------------------------------------
‚úì Retrieved 2 document(s)
  First result from page: 0
  Preview: xxxGIDHxxx
xxxGIDExxx
jftLV¬™h la√± Mh√± ,y√± ‚Äî(,u)04@0007@2003‚Äî23 REGISTERED NO. DL‚Äî(N)04/0007/2003‚Äî23
...

Query: 'Are there any requirements for children's data on page 10?'
--------------------------------------------------------------------------------
‚úì Retrieved 0 document(s)

Query: 'Explain consent withdrawal process'
--------------------------------------------------------------------------------
‚úì Retrieved

In [63]:
## Setup Contextual Compression

from langchain_classic.retrievers import ContextualCompressionRetriever
from langchain_classic.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import OpenAI

# Initialize the LLM for compression
llm = OpenAI(temperature=0, model='gpt-3.5-turbo-instruct')

# Create the compressor that will extract relevant portions
compressor = LLMChainExtractor.from_llm(llm)


In [64]:
# Helper function to display documents nicely
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [65]:
# Create the compression retriever
# This combines standard retrieval with LLM-based compression
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [66]:
## Example: Compare Standard vs Compressed Retrieval

# Question about DPDPA penalties
question = "What are the penalties for data breach and non-compliance?"

print("="*100)
print("STANDARD RETRIEVAL (Full Documents)")
print("="*100)

# Standard retrieval - returns full document chunks
standard_docs = vectordb.similarity_search(question, k=3)
pretty_print_docs(standard_docs)

STANDARD RETRIEVAL (Full Documents)
Document 1:

SEC. 1] THE GAZETTE OF INDIA EXTRAORDINARY 21
Breach of provisions of  this Act or rules made thereunder
(2)
Breach in observing the obligation of Data Fiduciary to
take reasonable security safeguards to prevent personal
data breach under sub-section (5) of section 8.
Breach in observing the obligation to give the Board or
affected Data Principal notice of a personal data breach
under sub-section (6) of section 8.
Breach in observance of additional obligations in relation
to children under section 9.
Breach in observance of additional obligations of
Significant Data Fiduciary under section 10.
Breach in observance of the duties under section 15.
Breach of any term of voluntary undertaking accepted by
the Board under section 32.
Breach of any other provision of this Act or the rules
made thereunder.
Sl. No.
(1)
1.
2.
3.
4.
5.
6.
7.
Penalty
(3)
May extend to two
hundred and fifty
crore rupees.
May extend to two
hundred crore
rupees.
May ex

In [68]:
print("\n" + "="*100)
print("COMPRESSED RETRIEVAL (Extracted Relevant Portions Only)")
print("="*100)

# Compressed retrieval - extracts only relevant portions
compressed_docs = compression_retriever.invoke(question)
pretty_print_docs(compressed_docs)


COMPRESSED RETRIEVAL (Extracted Relevant Portions Only)
Document 1:

Breach of provisions of this Act or rules made thereunder
Breach in observing the obligation of Data Fiduciary to take reasonable security safeguards to prevent personal data breach under sub-section (5) of section 8.
Breach in observing the obligation to give the Board or affected Data Principal notice of a personal data breach under sub-section (6) of section 8.
Breach in observance of additional obligations in relation to children under section 9.
Breach in observance of additional obligations of Significant Data Fiduciary under section 10.
Breach in observance of the duties under section 15.
Breach of any term of voluntary undertaking accepted by the Board under section 32.
Breach of any other provision of this Act or the rules made thereunder.
Sl. No.
(1)
1.
2.
3.
4.
5.
6.
7.
Penalty
(3)
May extend to two hundred and fifty crore rupees.
May extend to two hundred crore rupees.
May extend to two hundred crore rupe

In [69]:
## Compare Token Usage

# Count tokens in both approaches
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

standard_tokens = sum([len(encoding.encode(doc.page_content)) for doc in standard_docs])
compressed_tokens = sum([len(encoding.encode(doc.page_content)) for doc in compressed_docs])

print(f"\n{'='*100}")
print("TOKEN USAGE COMPARISON")
print(f"{'='*100}")
print(f"Standard Retrieval: {standard_tokens} tokens")
print(f"Compressed Retrieval: {compressed_tokens} tokens")
print(f"Reduction: {standard_tokens - compressed_tokens} tokens ({((standard_tokens - compressed_tokens) / standard_tokens * 100):.1f}% savings)")
print(f"\nEstimated cost savings per query: ${(standard_tokens - compressed_tokens) * 0.0000015:.6f}")
print("(Based on GPT-3.5-turbo input pricing: $0.0015 per 1K tokens)")


TOKEN USAGE COMPARISON
Standard Retrieval: 1112 tokens
Compressed Retrieval: 485 tokens
Reduction: 627 tokens (56.4% savings)

Estimated cost savings per query: $0.000941
(Based on GPT-3.5-turbo input pricing: $0.0015 per 1K tokens)


In [71]:
## Another Example: Consent Requirements

question2 = "What are the requirements for obtaining valid consent from a data principal?"

print("="*100)
print(f"QUESTION: {question2}")
print("="*100)

compressed_docs2 = compression_retriever.invoke(question2)
pretty_print_docs(compressed_docs2)

QUESTION: What are the requirements for obtaining valid consent from a data principal?
Document 1:

- Where consent given by the Data Principal is the basis of processing of personal data, such Data Principal shall have the right to withdraw her consent at any time, with the ease of doing so being comparable to the ease with which such consent was given.
- The consequences of the withdrawal referred to in sub-section (4) shall be borne by the Data Principal, and such withdrawal shall not affect the legality of processing of the personal data based on consent before its withdrawal.
- If a Data Principal withdraws her consent to the processing of personal data under sub-section (5), the Data Fiduciary shall, within a reasonable time, cease and cause its Data Processors to cease processing the personal data of such Data Principal unless such processing without her consent is required or authorised under the provisions of this Act.
----------------------------------------------------------

---

## Combining Techniques: Compression + Metadata Filtering

You can combine contextual compression with other retrieval techniques for even better results.

For example, combine compression with metadata filtering to get compressed, focused results from specific pages:

In [77]:
# Create a filtered retriever for specific pages
from langchain_classic.vectorstores import Chroma

# Retriever that only searches pages 15-21 (penalty and enforcement sections)
# Note: ChromaDB requires $and operator to combine multiple conditions
filtered_retriever = vectordb.as_retriever(
    search_kwargs={
        "k": 3,
        "filter": {"$and": [{"page": {"$gte": 15}}, {"page": {"$lte": 21}}]}
    }
)

# Combine with compression
compression_filtered_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=filtered_retriever
)

# Query about penalties (should only search in penalty-related pages)
question3 = "What are the maximum monetary penalties specified?"
print(f"Question: {question3}")
print(f"Searching only pages 16-21 (penalties and enforcement sections)\n")

compressed_filtered_docs = compression_filtered_retriever.invoke(question3)
pretty_print_docs(compressed_filtered_docs)

Question: What are the maximum monetary penalties specified?
Searching only pages 16-21 (penalties and enforcement sections)

Document 1:

- monetary penalty specified in the Schedule
- determining the amount of monetary penalty to be imposed under sub-section (1)
- the Board shall have regard to the following matters, namely:‚Äî
- the nature, gravity and duration of the breach;
- the type and nature of the personal data affected by the breach;
- repetitive nature of the breach;
- whether the person, as a result of the breach, has realised a gain or avoided any loss;
- whether the person took any action to mitigate the effects and consequences of the breach, and the timeliness and effectiveness of such action;
- whether the monetary penalty to be imposed is proportionate and effective, having regard to the need to secure observance of and deter breach of the provisions of this Act; and
- the likely impact of the imposition of the monetary penalty on the person.
- All sums realised by w

---

## Advanced Combination: Compression + MMR

Combine compression with MMR to get diverse AND focused results:

In [78]:
# Create MMR retriever
mmr_retriever = vectordb.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 3,
        "fetch_k": 10,  # Fetch 10 candidates, select 3 diverse ones
        "lambda_mult": 0.5  # Balance between relevance (1.0) and diversity (0.0)
    }
)

# Combine with compression
compression_mmr_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=mmr_retriever
)

# Query about data principal rights
question4 = "What rights does a data principal have under DPDPA?"
print(f"Question: {question4}\n")

compressed_mmr_docs = compression_mmr_retriever.invoke(question4)
pretty_print_docs(compressed_mmr_docs)

print(f"\n{'='*100}")
print(f"Retrieved {len(compressed_mmr_docs)} diverse, compressed documents about data principal rights")
print(f"{'='*100}")

Question: What rights does a data principal have under DPDPA?

Document 1:

- The Data Principal shall have the right to obtain from the Data Fiduciary to whom she has previously given consent
- upon making to it a request in such manner as may be prescribed
- a summary of personal data which is being processed by such Data Fiduciary and the processing activities undertaken by that Data Fiduciary with respect to such personal data
- the identities of all other Data Fiduciaries and Data Processors with whom the personal data has been shared by such Data Fiduciary, along with a description of the personal data so shared
- any other information related to the personal data of such Data Principal and its processing, as may be prescribed
----------------------------------------------------------------------------------------------------
Document 2:

- "rights of Data Principals" (mentioned in the context of "periodic Data Protection Impact Assessment")
- "risk to the rights of Data Principa

---

## Summary and Best Practices

### Retrieval Technique Selection Guide

| Scenario | Recommended Technique | Why |
|----------|----------------------|-----|
| **General Q&A** | Similarity Search | Fast, simple, good baseline |
| **Need diverse results** | MMR | Avoids redundant information |
| **Search specific sections** | Metadata Filtering | Precise control over scope |
| **User-friendly interface** | Self-Query Retriever | Natural language, no manual filters |
| **Cost-sensitive production** | Compression | Reduces token usage 50-80% |
| **Best quality answers** | Compression + MMR | Diverse, focused, efficient |
| **Production chatbot** | Self-Query + Compression | Best UX with cost efficiency |

### Production Considerations

#### 1. Cost Optimization
- **Contextual Compression**: Saves 50-80% on downstream LLM costs
- **Self-Query adds 1 LLM call per query** (~$0.0001-0.0002)
- Consider caching common query patterns
- Use cheaper models (gpt-3.5-turbo-instruct) for query construction and compression

#### 2. Latency
- Similarity Search: ~50-100ms
- MMR: ~100-200ms (fetch_k candidates, then select diverse k)
- Self-Query: +200-500ms (LLM query parsing)
- Metadata Filter: ~50-100ms (when properly indexed)
- **Compression: +200-500ms** (LLM extraction) - but worth it for cost savings

#### 3. Error Handling
```python
try:
    docs = compression_retriever.get_relevant_documents(query)
except Exception as e:
    # Fallback to standard similarity search
    docs = vectordb.similarity_search(query, k=3)
```

#### 4. Hybrid Approaches
Combine multiple techniques for best results:
```python
# Example 1: MMR + Filtering
# Note: ChromaDB requires $and operator to combine multiple conditions
docs = vectordb.max_marginal_relevance_search(
    query,
    k=5,
    fetch_k=20,
    filter={"$and": [{"page": {"$gte": 5}}, {"page": {"$lte": 10}}]}
)

# Example 2: Compression + MMR (demonstrated above)
compression_mmr_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type="mmr")
)
```

### Key Takeaways

‚úÖ **Similarity Search**: Fast baseline, but may return redundant results  
‚úÖ **MMR**: Balances relevance with diversity  
‚úÖ **Metadata Filtering**: Narrows search scope to specific document sections  
‚úÖ **Self-Query Retriever**: Natural language interface with automatic filter extraction  
‚úÖ **Contextual Compression**: Extracts only relevant portions, reducing tokens and cost
‚úÖ **Hybrid Approaches**: Combine techniques for optimal results (e.g., Compression + MMR)

### Recommended Production Stack

For a production RAG system on DPDPA or similar legal documents:

```python
# 1. Base retriever with MMR for diversity
base_retriever = vectordb.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 20, "lambda_mult": 0.5}
)

# 2. Add compression to reduce cost
production_retriever = ContextualCompressionRetriever(
    base_compressor=LLMChainExtractor.from_llm(llm),
    base_retriever=base_retriever
)

# 3. Optional: Wrap with Self-Query for natural language filtering
# (adds latency but greatly improves UX)
```

### Next Steps

1. **Experiment with parameters**: 
   - Adjust `k` (number of results)
   - Tune MMR `lambda_mult` (0 = max diversity, 1 = max relevance)
   - Test different metadata filters
   - Compare compression savings on your specific use case

2. **Build RAG chain**: Combine retrieval with LLM for Q&A
3. **Add conversation history**: Implement chat memory
4. **Evaluate quality**: Measure retrieval precision/recall and answer quality

---

## Additional Resources

- [LangChain Retrievers Documentation](https://python.langchain.com/docs/modules/data_connection/retrievers/)
- [Contextual Compression](https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression/)
- [Chroma Filtering Guide](https://docs.trychroma.com/usage-guide#filtering-by-metadata)
- [Self-Query Retriever Deep Dive](https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/)

**Ready to build a complete RAG Q&A system in the next notebook!** üöÄ