# Multi-Query Retriever vs Self-Querying Retriever Comparison

## Overview

This notebook compares two advanced retrieval techniques in LangChain:
- **Multi-Query Retriever**: Generates multiple query variations to capture different perspectives
- **Self-Querying Retriever**: Uses LLM to construct structured queries with metadata filters



## Setup and Imports

Import all necessary libraries for document processing, embeddings, vector stores, and retrievers.

In [1]:

# Core LangChain imports

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain_core.documents import Document
from langchain_community.llms import Ollama

import sys
import warnings
import logging
sys.path.append('..')
import config
from langchain_core.prompts import PromptTemplate
from langchain_community.llms import Ollama

warnings.filterwarnings('ignore')

print("✅ All imports successful!")


  from .autonotebook import tqdm as notebook_tqdm


✅ All imports successful!


## Initialize Embedding Model

Set up logging for debugging and initialize the HuggingFace embedding model for document vectorization.

In [2]:

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

# Initialize embedding model
print("🔤 Initializing embedding model...")
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={'device': 'cuda'},
)
print("✅ Embedding model ready!")

🔤 Initializing embedding model...
✅ Embedding model ready!


## Initialize Large Language Models

Configure the Ollama LLM (Gemma 3:1b) for query generation and processing.

In [None]:
from langchain_openai import ChatOpenAI
import os
from dotenv import load_dotenv

# Load from .env file (never commit .env to git!)
load_dotenv()

# Validate API key exists
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("❌ OPENAI_API_KEY not found in environment variables. Please set it in .env file")

llm_self_query = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.0,
    max_tokens=512
)

llm_multi_query = Ollama(model="qwen2.5:3b", temperature=0.5, num_predict=512) 

## 🎯 Self-Querying Retriever

### What is Self-Querying Retriever?

The Self-Querying Retriever has the ability to query itself. Given a natural language query, it uses a query-constructing LLM chain to generate a structured query and applies metadata filters.

**How it works:**
1. Takes a natural language query
2. Uses LLM to extract query text and metadata filters
3. Applies both semantic search and metadata filtering
4. Returns filtered results

**Pros:**
- Combines semantic search with metadata filtering
- Handles complex queries with multiple criteria
- More precise results through filtering

**Cons:**
- Requires well-structured metadata
- More complex setup
- May miss results if metadata is incomplete

## 📊 Dataset for Self-Query Retriever: HuggingFace Daily Papers

### Why This Dataset?
- **Rich structured metadata**: Papers have natural categorical and temporal attributes (subjects, publication dates)
- **Filtering requirements**: Research paper searches often need specific criteria ("LLM papers from October 2025")
- **Demonstrates Self-Query strength**: Shows how the retriever combines semantic search with metadata filtering
- **Real-world use case**: Mirrors actual research paper discovery workflows

### Data Creation Pipeline

**Step 1: Fetch Raw Data**
- Source: HuggingFace Daily Papers API (`https://huggingface.co/api/daily_papers`)
- Method: Paginated API calls (10 pages × 50 papers = 500 papers)
- Fields: title, summary, authors, publishedAt, ai_keywords

**Step 2: Transform and Enrich with LLM**
- Used Gemma 3:1b model to generate enhanced metadata:
  - `three_sentence_summary`: Concise summary of the abstract
  - `innovations`: Top 3 key innovations in bullet points
  - `subjects`: AI topic classification (LLM, RAG, Agentic AI, etc.)
- Date conversion: ISO format → Unix timestamp for filtering

**Step 3: Create Vector Store Documents**
- **page_content**: LLM-generated three-sentence summary (semantic search target)
- **metadata**: 
  - `title` (string): Paper title
  - `subject` (string): Primary subject classification
  - `date` (integer): Unix timestamp for date-range queries

**Result**: 500 semantically-searchable papers with filterable metadata stored in Chroma (`sq_hf` directory)

In [3]:
import json

with open(config.TRANSFORMED_PAPERS_PATH, 'r') as file:
   transformed_papers = json.load(file)

In [None]:
from datetime import datetime

sq_documents = []
for paper in transformed_papers:
    # Convert date string to timestamp
    date_obj = datetime.strptime(paper['date'], '%Y-%m-%d')
    timestamp = int(date_obj.timestamp())
    
    document = Document(
        page_content=paper["three_sentence_summary"],
        metadata={
            "title": paper['title'],
            "subject": paper['subjects'].split(',')[0],
            "date": timestamp,  
        }
    )
    sq_documents.append(document)

In [9]:
# Create vector store
print(" Creating vector store for multi-query retriever")
vectorstore = Chroma.from_documents(
    documents=sq_documents,
    embedding=embeddings,
    persist_directory=os.path.join(config.CHROMA_PERSIST_DIRECTORY, "sq_hf")
)
print(" Vector store created!")


 Creating vector store for multi-query retriever
 Vector store created!


In [11]:
# Define metadata field information for Self-Querying Retriever
metadata_field_info = [
    AttributeInfo(
        name="title",
        description="The title of the paper related to AI from Huggingface Daily Papers",
        type="string",
    ),

    AttributeInfo(
        name="subjects",
        description="The subject of the paper. Options:LLM, SLM, VLM, Agentic RAG, AI agents, Agentic AI, RAG, multi-agent, RLHF, prompt engineering, context engineering, Agent, RLHF, Reinforcement learning, prompt, GPT and other related topics",
        type="string",
    ),
    AttributeInfo(
        name="date",
        description="The publication date as Unix timestamp (integer). Use this for date comparisons.",
        type="integer",
    ),
]

document_content_description = "Research papers from huggingface daily papers."

In [12]:
# Create Self-Querying Retriever
print("🎯 Setting up Self-Querying Retriever...")
self_query_retriever = SelfQueryRetriever.from_llm(
    llm_self_query,
    vectorstore,
    document_content_description,
    metadata_field_info,
    verbose=True
)
print("✅ Self-Querying Retriever ready!")


🎯 Setting up Self-Querying Retriever...


✅ Self-Querying Retriever ready!


In [None]:
query="papers on Generative Models?"

In [34]:
print("🎯 Testing Self-Querying Retriever...")



self_query_results1 = self_query_retriever.invoke(query)
print(f"Query: '{query}'")
print(f"Results: {len(self_query_results1)} documents")


🎯 Testing Self-Querying Retriever...
Query: 'papers on Genrative Models?'
Results: 4 documents


In [35]:
for result in self_query_results1:
    print("Title: ", result.metadata['title'])
    print("Subject: ", result.metadata['subject'])
    print("Date: ", result.metadata['date'])
    print("Page content: ", result.page_content)
    print("\n ---------------------------------------------------------")
    

Title:  From Data to Rewards: a Bilevel Optimization Perspective on Maximum
  Likelihood Estimation
Subject:  Generative models
Date:  1759874400
Page content:  This research explores a novel optimization framework for generative models to improve generalization and robustness. Utilizing Bilevel Optimization, the work addresses the challenge of reward signal scarcity by incorporating a policy gradient objective within an outer-level optimization problem.  The theoretical analysis and code release demonstrate the framework’s applicability to tabular classification and model-based reinforcement learning, offering a potential solution to alignment challenges in generative AI.

 ---------------------------------------------------------
Title:  GRACE: Generative Representation Learning via Contrastive Policy
  Optimization
Subject:  Large Language Models
Date:  1759701600
Page content:  GRACE is a new framework that trains LLMs to generate explicit rationales alongside embeddings, transform

######################################################################################## ########################################################################################

## 🔍 Multi-Query Retriever

### What is Multi-Query Retriever?

The Multi-Query Retriever addresses the limitation that retrieval results may vary with subtle changes in query wording. It uses an LLM to generate multiple queries from different perspectives for a given user input query.

**How it works:**
1. Takes a user query
2. Uses LLM to generate multiple query variations
3. Retrieves documents for each query variation
4. Returns the union of all results

**Pros:**
- Captures different perspectives on the same question
- Reduces sensitivity to query wording
- Provides more comprehensive results

**Cons:**
- More computationally expensive
- May return redundant results
- Requires LLM for query generation

## 📚 Dataset for Multi-Query Retriever: Course FAQ Data

### Why This Dataset?
- **Natural query variation**: Course questions can be asked in many different ways
- **Simple metadata structure**: Focus is on semantic similarity, not complex filtering
- **Demonstrates Multi-Query strength**: Shows how query expansion captures different phrasings
- **Common use case**: FAQ retrieval benefits from understanding multiple question formulations

### Data Source and Structure

**Source**: DataTalks Club LLM Zoomcamp FAQ
- URL: `https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/search_evaluation/documents-with-ids.json`
- Pre-structured JSON with validated Q&A pairs
- Multiple courses and modules included

**Data Creation Process**:
- **Direct loading**: No transformation needed - data already in FAQ format
- **Document structure**:
  - **page_content**: The FAQ question (what users search for)
  - **metadata**:
    - `answer`: Full answer text
    - `section`: Course module (e.g., "Module 1: Docker and Terraform")
    - `course`: Course identifier (e.g., "data-engineering-zoomcamp")
    - `id`: Unique identifier

**Why Questions as page_content?**
- Users search with questions, so embedding questions enables better semantic matching
- Answers stored in metadata for retrieval after matching
- Query expansion helps match different ways of asking the same question

**Result**: Comprehensive FAQ knowledge base stored in Chroma (`mq_faq` directory) optimized for multi-perspective retrieval

In [4]:
import requests 

docs_url = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/refs/heads/main/03-evaluation/search_evaluation/documents-with-ids.json'

documents = requests.get(docs_url).json()

In [5]:
documents[50]

{'text': 'For those who wish to use the backslash as an escape character in Git Bash for Windows (as Alexey normally does), type in the terminal: bash.escapeChar=\\ (no need to include in .bashrc)',
 'section': 'Module 1: Docker and Terraform',
 'question': 'Git Bash - Backslash as an escape character in Git Bash for Windows',
 'course': 'data-engineering-zoomcamp',
 'id': '2f83dbe7'}

## Create LangChain Documents


In [None]:


faq_documents=[]
for doc in documents:
    document=Document(
        page_content=doc["question"],
        metadata={
            "answer": doc['text'],
            "section": doc['section'],
            "course": doc['course'],
            "id": doc['id']
        }
    )
    faq_documents.append(document)

    

## Build Vector Store - Multi-Query Retreival

Create a Chroma vector database from our documents using the embedding model.

In [34]:
# Create vector store
print(" Creating vector store for multi-query retriever")
vectorstore = Chroma.from_documents(
    documents=faq_documents,
    embedding=embeddings,
    persist_directory=os.path.join(config.CHROMA_PERSIST_DIRECTORY, "mq_faq")
)
print(" Vector store created!")


 Creating vector store for multi-query retriever
 Vector store created!


### Custom Prompt Template

1. Import  the PromptTemplate class to create a custom prompt for query generation.
2. Create a specialized prompt that instructs the LLM to generate 3 alternative versions of the user's query.

In [10]:
from langchain_core.prompts.prompt import PromptTemplate

In [31]:
prompt=PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is
    to generate 3 different versions of the given user
    query to retrieve relevant documents from a vector database.
    The database consists of FAQ questions and answers from a course.
    The query is supposed to be a general questions about the course.
    Your task is to generate 3 different versions of the query.
    your goal is to help the user overcome some of the limitations of distance-based similarity search. Provide these alternative
    queries separated by newlines. Original query: {question}""",
)

In [None]:
test_query = "when does the course start?"
formatted_prompt = prompt.format(question=test_query)
generated_text = llm_multi_query.invoke(formatted_prompt)

print("Raw LLM output (generated queries):")
print(generated_text)

Raw LLM output (generated queries):
Okay, here are 3 different versions of the query, designed to help retrieve relevant documents from the vector database, focusing on broader understanding and addressing potential limitations of a distance-based search:

**Version 1:**  “What is the official start date for the [Course Name] course?”

**Version 2:** “Can you provide a timeline for the [Course Name] course, including the enrollment date and initial class start?”

**Version 3:** “What are the key dates associated with the [Course Name] course, such as the first session, and what’s the general timeframe for the course?” 

---

Let me know if you’d like me to generate more variations or want me to focus on a specific aspect of the query (e.g., focusing on specific topics within the course).


In [37]:
# Create Multi-Query Retriever
print("🔍 Setting up Multi-Query Retriever...")
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 1}),
    prompt=prompt,
    llm=llm
)
print("✅ Multi-Query Retriever ready!")


🔍 Setting up Multi-Query Retriever...
✅ Multi-Query Retriever ready!


In [58]:
documents[0]['section']

'General course-related questions'

In [61]:
# Test Multi-Query Retriever
print("🔍 Testing Multi-Query Retriever...")
test_query = "when does the data engineering course start?"
multi_query_results = multi_query_retriever.invoke(test_query)

print(f"📋 Multi-Query Results ({len(multi_query_results)} documents):")
for i, doc in enumerate(multi_query_results):
    print(f"\n--- Result {i+1} ---")
    print(f"Content: \n {doc.page_content}")
    print(f"Metadata: \n {doc.metadata['section']}")



🔍 Testing Multi-Query Retriever...


INFO:langchain.retrievers.multi_query:Generated queries: ['Okay, here are 3 different versions of the query, designed to help retrieve relevant documents from a vector database, focusing on a general understanding of the course’s start date and addressing potential limitations of a purely distance-based search:', '**Version 1:**  “What is the official start date for the Data Engineering course?”', '**Version 2:** “Can you provide the enrollment date for the Data Engineering course?”', '**Version 3:** “What’s the timeline for the Data Engineering course, including the beginning date?”']


📋 Multi-Query Results (3 documents):

--- Result 1 ---
Content: 
 Course - When will the course start?
Metadata: 
 General course-related questions

--- Result 2 ---
Content: 
 Course - I have registered for the Data Engineering Bootcamp. When can I expect to receive the confirmation email?
Metadata: 
 General course-related questions

--- Result 3 ---
Content: 
 What are the deadlines in this course?
Metadata: 
 General course-related questions


## 📋 Comparison Summary

| Aspect | Multi-Query Retriever | Self-Querying Retriever |
|--------|----------------------|------------------------|
| **Primary Use Case** | Query variation and comprehensive results | Metadata filtering with semantic search |
| **Query Processing** | Generates multiple query variations | Extracts query text + metadata filters |
| **Result Quality** | Broader, more comprehensive | More precise, filtered results |
| **Metadata Usage** | Limited | Extensive |
| **Best For** | Exploratory search, research | Filtered search, specific criteria |


## 🎯 When to Use Each Retriever

### Use Multi-Query Retriever when:
- ✅ You want comprehensive results from different perspectives
- ✅ Query wording might vary significantly
- ✅ You're doing exploratory research
- ✅ You need to capture different ways of asking the same question
- ✅ You have limited metadata or don't need filtering

### Use Self-Querying Retriever when:
- ✅ You have rich, structured metadata
- ✅ You need precise filtering capabilities
- ✅ You want to combine semantic search with metadata filters
- ✅ You have specific criteria (date ranges, categories, etc.)
- ✅ You need faster, more targeted results

### Use Both Together when:
- ✅ You need both comprehensive coverage AND precise filtering
- ✅ You're building a complex search system
- ✅ You want to combine the benefits of both approaches


## 🎓 Key Takeaways

### Multi-Query Retriever:
- **Best for**: Exploratory research, comprehensive coverage
- **Strengths**: Captures different perspectives, reduces query sensitivity
- **Trade-offs**: Higher computational cost, potential redundancy

### Self-Querying Retriever:
- **Best for**: Precise filtering, structured data
- **Strengths**: Combines semantic search with metadata filtering
- **Trade-offs**: Requires rich metadata, more complex setup

### Hybrid Approach:
- **Best for**: Complex applications needing both coverage and precision
- **Strengths**: Combines benefits of both approaches
- **Trade-offs**: Highest computational cost, most complex setup

## 🏁 Conclusion

Both retrievers serve different purposes in the retrieval ecosystem:
- Use **Multi-Query** when you need comprehensive, diverse results
- Use **Self-Querying** when you need precise, filtered results  
- Use **Hybrid** when you need the best of both worlds

The choice depends on your specific use case, data structure, and performance requirements.
