##  Query Refinement Techniques in RAG pipelines

Retrieval quality is the biggest factor influencing overall RAG performance.


* Baseline retrieval embeds the query as-is and returns the top-K nearest documents, but this often fails for multi-topic or poorly phrased queries.


* Some queries differ significantly from the language or structure of the source documents, leading to weak matches.


* Query Refinement Techniquess help bridge these gaps and improve accuracy on more complex or ambiguous queries.

    * Query Decomposition
    * Query Expansion



In [1]:
import json
import numpy as np
from typing import List, Tuple

from langchain_text_splitters  import CharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI



In [2]:
from dotenv import load_dotenv
# Load environment variables from .env
load_dotenv()

True

In [3]:
embeddings = OpenAIEmbeddings()



llm = ChatOpenAI(
    model="gpt-4o-mini"
)

## Corpus ‚Äî Load and Process Climate Change PDF

In [4]:
import os
# directory setup
# ------------------------------
try:
    current_dir = os.path.dirname(os.path.abspath(__file__))
except NameError:
    current_dir = os.getcwd()

pdf_path = current_dir + "/data/Climate Change.pdf"    


print(pdf_path)
if not os.path.exists(pdf_path):
    raise FileNotFoundError("‚ùå dataset.csv not found in current directory.")

C:\Users\vyanktesh.l\Documents\FDE\FullStack\FDE_first_web_App\LABS-From-01December\Vector_Rag/data/Climate Change.pdf


In [6]:
import PyPDF2
def extract_text_from_pdf(pdf_path: str) -> List[Document]:
    """Extract text from PDF and convert to LangChain Documents"""
    
    documents = []
    
    try:
        with open(pdf_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            
            for page_num, page in enumerate(pdf_reader.pages):
                text = page.extract_text()
                if text.strip():  # Only add non-empty pages
                    # Create metadata with page number
                    metadata = {
                        "source": pdf_path,
                        "page": page_num + 1,
                        "document": "Understanding Climate Change"
                    }
                    documents.append(Document(page_content=text, metadata=metadata))
                    
        print(f"‚úÖ Extracted {len(documents)} pages from PDF")
        return documents
        
    except Exception as e:
        print(f"‚ùå Error reading PDF: {e}")
        return []

# Load the Climate Change PDF

climate_documents = extract_text_from_pdf(pdf_path)

‚úÖ Extracted 35 pages from PDF


In [7]:
# Convert documents to LangChain Document format

simple_documents = []
for doc in climate_documents:
    simple_documents.append(Document(
        page_content=doc.page_content,
        metadata=doc.metadata  # Preserve metadata if needed
    ))

print(f"üìÑ Converted {len(simple_documents)} documents")
langchain_documents = simple_documents

üìÑ Converted 35 documents


## Chunking

In [8]:
# Split the documents into chunks
text_splitter = CharacterTextSplitter(
    chunk_size=10000,
    chunk_overlap=100,
    separator="\n"
)
docs = text_splitter.split_documents(climate_documents)

print(f"üî™ Split into {len(docs)} chunks")
for i, doc in enumerate(docs):
    print(f"  Chunk {i+1}: {doc.page_content[:100]}...")
    print(f"     Metadata: {doc.metadata}\n")

üî™ Split into 35 chunks
  Chunk 1: By Nicholas SchneiderUnderstanding 
Climate Change...
     Metadata: {'source': 'C:\\Users\\vyanktesh.l\\Documents\\FDE\\FullStack\\FDE_first_web_App\\LABS-From-01December\\Vector_Rag/data/Climate Change.pdf', 'page': 1, 'document': 'Understanding Climate Change'}

  Chunk 2: U n d e r s t a n d i n g  C l i m a t e  C h a n g e Preface   . . . . . . . . . . . . . . . . . . ...
     Metadata: {'source': 'C:\\Users\\vyanktesh.l\\Documents\\FDE\\FullStack\\FDE_first_web_App\\LABS-From-01December\\Vector_Rag/data/Climate Change.pdf', 'page': 3, 'document': 'Understanding Climate Change'}

  Chunk 3: w w w . f r a s e r i n s t i t u t e . o r gCopyright ¬© 2008 by The Fraser Institute. All rights 
r...
     Metadata: {'source': 'C:\\Users\\vyanktesh.l\\Documents\\FDE\\FullStack\\FDE_first_web_App\\LABS-From-01December\\Vector_Rag/data/Climate Change.pdf', 'page': 4, 'document': 'Understanding Climate Change'}

  Chunk 4: 1 U n d e r s t a n d i n g  C 

##  Create Vector Store with openai Embeddings

In [9]:
# Create the vector store
print("üîÑ Creating vector store with openai embeddings...")
vector_store = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory="./db/chroma_db_climate"
)

print("‚úÖ Vector store created successfully!")

üîÑ Creating vector store with openai embeddings...
‚úÖ Vector store created successfully!


## Base Retriever 

User: "What is the relationship between historical CO2 levels, temperature changes during ice age cycles, and the current projections for future warming based on climate models?"

Step 1: Retrieve ‚Üí "historical CO2 levels and patterns"
Step 2: Retrieve ‚Üí "temperature changes during ice age cycles"
Step 3: Retrieve ‚Üí "current CO2 levels and trends"
Step 4: "climate model projections for future warming"

In [10]:
# Create base retriever
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# Test base retrieval
question = 'What is the relationship between historical CO2 levels, temperature changes during ice age cycles, and the current projections for future warming based on climate models?'
base_results = retriever.invoke(question)


In [11]:
base_results

[Document(metadata={'document': 'Understanding Climate Change', 'page': 24, 'source': 'C:\\Users\\vyanktesh.l\\Documents\\FDE\\FullStack\\FDE_first_web_App\\LABS-From-01December\\Vector_Rag/data/Climate Change.pdf'}, page_content='20 w w w . f r a s e r i n s t i t u t e . o r gThe earliest records of weather and climate derived \nfrom modern instruments such as thermometers \ngo back only to about 1750. Palaeoclimate science \ntherefore looks to long-term natural records to  \nestimate past climate changes. Many organisms, \nsuch as trees, coral, and plankton, exhibit variations \nin growth due to changes in the local climate,  \nproviding a possible means of inferring local \nchanges in the distant past. Non-biological natural \nrecords, such as ice-cores, can also help in the  \nestimation of past climate change. \nMany million years ago \nFor most of the last 500 million years, the earth \nwas probably much warmer than it is today  \nand had no ice sheets. Palaeoclimate records of 

## Custom Query variants - - Multiple retrieval rounds with dependency between steps.

Goal: Improve retrieval quality by enhancing the original query.
How it works:

    Generate multiple query variations from original
    Search with all variations simultaneously
    Single retrieval process with enriched context

In [12]:
def break_into_subquestions(complex_question: str) -> List[str]:
    """Break down complex climate questions into dependent sub-questions"""
    
    prompt_template = """Analyze this complex climate science question and break it into 3-4 logical sub-questions that need to be answered in sequence.
    Each sub-question should build upon the previous one, creating a step-by-step reasoning chain.

    Complex Question: {question}

    Return ONLY a numbered list of sub-questions without any additional text. Make each sub-question specific and answerable.

    Sub-questions:"""
    
    prompt = ChatPromptTemplate.from_template(prompt_template)
    chain = prompt | llm
    
    try:
        response = chain.invoke({"question": complex_question})
        subquestions_text = response.content
        
        # Parse numbered sub-questions
        subquestions = []
        for line in subquestions_text.split('\n'):
            line = line.strip()
            # Match lines that start with numbers (1., 2., etc.)
            if line and (line[0].isdigit() and '.' in line.split()[0]):
                # Remove the number and period
                subq = line.split('.', 1)[1].strip()
                if subq and len(subq) > 10:  # Ensure meaningful questions
                    subquestions.append(subq)
        
        print("üîç Multi-hop sub-questions generated:")
        for i, subq in enumerate(subquestions):
            print(f"  Step {i+1}: {subq}")
            
        return subquestions
        
    except Exception as e:
        print(f"Error generating sub-questions: {e}")
        return generate_fallback_subquestions(complex_question)


In [13]:
question="What is the relationship between historical CO2 levels, temperature changes during ice age cycles, and the current projections for future warming based on climate models?"
# Generate query subquestions
query_variants = break_into_subquestions(question)

üîç Multi-hop sub-questions generated:
  Step 1: What were the historical CO2 levels during past ice age cycles, and how did they correlate with temperature changes?
  Step 2: How did the changes in temperature during these ice age cycles influence atmospheric CO2 levels?
  Step 3: What are the key mechanisms by which current climate models project future warming based on historical CO2 and temperature data?
  Step 4: How do current projections for future warming compare to the temperature changes observed during past ice age cycles in light of historical CO2 levels?


## Query Decomposition
Complex queries can be improved by breaking them into smaller sub-questions, each with its own targeted retrieval step.

Query decomposition generates either the original question or a set of simpler, focused questions.

The model drives this process using a system prompt that defines the task and guides behavior.

Providing few-shot examples strengthens the model‚Äôs generalization, while removing them leads to less reliable decompositions.

In [14]:
from pydantic import BaseModel, Field
from langchain_core.output_parsers import PydanticToolsParser
from langchain_core.runnables import (
    RunnableLambda,
    RunnableParallel,
    RunnablePassthrough,
    RunnableBranch,
)

In [15]:
decomp_system_prompt = """You are a expert assistant that prepares queries that will be sent to a search component. 
These queries may be very complex. Your job is to simplify complex queries into multiple queries that can be answered in isolation to eachother.

If the query is simple, then keep it as it is.

If there are acronyms or words you are not familiar with, do not try to rephrase them.
Here is an example of how to respond in a standard interaction:
<example>
- Query: Did Meta or Nvidia make more money last year?
Decomposed Questions: [SubQuery(sub_query='How much profit did Meta make last year?'), SubQuery(sub_query'How much profit did Nvidia make last year?')]
</example>
<example>
- Query: What is the capital of France?
Decomposed Questions: [SubQuery(sub_query='What is the capital of France?')]
</example>"""

In [16]:
class SubQuery(BaseModel):
    """You have performed query decomposition to generate a subquery of a question"""

    sub_query: str = Field(description="A unique subquery of the original question.")

In [17]:
query_decomposition_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", decomp_system_prompt),
        ("human", "Here is the customer's question: <question>{question}</question> How do you answer to the instructions?"),
    ]
)

llm_with_tools = llm.bind_tools([SubQuery])
decomp_query_analyzer = query_decomposition_prompt | llm_with_tools | PydanticToolsParser(tools=[SubQuery])

In [18]:
queries = decomp_query_analyzer.invoke({"question": question})
queries

[SubQuery(sub_query='What is the relationship between historical CO2 levels and temperature changes during ice age cycles?'),
 SubQuery(sub_query='What are the current projections for future warming based on climate models?')]

## Query Expansion

Query expansion improves retrieval by generating multiple alternative versions of the original query.

Unlike decomposition, expansion keeps the same intent but varies the wording to increase match likelihood.

The model uses a system prompt that instructs it to produce multiple rewrites of the query.

In our setup, the prompt consistently returns three distinct reformulations of the original question.

In [19]:
class ParaphrasedQuery(BaseModel):
    """You have performed query expansion to generate a paraphrasing of a question."""

    paraphrased_query: str = Field(description="A unique paraphrasing of the original question.")

In [20]:
paraphrase_system_prompt = """You are an expert at converting user questions into database queries. 
You have access to a database of travel destinations and a list of recent destinations for travelers. 

Perform query expansion. If there are multiple common ways of phrasing a user question 
or common synonyms for key words in the question, make sure to return multiple versions 
of the query with the different phrasings.

If there are acronyms or words you are not familiar with, do not try to rephrase them.

Always return at least 3 versions of the question."""


In [21]:
query_expansion_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", paraphrase_system_prompt),
        ("human", "Here is the customer's question: <question>{question}</question> How do you answer to the instructions?"),
    ]
)
llm_with_tools = llm.bind_tools([ParaphrasedQuery])
query_expansion = query_expansion_prompt | llm_with_tools | PydanticToolsParser(tools=[ParaphrasedQuery])


In [22]:
short_question="co2 levels"

In [23]:
query_expansion.invoke({"question": short_question})

[ParaphrasedQuery(paraphrased_query='What are the levels of CO2?'),
 ParaphrasedQuery(paraphrased_query='Can you provide information on CO2 concentrations?'),
 ParaphrasedQuery(paraphrased_query='What is the current status of carbon dioxide levels?')]