# Query Transformations for Improved Retrieval in RAG Systems

## Overview

This code implements three query transformation techniques to enhance the retrieval process in Retrieval-Augmented Generation (RAG) systems:

1. Query Rewriting
2. Step-back Prompting
3. Sub-query Decomposition

Each technique aims to improve the relevance and comprehensiveness of retrieved information by modifying or expanding the original query.

## Motivation

RAG systems often face challenges in retrieving the most relevant information, especially when dealing with complex or ambiguous queries. These query transformation techniques address this issue by reformulating queries to better match relevant documents or to retrieve more comprehensive information.

## Key Components

1. Query Rewriting: Reformulates queries to be more specific and detailed.
2. Step-back Prompting: Generates broader queries for better context retrieval.
3. Sub-query Decomposition: Breaks down complex queries into simpler sub-queries.

## Method Details

### 1. Query Rewriting

- **Purpose**: To make queries more specific and detailed, improving the likelihood of retrieving relevant information.
- **Implementation**:
  - Uses a GPT-4 model with a custom prompt template.
  - Takes the original query and reformulates it to be more specific and detailed.

### 2. Step-back Prompting

- **Purpose**: To generate broader, more general queries that can help retrieve relevant background information.
- **Implementation**:
  - Uses a GPT-4 model with a custom prompt template.
  - Takes the original query and generates a more general "step-back" query.

### 3. Sub-query Decomposition

- **Purpose**: To break down complex queries into simpler sub-queries for more comprehensive information retrieval.
- **Implementation**:
  - Uses a GPT-4 model with a custom prompt template.
  - Decomposes the original query into 2-4 simpler sub-queries.

## Benefits of these Approaches

1. **Improved Relevance**: Query rewriting helps in retrieving more specific and relevant information.
2. **Better Context**: Step-back prompting allows for retrieval of broader context and background information.
3. **Comprehensive Results**: Sub-query decomposition enables retrieval of information that covers different aspects of a complex query.
4. **Flexibility**: Each technique can be used independently or in combination, depending on the specific use case.

## Implementation Details

- All techniques use OpenAI's GPT-4 model for query transformation.
- Custom prompt templates are used to guide the model in generating appropriate transformations.
- The code provides separate functions for each transformation technique, allowing for easy integration into existing RAG systems.

## Example Use Case

The code demonstrates each technique using the example query:
"What are the impacts of climate change on the environment?"

- **Query Rewriting** expands this to include specific aspects like temperature changes and biodiversity.
- **Step-back Prompting** generalizes it to "What are the general effects of climate change?"
- **Sub-query Decomposition** breaks it down into questions about biodiversity, oceans, weather patterns, and terrestrial environments.

## Conclusion

These query transformation techniques offer powerful ways to enhance the retrieval capabilities of RAG systems. By reformulating queries in various ways, they can significantly improve the relevance, context, and comprehensiveness of retrieved information. These methods are particularly valuable in domains where queries can be complex or multifaceted, such as scientific research, legal analysis, or comprehensive fact-finding tasks.

# Package Installation and Imports

The cell below installs all necessary packages required to run this notebook.


In [9]:
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate

import os
from dotenv import load_dotenv

In [11]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="meta-llama/Llama-3.2-3B-Instruct",
    base_url="http://10.0.64.77:36363/v1",  # URL server VLLM
    api_key="EMPTY",  # có thể là bất kỳ chuỗi nào, VLLM không kiểm tra key
    temperature=0,
    max_tokens=4000,
)

query_rewrite_template = """You are an AI assistant tasked with reformulating user queries to improve retrieval in a RAG system.
Given the original query, rewrite it to be more specific, detailed, and likely to retrieve relevant information.

Original query: {original_query}

Rewritten query:"""

query_rewrite_prompt = PromptTemplate(
    input_variables=["original_query"],
    template=query_rewrite_template
)

query_rewriter = query_rewrite_prompt | llm

def rewrite_query(original_query):
    response = query_rewriter.invoke({"original_query": original_query})
    return response.content

rewrite_query("Explain the theory of relativity.")

'Here are a few rewritten queries that are more specific, detailed, and likely to retrieve relevant information:\n\n1. "Explain the fundamental principles of special relativity, including time dilation, length contraction, and the speed of light limit, and how they differ from general relativity."\n2. "Describe the mathematical formulation of special relativity, including the Lorentz transformation and the concept of spacetime, and provide examples of its application in particle physics and astrophysics."\n3. "Discuss the key differences between special and general relativity, including the role of gravity, the curvature of spacetime, and the implications for our understanding of space and time."\n4. "Explain the concept of spacetime in special relativity, including its geometry, topology, and the relationship between space and time, and provide examples of its application in modern physics and engineering."\n5. "Provide a detailed overview of Albert Einstein\'s theory of special relat

### 1 - Query Rewriting: Reformulating queries to improve retrieval.

In [15]:
# Create a prompt template for query rewriting
query_rewrite_template = """
You are an AI assistant tasked with reformulating user queries to improve retrieval in a RAG system. 
Given the original query, rewrite it to be more specific, detailed, and likely to retrieve relevant information.
Original query: {original_query}
Rewritten query:
"""

query_rewrite_prompt = PromptTemplate(
    input_variables=["original_query"],
    template=query_rewrite_template
)

query_rewriter = query_rewrite_prompt | llm

def rewrite_query(original_query):
    """
    Rewrite the original query to improve retrieval.
    Args:
    original_query (str): The original user query
    Returns:
    str: The rewritten query
    """
    response = query_rewriter.invoke({"original_query": original_query})
    return response.content

print(rewrite_query("Explain the theory of relativity."))

Here are a few rewritten queries that are more specific, detailed, and likely to retrieve relevant information:

1. "Explain the fundamental principles of special relativity, including time dilation, length contraction, and the speed of light limit, in the context of classical mechanics and electromagnetism."
2. "Describe the key concepts of general relativity, including curvature of spacetime, gravitational redshift, and the equivalence principle, and their implications for our understanding of gravity and the behavior of celestial objects."
3. "Discuss the historical development of the theory of relativity, from Einstein's early work on special relativity to the refinement of general relativity, and highlight the major contributions of other physicists, such as Henri Poincaré and Hermann Minkowski."
4. "Compare and contrast the special and general theories of relativity, highlighting their differences in approach, scope, and application, and explain how they have been used to make pr

### Demonstrate on a use case

In [14]:
# example query over the understanding climate change dataset
original_query = "What are the impacts of climate change on the environment?"
rewritten_query = rewrite_query(original_query)
print("Original query:", original_query)
print("\nRewritten query:", rewritten_query)

Original query: What are the impacts of climate change on the environment?

Rewritten query: Here are a few rewritten query options:

1. "What are the most significant environmental impacts of climate change, including rising sea levels, melting glaciers, and altered ecosystems, and how can they be mitigated or adapted to?"
2. "What are the direct and indirect effects of climate change on global biodiversity, including changes in species distribution, extinction rates, and ecosystem resilience, and what are the potential consequences for human well-being?"
3. "What are the primary environmental consequences of climate change, such as ocean acidification, water scarcity, and increased frequency of natural disasters, and how can they be addressed through policy, technology, and individual actions?"
4. "What are the specific environmental impacts of climate change on different ecosystems, including coral reefs, forests, and polar regions, and how can conservation efforts and sustainable p

### 2 - Step-back Prompting: Generating broader queries for better context retrieval.



In [None]:
step_back_template = f"""
    You are an AI assistant tasked with generating broader, more general queries to improve context retrieval in a RAG (Retrieval-Augmented Generation) system.
    Given the original query, generate a step-back query that is more general and can help retrieve relevant background information.
    Original query: {original_query}
    Step-back query:
"""

step_back_prompt = PromptTemplate(
    input_variables=["original_query"],
    template=step_back_template
)

step_back_chain = step_back_prompt | llm

def generate_step_back_query(original_query):
    """
    Generate a step-back query to retrieve broader context.
    Args:
    original_query (str): The original user query
    Returns:
    str: The step-back query
    """
    response = step_back_chain.invoke({"original_query": original_query})
    return response.content

### Demonstrate on a use case

In [21]:
# example query over the understanding climate change dataset
original_query = "What are the impacts of climate change on the environment?"
step_back_query = generate_step_back_query(original_query)
print("Original query:", original_query)
print(step_back_query)

Original query: What are the impacts of climate change on the environment?
To generate a step-back query, I'll try to identify the core concepts and themes in the original query and then ask a more general question that can help retrieve relevant background information.

Step-back query: What are the effects of environmental degradation on ecosystems and human societies?

This step-back query is more general because it:

1. Removes the specific topic of climate change, focusing on a broader concept of environmental degradation.
2. Emphasizes the impact on ecosystems, which can help retrieve information on the natural world and the interconnectedness of species.
3. Includes human societies, which can lead to information on the social and economic implications of environmental degradation.

By asking this more general question, the RAG system can retrieve a wider range of relevant information, including background knowledge on environmental science, ecology, and human impact on the envir

### 3- Sub-query Decomposition: Breaking complex queries into simpler sub-queries.

In [None]:
subquery_decomposition_template = f"""
    You are an AI assistant tasked with breaking down complex queries into simpler sub-queries for a RAG system.
    Given the original query, decompose it into 2-4 simpler sub-queries that, when answered together, would provide a comprehensive response to the original query.

    Original query: {original_query}

    example: What are the impacts of climate change on the environment?

    Sub-queries:
    1. What are the impacts of climate change on biodiversity?
    2. How does climate change affect the oceans?
    3. What are the effects of climate change on agriculture?
    4. What are the impacts of climate change on human health?
"""

subquery_decomposition_prompt = PromptTemplate(
    input_variables=["original_query"],
    template=subquery_decomposition_template
)

subquery_decomposer_chain = subquery_decomposition_prompt | llm

def decompose_query(original_query: str):
    """
    Decompose the original query into simpler sub-queries.
    
    Args:
    original_query (str): The original complex query
    
    Returns:
    List[str]: A list of simpler sub-queries
    """
    response = subquery_decomposer_chain.invoke({"original_query":original_query}).content
    sub_queries = [q.strip() for q in response.split('\n') if q.strip() and not q.strip().startswith('Sub-queries:')]
    return sub_queries

### Demonstrate on a use case

In [26]:
# example query over the understanding climate change dataset
original_query = "What are the impacts of climate change on the environment?"
sub_queries = decompose_query(original_query)
print("Sub-queries:\n")
for i, sub_query in enumerate(sub_queries, 1):
    print(sub_query)

Sub-queries:

Here's a breakdown of the original query into 4 simpler sub-queries:
1. What are the main causes of climate change?
2. What are the impacts of climate change on biodiversity?
3. How does climate change affect the oceans and water resources?
4. What are the effects of climate change on human health and well-being?
These sub-queries can be answered together to provide a comprehensive response to the original query, "What are the impacts of climate change on the environment?"
Here's a possible response:
"Climate change has numerous impacts on the environment, including:
* Loss of biodiversity due to rising temperatures, changing precipitation patterns, and increased frequency of extreme weather events (sub-query 1).
* Rising sea levels, ocean acidification, and changes in ocean circulation patterns affecting marine ecosystems and water resources (sub-query 2).
* Impacts on agriculture, including changes in growing seasons, crop yields, and water availability, which can lead 

![](https://europe-west1-rag-techniques-views-tracker.cloudfunctions.net/rag-techniques-tracker?notebook=all-rag-techniques--query-transformations)