
<a href="https://colab.research.google.com/github/edumunozsala/llamaindex-RAG-techniques/blob/main/multi-sub-queries-engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sub Query RetrieverEngine for RAG 

Query transformations are a family of techniques using an LLM as a reasoning engine to modify user input in order to improve retrieval quality.

In this notebook, we showcase how to use a sub question query engine to tackle the problem of answering a complex query using multiple data sources.
It first breaks down the complex query into sub questions for each relevant data source, then gather all the intermediate reponses and synthesizes a final response.

Original code from Llama-index: https://docs.llamaindex.ai/en/stable/examples/query_engine/sub_question_query_engine.html

In [1]:
import os
import openai

from llama_index import VectorStoreIndex
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine
from llama_index.callbacks import CallbackManager, LlamaDebugHandler
from llama_index import ServiceContext

Load the API Keys:

In [2]:
from dotenv import load_dotenv

# Load the enviroment variables
load_dotenv()

True

In [3]:
# NOTE: This is ONLY necessary in jupyter notebook.
# Details: Jupyter runs an event-loop behind the scenes.
#          This results in nested event-loops when we start an event-loop to make async queries.
#          This is normally not allowed, we use nest_asyncio to allow it for convenience.
import nest_asyncio

nest_asyncio.apply()

## Setup


If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

### Load the data

In [4]:
from pathlib import Path
from llama_index import download_loader

PDFReader = download_loader("PDFReader")

loader = PDFReader()
documents = loader.load_data(file=Path('./data/Attention is all you need.pdf'))

First, we setup a DebugHandler to keep track of the subquestions

In [5]:
llama_debug = LlamaDebugHandler(print_trace_on_end=True)

callback_manager = CallbackManager([llama_debug])

Next, we will setup a vector index over the documentation.

In [6]:
# D
service_context = ServiceContext.from_defaults(
    callback_manager=callback_manager,chunk_size=512
)

#service_context = ServiceContext.from_defaults(chunk_size=512)

# build index and query engine
vector_query_engine = VectorStoreIndex.from_documents(
    documents, use_async=True, service_context=service_context
).as_query_engine()

**********
Trace: index_construction
    |_CBEventType.NODE_PARSING ->  0.054173 seconds
      |_CBEventType.CHUNKING ->  0.002804 seconds
      |_CBEventType.CHUNKING ->  0.003553 seconds
      |_CBEventType.CHUNKING ->  0.0 seconds
      |_CBEventType.CHUNKING ->  0.0 seconds
      |_CBEventType.CHUNKING ->  0.013239 seconds
      |_CBEventType.CHUNKING ->  0.0 seconds
      |_CBEventType.CHUNKING ->  0.0 seconds
      |_CBEventType.CHUNKING ->  0.0 seconds
      |_CBEventType.CHUNKING ->  0.005479 seconds
      |_CBEventType.CHUNKING ->  0.0 seconds
      |_CBEventType.CHUNKING ->  0.012442 seconds
      |_CBEventType.CHUNKING ->  0.0 seconds
      |_CBEventType.CHUNKING ->  0.0 seconds
      |_CBEventType.CHUNKING ->  0.0 seconds
      |_CBEventType.CHUNKING ->  0.0 seconds
    |_CBEventType.EMBEDDING ->  1.215931 seconds
    |_CBEventType.EMBEDDING ->  1.215931 seconds
    |_CBEventType.EMBEDDING ->  1.215931 seconds
    |_CBEventType.EMBEDDING ->  1.215931 seconds
**********


## Setup sub question query engine



Now, we create our subquery engine based on a new defined tool:

In [7]:
# setup base query engine as tool
query_engine_tools = [
    QueryEngineTool(
        query_engine=vector_query_engine,
        metadata=ToolMetadata(
            name="Attention_paper",
            description="Attention is all you nedd paper",
        ),
    ),
]

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    service_context=service_context,
    use_async=True,
)

## Use the  Sub Query Engine!

Now, we can invoke the sub query engine to synthesize natural language responses.

In [8]:
response = query_engine.query("How are transformers related to convolutional neural networks?")

Generated 6 sub questions.
[1;3;38;2;237;90;200m[Attention_paper] Q: What is a transformer?
[0m[1;3;38;2;90;149;237m[Attention_paper] Q: What is a convolutional neural network?
[0m[1;3;38;2;11;159;203m[Attention_paper] Q: How do transformers work?
[0m[1;3;38;2;155;135;227m[Attention_paper] Q: How do convolutional neural networks work?
[0m[1;3;38;2;237;90;200m[Attention_paper] Q: What are the similarities between transformers and convolutional neural networks?
[0m[1;3;38;2;90;149;237m[Attention_paper] Q: What are the differences between transformers and convolutional neural networks?
[0m[1;3;38;2;90;149;237m[Attention_paper] A: A convolutional neural network (CNN) is a type of neural network that is commonly used for image recognition and computer vision tasks. It is designed to automatically learn and extract features from input images through a series of convolutional layers. These layers apply filters to the input image, which helps to identify patterns and features at d

In [9]:
from llama_index.response.notebook_utils import display_response

display_response(response)

**`Final Response:`** Transformers and convolutional neural networks (CNNs) are both types of neural networks used in different domains. While CNNs are commonly used for image recognition and computer vision tasks, transformers are often used in natural language processing and sequence-to-sequence tasks.

Both transformers and CNNs aim to compute hidden representations in parallel for all input and output positions. However, there are differences in the way they handle dependencies between distant positions. Transformers reduce the number of operations required to relate signals from two arbitrary input or output positions to a constant number, while CNNs have a linear or logarithmic growth in the number of operations with the distance between positions. This difference in handling dependencies makes it more difficult for CNNs to learn dependencies between distant positions compared to transformers.

In terms of architecture, transformers use stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, while CNNs use convolutional layers as their basic building block.

Overall, while transformers and CNNs have some similarities in terms of parallel computation, they have different approaches to computing representations and handling dependencies between positions. Transformers rely on self-attention, while CNNs use convolutional layers.

#### Iterate over the sub queries

In [10]:
# iterate through sub_question items captured in SUB_QUESTION event
from llama_index.callbacks.schema import CBEventType, EventPayload

for i, (start_event, end_event) in enumerate(
    llama_debug.get_event_pairs(CBEventType.SUB_QUESTION)
):
    qa_pair = end_event.payload[EventPayload.SUB_QUESTION]
    print("Sub Question " + str(i) + ": " + qa_pair.sub_q.sub_question.strip())
    print("Answer: " + qa_pair.answer.strip())
    print("====================================")

Sub Question 0: What is a transformer?
Answer: The Transformer is a transduction model that relies entirely on self-attention to compute representations of its input and output, without using sequence-aligned recurrent neural networks (RNNs) or convolution. It consists of an encoder-decoder structure, where the encoder maps an input sequence of symbol representations to a sequence of continuous representations, and the decoder generates an output sequence of symbols based on the encoded representations. The Transformer utilizes stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.
Sub Question 1: What is a convolutional neural network?
Answer: A convolutional neural network (CNN) is a type of neural network that is commonly used for image recognition and computer vision tasks. It is designed to automatically learn and extract features from input images through a series of convolutional layers. These layers apply filters to the input image, whic