# SubQuestionQueryEngine

Often, we encounter scenarios where our queries span across multiple documents. In this notebook, we delve into addressing complex queries that extend over various documents by breaking them down into simpler sub-queries and generate answers using the `SubQuestionQueryEngine`.

### Installation

In [None]:
!pip install llama-index
!pip install llama-index-llms-anthropic
!pip install llama-index-embeddings-huggingface

### Setup API Key

In [None]:
import os
os.environ['ANTHROPIC_API_KEY'] = 'YOUR ANTHROPIC API KEY'

### Setup LLM and Embedding model

In [None]:
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

In [None]:
llm = Anthropic(temperature=0.0, model='claude-2.1')
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [None]:
from llama_index.core import Settings
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512

### Setup logging

In [None]:
# NOTE: This is ONLY necessary in jupyter notebook.
# Details: Jupyter runs an event-loop behind the scenes.
#          This results in nested event-loops when we start an event-loop to make async queries.
#          This is normally not allowed, we use nest_asyncio to allow it for convenience.
import nest_asyncio

nest_asyncio.apply()

import logging
import sys

# Set up the root logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # Set logger level to INFO

# Clear out any existing handlers
logger.handlers = []

# Set up the StreamHandler to output to sys.stdout (Colab's output)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)  # Set handler level to INFO

# Add the handler to the logger
logger.addHandler(handler)

from IPython.display import display, HTML

### Download Data

We will use Uber and Lyft 2021 10K SEC Filings

In [None]:
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/uber_2021.pdf' -O './uber_2021.pdf'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/lyft_2021.pdf' -O './lyft_2021.pdf'

--2024-02-29 13:12:32--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/uber_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1880483 (1.8M) [application/octet-stream]
Saving to: ‘./uber_2021.pdf’


2024-02-29 13:12:33 (33.2 MB/s) - ‘./uber_2021.pdf’ saved [1880483/1880483]

--2024-02-29 13:12:33--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/lyft_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1440303 (1.4M) [application/octet-stream]
Sa

### Load Data

In [None]:
from llama_index.core import SimpleDirectoryReader
lyft_docs = SimpleDirectoryReader(input_files=["lyft_2021.pdf"]).load_data()
uber_docs = SimpleDirectoryReader(input_files=["uber_2021.pdf"]).load_data()

In [None]:
print(f'Loaded lyft 10-K with {len(lyft_docs)} pages')
print(f'Loaded Uber 10-K with {len(uber_docs)} pages')

Loaded lyft 10-K with 238 pages
Loaded Uber 10-K with 307 pages


### Index Data

In [None]:
from llama_index.core import VectorStoreIndex
lyft_index = VectorStoreIndex.from_documents(lyft_docs[:100])
uber_index = VectorStoreIndex.from_documents(uber_docs[:100])

### Create Query Engines

In [None]:
lyft_engine = lyft_index.as_query_engine(similarity_top_k=5)

In [None]:
uber_engine = uber_index.as_query_engine(similarity_top_k=5)


### Querying

In [None]:
response = await lyft_engine.aquery('What is the revenue of Lyft in 2021? Answer in millions with page reference')
display(HTML(f'<p style="font-size:20px">{response.response}</p>'))

HTTP Request: POST https://api.anthropic.com/v1/complete "HTTP/1.1 200 OK"


In [None]:
response = await uber_engine.aquery('What is the revenue of Uber in 2021? Answer in millions, with page reference')
display(HTML(f'<p style="font-size:20px">{response.response}</p>'))

HTTP Request: POST https://api.anthropic.com/v1/complete "HTTP/1.1 200 OK"


### Create Tools

In [None]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine_tools = [
    QueryEngineTool(
        query_engine=lyft_engine,
        metadata=ToolMetadata(name='lyft_10k', description='Provides information about Lyft financials for year 2021')
    ),
    QueryEngineTool(
        query_engine=uber_engine,
        metadata=ToolMetadata(name='uber_10k', description='Provides information about Uber financials for year 2021')
    ),
]

### Create `SubQuestionQueryEngine`

In [None]:
sub_question_query_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=query_engine_tools)

### Querying

In [None]:
response = await sub_question_query_engine.aquery('Compare revenue growth of Uber and Lyft from 2020 to 2021')
display(HTML(f'<p style="font-size:20px">{response.response}</p>'))

HTTP Request: POST https://api.anthropic.com/v1/complete "HTTP/1.1 200 OK"
Generated 4 sub questions.
[1;3;38;2;237;90;200m[uber_10k] Q: What was Uber's revenue in 2020?
[0m[1;3;38;2;90;149;237m[uber_10k] Q: What was Uber's revenue in 2021?
[0m[1;3;38;2;11;159;203m[lyft_10k] Q: What was Lyft's revenue in 2020?
[0m[1;3;38;2;155;135;227m[lyft_10k] Q: What was Lyft's revenue in 2021?
[0mHTTP Request: POST https://api.anthropic.com/v1/complete "HTTP/1.1 200 OK"
[1;3;38;2;237;90;200m[uber_10k] A:  Unfortunately, I should not directly reference the given context in my answer. However, based on the information provided, Uber's revenue in 2020 was $11.139 billion.
[0mHTTP Request: POST https://api.anthropic.com/v1/complete "HTTP/1.1 200 OK"
[1;3;38;2;155;135;227m[lyft_10k] A:  Unfortunately, I cannot directly reference the given context in my answer. However, the information provided shows that Lyft's total revenue in 2021 was $3,208,323,000.
[0mHTTP Request: POST https://api.anthr