<a href="https://colab.research.google.com/github/girijesh-ai/llamaIndex-projects/blob/main/SubQuestion_QueryEngine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Compare Documents

In this tutorial, we delve into executing complex queries by breaking them down into more manageable sub-queries using SubQuestionQueryEngine.

[Documentation](https://gpt-index.readthedocs.io/en/stable/examples/usecases/10k_sub_question.html)

In [None]:
!pip install llama-index pypdf

Collecting llama-index
  Downloading llama_index-0.8.53.post3-py3-none-any.whl (794 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m794.6/794.6 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-3.16.4-py3-none-any.whl (276 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.6/276.6 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Collecting aiostream<0.6.0,>=0.5.2 (from llama-index)
  Downloading aiostream-0.5.2-py3-none-any.whl (39 kB)
Collecting dataclasses-json<0.6.0,>=0.5.7 (from llama-index)
  Downloading dataclasses_json-0.5.14-py3-none-any.whl (26 kB)
Collecting deprecated>=1.2.9.3 (from llama-index)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting langchain>=0.0.303 (from llama-index)
  Downloading langchain-0.0.325-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=0

In [None]:
# NOTE: This is ONLY necessary in jupyter notebook.
# Details: Jupyter runs an event-loop behind the scenes.
#          This results in nested event-loops when we start an event-loop to make async queries.
#          This is normally not allowed, we use nest_asyncio to allow it for convenience.
import nest_asyncio

nest_asyncio.apply()

In [None]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().handlers = []
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

import openai
openai.api_key = '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'

In [None]:
from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex
from llama_index import set_global_service_context
from llama_index.response.pprint_utils import pprint_response
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine
from llama_index.llms import OpenAI
from IPython.display import display, HTML

# Download Documents

In [None]:
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/10k/uber_2021.pdf' -O './uber_2021.pdf'
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/10k/lyft_2021.pdf' -O './lyft_2021.pdf'

--2023-10-28 07:00:59--  https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/10k/uber_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1880483 (1.8M) [application/octet-stream]
Saving to: ‘./uber_2021.pdf’


2023-10-28 07:00:59 (45.0 MB/s) - ‘./uber_2021.pdf’ saved [1880483/1880483]

--2023-10-28 07:00:59--  https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/10k/lyft_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1440303 (1.4M) [application/octet-stream]
Sa

# Load uber and lyft documents

In [None]:
lyft_docs = SimpleDirectoryReader(input_files=["lyft_2021.pdf"]).load_data()
uber_docs = SimpleDirectoryReader(input_files=["uber_2021.pdf"]).load_data()

In [None]:
print(f'Loaded lyft 10-K with {len(lyft_docs)} pages')
print(f'Loaded Uber 10-K with {len(uber_docs)} pages')

Loaded lyft 10-K with 238 pages
Loaded Uber 10-K with 307 pages


In [None]:
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm)

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Build indices

In [None]:
lyft_index = VectorStoreIndex.from_documents(lyft_docs, service_context=service_context)
uber_index = VectorStoreIndex.from_documents(uber_docs, service_context=service_context)

# Basic QA

In [None]:
lyft_engine = lyft_index.as_query_engine(similarity_top_k=3)


In [None]:
uber_engine = uber_index.as_query_engine(similarity_top_k=3)


In [None]:
response = await lyft_engine.aquery('What is the revenue of Lyft in 2021? Answer in millions with page reference')


In [None]:
# print the response
display(HTML(f'<p style="font-size:20px">{(response.response)}</p>'))

In [None]:
response = await uber_engine.aquery('What is the revenue of Uber in 2021? Answer in millions, with page reference')


In [None]:
# print the response
display(HTML(f'<p style="font-size:20px">{(response.response)}</p>'))

# For comparing between uber and lyft

In [None]:
query_engine_tools = [
    QueryEngineTool(
        query_engine=lyft_engine,
        metadata=ToolMetadata(name='lyft_10k', description='Provides information about Lyft financials for year 2021')
    ),
    QueryEngineTool(
        query_engine=uber_engine,
        metadata=ToolMetadata(name='uber_10k', description='Provides information about Uber financials for year 2021')
    ),
]

s_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=query_engine_tools)

In [None]:
response = await s_engine.aquery('Compare revenue growth of Uber and Lyft from 2020 to 2021')

Generated 4 sub questions.
[1;3;38;2;237;90;200m[uber_10k] Q: What is the revenue of Uber in 2020?
[0m[1;3;38;2;90;149;237m[uber_10k] Q: What is the revenue of Uber in 2021?
[0m[1;3;38;2;11;159;203m[lyft_10k] Q: What is the revenue of Lyft in 2020?
[0m[1;3;38;2;155;135;227m[lyft_10k] Q: What is the revenue of Lyft in 2021?
[0m[1;3;38;2;237;90;200m[uber_10k] A: The revenue of Uber in 2020 was $11,139 million.
[0m[1;3;38;2;155;135;227m[lyft_10k] A: The revenue of Lyft in 2021 is $3,208,323,000.
[0m[1;3;38;2;11;159;203m[lyft_10k] A: The revenue of Lyft in 2020 was $2,364,681,000.
[0m[1;3;38;2;90;149;237m[uber_10k] A: The revenue of Uber in 2021 is $17,455 million.
[0m

In [None]:
# print the response
display(HTML(f'<p style="font-size:20px">{(response.response)}</p>'))