In [1]:
! python --version

Python 3.12.2


## 10K Analysis using Sub Question Retrieval

In this demo, we explore answering complex queries by decomposing them into simpler sub-queries. 

A 10-K form is an annual report filed by publicly traded companies to the U.S. Securities and Exchange Commission (SEC). It provides a comprehensive overview of the company's business and financial condition.

## Install the required packages
- `%%capture` is used to suppress the output of the installation commands.

In [2]:
%%capture
%pip install llama-index-readers-file pymupdf
%pip install llama-index-vector-stores-postgres
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-bedrock
%pip install llama-index-embeddings-bedrock
%pip install psycopg2-binary
%pip install ipywidgets
%pip install SQLAlchemy
%pip install python-dotenv

In [3]:
import nest_asyncio

nest_asyncio.apply()

## Import the required libraries
- The `load_dotenv` function is used to load the environment variables from the `.env` file - this is used when I had to access a more capable generator model in Bedrock. 
- This notebook is based on this [notebook from LlamaIndex - Sub Question querying](https://docs.llamaindex.ai/en/stable/examples/usecases/10k_sub_question/).

In [4]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine
from dotenv import load_dotenv
load_dotenv(verbose=True, dotenv_path=".env")

True

## Downloading the dataset
- If you need to download the dataset, you can use the following commands below, but commented out as they are large files.

In [5]:
# ! mkdir -p 'data/10-K'
# ! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10-K/uber_2021.pdf'
# ! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10-K/lyft_2021.pdf'

## Setup the Retriever and Generator models
- pass `mode=local` to the `setup` function to use our local LMStudio models.
- pass `mode=remote` to the `setup` function to use the AWS Bedrock.

In [6]:
from llama_index.core import Settings
from llama_index.llms.bedrock import Bedrock
from llama_index.llms.lmstudio import LMStudio
from llama_index.embeddings.bedrock import BedrockEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import os

def setup_models(mode="local"):
    if mode == "local":
        # Setup Retriever model
        embedding_model = "BAAI/bge-base-en-v1.5"
        print(f"Setting up local Retriever model (embedding: {embedding_model})...")
        Settings.embed_model = HuggingFaceEmbedding(model_name=embedding_model)
        Settings.chunk_size = 1024
        Settings.chunk_overlap = 20
        
        # Setup Generator model
        llm_model = "lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF"
        print(f"Setting up local Generator model (main LLM: {llm_model})...")
        Settings.llm = LMStudio(
            model_name=llm_model,
            base_url="http://localhost:1234/v1",
            temperature=0,
            request_timeout=120,
        )
    elif mode == "remote":
        # Setup Retriever model
        embedding_model = "cohere.embed-multilingual-v3"
        print(f"Setting up remote Retriever model (embedding: {embedding_model})...")
        Settings.embed_model = BedrockEmbedding(
            model_name=embedding_model,
            region_name=os.environ["AWS_DEFAULT_REGION"],
        )
        Settings.chunk_size = 1024
        Settings.chunk_overlap = 20
                
        # Setup Generator model
        llm_model = "anthropic.claude-3-sonnet-20240229-v1:0"
        print(f"Setting up remote Generator model (main LLM: {llm_model})...")
        Settings.llm = Bedrock(
            model=llm_model,
            aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
            aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
            aws_session_token=os.environ["AWS_SESSION_TOKEN"],
            region_name=os.environ["AWS_DEFAULT_REGION"],
            request_timeout=120,
        )

    else:
        raise ValueError(f"Unknown mode: {mode}")
    
setup_models(mode="remote")

text_embedding = Settings.embed_model.get_text_embedding("Once upon a time, there was a cat.")
print(text_embedding[:5])
print(f"Emedding length: {len(text_embedding)}")
vector_size = len(text_embedding)


Setting up remote Retriever model (embedding: cohere.embed-multilingual-v3)...
Setting up remote Generator model (main LLM: anthropic.claude-3-sonnet-20240229-v1:0)...
[-0.043518066, -0.010955811, -0.00032567978, 0.0057792664, -0.016540527]
Emedding length: 1024


# Setup PgVector extension in Postgres SQL
- In the code below, we drop the database everytime, just to ensure that we are starting from scratch. This is not recommended in production.

In [7]:
import psycopg2
import nest_asyncio

try:
    pg_pw = "mysecretpassword"
    pg_db = "vector_store"
    connection_string = f"postgresql://postgres:{pg_pw}@localhost:5432"
    db_name = pg_db
    conn = psycopg2.connect(connection_string)
    conn.autocommit = True

    with conn.cursor() as c:
        c.execute(f"DROP DATABASE {db_name} WITH (FORCE);")
        c.execute(f"CREATE DATABASE {db_name};")

    conn.commit()
    conn.close()
    
    nest_asyncio.apply()
    
except Exception as e:
    print(e)
    

In [8]:
from IPython.display import Markdown
from llama_index.core import SimpleDirectoryReader
from sqlalchemy import make_url
from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext
from llama_index.vector_stores.postgres import PGVectorStore

def simple_RAG(vector_size):
    """
    Simple Retrieval Augmented Generation (RAG) using Llama Index.
    """
    BASE_DIR = "./data/10-K"

    url = make_url(connection_string)
    print(f"Url {url}")
    
    vector_store = PGVectorStore.from_params(
        database=db_name,
        host=url.host,
        password=url.password,
        port=url.port,
        user=url.username,
        table_name="basic_rag",
        embed_dim=vector_size
    )

    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    
    nodes = ingest_directory(BASE_DIR)
    
    print(f"Number of nodes: {len(nodes)}")

    index = VectorStoreIndex.from_documents(nodes, storage_context=storage_context, show_progress=True)
    return index

def advanced_RAG(vector_size, input_file):
    """
    Simple Retrieval Augmented Generation (RAG) using Llama Index.
    """

    print(f"Ingesting document: {input_file}...")
    url = make_url(connection_string)
    print(f"Url {url}")
    
    vector_store = PGVectorStore.from_params(
        database=db_name,
        host=url.host,
        password=url.password,
        port=url.port,
        user=url.username,
        table_name="advanced_rag",
        embed_dim=vector_size
    )

    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    
    nodes = ingest_document(input_file)
    
    print(f"Number of nodes: {len(nodes)}")

    index = VectorStoreIndex.from_documents(nodes, storage_context=storage_context, show_progress=True)
    return index

def ingest_document(input_file):
    """
    Ingest a document into the vector store. 
    """
    reader = SimpleDirectoryReader(input_files=[input_file])
    return reader.load_data(show_progress=True)

def ingest_directory(directory):
    """
    Ingest documents from a directory into the vector store. 
    """
    reader = SimpleDirectoryReader(input_dir=directory)
    return reader.load_data(show_progress=True)

def display_markdown(question, response):
    """
    Display a question and response in markdown format.
    """
    return Markdown(
f"""
## Question:
{question}

## Answer:
{response}
""")

## Using naive LlamaIndex RAG
- A single index containing both company's 10-K filings, and setup the query engine with top `k=3`. 

In [9]:
index = simple_RAG(vector_size=vector_size)
query_engine = index.as_query_engine(similarity_top_k=3, verbose=True)

Url postgresql://postgres:***@localhost:5432


Loading files: 100%|██████████| 2/2 [00:13<00:00,  6.55s/file]

Number of nodes: 545





Parsing nodes:   0%|          | 0/545 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/757 [00:00<?, ?it/s]

In [10]:
question = "Compare and contrast the their major assets and liabilities in 2021"
response = query_engine.query(question)

display_markdown(question, response.response)


## Question:
Compare and contrast the their major assets and liabilities in 2021

## Answer:
Based on the financial information provided, here is a comparison of Uber and Lyft's major assets and liabilities in 2021:

Assets:
- Uber had significant holdings of marketable equity securities worth $11.4 billion, while Lyft did not report holding any marketable equity securities.
- Both companies had non-marketable debt and equity investments, with Uber's non-marketable debt securities valued at $2.3 billion and non-marketable equity securities at $32 million. Lyft did not provide specific values for these assets.
- Lyft had a large deferred tax asset of $2.1 billion related to net operating loss carryforwards, while Uber did not report deferred tax assets of this magnitude.
- Both companies had operating lease right-of-use assets on their balance sheets, with Lyft reporting $223 million and Uber not providing a specific value.

Liabilities:  
- Uber reported a $193 million liability related to a call option held by MLU B.V., while Lyft did not have a comparable liability.
- Lyft had $264 million in operating lease liabilities and $28 million in finance lease liabilities. Uber did not disclose detailed lease liability amounts.
- Lyft had $2.4 billion in valuation allowances against its deferred tax assets, indicating uncertainty about realizing those assets. Uber did not report a valuation allowance of this scale.

Overall, the information shows some similarities like non-marketable investments and lease assets/liabilities, but also key differences such as Uber's large marketable securities holdings and Lyft's significant deferred tax assets/allowances related to net operating losses.


## Using Sub Question decomposition method
- Given the same question `Compare and contrast the their major assets and liabilities in 2021` Sub Question is able to generate answeres more clearly and intelligently than the naive RAG method. Also notice the sub questions generated by the system.
- Below, we are building two indices, one for each company, so that they can be compared and contrasted more effectively.

In [11]:
uber_index = advanced_RAG(vector_size=vector_size, input_file="./data/10-K/uber_2021.pdf")
lyft_index = advanced_RAG(vector_size=vector_size, input_file="./data/10-K/lyft_2021.pdf")

Ingesting document: ./data/10-K/uber_2021.pdf...
Url postgresql://postgres:***@localhost:5432


Loading files: 100%|██████████| 1/1 [00:07<00:00,  7.51s/file]

Number of nodes: 307





Parsing nodes:   0%|          | 0/307 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/410 [00:00<?, ?it/s]

Ingesting document: ./data/10-K/lyft_2021.pdf...
Url postgresql://postgres:***@localhost:5432


Loading files: 100%|██████████| 1/1 [00:05<00:00,  5.29s/file]

Number of nodes: 238





Parsing nodes:   0%|          | 0/238 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/344 [00:00<?, ?it/s]

In [12]:
lyft_engine = lyft_index.as_query_engine(similarity_top_k=3, verbose=True)
uber_engine = uber_index.as_query_engine(similarity_top_k=3, verbose=True)

query_engine_tools = [
    QueryEngineTool(
        query_engine=lyft_engine,
        metadata=ToolMetadata(
            name="lyft_10k",
            description=(
                "Provides information about Lyft financials for year 2021"
            ),
        ),
    ),
    QueryEngineTool(
        query_engine=uber_engine,
        metadata=ToolMetadata(
            name="uber_10k",
            description=(
                "Provides information about Uber financials for year 2021"
            ),
        ),
    ),
]

s_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    verbose=True
    
)

In [13]:
question = "Compare and contrast the their major assets and liabilities in 2021"
response = s_engine.query(question)

display_markdown(question, response.response)

Generated 4 sub questions.
[1;3;38;2;237;90;200m[uber_10k] Q: What were Uber's major assets in 2021?
[0m[1;3;38;2;90;149;237m[uber_10k] Q: What were Uber's major liabilities in 2021?
[0m[1;3;38;2;11;159;203m[lyft_10k] Q: What were Lyft's major assets in 2021?
[0m[1;3;38;2;155;135;227m[lyft_10k] Q: What were Lyft's major liabilities in 2021?
[0m[1;3;38;2;90;149;237m[uber_10k] A: Based on the consolidated balance sheet information provided, Uber's major liabilities in 2021 were:

1. Accrued and other current liabilities - $6.537 billion
2. Long-term debt, net of current portion - $9.276 billion  
3. Long-term insurance reserves - $2.546 billion
4. Operating lease liabilities, non-current - $1.644 billion

The accrued and other current liabilities included significant amounts for accrued legal, regulatory and non-income taxes ($2.187 billion), accrued driver and merchant liabilities ($1.187 billion), and other accruals ($2.549 billion). The long-term debt was the largest liabilit


## Question:
Compare and contrast the their major assets and liabilities in 2021

## Answer:
In 2021, both Uber and Lyft had significant cash and investment holdings as major assets, reflecting their need for liquidity to fund operations. However, Uber's cash and investments were substantially larger at $16.1 billion compared to Lyft's $3.3 billion, indicating Uber's greater financial resources.

A key difference was Uber's much higher goodwill and intangible assets of $10.8 billion versus Lyft's $180.5 million, likely due to Uber's larger acquisition activities over the years. Uber also had higher property and equipment assets of $1.9 billion compared to Lyft's $298.2 million, suggesting a more extensive physical operational footprint.

On the liabilities side, both companies had substantial accrued liabilities, insurance reserves, and lease obligations. However, Uber's long-term debt of $9.3 billion dwarfed Lyft's $655.2 million, reflecting Uber's higher leverage and borrowing needs.

Overall, while both companies had similar liability components, Uber's assets were significantly larger across multiple categories, highlighting its greater scale, acquisitions, and capital-intensive nature compared to Lyft's relatively smaller operations and asset base.


In [14]:
question = "Compare revenue growth of Uber and Lyft from 2020 to 2021"
response = s_engine.query(question)

display_markdown(question, response.response)

Generated 4 sub questions.
[1;3;38;2;237;90;200m[uber_10k] Q: What was Uber's revenue in 2020?
[0m[1;3;38;2;90;149;237m[uber_10k] Q: What was Uber's revenue in 2021?
[0m[1;3;38;2;11;159;203m[lyft_10k] Q: What was Lyft's revenue in 2020?
[0m[1;3;38;2;155;135;227m[lyft_10k] Q: What was Lyft's revenue in 2021?
[0m[1;3;38;2;237;90;200m[uber_10k] A: According to the financial information provided, Uber's revenue in 2020 was $11,139 million.
[0m[1;3;38;2;90;149;237m[uber_10k] A: According to the financial information provided, Uber's total revenue for the year 2021 was $17.455 billion.
[0m[1;3;38;2;11;159;203m[lyft_10k] A: According to the consolidated statements of operations provided in the context, Lyft's revenue in 2020 was $2,364,681,000.
[0m[1;3;38;2;155;135;227m[lyft_10k] A: According to the financial information provided, Lyft's revenue for the year ended December 31, 2021 was $3,208,323,000.
[0m


## Question:
Compare revenue growth of Uber and Lyft from 2020 to 2021

## Answer:
Uber's revenue grew at a higher rate compared to Lyft from 2020 to 2021. Uber's revenue increased by around 56% from $11.139 billion in 2020 to $17.455 billion in 2021. On the other hand, Lyft's revenue grew by approximately 36% from $2.364 billion in 2020 to $3.208 billion in 2021. Therefore, Uber experienced a more substantial revenue growth compared to its competitor Lyft during the same period.


## Handling missing data
- Here it asks about the years 2017 to 2018, but the context only has data from 2019 to 2021.

In [15]:
question = "Compare revenue growth of Uber and Lyft from 2017 to 2018"
response = s_engine.query(question)

display_markdown(question, response.response)

Generated 4 sub questions.
[1;3;38;2;237;90;200m[uber_10k] Q: What was Uber's revenue in 2017?
[0m[1;3;38;2;90;149;237m[uber_10k] Q: What was Uber's revenue in 2018?
[0m[1;3;38;2;11;159;203m[lyft_10k] Q: What was Lyft's revenue in 2017?
[0m[1;3;38;2;155;135;227m[lyft_10k] Q: What was Lyft's revenue in 2018?
[0m[1;3;38;2;155;135;227m[lyft_10k] A: Unfortunately, the provided context does not contain any information about Lyft's revenue in 2018. The context only discusses Lyft's revenue, costs, and other financial metrics for the years 2021, 2020, and 2019. There is no mention of the company's revenue or financial performance in 2018.
[0m[1;3;38;2;237;90;200m[uber_10k] A: Unfortunately, the provided context information does not mention Uber's revenue for the year 2017. The context only provides details about Uber's revenue for the years 2019, 2020, and 2021. Without any information about 2017 revenue in the given context, I cannot provide a specific figure for that year.
[0m[


## Question:
Compare revenue growth of Uber and Lyft from 2017 to 2018

## Answer:
Unfortunately, I do not have enough information from the provided context to compare the revenue growth of Uber and Lyft from 2017 to 2018. The context does not mention any revenue figures for either company in those specific years. Without access to their actual revenue numbers for 2017 and 2018, it is impossible for me to analyze or compare their revenue growth during that period based solely on the given context information.


In [17]:
question = "What are these companies' revenue, profit, and expenses for year 2021?"
response = s_engine.query(question)

display_markdown(question, response.response)

Generated 6 sub questions.
[1;3;38;2;237;90;200m[uber_10k] Q: What is Uber's revenue for 2021?
[0m[1;3;38;2;90;149;237m[uber_10k] Q: What is Uber's profit for 2021?
[0m[1;3;38;2;11;159;203m[uber_10k] Q: What are Uber's expenses for 2021?
[0m[1;3;38;2;155;135;227m[lyft_10k] Q: What is Lyft's revenue for 2021?
[0m[1;3;38;2;237;90;200m[lyft_10k] Q: What is Lyft's profit for 2021?
[0m[1;3;38;2;90;149;237m[lyft_10k] Q: What are Lyft's expenses for 2021?
[0m[1;3;38;2;11;159;203m[uber_10k] A: According to the financial information provided, Uber's major expenses for 2021 were:

Cost of revenue, exclusive of depreciation and amortization: $9,351 million
Operations and support: $1,877 million  
Sales and marketing: $4,789 million
Research and development: $2,054 million
General and administrative: $2,316 million
Depreciation and amortization: $902 million
Interest expense: $483 million

The total costs and expenses for Uber in 2021 amounted to $21,289 million. The largest expense w


## Question:
What are these companies' revenue, profit, and expenses for year 2021?

## Answer:
Based on the financial information provided, here are the key details for Uber and Lyft's revenue, profit, and expenses for the year 2021:

Uber:
Revenue: $17.455 billion
Profit/Loss: Net loss of $496 million
Major Expenses: Cost of revenue ($9.351 billion), Sales and marketing ($4.789 billion), Operations and support ($1.877 billion), Research and development ($2.054 billion), General and administrative ($2.316 billion)

Lyft: 
Revenue: $3.208 billion  
Profit/Loss: Net loss of $1.009 billion
Major Expenses: Cost of revenue ($1.650 billion), Research and development ($911.9 million), General and administrative ($915.6 million), Operations and support ($402.2 million), Sales and marketing ($411.4 million)

In summary, while both companies generated significant revenue in 2021, they also incurred substantial expenses across various areas such as cost of revenue, marketing, research and development, and operations, resulting in net losses for the year. Uber's revenue was notably higher than Lyft's, but it also had higher expenses, leading to a lower net loss compared to Lyft.
