<a href="https://colab.research.google.com/github/dhnanjay/HuggingFace/blob/main/10k_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install llama-index<0.6.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting llama-index
  Downloading llama_index-0.4.29.tar.gz (140 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.7/140.7 KB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dataclasses_json
  Downloading dataclasses_json-0.5.7-py3-none-any.whl (25 kB)
Collecting langchain
  Downloading langchain-0.0.115-py3-none-any.whl (404 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m404.3/404.3 KB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=0.26.4
  Downloading openai-0.27.2-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.1/70.1 KB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken
  Downloading tiktoken-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [1]:
# download files
!mkdir data
!wget https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1 -O data/UBER.zip
!unzip data/UBER.zip -d data

--2023-06-02 20:07:12--  https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/948jr9cfs7fgj99/UBER.zip [following]
--2023-06-02 20:07:12--  https://www.dropbox.com/s/dl/948jr9cfs7fgj99/UBER.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc33d76341c5f554ba8a75d40a35.dl.dropboxusercontent.com/cd/0/get/B9OaU0YETvUJ5HefhJCfACRBXSnN7rwOTcLtNpv1HkVEfxjsMIJufsGqFQ2CxpaBAUaQ3yW085Wr1EVAW5X_EmPI4QC3kV-EPd0xbL25msL6CZQsc1pp6-ypw4606alrJbFoO6JJwGNX33HV_q9JfRc3wsl7vKH7rEVgsN1m39ijUQ/file?dl=1# [following]
--2023-06-02 20:07:13--  https://uc33d76341c5f554ba8a75d40a35.dl.dropboxusercontent.com/cd/0/get/B9OaU0YETvUJ5HefhJCfACRBXSnN7rwOTcLtNpv1HkVEfxjsMIJufsGqFQ2CxpaBAUaQ3yW085Wr1EVAW5X_EmPI4Q

In [None]:
# set text wrapping
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
import os
os.environ['OPENAI_API_KEY'] = ""

In [None]:
from llama_index import download_loader, GPTSimpleVectorIndex
from pathlib import Path

### Ingest Unstructured Data Through the Unstructured.io Reader

Leverage the capabilities of Unstructured.io HTML parsing.
Downloaded through LlamaHub.

In [None]:
UnstructuredReader = download_loader("UnstructuredReader", refresh_cache=True)

In [None]:
loader = UnstructuredReader()
doc_set = {}
all_docs = []
years = [2022, 2021, 2020, 2019]
for year in years:
    year_docs = loader.load_data(file=Path(f'./data/UBER/UBER_{year}.html'), split_documents=False)
    # insert year metadata into each year
    for d in year_docs:
        d.extra_info = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


### Setup Service Context

In [None]:
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(chunk_size_limit=512)

### Setup a Vector Index for each SEC filing

We setup a separate vector index for each SEC filing from 2019-2022.

We also optionally initialize a "global" index by dumping all files into the vector store.

In [None]:
# initialize simple vector indices + global vector index
# NOTE: don't run this cell if the indices are already loaded! 
index_set = {}
for year in years:
    cur_index = GPTSimpleVectorIndex.from_documents(doc_set[year], service_context=service_context)
    index_set[year] = cur_index
    cur_index.save_to_disk(f'index_{year}.json')
    

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 156882 tokens
INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 162641 tokens
INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 173288 tokens
INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 169130 tokens


In [None]:
# Load indices from disk
index_set = {}
for year in years:
    cur_index = GPTSimpleVectorIndex.load_from_disk(f'index_{year}.json', service_context=service_context)
    index_set[year] = cur_index

In [None]:
# NOTE: this global index is a single vector store containing all documents
# Only relevant for the section below: "Can a single vector index answer questions across years?"
global_index = GPTSimpleVectorIndex.from_documents(all_docs, service_context=service_context)
global_index.save_to_disk(f'index_global.json')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [None]:
global_index = GPTSimpleVectorIndex.load_from_disk(f'index_global.json', service_context=service_context)

### Ask Initial Questions over a Given Year (2020)

Let's first ask some questions over the UBER 10-k for 2020! 

In [None]:
response = index_set[2020].query("What were some of the biggest risk factors in 2020?", similarity_top_k=3)

In [None]:
print(response)

In [None]:
response = index_set[2020].query("What were some of the signifcant acquisitions?", similarity_top_k=3)

In [None]:
print(response)

### Can a single vector index answer questions across years?

If we dump all documents to a single vector store, let's test its ability to answer questions across years! 

In [None]:
# # Option 2
# risk_query_str = (
#     "Describe the current risk factors. If the year is provided in the information, "
#     "provide that as well. If the context contains risk factors for multiple years, "
#     "explicitly provide the following:\n"
#     "- A description of the risk factors for each year\n"
#     "- A summary of how these risk factors are changing across years"
# )

In [None]:
# Option 1
risk_query_str = "What are some of the biggest risk factors in each year?"

In [None]:
response = global_index.query(risk_query_str, similarity_top_k=3)

In [None]:
print(str(response))



The biggest risk factors in 2019 include economic downturns, natural disasters, geopolitical tensions, changes in government policies, risk-free interest rates, and the potential for rising inflation. Economic downturns can be caused by a variety of factors, such as a decrease in consumer spending, a decrease in business investment, or a decrease in exports. Natural disasters can cause significant damage to infrastructure and disrupt economic activity. Geopolitical tensions can lead to increased uncertainty and volatility in financial markets. Changes in government policies can have a significant impact on the economy, such as changes in taxation or regulations. Risk-free interest rates can affect the cost of borrowing and the availability of credit, while rising inflation can erode the purchasing power of consumers.


### Composing a Graph to synthesize answers across 10-K filings (2019-2022)

We want our queries to aggregate/synthesize information across *all* 10-K filings. To do this, we define a List index
on top of the 4 vector indices.

In [None]:
from llama_index import GPTListIndex, LLMPredictor
from langchain import OpenAI
from llama_index.composability import ComposableGraph

In [None]:
# set summary text for each doc
summaries = {}
for year in years:
    summaries[year] = f"UBER 10-k Filing for {year} fiscal year"

In [None]:
# set number of output tokens
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, max_tokens=512))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

In [None]:
graph = ComposableGraph.from_indices(
    GPTListIndex,
    [index_set[y] for y in years],
    [summaries[y] for y in years],
    service_context=service_context
)

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 0 tokens


In [None]:
graph.save_to_disk('10k_graph.json')

In [None]:
graph = ComposableGraph.load_from_disk('10k_graph.json', service_context=service_context)

### Setting Up the Query

We query about the risk factors. We want to synthesize information across each year.

In [None]:
risk_query_str = (
    "Describe the current risk factors. If the year is provided in the information, "
    "provide that as well. If the context contains risk factors for multiple years, "
    "explicitly provide the following:\n"
    "- A description of the risk factors for each year\n"
    "- A summary of how these risk factors are changing across years"
)

In [None]:
query_configs = [
    {
        "index_struct_type": "dict",
        "query_mode": "default",
        "query_kwargs": {
            "similarity_top_k": 1,
            # "include_summary": True
        }
    },
    {
        "index_struct_type": "list",
        "query_mode": "default",
        "query_kwargs": {
            "response_mode": "tree_summarize",
        }
    },
]

In [None]:
response_summary = graph.query(risk_query_str, query_configs=query_configs)

In [None]:
print(response_summary)


For 2017, 2018, and 2019, the risk factors included economic uncertainty, geopolitical tensions, and the potential for natural disasters.

For 2020, the risk factors include the COVID-19 pandemic and the impact of actions to mitigate the pandemic, which has adversely affected and continues to adversely affect business, financial condition, operating results, and prospects. This includes the potential for reduced demand for products and services, disruption of the supply chain, reduced liquidity, increased costs, and reduced revenue.

For 2021 and 2022, the risk factors include the potential for Drivers to be classified as employees, workers or quasi-employees instead of independent contractors, the highly competitive nature of the mobility, delivery, and logistics industries, and the need to lower fares or service fees and offer Driver incentives and consumer discounts and promotions in order to remain competitive in certain markets.

Overall, the risk factors have remained relatively

In [None]:
print(response_summary.get_formatted_sources())

> Source (Doc id: 9b8c571e-b813-4906-9020-208cc57ac12a): UBER 10-k Filing for 2022 fiscal year...

> Source (Doc id: 8752ce30-d2c3-4ea6-9a5f-ca890105664b): year: 2022

announcements regarding our financial performance, including SEC filings, investor ev...

> Source (Doc id: 54649d63-ca22-4e40-a715-304606638bea): UBER 10-k Filing for 2021 fiscal year...

> Source (Doc id: 1a7abef5-2939-4d6d-aaa1-f579b57a65ea): year: 2021

our earnings calls and certain events we participate in or host with members of the i...

> Source (Doc id: a869db0a-254e-465d-836b-1e37335c2561): UBER 10-k Filing for 2020 fiscal year...

> Source (Doc id: 3cc53231-5406-4af3-8ab9-81fd61ffedc6): year: 2020

on Form 8-K and amendments to reports filed or furnished pursuant to Sections 13(a) a...

> Source (Doc id: b1538dd1-8270-45db-a603-ab6d34e9a6b3): UBER 10-k Filing for 2019 fiscal year...

> Source (Doc id: 4b64862e-b4bf-4013-86ab-ecaf6d4d0157): year: 2019

31,

2017

2018

2019

Risk-free interest...


In [None]:
# query a specific year
response_tmp = index_set[2022].query(risk_query_str)

INFO:root:> [query] Total LLM token usage: 576 tokens
INFO:root:> [query] Total embedding token usage: 41 tokens


In [None]:
str(response_tmp)

'\nIn 2022, the risk factors for our business include Drivers being classified as employees, workers or quasi-employees instead of independent contractors, the mobility, delivery, and logistics industries being highly competitive, and the need to lower fares or service fees and offer Driver incentives and consumer discounts and promotions in order to remain competitive in certain markets. Since our inception, we have incurred significant losses.'

In [None]:
# query a global index
response = global_index.query(risk_query_str, similarity_top_k=4)

INFO:root:> [query] Total LLM token usage: 2926 tokens
INFO:root:> [query] Total embedding token usage: 41 tokens


In [None]:
str(response)

"\n\nThe current risk factors include the COVID-19 pandemic and the impact of actions to mitigate the pandemic, the potential for Drivers to be classified as employees, workers or quasi-employees instead of independent contractors, and the highly competitive mobility, delivery, and logistics industries with well-established and low-cost alternatives, low barriers to entry, low switching costs, and well-capitalized competitors in nearly every major geographic region. To remain competitive, the company has in the past lowered, and may continue to lower, fares or service fees, and has in the past offered, and may continue to offer, significant Driver incentives and consumer discounts and promotions. Additionally, the risk-free interest rate has decreased from 2017 to 2019, from 2.4% to 2.1%, which could have an impact on the company's ability to access capital. Furthermore, the company has incurred significant losses since inception, including in the United States and other major markets,