<a href="https://colab.research.google.com/github/edquestofficial/Gen-AI-Cohort/blob/main/2024/april/Level_2/LLaMA_Index/LLaMA_Index_Multiple_Financial_Document_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiple Financial Document Analysis

## Reference

* [YouTube](https://www.youtube.com/watch?v=GT_Lsj3xj1o&list=PLTZkGHtR085ZjK1srrSZIrkeEzQiMjO9W)
* [Documentation](https://github.com/openai/openai-cookbook/blob/main/examples/third_party/financial_document_analysis_with_llamaindex.ipynb)

## Install Required Libraries

In [None]:
!pip install llama-index pypdf

In [None]:
! pip install langchain

In [None]:
! pip install llama_index

In [None]:
! pip install -q llama-index-llms-gemini

In [None]:
! pip install llama-index-embeddings-huggingface

## Mount Google Drive

In [6]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [7]:
import os

base_path = "/content/drive/MyDrive/Gen AI Course/RAG_For_HDFC_Policy"
filepath = f"{base_path}/gemini_api_key.txt"
with open(filepath, "r") as f:
  api_key = ' '.join(f.readlines())
  os.environ["GOOGLE_API_KEY"] = api_key

## Required Imports

In [9]:
from langchain import OpenAI

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

from llama_index.core import Settings

## Basic LLM Setup

In [11]:
from llama_index.llms.gemini import Gemini

llm = Gemini()

In [12]:
Settings.llm = llm

## Data Loading

In [13]:
data_path = "/content/drive/MyDrive/Gen AI Course/data"
lyft_path = f"{data_path}/lyft_2021.pdf"
uber_path = f"{data_path}/uber_2021.pdf"

In [14]:
lyft_docs = SimpleDirectoryReader(input_files=[lyft_path]).load_data()

In [16]:
len(lyft_docs)

238

In [17]:
uber_docs = SimpleDirectoryReader(input_files=[uber_path]).load_data()

In [18]:
len(uber_docs)

307

In [19]:
print(f'Loaded lyft with {len(lyft_docs)} pages')
print(f'Loaded Uber with {len(uber_docs)} pages')

Loaded lyft with 238 pages
Loaded Uber with 307 pages


## Indexing

To save costs, you may want to use a local model.

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

In [26]:
lyft_index = VectorStoreIndex.from_documents(lyft_docs)

In [27]:
uber_index = VectorStoreIndex.from_documents(uber_docs)

## Simple QA

Now we are ready to run some queries against our indices!  
To do so, we first configure a `QueryEngine`, which just captures a set of configurations for how we want to query the underlying index.

For a `VectorStoreIndex`, the most common configuration to adjust is `similarity_top_k` which controls how many document chunks (which we call `Node` objects) are retrieved to use as context for answering our question.

In [30]:
lyft_engine = lyft_index.as_query_engine(similarity_top_k=3)

In [31]:
uber_engine = uber_index.as_query_engine(similarity_top_k=3)

### Let's see some queries in action!

In [32]:
response = await lyft_engine.aquery('What is the revenue of Lyft in 2021? Answer in millions with page reference')

In [33]:
print(response)

$3,208,323 million. (page 79)


In [34]:
response = await uber_engine.aquery('What is the revenue of Uber in 2021? Answer in millions, with page reference')

In [35]:
print(response)

$17,455 million (page 55)


## Advanced QA - Compare and Contrast

For more complex financial analysis, one often needs to reference multiple documents.  

As a example, let's take a look at how to do compare-and-contrast queries over both Lyft and Uber financials.  
For this, we build a `SubQuestionQueryEngine`, which breaks down a complex compare-and-contrast query, into simpler sub-questions to execute on respective sub query engine backed by individual indices.

In [36]:
query_engine_tools = [
    QueryEngineTool(
        query_engine=lyft_engine,
        metadata=ToolMetadata(name='lyft_10k',
                              description='Provides information about Lyft financials for year 2021')
    ),
    QueryEngineTool(
        query_engine=uber_engine,
        metadata=ToolMetadata(name='uber_10k',
                              description='Provides information about Uber financials for year 2021')
    ),
]

s_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=query_engine_tools)

In [37]:
response = await s_engine.aquery('Compare and contrast the customer segments and geographies that grew the fastest')

Generated 4 sub questions.
[1;3;38;2;237;90;200m[uber_10k] Q: What are the fastest growing customer segments for Uber
[0m[1;3;38;2;237;90;200m[uber_10k] A: The provided context does not mention anything about the fastest growing customer segments for Uber, so I cannot answer this question from the provided context.
[0m[1;3;38;2;90;149;237m[uber_10k] Q: What are the fastest growing geographies for Uber
[0m[1;3;38;2;90;149;237m[uber_10k] A: This question cannot be answered from the given context.
[0m[1;3;38;2;11;159;203m[lyft_10k] Q: What are the fastest growing customer segments for Lyft
[0m[1;3;38;2;11;159;203m[lyft_10k] A: This question cannot be answered from the given context.
[0m[1;3;38;2;155;135;227m[lyft_10k] Q: What are the fastest growing geographies for Lyft
[0m[1;3;38;2;155;135;227m[lyft_10k] A: This question cannot be answered from the given context because it does not mention the fastest growing geographies for Lyft.
[0m

In [38]:
print(response)

The provided context does not mention the fastest growing customer segments or geographies for either Uber or Lyft, so I cannot compare and contrast them from the provided context.


In [39]:
response = await s_engine.aquery('Compare revenue growth of Uber and Lyft from 2020 to 2021')

Generated 2 sub questions.
[1;3;38;2;237;90;200m[uber_10k] Q: What is the revenue growth of Uber from 2020 to 2021
[0m[1;3;38;2;237;90;200m[uber_10k] A: Uber's revenue increased by 57% from 2020 to 2021, from $11,139 million to $17,455 million.
[0m[1;3;38;2;90;149;237m[lyft_10k] Q: What is the revenue growth of Lyft from 2020 to 2021
[0m[1;3;38;2;90;149;237m[lyft_10k] A: Lyft's revenue grew by 35.7% from $2,364,681,000 in 2020 to $3,208,323,000 in 2021.
[0m

In [40]:
print(response)

Uber's revenue growth from 2020 to 2021 was 57%, while Lyft's revenue growth was 35.7%. Therefore, Uber's revenue growth was higher than Lyft's revenue growth during this period.
