# Financial Document Analysis using LlamaIndex

## Introduction

### LLamaIndex
[LlamaIndex](https://gpt-index.readthedocs.io/en/latest/) is a data framework for LLM applications.
You can get started with just a few lines of code and build a retrieval-augmented generation (RAG) system in minutes.
For more advanced users, LlamaIndex offers a rich toolkit for ingesting and indexing your data, modules for retrieval and re-ranking, and composable components for building custom query engines.


### Financial Analysis of Annual Reports
A key part of a financial analyst's job is to extract information and synthesize insight from long financial documents.
We will be using here Annual reports of Zomato and Tata Motors and figure out some queries related to them.


We showcase how LlamaIndex can support a financial analyst in quickly extracting information and synthesize insights **across multiple documents** with very little coding. 

## To begin, we need to install the llama-index library

In [1]:
!pip install llama-index pypdf



Now, we import all modules used in this tutorial

In [2]:
!pip install langchain

!pip install llama-index

from langchain import OpenAI

from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex
from llama_index import set_global_service_context
from llama_index.response.pprint_utils import pprint_response
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine



ModuleNotFoundError: No module named 'langchain_community'

Before we start, we can configure the LLM provider and model that will power our RAG system.  
Here, we pick `gpt-3.5-turbo-instruct` from OpenAI.  

In [None]:
!pip install langchain openai
from langchain.llms import OpenAI

llm = OpenAI(temperature=0, model_name="gpt-3.5-turbo-instruct", max_tokens=-1)

We construct a `ServiceContext` and set it as the global default, so all subsequent operations that depends on LLM calls will use the model we configured here.

In [None]:
service_context = ServiceContext.from_defaults(llm=llm)
set_global_service_context(service_context=service_context)

## Data Loading and Indexing

Now, we load and parse 2 PDFs (Annual Reports of Zomato and Tata Motors for FY 2023-2024).    
Under the hood, the PDFs are converted to plain text `Document` objects, separate by page.  

> Note: this operation might take a while to run, since each document is more than 100 pages.

In [None]:
tata_docs = SimpleDirectoryReader(input_files=["tata-motor-IAR-2023-24.pdf"]).load_data()
zomato_docs = SimpleDirectoryReader(input_files=["Zomato_Annual_Report_2023-24.pdf"]).load_data()

In [None]:
print(f'Loaded Tata Motors report with {len(tata_docs)} pages')
print(f'Loaded Zomato report with {len(zomato_docs)} pages')

Now, we can build an (in-memory) `VectorStoreIndex` over the documents that we've loaded.  

> Note: this operation might take a while to run, since it calls OpenAI API for computing vector embedding over document chunks.

In [None]:
tata_index = VectorStoreIndex.from_documents(tata_docs)
zomato_index = VectorStoreIndex.from_documents(zomato_docs)

## Simple QA

Now we are ready to run some queries against our indices!  
To do so, we first configure a `QueryEngine`, which just captures a set of configurations for how we want to query the underlying index.

For a `VectorStoreIndex`, the most common configuration to adjust is `similarity_top_k` which controls how many document chunks (which we call `Node` objects) are retrieved to use as context for answering our question.

In [None]:
tata_engine = tata_index.as_query_engine(similarity_top_k=3)

In [None]:
zomato_engine = zomato_index.as_query_engine(similarity_top_k=3)

Let's see some queries in action!

In [None]:
response = await tata_engine.aquery('What is the Profit After Tax of Tata Motors in 2023? Answer in millions with page reference')

In [None]:
print(response)

In [None]:
response = await uber_engine.aquery('What is the Profit After Tax of Uber in 2023? Answer in millions, with page reference')

In [None]:
print(response)

## Advanced QA - Compare and Contrast

For more complex financial analysis, one often needs to reference multiple documents.  

As a example, let's take a look at how to do compare-and-contrast queries over both Tata Motors and Zomato financials.  
For this, we build a `SubQuestionQueryEngine`, which breaks down a complex compare-and-contrast query, into simpler sub-questions to execute on respective sub query engine backed by individual indices.

In [None]:
query_engine_tools = [
    QueryEngineTool(
        query_engine=tata_engine, 
        metadata=ToolMetadata(name='tata_10k', description='Provides information about Tata Motors financials for year 2023')
    ),
    QueryEngineTool(
        query_engine=uber_engine, 
        metadata=ToolMetadata(name='Zomato_10k', description='Provides information about Zomato financials for year 2023')
    ),
]

s_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=query_engine_tools)

Let's see these queries in action!

In [None]:
response = await s_engine.aquery('Compare and contrast the customer segments and geographies that grew the fastest')

In [None]:
print(response)

In [None]:
response = await s_engine.aquery('Compare revenue growth of Tata Motors and Zomato from 2023 to 2024')

In [None]:
print(response)