### Example Case: Financial Statements Analysis Using LLMs Powered by RAG (Retrieval-Augmented Generation)

1. **Scenario:**  
   Assume you are a financial analyst tasked with analyzing financial statements. The data is stored in diverse formats like PDFs, Excel sheets, or other document types.

2. **Challenge:**  
   Traditional methods involve manual extraction, spreadsheet analysis, and complex formula-driven insights. A natural language-driven approach is preferred to make the process more intuitive and efficient.

3. **Solution: LLMs with Retrieval-Augmented Generation (RAG):**  
   - **Data Ingestion:**  
     Use OCR tools or specialized parsers to extract relevant data from documents like PDFs or scanned images.  
   - **Knowledge Base Creation:**  
     Store structured data in a vector database after embedding it using techniques like sentence transformers or similar models.  
   - **Query and Analysis:**  
     Deploy an LLM fine-tuned for financial terminologies and concepts. The model interacts with the vector database, fetching only the most relevant snippets for user queries.  
   - **Natural Language Interaction:**  
     Users can ask questions like:  
       - "What are the key financial ratios for Company X over the past 3 years?"  
       - "Summarize the cash flow trends in this document."  
     - The LLM retrieves the relevant context, processes it, and provides a concise, accurate response.  
   - **Enhanced Insights:**  
     The system can generate charts, comparisons, and forecasts based on the extracted data, making the analysis process even more actionable.

4. **Advantages:**  
   - **Time Efficiency:** Reduces manual data processing significantly.  
   - **Intuitive Interaction:** Natural language queries eliminate the need for technical expertise in financial modeling.  
   - **Contextual Relevance:** Retrieval-based systems ensure responses are backed by specific, accurate data.  
   - **Scalability:** Can handle a vast volume of documents across multiple formats.  
   - **Enhanced Visualization:** Integrates with tools to generate charts, reports, and trend analyses.  

5. **Disadvantages:**  
   - **Data Extraction Challenges:** OCR and parser tools may fail with poorly scanned or non-standardized documents.  
   - **Model Accuracy:** LLMs may occasionally misinterpret queries or generate incorrect insights without adequate fine-tuning.  
   - **Dependency on Pre-existing Data:** Results are limited to the quality and comprehensiveness of the knowledge base.  
   - **Cost:** Setting up and maintaining RAG systems, including vector databases and fine-tuned LLMs, can be expensive.  
   - **Security Risks:** Sensitive financial data may require robust encryption and access controls to prevent breaches.  
   - **Limited Reasoning:** While LLMs excel in summarizing and retrieving information, they may lack advanced reasoning for complex financial scenarios.  
   - **Continuous Maintenance:** Regular updates and monitoring are needed to ensure the model remains accurate with new regulations and data changes.  


In [None]:
!pip install langchain[all]

In [None]:

!pip install langchain==0.3.0 --quiet
!pip install langchain_core==0.3.15 --quiet
!pip install langchain_community==0.3.0 --quiet
!pip install langchain_text_splitters==0.3.0 --quiet
!pip install langchain_experimental==0.3.0 --quiet
!pip install langchain_openai==0.2.0 --quiet
!pip install httpx==0.27.2 --quiet
!pip install faiss-cpu==1.8.0 --quiet
!pip install pdfplumber==0.11.0 --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m408.7/408.7 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.5/49.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m206.9/206.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.5/51.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m46.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Section A: Initial Setup - Load Libraries, Keys...


In [None]:
!pip install --upgrade langchain langchain-openai

In [None]:
OPENAI_API_KEY = #OPENAI_API_KEY
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=OPENAI_API_KEY)

  embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai_api_key)


In [None]:
review_text = "The product is amazing!"
review_embedding = embeddings.embed_query(review_text)
review_embedding

In [None]:
from langchain_openai import OpenAI

In [None]:
EMBEDDING_MODEL  = "text-embedding-3-small"
GENERATION_MODEL = "gpt-3.5-turbo-instruct"

llm = openai.OpenAI(model=GENERATION_MODEL)
embed_model = OpenAIEmbeddings(model=EMBEDDING_MODEL)

# Section B: testing GPT 3.5 it work or not


In [None]:
llm.invoke("What is the Capital of India")

'\n\nThe capital of India is New Delhi.'

In [None]:
llm.invoke("What is a Quarterly Revenue of Infosys in USD for recent quarter?")

'\n\nThe quarterly revenue of Infosys in USD for the most recent quarter (Q4 2020) was $3.3 billion.'

# Section C : Augment with  10K/10-Q (PDF)

In [None]:
pdfFile = "/media/kavi/EFLabs/RAG_MODEL/InfosysPressReleaseQ32024.pdf"

from langchain_community.document_loaders import PDFPlumberLoader
loader = PDFPlumberLoader(pdfFile)
docs = loader.load()

# Check the number of pages
print("Number of pages in the PDF:",len(docs))

# Load the random page content
print(docs[2].page_content)  # Sample content

Number of pages in the PDF: 8
IFRS – USD
Press Release
user experience (UX), and cloud-powered digital services. Sven Bauer, Head of Software at
Polestar, said, “Polestar is starting a new chapter in the company’s global setup with our
partner Infosys in Bengaluru. We look forward to building automotive competence in the
Polestar Tech Hub to support our growing vehicle portfolio and new model launches.”
• Infosys announced a successful collaboration with the Life Insurance Corporation of India (LIC)
to spearhead its digital transformation initiative called DIVE. Shri Siddhartha Mohanty, CEO
& MD, LIC, said, “Our collaboration with Infosys marks a significant milestone in our digital
transformation journey. It will not only enhance our operational capabilities, but also enable us
to cater to our vast customer, agent and employee base with newer, more personalized
experiences. We are committed to leveraging the latest technologies that Infosys has to offer,
including Cloud and Enterprise

# Section D: Pre-processing of Data

## Step D1. Split the document into Chunks

- The SemanticChunker splits text into chunks based on semantic similarity, ensuring that related content stays together in the same chunk.

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain.embeddings import HuggingFaceEmbeddings

text_splitter = SemanticChunker(HuggingFaceEmbeddings())
documents = text_splitter.split_documents(docs)

  text_splitter = SemanticChunker(HuggingFaceEmbeddings())
  text_splitter = SemanticChunker(HuggingFaceEmbeddings())
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
print(len(documents))  # Number of Chunks

13


In [None]:
# Now Look at the content of Second Document post Chucking - the contents will be different
print(documents[2].page_content)

IFRS – USD
Press Release
1. Client wins & Testimonials
• Infosys announced that it has entered into a long-term collaboration with Metro Bank to
enhance some of its IT and support functions, while digitally transforming the bank’s business
operations. Daniel Frumkin, Metro Bank Chief Executive Officer, said, “This collaboration
with a world class provider like Infosys builds on the solid foundations we have already laid,
unleashing our true potential, and creating a sustainably profitable and scalable organization
that is fit for the future. At the end of this transformation, we will be a very different business,
but the true essence of Metro Bank will remain the same – a high-quality service organization
putting customers centre-stage. Metro Bank expects to deliver £80m of annualized cost
savings this year across multiple initiatives, as it progresses towards the target of reaching mid-
to-high teen Return on Tangible Equity by 2027. Our vision for Metro Bank in 2025 and beyond,
place

## Step D2. Create embeddings for each text chunk
- Text (Unstructured Data) Converted to Numeric Representation
- Store in specialised / purpose build Database - Vector Database

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# Instantiate the embedding model
embedder = HuggingFaceEmbeddings()

# Create the vector store
vector = FAISS.from_documents(documents, embedder)

  embedder = HuggingFaceEmbeddings()


## Step D3. Test with Sample Retrieval of Data from the vector database

In [None]:
# Input
retriever = vector.as_retriever(search_type="similarity", search_kwargs={"k": 3})
retrieved_docs = retriever.invoke("What is name of the CFO?")


In [None]:
# Look at the name of CFO in this chuck of Text at the bottom or near end of the chunk
print(retrieved_docs[0].page_content)

Free cash flow for Q2 was at $839 million, growing 25.2% year
on year. TCV of large deal wins was $2.4 billion, 41% being net new. H1 revenues grew at 2.9% year over year in constant currency. Operating margin for H1 was at 21.1%. “We had strong growth of 3.1% quarter-on-quarter in constant currency in Q2. The growth was broad
based with good momentum in financial services. This stems from our strength in industry expertise,
market leading capabilities in cloud with Cobalt and generative AI with Topaz, resulting in growing client
preference to partner with us”, said Salil Parekh, CEO and MD. “Our large deals at $2.4 billion in Q2
reflect our differentiated position. I am grateful to our employees for their unwavering commitment to our
client as we further strengthen our market leadership” he added. 3.1% QoQ 21.1% 4.7% YoY $2.4 Bn $839 Mn
3.3% YoY Operating EPS Increase Large Deal Free
CC Growth Margin (₹ terms) TCV Cash Flow
Guidance for FY25:
• Revenue growth of 3.75%-4.50% in constan

# Section E: Augmentation

In [None]:
from langchain.chains import RetrievalQA
from langchain.chains.llm import LLMChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.prompts import PromptTemplate

In [None]:
prompt = """
1. Use the following pieces of context to answer the question at the end.
2. If you don't know the answer, just say that "I don't know" but don't make up an answer on your own.\n
3. Keep the answer crisp and limited to 3,4 sentences.

Context: {context}

Question: {question}

Helpful Answer:"""

In [None]:
QA_CHAIN_PROMPT = PromptTemplate.from_template(prompt)

llm_chain = LLMChain(
                  llm=llm,
                  prompt=QA_CHAIN_PROMPT,
                  callbacks=None,
                  verbose=False)

document_prompt = PromptTemplate(
    input_variables=["page_content", "source"],
    template="Context:\ncontent:{page_content}\nsource:{source}",
)

combine_documents_chain = StuffDocumentsChain(
                  llm_chain=llm_chain,
                  document_variable_name="context",
                  document_prompt=document_prompt,
                  callbacks=None,
              )

  llm_chain = LLMChain(
  combine_documents_chain = StuffDocumentsChain(


In [None]:
qa = RetrievalQA(
                  combine_documents_chain=combine_documents_chain,
                  verbose=False,
                  retriever=retriever,
                  return_source_documents=False,
              )

  qa = RetrievalQA(


# Section F: testing!!

In [None]:
# Input Prompt
# Note : Ignore warnings if you are getting the response
print(qa("What is name of the CFO?")["result"])

  print(qa("What is name of the CFO?")["result"])


 The name of the CFO mentioned in the context is Jayesh Sanghrajka.


In [None]:
pprint(qa("What is a Quarterly Revenue of Infosys in USD for recent quarter?"))

{'query': 'What is a Quarterly Revenue of Infosys in USD for recent quarter?',
 'result': ' Infosys reported a quarterly revenue of $4,894 million in USD for '
           'the recent quarter, with a sequential growth of 3.1% and a '
           'year-on-year growth of 3.3% in constant currency. This information '
           'was extracted from the audited condensed consolidated Balance '
           'sheet and Statement of Comprehensive Income for the quarter and '
           'six months ended September 30, 2024, which have been taken on '
           'record at the Board meeting held on October 17, 2024.'}
