# RAG for Finance - Introduction



### Mounting Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Installing Necessary Libraries

In [2]:
!pip install langchain langchain-core langchain-groq langchain-community chromadb pypdf langchain-huggingface xformers
!pip uninstall transformers -y
!pip install transformers
!pip uninstall torch torchvision -y
!pip install torch torchvision

Collecting langchain-groq
  Downloading langchain_groq-0.2.4-py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.16-py3-none-any.whl.metadata (2.9 kB)
Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting pypdf
  Downloading pypdf-5.2.0-py3-none-any.whl.metadata (7.2 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Collecting xformers
  Downloading xformers-0.0.29.post2-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting groq<1,>=0.4.1 (from langchain-groq)
  Downloading groq-0.16.0-py3-none-any.whl.metadata (14 kB)
Collecting langchain-core
  Downloading langchain_core-0.3.33-py3-none-any.whl.metadata (6.3 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downlo

Found existing installation: transformers 4.47.1
Uninstalling transformers-4.47.1:
  Successfully uninstalled transformers-4.47.1
Collecting transformers
  Downloading transformers-4.48.2-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.48.2-py3-none-any.whl (9.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m70.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
Successfully installed transformers-4.48.2
Found existing installation: torch 2.6.0
Uninstalling torch-2.6.0:
  Successfully uninstalled torch-2.6.0
Found existing installation: torchvision 0.20.1+cu124
Uninstalling torchvision-0.20.1+cu124:
  Successfully uninstalled torchvision-0.20.1+cu124
Collecting torch
  Using cached torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting torchvision
  Downloading torchvi

### Importing Necessary Libraries

In [3]:
# LLM Service and Embeddings
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings

# Document Processing
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

# Chain Management
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain import hub

### Defining Constants

In [4]:
from google.colab import userdata
# import chromadb

GROQ_API_KEY = userdata.get('GROQ_API_KEY')
MODEL_ID = "deepseek-r1-distill-llama-70b"
PDF_PATH = "/content/drive/MyDrive/2025_Projects/interviewclub_rag_for_finance/workshop_exercise/10-Q4-2024-As-Filed.pdf"

# # Initialize ChromaDB client
# db = chromadb.PersistentClient(path="./chroma_db")

### Setting Up Chat Model Inference

In [5]:
llm = ChatGroq(model=MODEL_ID, api_key=GROQ_API_KEY)

# Example usage - Invoking
# result = llm.invoke("Solve me fibonacci sequence?")
# print(result.content)

# Example usage - Streaming
# for chunk in llm.stream("Solve me fibonacci sequence?"):
#   print(chunk.content, end="")

### Loading PDF documents

In [6]:
pdf_loader = PyPDFLoader(PDF_PATH)
documents = pdf_loader.load()

In [7]:
len(documents)

121

### Splitting the Document into Manageable Chunks

In [8]:
# Split text into chunks
def split_documents(documents):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100
    )
    return text_splitter.split_documents(documents)

In [9]:
docs = split_documents(documents)

To know more about splitters: https://medium.com/@harsh.vardhan7695/mastering-text-splitting-in-langchain-735313216e01

### Loading Embeddings from HuggingFace model hub

In [13]:
model_name = "NovaSearch/stella_en_400M_v5"
model_kwargs = {'device': 'cuda:0', 'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': False}
hf = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

Some weights of the model checkpoint at NovaSearch/stella_en_400M_v5 were not used when initializing NewModel: {'new.pooler.dense.weight', 'new.pooler.dense.bias'}
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Model selected based on: https://huggingface.co/spaces/mteb/leaderboard

### Creating and Storing Embeddings in ChromaDB

In [11]:
# Create embeddings and store them in ChromaDB
def create_vectorstore(documents):
    # embeddings = OpenAIEmbeddings()
    persist_directory = "chroma_db"
    vectorstore = Chroma.from_documents(documents, hf, persist_directory=persist_directory, collection_name="financial_docs")
    return vectorstore

In [14]:
vectorstore = create_vectorstore(docs)

For confirming if our embedding model work correctly.


In [15]:
query = "What is Apple’s current ratio over the last 3 years, and how has it changed?"
docs = vectorstore.similarity_search(query, k=5)
for doc in docs:
    print(doc.page_content, "----sep-----")

2023 spanned 53 weeks.
Macroeconomic Conditions
Macroeconomic conditions, including inflation, interest rates and currency fluctuations, have directly and indirectly impacted, and 
could in the future materially impact, the Company’s results of operations and financial condition.
Apple Inc. | 2024 Form 10-K | 21 ----sep-----
Company Stock Performance
The following graph shows a comparison of five-year cumulative total shareholder return, calculated on a dividend-reinvested 
basis, for the Company, the S&P 500 Index and the Dow Jones U.S. Technology Supersector Index. The graph assumes $100 
was invested in each of the Company’s common stock, the S&P 500 Index and the Dow Jones U.S. Technology Supersector 
Index as of the market close on September 27, 2019. Past stock price performance is not necessarily indicative of future stock 
price performance.
COMPARISON OF 5-YEAR CUMULATIVE TOTAL RETURN
Among Apple Inc., the S&P 500 Index and the Dow Jones U.S. Technology Supersector Index
Apple

### Setting Up Retrieval Chain

In [16]:
retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")
combine_docs_chain = create_stuff_documents_chain(
    llm, retrieval_qa_chat_prompt
)

retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 10})
retrieval_chain = create_retrieval_chain(retriever, combine_docs_chain)



Source of the prompt: https://smith.langchain.com/hub/langchain-ai/retrieval-qa-chat

Regular Output

In [18]:
results = retrieval_chain.invoke({"input": "What is Apple’s current ratio over the last 3 years, and how has it changed?"})
print(results['answer'])

<think>
Okay, I need to figure out Apple's current ratio over the last three years and how it has changed. I remember that the current ratio is calculated by dividing current assets by current liabilities. So, first, I should find the relevant data from the context provided.

Looking through the context, I see the consolidated balance sheets for September 28, 2024, and September 30, 2023. The current assets and current liabilities are listed there. For 2024, current assets are $152,987 million and current liabilities are $176,392 million. For 2023, current assets were $143,566 million and current liabilities were $145,308 million. 

Wait, I also need data for 2022 to cover the last three years. Let me check if the 2022 figures are provided. Hmm, in the Notes section, I see a mention of fiscal years 2022, 2023, and 2024, but I don't immediately see the specific current assets and liabilities for 2022. Maybe I need to look more carefully.

Wait, the context includes a section labeled "CO

Streaming output

In [19]:
for item in retrieval_chain.stream({"input": "What is Apple’s current ratio over the last 3 years, and how has it changed?"}):
  if item.get('answer'):
    print(item.get('answer'), end="")

<think>
Alright, so I need to figure out Apple's current ratio over the last three years and see how it's changed. I remember that the current ratio is a liquidity measure, calculated by dividing current assets by current liabilities. It gives an idea of a company's ability to cover its short-term debts with its short-term assets.

Looking at the context provided, I see that Apple's Form 10-K for 2024 is included. The context mentions the consolidated balance sheets for September 28, 2024, and September 30, 2023. I also notice that the fiscal year 2022 is mentioned, so I can probably find data for 2022 as well.

First, I'll extract the current assets and current liabilities for each year. From the balance sheet section:

- For September 28, 2024:
  - Total current assets: $152,987 million
  - Total current liabilities: $176,392 million

- For September 30, 2023:
  - Total current assets: $143,566 million
  - Total current liabilities: $145,308 million

- For September 24, 2022 (since 2

In [20]:
for item in retrieval_chain.stream({"input": "How does Apple’s ratio analysis reflect its financial health compared to industry standards?"}):
  if item.get('answer'):
    print(item.get('answer'), end="")

<think>
Okay, I need to figure out how Apple's ratio analysis reflects its financial health compared to industry standards. Let me start by recalling what I know about financial ratios and what they indicate. There are several key ratios like profit margin, ROE, ROA, debt-to-equity, current ratio, etc., each giving insight into different aspects of a company's financial health.

Looking at the context provided, I don't see explicit mentions of these ratios. However, I can infer some information based on the data given. For example, the stock performance graph shows that Apple's stock has grown from $100 to $430 over five years, which is a 330% increase. Comparing this to the S&P 500 and the Dow Jones Technology index would give a relative performance, but without their exact figures, it's hard to quantify. Still, a 330% increase is quite significant and suggests strong performance.

The context mentions that Apple's net sales through direct and indirect channels were 38% and 62% respec