### Rag Bot for Q/A from a pdf document

`PypdfLoader` is used to load the file so that it can be analysed by the bot. Here we will use the `Sample_Financial_Statement.pdf` but for the web application which will be built later, we will allow the user to upload the document. An interesting note is that the `PypdfLoader` loads the pdf file as pages and not as a complete document. We can access the specific page of the document by indexing the page within `[]`. For example if we want to access the page 6, we will print it as `print(data[6])`.

In [1]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('Sample_Financial_Statement.pdf')
data =  loader.load()


We need to split the file into smaller chunks for the document to be analyzed. For this we will use the `RecursiveCharacterTextSplitter` class from `langchain.text_splitter` module which will  break down the document into readable chunks for better understanding. The `RecursiveCharacterTextSplitter` is quite versatile and therefore is used in this particular case.

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
docs =  text_splitter.split_documents(data)

Next we need to import the `GoogleGenerativeAIEmbeddings` to convert the text into vector format which is how the Gemini API will access the records and provide a response. We will also require a database to store the vectors and for that we will import `Chroma` to store the numerical tokens. We will aso require an API key in order to access the Gemini API which will be used for this code. In order to access the key, we will need `load_dotenv` and the API key is stored in the `.env` file

In [49]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import Chroma
from dotenv import load_dotenv

load_dotenv()

embeddings  = GoogleGenerativeAIEmbeddings(model='models/embedding-001')

The following step stores the vector into the database that we have created so that they can be accessed afterwards by the bot.

In [50]:
store_vector =  Chroma.from_documents(documents=docs,embedding=GoogleGenerativeAIEmbeddings(model='models/embedding-001'),persist_directory="./data")

After storing the vectorized tokens, we will use the `invoke` command to call the similar values from the document and the number of instances which will be called will be determined by the ` search_kwargs = {"k":10}` parameter

In [40]:
retriver = store_vector.as_retriever(search_type =  "similarity", search_kwargs = {"k":10})
retrived_docs = retriver.invoke("What is  investment?")

We can see that the statement that we have used here is `"What is  investment?"` will bring out 10 most similar excerpts from the document and will be stored in the variable. `print(retrived_docs[8].page_content)` command prints out the 8th entry in the content.

In [51]:
print(retrived_docs[8].page_content)

Investments in government bonds                            28                         28                                —                                 — 
Investments in non convertible debentures                       3,868                    1,793                       2,075                                 — 
Investment in government securities                       7,632                       7,549                                83                                 — 
Investments in equity securities                              3                            —                                —                                   3 
Investments in preference securities                          193                            —                                —                               193 
Investments in commercial papers                          742                            —                              742                             —


As can be seen from the text above, the word investment was analysed and then the most similar text was then stored in the variable

We will now import the `ChatGoogleGenerativeAI` from `langchain_google_genai` which will be our large languange model which will help us chat and find answers from the document.

In [52]:
from langchain_google_genai import ChatGoogleGenerativeAI

We are using the `"gemini-1.5-pro"` model as our llm and by setting the temperature to 0.3, we  will get less random outputs from the llm and by limiting the max tokens to 500, we will decide to limit the output that we get from the llm to 500 characters.

In [54]:
llm =  ChatGoogleGenerativeAI(model =  "gemini-1.5-pro", temperature=0.3,max_tokens=500)

Next we need to create a pipeline wwhich informs the llm on the order of events happening and for that we will assemble a chain by using the ` langchain.chains` module and define the chat prompts using the `ChatPromptTemplate` class from the `langchain.prompts` module.

In [55]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import ChatPromptTemplate

Now we decide a system prompt which will inform the bot how to respond to queries from the user. The `ChatPromptTemplate` will include the format in which the message is relayed to the llm in order to elicit an answer.

In [56]:
system_prompt = (
    "You are an assistant for question answer tasks"
    "Use the following pieces of retrived context to answer"
    "the question.If you don't know the answer, say that you"
    "dont know. Use three sentences maximum  and keep the"
    "answer concise."
    "\n\n"
    "{context}"
    
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system",system_prompt),
        ("human", "{input}"),
    ]
)

We will create a chain of events which will decide order in which the events will happen for the llm. In this particular case, the llm will get the prompt from the user or human and using the retriver, we will extract an answer from the llm with respect to the query supplied from the human counterpart. The prompt which was defined earlier informs the llm about the format of the response as well.

In [46]:
question_answer_chain = create_stuff_documents_chain(llm,prompt)
rag_chain =  create_retrieval_chain(retriver,question_answer_chain)

We will use the bot now to check for responses to the particular file that has been uploaded using the invoke command.

In [47]:
response = rag_chain.invoke({"input": "Show me the summary of the file"})
print(response["answer"])

Infosys Americas Inc. was liquidated on July 14, 2023, and several oddity GmbH subsidiaries merged on September 29, 2023. Infosys Ltd. acquired Danske IT (renamed Idunn Information Technology) on September 1, 2023.  Financial statements show asset balances, additions, deletions, and depreciation through March 31, 2024, along with revenue, expenses, and profit figures. Investment details in various securities are also provided.


In [48]:
response = rag_chain.invoke({"input": "Show me the summary of the file"})
print(response["answer"])

Infosys acquired Danske IT (renamed Idunn Information Technology) on September 1, 2023.  Several oddity GmbH subsidiaries merged into WongDoody GmbH on September 29, 2023. Infosys Americas was liquidated on July 14, 2023.


As seen from above, we can see that the bot gives answers which are from the document and due to the temperature=0.3 parameter, we can get slight randomness in each response, which is characterized by its uniqueness in the response.

In [57]:
response = rag_chain.invoke({"input": "What is the gross profit for Q3 2024?"})
print(response["answer"])

The gross profit for Q3 2024 is ₹11,175 crore.  This is calculated as revenue from operations less cost of sales.  The information is found in the provided condensed consolidated statement of profit and loss.


In [59]:
response = rag_chain.invoke({"input": "How do the comprehensive income and operating expenses compare for Q1 2024?"})
print(response["answer"])

The provided information is a consolidated statement of profit and loss, it does not contain comprehensive income data.  It does show operating expenses totaling ₹3,554 crore in Q1 2024. Therefore, I cannot provide a comparison without the comprehensive income figure.


In [60]:
response = rag_chain.invoke({"input": "Show me the table for Cash flows from financing activities"})
print(response["answer"])

Cash flows from financing activities:

| Item                                            | Current Year | Prior Year |
|-------------------------------------------------|-------------:|-----------:|
| Payment of lease liabilities                    |     (2,024) |    (1,231) |
| Payment of dividends                            |    (14,692) |   (13,631) |
| Payment of dividend to non-controlling interest |        (39) |       (22) |
| Buyback of shares (non-controlling interest)    |        (18) |         — |
| Shares issued (employee stock options)          |          5 |        35 |
| Other receipts                                  |          - |       132 |
| Other payments                                   |       (736) |      (479) |


In [62]:
response = rag_chain.invoke({"input": "What are the acquisitions during the year ended March 31, 2023"})
print(response["answer"])

Net assets were ₹103 crore. Intangible assets included customer contracts and relationships (₹274 crore), vendor relationships (₹30 crore), and brand (₹24 crore), with deferred tax liabilities on intangible assets of ₹(80) crore.  The total purchase price allocated was ₹351 crore.


In [64]:
response = rag_chain.invoke({"input": "What are the proposed acquisitions during the year 2024?"})
print(response["answer"])

Infosys proposed two acquisitions in 2024.  They planned to acquire in-tech Holding GmbH, a German Engineering R&D services provider, for up to €450 million.  They also proposed acquiring InSemi Technology Services Private Limited, an Indian semiconductor design services company, for up to ₹280 crore.


In [65]:
response = rag_chain.invoke({"input": "What is the accounting policy of the company? Answer in 4 sentences"})
print(response["answer"])

The company assesses contracts at inception to determine if they contain a lease based on control of an identified asset.  Lease liabilities are initially measured at amortized cost, using the present value of future lease payments.  Claims against the group are categorized as contingent liabilities, particularly those related to tax matters.  The company also recognizes various expenses such as rates and taxes, consumables, insurance, and contributions to corporate social responsibility.
