

# üìë GenAI Powered Financial Policy Assistant ‚Äî End-to-End Solution


## üíº 1. Business Problem

In banks, insurance firms, and financial institutions, professionals face **thousands of pages of compliance manuals, regulatory filings, and internal policy documents**.  

- Searching manually wastes time and increases compliance risk.  
- Misinterpretation can lead to **multi-million dollar fines**.  
- Employees and auditors need **instant access to precise, policy-based answers**.  

This project solves that problem by creating an AI assistant that ingests **financial policy PDFs**, converts them into embeddings, and answers questions strictly based on those documents ‚Äî ensuring compliance and reducing risk.

---

## üåç 2. Why This Problem Is Important

- **Financial regulations = 1000s of pages** (Basel III, IFRS, SEC filings, RBI circulars, etc.)  
- Manual searching is **slow and error-prone**  
- **Time-critical decisions** in audits, reporting, and client onboarding  
- **Accuracy is non-negotiable** ‚Äî hallucinations = fines, lawsuits, reputational damage  
- Staff are often non-technical ‚Üí need **natural language Q&A**  

üëâ ‚ÄúThis AI tool democratizes access to financial compliance knowledge, reduces manual effort, and minimizes regulatory risk.‚Äù

---

## üß© 3. Challenges Solved

**Challenge 1 ‚Äî Long, unstructured financial documents**  
- PDFs contain dense legal text, tables of thresholds, irregular formatting, and no metadata.  
- ‚úÖ Solution: **OpenAI embedding models** convert text chunks into vector representations stored in Vector Store.

**Challenge 2 ‚Äî Accuracy in compliance domain**  
- LLMs must not hallucinate.  
- ‚úÖ Solution: Retrieval-Augmented Generation (RAG) ensures answers are grounded in the uploaded policies only.  

**Challenge 3 ‚Äî Strict document-only answers**  
- No external knowledge allowed.  
- ‚úÖ Solution: Query pipeline enforces grounding in embeddings, with citations from source PDFs.  

**Challenge 4 ‚Äî Fast inference at scale**  
- Compliance queries often happen under audit pressure.  
- ‚úÖ Solution: OpenAI LLMs provide low-latency, high-quality responses.  

---

## üèóÔ∏è 4. Technical Architecture

### üîπ Flow Overview
1. **Frontend (Streamlit)**  
   - Chatbot UI for document upload + Q&A.  
   - Displays answers with citations.  

2. **Backend RAG Pipeline**  
   - Handles preprocessing, embeddings, retrieval, and LLM orchestration.  

3. **Embedding & Retrieval Layer**  
   - **OpenAI Embedding Models** (`text-embedding-ada-002` or newer).  
   - Vector DB: FAISS/Chroma (self-hosted) or Azure Cognitive Search.  

4. **LLM Layer**  
   - **OpenAI GPT‚Äë4 / GPT‚Äë4 Turbo** for grounded answers.  
   - RAG pipeline ensures strict document grounding.  

5. **Deployment (Azure Web App)**   
   - Streamlit frontend deployed separately.  
   - Azure services:  
     - **Blob Storage** ‚Üí raw PDFs  
     - **Key Vault** ‚Üí secure API keys  
     - **Monitor** ‚Üí logging & performance  

---

### üîπ Architecture Diagram (Conceptual)

```
[User Query] ‚Üí [Streamlit Frontend]
       ‚Üì                               
   [OpenAI LLM] ‚Üê [Relevant Chunks] ‚Üê [FAISS Vector DB]
       ‚Üë                               ‚Üë
   [Embedding API] ‚Üê [Chunked PDFs] ‚Üê [Document Loader]
```

---

## üìÇ 5. Document Loaders

- **PyPDFLoader** ‚Üí standard PDFs  
- **UnstructuredPDFLoader** ‚Üí messy PDFs with tables  
- **DocxLoader** ‚Üí Word-based compliance docs  
- **CSVLoader** ‚Üí tabular compliance thresholds  
- **HTMLLoader** ‚Üí regulatory websites  

---

## ‚úÇÔ∏è 6. Chunking Strategies

- **Fixed-size chunking** (512‚Äì1024 tokens)  
- **Table-aware chunking** ‚Üí preserve rows/columns  
- **Semantic chunking** ‚Üí split by headings/clauses  
- **Overlap strategy** (50‚Äì100 tokens overlap) ‚Üí prevent context loss  

---

## üé® 7. Frontend (Streamlit)

- Upload financial policy documents  
- Ask compliance questions in natural language  
- Display answers with citations  

---

## üåç 8. Impact

- **Compliance Risk Reduction** ‚Üí grounded answers prevent misinterpretation  
- **Efficiency Gains** ‚Üí instant retrieval vs hours of manual searching  
- **Audit Readiness** ‚Üí citation-backed responses for regulators  
- **Scalability** ‚Üí supports multiple document types + Azure scaling  
- **User Empowerment** ‚Üí non-technical staff can query complex policies in plain English  

---

## ‚ú® 09. Example Use Cases

- ‚ÄúSummarize RBI circulars on KYC requirements in the last 2 years.‚Äù  
- ‚ÄúWhat is the minimum liquidity coverage ratio under Basel III?‚Äù  
- ‚ÄúCompare IFRS vs GAAP treatment of deferred tax assets.‚Äù  
- ‚ÄúHighlight conflicting clauses between internal compliance manual and RBI guidelines.‚Äù  

---


---

## ‚úÖ One-line Summary (Elevator Pitch)

Built a GenAI-powered **Financial Policy Assistant** using OpenAI embeddings + LLMs, where financial policy PDFs are converted into embeddings, stored in Vectore, and queried via a FastAPI backend with a Streamlit frontend, deployed on Azure Web App for fast, accurate, compliance-focused responses.

---













# üé§ Interview Q&A on Financial Policy Assistant

---

### **Q1. Can you summarize your project in one line?**
**A:**  
I built a GenAI-powered **Financial Policy Assistant** using OpenAI embeddings + LLMs, where financial policy PDFs are converted into embeddings, stored in FAISS, and queried via a FastAPI backend with a Streamlit frontend, deployed on Azure Web App for fast, accurate, compliance-focused responses.

---

### **Q2. What business problem does this solve?**
**A:**  
Financial institutions deal with thousands of pages of compliance manuals and regulatory filings. Searching them manually is slow and error-prone, and misinterpretation can lead to multi-million dollar fines. My assistant solves this by enabling instant, accurate, document-grounded answers, reducing compliance risk and saving time.

---

### **Q3. Why is this problem important?**
**A:**  
- Regulations like Basel III or IFRS run into hundreds of pages.  
- Manual searching wastes time during audits or reporting.  
- Accuracy is critical ‚Äî hallucinations can cause penalties or reputational damage.  
- Staff are often non-technical, so natural language Q&A democratizes access to compliance knowledge.  

---

### **Q4. What were the main technical challenges you faced?**
**A:**  
1. **Unstructured documents** ‚Äî PDFs with tables, irregular formatting, no metadata.  
2. **Accuracy requirement** ‚Äî LLMs must not hallucinate in compliance domain.  
3. **Strict grounding** ‚Äî answers must come only from uploaded documents.  
4. **Latency** ‚Äî queries must return quickly under audit pressure.  

---

### **Q5. How did you solve these challenges?**
**A:**  
- Used **different document loaders** (PyPDFLoader, UnstructuredPDFLoader, DocxLoader, CSVLoader) to handle diverse formats.  
- Applied **chunking strategies**: fixed-size, semantic (by headings), table-aware, with overlap to preserve context.  
- Built a **RAG pipeline**: OpenAI embeddings stored in FAISS, retrieved chunks passed into GPT‚Äë4 for grounded answers.  
- Deployed on **Azure Web App** with FastAPI backend and Streamlit frontend for scalability and low latency.  

---

### **Q6. Can you walk me through the technical architecture?**
**A:**  
- **Frontend (Streamlit)** ‚Üí chatbot UI for uploads + queries.   
- **Embedding Layer** ‚Üí OpenAI embedding models (`text-embedding-ada-002`).  
- **Vector Store** ‚Üí for similarity search.  
- **LLM Layer** ‚Üí GPT‚Äë4 generates answers strictly from retrieved chunks.  
- **Deployment** ‚Üí Azure Web App, Blob Storage for PDFs, Key Vault for secrets, Monitor for logging.  

---

### **Q7. What impact does this solution have?**
**A:**  
- **Compliance risk reduction** ‚Üí grounded answers prevent misinterpretation.  
- **Efficiency gains** ‚Üí instant retrieval vs hours of manual searching.  
- **Audit readiness** ‚Üí citation-backed answers for regulators.  
- **Scalability** ‚Üí supports multiple document types and chunking strategies.  
- **User empowerment** ‚Üí non-technical staff can query complex policies in plain English.  

---

### **Q8. Can you give an example of how it‚Äôs used?**
**A:**  
- ‚ÄúSummarize RBI circulars on KYC requirements in the last 2 years.‚Äù  
- ‚ÄúWhat is the minimum liquidity coverage ratio under Basel III?‚Äù  
- ‚ÄúCompare IFRS vs GAAP treatment of deferred tax assets.‚Äù  
- ‚ÄúHighlight conflicting clauses between internal compliance manual and RBI guidelines.‚Äù  

---

### **Q9. How would you improve this solution further?**
**A:**  
- Add **multi-document cross-referencing** to detect conflicts across policies.  
- Integrate **role-based access control** for sensitive compliance documents.  
- Use **Azure Cognitive Search** for hybrid semantic + keyword retrieval.  
- Add **explainability features**: highlight exact clauses used in answers.  

---


In [None]:
# Document Understanding
# Types: PDF,DOCX,TXT,Confluence
# Structure
# Explore

In [None]:
# Loaders:
# pypdf
# pdfplumber
# pdfminer
# pymupdf   or pymupdf4llm


In [1]:
! pip install pymupdf

Collecting pymupdf
  Downloading pymupdf-1.26.7-cp310-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.7-cp310-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m24.1/24.1 MB[0m [31m84.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.7


In [2]:
! pip install langchain langchain-community

Collecting langchain-community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-classic<2.0.0,>=1.0.0 (from langchain-community)
  Downloading langchain_classic-1.0.0-py3-none-any.whl.metadata (3.9 kB)
Collecting requests<3.0.0,>=2.32.5 (from langchain-community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain-community)
  Downloading marshmallow-3.26.2-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting langchain-text-splitters<2.0.0,>=1.0.0 (from langchain-classic<2.0.0,>=1.0.0->langchain-community)
  Downloading langchain_text_splitters-1.1.0

In [3]:
! pip show langchain

Name: langchain
Version: 1.2.0
Summary: Building applications with LLMs through composability
Home-page: https://docs.langchain.com/
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.12/dist-packages
Requires: langchain-core, langgraph, pydantic
Required-by: 


In [4]:
from langchain_community.document_loaders import PyMuPDFLoader



In [5]:
loader = PyMuPDFLoader("/content/6_financial-policies-and-procedures_2022.pdf")

In [6]:
documents = loader.load()

In [7]:
documents

[Document(metadata={'producer': 'Microsoft: Print To PDF', 'creator': '', 'creationdate': '2022-07-19T09:16:20+01:00', 'source': '/content/6_financial-policies-and-procedures_2022.pdf', 'file_path': '/content/6_financial-policies-and-procedures_2022.pdf', 'total_pages': 20, 'format': 'PDF 1.7', 'title': 'Microsoft Word - 6_Financial policies and procedures_2022', 'author': 'IA-Admin', 'subject': '', 'keywords': '', 'moddate': '2022-07-19T09:16:20+01:00', 'trapped': '', 'modDate': "D:20220719091620+01'00'", 'creationDate': "D:20220719091620+01'00'", 'page': 0}, page_content='FINANCIAL POLICY AND  \nPROCEDURES MANUAL  \n \n \nApproved by Audit Committee June 2018, updated Jul 2019, updated Feb 2020, \nupdated Jan 2021, updated Mar 22 \n \n \n1 \n \nContents \n1. \nIntroduction \n2 \n2. \nBank accounts \n3 \n3. \nPetty cash \n3 \n4. \nFixed assets \n4 \n5. \nDebtors, creditors, accruals and prepayments \n4 \n6. \nIncome \n4 \n7. \nProcurement policy \n4 \n8. \nExpenditure processing \n6 \

In [8]:
# Document Object

# Document(
#     metadata={'producer': 'Microsoft: Print To PDF', 'creator': '', 'creationdate': '2022-07-19T09:16:20+01:00', 'source': '/content/6_financial-policies-and-procedures_2022.pdf', 'file_path': '/content/6_financial-policies-and-procedures_2022.pdf', 'total_pages': 20, 'format': 'PDF 1.7', 'title': 'Microsoft Word - 6_Financial policies and procedures_2022', 'author': 'IA-Admin', 'subject': '', 'keywords': '', 'moddate': '2022-07-19T09:16:20+01:00', 'trapped': '', 'modDate': "D:20220719091620+01'00'", 'creationDate': "D:20220719091620+01'00'", 'page': 0},
#     page_content='FINANCIAL POLICY AND  \nPROCEDURES MANUAL  \n \n \nApproved by Audit Committee June 2018, updated Jul 2019, updated Feb 2020, \nupdated Jan 2021, updated Mar 22 \n \n \n1 \n \nContents \n1. \nIntroduction \n2 \n2. \nBank accounts \n3 \n3. \nPetty cash \n3 \n4. \nFixed assets \n4 \n5. \nDebtors, creditors, accruals and prepayments \n4 \n6. \nIncome \n4 \n7. \nProcurement policy \n4 \n8. \nExpenditure processing \n6 \n9. \nStaff reward \n8 \n10. \nReserves \n8 \n11.



In [9]:
documents[0]

Document(metadata={'producer': 'Microsoft: Print To PDF', 'creator': '', 'creationdate': '2022-07-19T09:16:20+01:00', 'source': '/content/6_financial-policies-and-procedures_2022.pdf', 'file_path': '/content/6_financial-policies-and-procedures_2022.pdf', 'total_pages': 20, 'format': 'PDF 1.7', 'title': 'Microsoft Word - 6_Financial policies and procedures_2022', 'author': 'IA-Admin', 'subject': '', 'keywords': '', 'moddate': '2022-07-19T09:16:20+01:00', 'trapped': '', 'modDate': "D:20220719091620+01'00'", 'creationDate': "D:20220719091620+01'00'", 'page': 0}, page_content='FINANCIAL POLICY AND  \nPROCEDURES MANUAL  \n \n \nApproved by Audit Committee June 2018, updated Jul 2019, updated Feb 2020, \nupdated Jan 2021, updated Mar 22 \n \n \n1 \n \nContents \n1. \nIntroduction \n2 \n2. \nBank accounts \n3 \n3. \nPetty cash \n3 \n4. \nFixed assets \n4 \n5. \nDebtors, creditors, accruals and prepayments \n4 \n6. \nIncome \n4 \n7. \nProcurement policy \n4 \n8. \nExpenditure processing \n6 \n

In [10]:
documents[2]

Document(metadata={'producer': 'Microsoft: Print To PDF', 'creator': '', 'creationdate': '2022-07-19T09:16:20+01:00', 'source': '/content/6_financial-policies-and-procedures_2022.pdf', 'file_path': '/content/6_financial-policies-and-procedures_2022.pdf', 'total_pages': 20, 'format': 'PDF 1.7', 'title': 'Microsoft Word - 6_Financial policies and procedures_2022', 'author': 'IA-Admin', 'subject': '', 'keywords': '', 'moddate': '2022-07-19T09:16:20+01:00', 'trapped': '', 'modDate': "D:20220719091620+01'00'", 'creationDate': "D:20220719091620+01'00'", 'page': 2}, page_content='FINANCIAL POLICY AND  \nPROCEDURES MANUAL  \n \n \nApproved by Audit Committee June 2018, updated Jul 2019, updated Feb 2020, \nupdated Jan 2021, updated Mar 22 \n \n \n3 \n \n2. Bank accounts  \nIntegrity Action has the following bank accounts:  \nAccount name\nCurrency \nSort code\nAccount \nnumber\nHSBC Main\nGBP\n40 11 60\n50149810\nHSBC \nReserves* \nGBP\n40 02 90\n80680656\nHSBC Norad \nGBP\n40 11 60\n70151580\

In [11]:
documents[3]

Document(metadata={'producer': 'Microsoft: Print To PDF', 'creator': '', 'creationdate': '2022-07-19T09:16:20+01:00', 'source': '/content/6_financial-policies-and-procedures_2022.pdf', 'file_path': '/content/6_financial-policies-and-procedures_2022.pdf', 'total_pages': 20, 'format': 'PDF 1.7', 'title': 'Microsoft Word - 6_Financial policies and procedures_2022', 'author': 'IA-Admin', 'subject': '', 'keywords': '', 'moddate': '2022-07-19T09:16:20+01:00', 'trapped': '', 'modDate': "D:20220719091620+01'00'", 'creationDate': "D:20220719091620+01'00'", 'page': 3}, page_content='FINANCIAL POLICY AND  \nPROCEDURES MANUAL  \n \n \nApproved by Audit Committee June 2018, updated Jul 2019, updated Feb 2020, \nupdated Jan 2021, updated Mar 22 \n \n \n4 \n \n4. Fixed assets  \nAll assets costing more than ¬£3,000 and with an expected useful life exceeding one year are \ncapitalised.  \nDepreciation is charged annually at the following rates in order to write off assets over their \nuseful economic 

In [12]:
len(documents)

20

In [None]:
# Document Splitting/Chunking

In [None]:
# recursive char splitter

In [13]:
! pip install langchain_text_splitters



In [14]:
from langchain_text_splitters import RecursiveCharacterTextSplitter


#### Paragraph -- \n\n
#### New Line -- \n
#### sentence level -- .

In [16]:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200,
    separators= ["\n\n","\n","."," ",""]
)

In [None]:
## chunk 1:
"""
Bank accounts may be accessed on line via hsbc.net. The following people have access:
1. Britto Bernadet (Finance and Compliance Manager) ‚Äì System administrator: prepare and
approve payments
2. Jasmina Haynes (CEO) ‚Äì System administrator: prepare and approve payments
3. Derek Thorne (Head of Programme Development) ‚Äì End user: prepare and approve
payments
4. Annalisa Renna (Head of Operations) ‚Äì End user: prepare and approve payments
5. Baduwah MacArthur
"""

## chunk 2:
"""
(Finance and Governance Assistant) ‚Äì System administrator:
prepare payments only
All bank accounts are reconciled to the accounting system at least monthly and the month-end
reconciliation is reviewed by an independent person, typically the Finance and Compliance
Manager.
3. Petty cash
Since the organisation no longer has an office, we have discontinued the petty cash facility.
"""


## overlap of chunking
overlap_size = 100
overlap_size = 200

# Chunk
"""
4. Annalisa Renna (Head of Operations) ‚Äì End user: prepare and approve payments
5. Baduwah MacArthur
(Finance and Governance Assistant) ‚Äì System administrator:
prepare payments only
All bank accounts are reconciled to the accounting system at least monthly and the month-end
reconciliation is reviewed by an independent person, typically the Finance and Compliance
Manager.
3. Petty cash
Since the organisation no longer has an office, we have discontinued the petty cash facility.
"""


In [17]:
chunks = text_splitter.split_documents(documents)
len(chunks)

46

In [18]:
type(chunks)

list

In [19]:
chunks[0]

Document(metadata={'producer': 'Microsoft: Print To PDF', 'creator': '', 'creationdate': '2022-07-19T09:16:20+01:00', 'source': '/content/6_financial-policies-and-procedures_2022.pdf', 'file_path': '/content/6_financial-policies-and-procedures_2022.pdf', 'total_pages': 20, 'format': 'PDF 1.7', 'title': 'Microsoft Word - 6_Financial policies and procedures_2022', 'author': 'IA-Admin', 'subject': '', 'keywords': '', 'moddate': '2022-07-19T09:16:20+01:00', 'trapped': '', 'modDate': "D:20220719091620+01'00'", 'creationDate': "D:20220719091620+01'00'", 'page': 0}, page_content='FINANCIAL POLICY AND  \nPROCEDURES MANUAL  \n \n \nApproved by Audit Committee June 2018, updated Jul 2019, updated Feb 2020, \nupdated Jan 2021, updated Mar 22 \n \n \n1 \n \nContents \n1. \nIntroduction \n2 \n2. \nBank accounts \n3 \n3. \nPetty cash \n3 \n4. \nFixed assets \n4 \n5. \nDebtors, creditors, accruals and prepayments \n4 \n6. \nIncome \n4 \n7. \nProcurement policy \n4 \n8. \nExpenditure processing \n6 \n

In [20]:
chunks[1]

Document(metadata={'producer': 'Microsoft: Print To PDF', 'creator': '', 'creationdate': '2022-07-19T09:16:20+01:00', 'source': '/content/6_financial-policies-and-procedures_2022.pdf', 'file_path': '/content/6_financial-policies-and-procedures_2022.pdf', 'total_pages': 20, 'format': 'PDF 1.7', 'title': 'Microsoft Word - 6_Financial policies and procedures_2022', 'author': 'IA-Admin', 'subject': '', 'keywords': '', 'moddate': '2022-07-19T09:16:20+01:00', 'trapped': '', 'modDate': "D:20220719091620+01'00'", 'creationDate': "D:20220719091620+01'00'", 'page': 1}, page_content='FINANCIAL POLICY AND  \nPROCEDURES MANUAL  \n \n \nApproved by Audit Committee June 2018, updated Jul 2019, updated Feb 2020, \nupdated Jan 2021, updated Mar 22 \n \n \n2 \n \n1. Introduction  \nThe purpose of this manual is to document the finance related policies and procedures which \nunderpin the financial management system in place at Integrity Action and to ensure that the \nfinancial statements conform to gene

In [21]:
chunks[2]

Document(metadata={'producer': 'Microsoft: Print To PDF', 'creator': '', 'creationdate': '2022-07-19T09:16:20+01:00', 'source': '/content/6_financial-policies-and-procedures_2022.pdf', 'file_path': '/content/6_financial-policies-and-procedures_2022.pdf', 'total_pages': 20, 'format': 'PDF 1.7', 'title': 'Microsoft Word - 6_Financial policies and procedures_2022', 'author': 'IA-Admin', 'subject': '', 'keywords': '', 'moddate': '2022-07-19T09:16:20+01:00', 'trapped': '', 'modDate': "D:20220719091620+01'00'", 'creationDate': "D:20220719091620+01'00'", 'page': 1}, page_content='This manual must be reviewed by the Finance and Compliance Manager and approved by the \nAudit Committee periodically.  \nBackground information \n‚óè Integrity Action‚Äôs year end is 30 September.  \n‚óè The accounting system used since 1 October 2017 is Aqilla, a cloud based system (see \nwww.aqilla.com). \n‚óè All income and expenditure transactions are allocated to both a nominal ledger code \n(see Appendix 4) an

In [None]:
## Vector representation of text

## Embedding Model -
# 1.Open Source embedding - Sentence transformer all mini lm v6 02

# 2.Open AI Embedding  - ada, small,large
# 3. Google Embeeding - Text Embedding

In [24]:
! pip install langchain_openai

Collecting langchain_openai
  Downloading langchain_openai-1.1.6-py3-none-any.whl.metadata (2.6 kB)
Collecting langchain-core<2.0.0,>=1.2.2 (from langchain_openai)
  Downloading langchain_core-1.2.5-py3-none-any.whl.metadata (3.7 kB)
Downloading langchain_openai-1.1.6-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.7/84.7 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain_core-1.2.5-py3-none-any.whl (484 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m484.9/484.9 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-core, langchain_openai
  Attempting uninstall: langchain-core
    Found existing installation: langchain-core 1.2.1
    Uninstalling langchain-core-1.2.1:
      Successfully uninstalled langchain

In [25]:
from langchain_openai import OpenAIEmbeddings

In [27]:
from google.colab import userdata
openai_api_key = userdata.get('openai_key')

In [28]:
import os
os.environ["OPENAI_API_KEY"] = openai_api_key

In [29]:
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-large"
)

In [30]:
! pip install chromadb

Collecting chromadb
  Downloading chromadb-1.3.7-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.3.0-py3-none-any.whl.metadata (5.6 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.3-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.23.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.39.1-py3-none-any.whl.metadata (2.5 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading PyPika-0.48.9.tar.gz (67 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚î

In [31]:
from langchain_community.vectorstores import Chroma

In [32]:
vectorestore = Chroma.from_documents(documents=chunks,embedding=embeddings,persist_directory="./db")

In [34]:
retriever = vectorestore.as_retriever(
    search_type="similarity",
    search_kwargs={"k":10}
)

In [None]:
# vectorestore.as_retriever(
#     search_type="mmr",
#     search_kwargs={"k":3}
# )

In [None]:
## LLM integration

In [37]:
! pip install langchain_openai



In [39]:
from langchain_openai import ChatOpenAI

In [43]:
llm = ChatOpenAI(
    model="gpt-5-nano-2025-08-07",    # gpt-4o-2024-08-06
    temperature=0.0,
    top_p=1,
    max_tokens=1024,
)

In [40]:
# ChatOpenAI(
#     model="gpt-5.2-2025-12-11"
# )

In [36]:
! pip install langchain_classic



In [35]:
from langchain_classic.chains import RetrievalQA

In [44]:
qa_chain = RetrievalQA.from_llm(llm,retriever=retriever)

In [45]:
qa_chain.invoke({"query":"What is main purpose of the financial policy and Procedures manual?"})

{'query': 'What is main purpose of the financial policy and Procedures manual?',
 'result': 'The main purpose is to document the finance-related policies and procedures that underpin Integrity Action‚Äôs financial management, ensuring statements comply with GAAP, assets are safeguarded, donor guidelines are followed, and finances are managed accurately, efficiently, completely, and transparently.'}

In [46]:
response = qa_chain.invoke({"query":"What is main purpose of the financial policy and Procedures manual?"})
print(response["result"])

The main purpose is to document the organization‚Äôs finance-related policies and procedures that underpin the financial management system, ensuring financial statements comply with generally accepted accounting principles, assets are safeguarded, donor guidelines are followed, and finances are managed with accuracy, efficiency, completeness and transparency.


In [47]:
response = qa_chain.invoke({"query":"What types of bank accounts does integrity action maintained?"})
print(response["result"])

Integrity Action maintains five bank accounts with HSBC:

- HSBC Main ‚Äì GBP (current operating account)
- HSBC Reserves ‚Äì GBP (3-month fixed deposit, auto-reinvested until changed)
- HSBC Norad ‚Äì GBP
- HSBC IEN ‚Äì USD
- HSBC SIDA ‚Äì SEK

Notes: all accounts are in the organisation‚Äôs name; two signatories are required to approve transactions.


In [48]:
response = qa_chain.invoke({"query":"Who are the current authorized bank signatories?"})
print(response["result"])

- Jasmina Haynes (CEO)
- Siobhan Turner (Trustee)
- Gail Klintworth (Board chair)

Note: The bank mandate requires two signatories to approve all transactions.


In [49]:
response = qa_chain.invoke({"query":"Wha are the quotation requiements based on purcaed value?"})
print(response["result"])

Here are the quotation requirements by purchased value (value includes VAT):

- ¬£2,500 and under: minimum of 1 quote
- ¬£2,501 ‚Äì ¬£7,500: minimum of 2 written quotes
- ¬£7,501 ‚Äì ¬£15,000: minimum of 3 written quotes
- Greater than ¬£15,000: formal tender required (request for tenders published and advertised appropriately)

Notes:
- These thresholds are based on the value including VAT.
- Quotations should be obtained in line with delegated authority and documented rationale.


In [50]:
from langchain_core.prompts import PromptTemplate

In [57]:
rag_prompt_template = PromptTemplate(
    input_variables=["context","input"],
    template="""You are helpful Financial Policy Assistant answering strictly from given document and provide citation.
    context: {context}
    questions: {input}
    """
)