# Finance Domain RAG System

Interactive notebook for exploring RAG with finance domain documents:
- **LOS** (Loan Origination System)
- **LMS** (Loan Management System)  
- **Credit Reports Analysis**
- **Underwriting Guidelines**

## Workflow
1. **Setup** (Run once)
2. **Ingest Documents** (Run once - creates embeddings)
3. **Query the System** (Run multiple times with different questions)

This saves embedding costs while you experiment! üí∞

## Cell 1: Setup & Imports

In [None]:
import os
from openai import OpenAI
from dotenv import load_dotenv
import chromadb
from pathlib import Path
import re
from typing import List, Dict

# Load environment variables
load_dotenv()
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

print("‚úÖ Setup complete!")

## Cell 2: Helper Functions

In [None]:
def get_embedding(text: str) -> list[float]:
    """Convert text to embedding vector"""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding


def chunk_document(text: str, sentences_per_chunk: int = 4, overlap_sentences: int = 1) -> List[str]:
    """Chunk text by sentences with overlap"""
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    chunks = []
    i = 0

    while i < len(sentences):
        chunk_sentences = sentences[i:i + sentences_per_chunk]
        chunk_text = ' '.join(chunk_sentences)
        if chunk_text.strip():
            chunks.append(chunk_text)
        i += sentences_per_chunk - overlap_sentences

    return chunks


def print_retrieved_chunks(chunks: List[Dict]):
    """Pretty print retrieved chunks"""
    print("\nüìä RETRIEVED CHUNKS:")
    print("=" * 100)
    for i, chunk in enumerate(chunks, 1):
        print(f"\n#{i} - [{chunk['title']}] (Similarity: {chunk['similarity']:.3f})")
        print(f"Document Type: {chunk['document_type']}")
        print("-" * 100)
        print(f"{chunk['text'][:300]}..." if len(chunk['text']) > 300 else chunk['text'])


print("‚úÖ Helper functions defined!")

## Cell 3: Initialize ChromaDB & Ingest Documents

**RUN THIS ONCE** - Creates embeddings for all finance documents.

‚ö†Ô∏è This costs tokens (~$0.001 for all documents)

In [None]:
# Initialize ChromaDB
print("Initializing Finance RAG system...")
chroma_client = chromadb.PersistentClient(path="./finance_notebook_db")

# Delete existing collection (for clean start)
try:
    chroma_client.delete_collection(name="finance_kb")
    print("üóëÔ∏è  Cleared existing knowledge base")
except:
    pass

# Create new collection
collection = chroma_client.create_collection(
    name="finance_kb",
    metadata={"description": "Finance domain knowledge base"}
)

print("‚úÖ ChromaDB initialized\n")

# Ingest finance documents
doc_directory = "./sample_docs/finance"
doc_path = Path(doc_directory)
doc_files = list(doc_path.glob("*.txt"))

print(f"üìö Found {len(doc_files)} finance documents:\n")
for f in doc_files:
    print(f"  üìÑ {f.stem.replace('_', ' ').title()}")

print(f"\n{'='*100}")
print("INGESTING DOCUMENTS (Creating embeddings - this costs tokens!)")
print(f"{'='*100}\n")

all_ids = []
all_chunks = []
all_embeddings = []
all_metadata = []
chunk_counter = 0

for doc_file in doc_files:
    title = doc_file.stem.replace('_', ' ').title()
    content = doc_file.read_text()

    # Chunk the document
    chunks = chunk_document(content, sentences_per_chunk=4, overlap_sentences=1)

    print(f"üìÑ {title}")
    print(f"   ‚Üí {len(chunks)} chunks created")

    for i, chunk in enumerate(chunks):
        chunk_id = f"{doc_file.stem}_chunk_{i}"
        embedding = get_embedding(chunk)

        all_ids.append(chunk_id)
        all_chunks.append(chunk)
        all_embeddings.append(embedding)
        all_metadata.append({
            "title": title,
            "category": "finance",
            "document_type": doc_file.stem,
            "chunk_index": i,
            "total_chunks": len(chunks)
        })

        chunk_counter += 1

# Batch add to collection
collection.add(
    ids=all_ids,
    documents=all_chunks,
    embeddings=all_embeddings,
    metadatas=all_metadata
)

print(f"\n{'='*100}")
print(f"‚úÖ INGESTION COMPLETE!")
print(f"{'='*100}")
print(f"üìä Total chunks: {chunk_counter}")
print(f"üí∞ Embedding cost: ~${chunk_counter * 0.00002:.6f}")
print(f"üíæ Saved to: ./finance_notebook_db")
print(f"\nüéØ Now you can query unlimited times in Cell 4 without additional embedding costs!")

## Cell 4: Query the System (Run Multiple Times!)

**Change the `question` variable and re-run** to explore different queries.

No additional embedding costs - we already have everything stored! üéâ

### Try These Questions:
- "How does a Loan Origination System handle income verification?"
- "What is debt-to-credit ratio and how does it affect credit scores?"
- "How does an LMS handle delinquent loans?"
- "What are the DTI requirements for conventional mortgages?"
- "How do credit scores impact loan approval?"
- "What is the difference between hard and soft credit inquiries?"
- "How are escrow accounts managed in loan servicing?"

In [None]:
# üîß CHANGE THIS QUESTION AND RE-RUN!
question = "How does a Loan Origination System handle income verification?"

# Number of chunks to retrieve (adjust based on question complexity)
n_results = 3

print(f"{'='*100}")
print(f"‚ùì QUESTION: {question}")
print(f"{'='*100}")

# Step 1: Retrieve relevant chunks
print(f"\nüîç Retrieving top {n_results} relevant chunks...\n")

query_embedding = get_embedding(question)

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=n_results,
    include=["documents", "metadatas", "distances"]
)

# Format results
retrieved_chunks = []
for doc, meta, dist in zip(results["documents"][0], results["metadatas"][0], results["distances"][0]):
    similarity = 1 / (1 + dist)
    retrieved_chunks.append({
        "text": doc,
        "title": meta["title"],
        "document_type": meta["document_type"],
        "similarity": similarity
    })

# Display retrieved chunks
print_retrieved_chunks(retrieved_chunks)

# Step 2: Generate answer with context
print(f"\n{'='*100}")
print("ü§ñ GENERATING ANSWER...")
print(f"{'='*100}\n")

# Build context from retrieved chunks
context_parts = []
sources = set()

for i, chunk in enumerate(retrieved_chunks, 1):
    context_parts.append(f"[Source {i}: {chunk['title']}]\n{chunk['text']}\n")
    sources.add(chunk['title'])

context = "\n".join(context_parts)

# Construct prompt
prompt = f"""You are a finance domain expert assistant specializing in lending systems, credit analysis, and loan processing.

CONTEXT FROM KNOWLEDGE BASE:
{context}

INSTRUCTIONS:
- Answer the question using ONLY the information in the context above
- If the context doesn't contain enough information, say so clearly
- Cite your sources by mentioning the document titles
- Be precise with financial terms and regulatory requirements
- Use industry terminology appropriately

QUESTION: {question}

ANSWER:"""

# Call LLM
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2,  # Low temperature for factual finance answers
)

answer = response.choices[0].message.content
tokens_used = response.usage.total_tokens
cost = tokens_used * 0.00000015  # gpt-4o-mini pricing

# Display answer
print(f"{'='*100}")
print("üí° ANSWER:")
print(f"{'='*100}")
print(answer)
print(f"\n{'='*100}")
print(f"üìö Sources: {', '.join(sources)}")
print(f"üìä Tokens used: {tokens_used} | Cost: ${cost:.6f}")
print(f"{'='*100}")

print("\n‚ú® Try changing the question above and re-running this cell!")

## Cell 5: Explore the Knowledge Base

See what documents and chunks are stored

In [None]:
# Collection stats
total_chunks = collection.count()

print(f"{'='*100}")
print("üìä KNOWLEDGE BASE STATISTICS")
print(f"{'='*100}")
print(f"Total chunks: {total_chunks}")
print(f"Storage location: ./finance_notebook_db")

# Get sample chunks from each document type
print(f"\n{'='*100}")
print("üìÑ SAMPLE CHUNKS BY DOCUMENT TYPE")
print(f"{'='*100}\n")

doc_types = ["loan_origination_system", "loan_management_system", "credit_reports", "underwriting_guidelines"]

for doc_type in doc_types:
    results = collection.get(
        where={"document_type": doc_type},
        limit=1,
        include=["documents", "metadatas"]
    )
    
    if results['documents']:
        meta = results['metadatas'][0]
        doc = results['documents'][0]
        
        print(f"üìÑ {meta['title']}")
        print(f"   Total chunks: {meta['total_chunks']}")
        print(f"   Sample: {doc[:150]}...")
        print()

## Cell 6: Compare Different Retrieval Strategies

Experiment with different numbers of chunks retrieved

In [None]:
test_question = "What are the key components of underwriting?"

print(f"Testing question: '{test_question}'\n")
print(f"{'='*100}\n")

for n in [1, 3, 5]:
    print(f"üìä Retrieving TOP {n} chunks:")
    print("-" * 100)
    
    query_embedding = get_embedding(test_question)
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n,
        include=["metadatas", "distances"]
    )
    
    for i, (meta, dist) in enumerate(zip(results["metadatas"][0], results["distances"][0]), 1):
        similarity = 1 / (1 + dist)
        print(f"  {i}. [{meta['title']}] - Similarity: {similarity:.3f}")
    
    print()

print(f"{'='*100}")
print("üí° OBSERVATION:")
print("More chunks = more context but also more noise")
print("Sweet spot is usually 3-5 chunks for most questions")

## Cell 7: Test Edge Cases

What happens when we ask questions NOT in the knowledge base?

In [None]:
# Question NOT in knowledge base
out_of_scope_question = "How do I deploy a React application to AWS?"

print(f"‚ùì Out-of-scope question: '{out_of_scope_question}'\n")

query_embedding = get_embedding(out_of_scope_question)

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3,
    include=["documents", "metadatas", "distances"]
)

print("üìä Best matches (even though irrelevant):")
print("=" * 100)

for i, (doc, meta, dist) in enumerate(zip(results["documents"][0], results["metadatas"][0], results["distances"][0]), 1):
    similarity = 1 / (1 + dist)
    print(f"\n{i}. [{meta['title']}] - Similarity: {similarity:.3f}")
    print(f"   {doc[:150]}...")

print(f"\n{'='*100}")
print("üí° IMPORTANT: Notice the similarity scores are much lower!")
print("In production, set a similarity threshold (e.g., 0.4)")
print("If all results are below threshold, return 'I don't have information about that'")

## üéì Key Takeaways

### What You've Learned:

1. **Domain-Specific RAG**: Finance documents (LOS, LMS, Credit Reports, Underwriting) can be queried semantically

2. **Cost Efficiency**: Embed once (Cell 3), query unlimited times (Cell 4) - saves money!

3. **Semantic Search**: Finds relevant information even with different terminology
   - Ask about "vacation" ‚Üí finds "PTO"
   - Ask about "DTI" ‚Üí finds "debt-to-income ratio"

4. **Source Attribution**: Every answer cites sources for auditability (critical in finance!)

5. **Retrieval Strategies**: 
   - More chunks = more context but potential noise
   - Sweet spot: 3-5 chunks
   - Use similarity thresholds to detect out-of-scope questions

### Production Considerations for Finance:

- **Compliance**: Audit logs for all queries (who asked what, when)
- **Version Control**: Track which version of regulations/policies was used
- **Metadata Filtering**: Filter by effective date, regulation type, jurisdiction
- **Confidence Scores**: Flag low-confidence answers for human review
- **Access Control**: Role-based access to different document types
- **Regulatory Updates**: Easy document refresh without system changes

### Next Steps:

1. Add your own finance documents (replace sample_docs/finance)
2. Experiment with chunk sizes (Cell 3: `sentences_per_chunk`)
3. Try different retrieval amounts (Cell 4: `n_results`)
4. Implement metadata filtering for your use case
5. Build a production API with FastAPI

üéØ **You now understand RAG for Finance domain!**