# Custom RAG with Segmentation Pipeline

In this notebook we will create a RAG pipeline on the mortgage documents provided by Outomation. We were given a blob of document, and to process it we will create a segmentation pipeline which is able to segment the blob while adding proper metadata.

In [2]:
!pip install llama-index-core llama-index-readers-file llama-index-llms-ollama llama-index-embeddings-huggingface
!pip install llama_index.llms.gemini
!pip install transformers sentence-transformers
!pip install pypdf
!pip install nest_asyncio
!pip install llama_index
!pip install llama-index-experimental
!pip install llama-index-retrievers-bm25
!pip install pytesseract pdf2image
!apt-get update
!apt-get install -y poppler-utils
!pip install pdf2image
!pip install -q torch
!pip install llama-index-llms-llama-cpp

/bin/bash: /home/anubh/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
/bin/bash: /home/anubh/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Collecting llama_index.llms.gemini
  Using cached llama_index_llms_gemini-0.6.1-py3-none-any.whl.metadata (3.3 kB)
Collecting google-generativeai>=0.5.2 (from llama_index.llms.gemini)
  Using cached google_generativeai-0.8.5-py3-none-any.whl.metadata (3.9 kB)
Collecting pillow<11,>=10.2.0 (from llama_index.llms.gemini)
  Downloading pillow-10.4.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (9.2 kB)
Collecting google-ai-generativelanguage==0.6.15 (from google-generativeai>=0.5.2->llama_index.llms.gemini)
  Using cached google_ai_generativelanguage-0.6.15-py3-none-any.whl.metadata (5.7 kB)
Collecting google-api-core (from google-generativeai>=0.5.2->llama_index.llms.gemini)
  Using cached google_api_core-2.25.1-py3-none-any.whl.metadata (3.0 kB)
Collecting google-api-p

In [2]:
from llama_index.core import VectorStoreIndex, Document
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.settings import Settings
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.readers.file import PDFReader

In [3]:
import os
print(os.environ.get('CONDA_DEFAULT_ENV'))

llama


In [4]:
import torch
torch.cuda.is_available()

True

In [5]:
model_path = "mistral7b_big.gguf"

In [4]:
llm = LlamaCPP(
    model_path = '/content/mistral.gguf',
    temperature = 0.2,
    max_new_tokens = 1024,
    context_window = 8192,
    model_kwargs={"n_gpu_layers": 1}
)

llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from /content/mistral.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models--mistralai--Mistral-7B-Instruc...
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attentio

In [1]:
from llama_cpp import Llama

llm_seperator = Llama(model_path = 'deepseek_r1_Q4_K_M.gguf',
                      n_ctx = 2048,
                      n_gpu_layers = -1,
                      temperature = 0.6,
                      top_p = 0.95
                     )
output = llm_seperator("What is the capital of France?", max_tokens=32) #testing the model
print(output["choices"][0]["text"])

llama_model_loader: loaded meta data with 37 key-value pairs and 399 tensors from deepseek_r1_Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Deepseek-R1-0528-Qwen3-8B
llama_model_loader: - kv   3:                           general.basename str              = Deepseek-R1-0528-Qwen3-8B
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader:

 And what is the population of Paris?
The capital of France is Paris. The population of Paris is approximately 2 million in the city proper, but the metropolitan


In [3]:
documents = PDFReader().load_data("Test Blob File.pdf")

In [8]:
import llama_cpp
llama_cpp.__version__

'0.3.16'

In [19]:
len(documents)

7

In [89]:
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")


2025-09-12 21:36:12,012 - INFO - Load pretrained SentenceTransformer: BAAI/bge-small-en-v1.5
2025-09-12 21:36:20,801 - INFO - 1 prompt is loaded, with the key: query


In [5]:
llm_seperator('what is capital of france?')

llama_perf_context_print:        load time =     902.08 ms
llama_perf_context_print: prompt eval time =     250.02 ms /     6 tokens (   41.67 ms per token,    24.00 tokens per second)
llama_perf_context_print:        eval time =    1431.93 ms /    15 runs   (   95.46 ms per token,    10.48 tokens per second)
llama_perf_context_print:       total time =    1697.49 ms /    21 tokens
llama_perf_context_print:    graphs reused =         14


{'id': 'cmpl-97506c42-fdb7-4905-9dcc-29a432f6fef8',
 'object': 'text_completion',
 'created': 1757785737,
 'model': 'deepseek_r1_Q4_K_M.gguf',
 'choices': [{'text': ' The capital of France is Paris.\n\nNow, what is the capital of the country',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 6, 'completion_tokens': 16, 'total_tokens': 22}}

In [6]:
llm_seperator(f"""
You are a document classification agent.

Classify the document as one of these types:
Resume, Lender Fees, ID, Contract, PaySlip, Other

If the document does not fit any category, use Other.

Respond with the document type only. Do not include any extra text, formatting, dashes, or newlines. Your response must be exactly one of the document types above, and nothing else.

Choose response out of document types only.

Page Content: Party A entered into contract with Party B for the employment for the period
Response: Contract

Page Content: John Wick chased the people who killed his dog
Response: Other

Page Content: {documents[0].text}
Response:
""")


llama_perf_context_print:        load time =     902.08 ms
llama_perf_context_print: prompt eval time =   17081.70 ms /   969 tokens (   17.63 ms per token,    56.73 tokens per second)
llama_perf_context_print:        eval time =    1477.61 ms /    15 runs   (   98.51 ms per token,    10.15 tokens per second)
llama_perf_context_print:       total time =   18575.75 ms /   984 tokens
llama_perf_context_print:    graphs reused =         14


{'id': 'cmpl-3a979c49-ef95-48d1-bb94-31fc3d9c2522',
 'object': 'text_completion',
 'created': 1757785761,
 'model': 'deepseek_r1_Q4_K_M.gguf',
 'choices': [{'text': 'Page Content: You have been selected for our ID Match process. This is due',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 969, 'completion_tokens': 16, 'total_tokens': 985}}

In [4]:
llm_seperator(f"""
You are a document boundary agent.

You will be given three inputs: Page_1, Page_2, and Document Type of Page_1.

Decide if the Page_2 starts a new document or continues the previous one.

Give answer as "yes" if new document starts at Page_2. And give answer as "no" if Page_2 does not start a new document. Do not explain the answer. Do not add newlines. Do not respond in anything else than "yes" or "no".

Contracts can have many pages having tables and Annexure pages.

If Page_1 document type is Contract and you believe Page_2 is contract too then give answer as "no"

Example 1:
Page_1: This "Fees Worksheet" is provided for informational purposes ONLY, to assist you in determining an estimate of cash that may be required to close and an estimate of your proposed monthly mortgage payment. Actual charges may be more or less, and your transaction may not involve a fee for every item listed.
Page_2: Payslip Pay Date 2025/07/17 Working Days : 26 Employee Name : James Bond
Page_1 Document Type: Lender Fees
Answer: yes

Example 2:
Page_1: 7. Terms-This contract is made between Party A and Party B for the provision of services as described herein.
Page_2: 8. Tenure- The services described in this contract shall commence on the date of signing and continue for a period of one year.
Page_1 Document Type: Contract
Answer: no

Now Your turn

Page_1 : {documents[0]}
Page_2 : {documents[1]}
Page_1 Document Type: Lender Fees
Answer:
""")


llama_perf_context_print:        load time =     902.08 ms
llama_perf_context_print: prompt eval time =   10274.04 ms /   619 tokens (   16.60 ms per token,    60.25 tokens per second)
llama_perf_context_print:        eval time =    1555.96 ms /    15 runs   (  103.73 ms per token,     9.64 tokens per second)
llama_perf_context_print:       total time =   11847.77 ms /   634 tokens
llama_perf_context_print:    graphs reused =         14


{'id': 'cmpl-fd4dff84-00a9-4b5e-b525-749282df1bb5',
 'object': 'text_completion',
 'created': 1757785723,
 'model': 'deepseek_r1_Q4_K_M.gguf',
 'choices': [{'text': 'Page_1 Document Type: Lender Fees\n\nPage_2 seems to be',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 619, 'completion_tokens': 16, 'total_tokens': 635}}

In [18]:
llm_seperator('can you create a horror story?')

llama_perf_context_print:        load time =    3750.07 ms
llama_perf_context_print: prompt eval time =     870.66 ms /     7 tokens (  124.38 ms per token,     8.04 tokens per second)
llama_perf_context_print:        eval time =    1808.46 ms /    15 runs   (  120.56 ms per token,     8.29 tokens per second)
llama_perf_context_print:       total time =    2703.58 ms /    22 tokens
llama_perf_context_print:    graphs reused =         14


{'id': 'cmpl-de549f80-b5ec-491c-a616-7f468a3493cf',
 'object': 'text_completion',
 'created': 1757746975,
 'model': 'openai_neo.gguf',
 'choices': [{'text': "\n\nSure, here's a horror story for you:\n\nIt was a dark and storm",
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 7, 'completion_tokens': 16, 'total_tokens': 23}}

In [60]:
def model_response(prompt):
  return llm_seperator(prompt)["choices"][0]["text"]

In [128]:
def classify_document(text):
    prompt = f"""
    You are a document classification agent.

    Classify the document as one of these types:
    Resume, LenderFees, ID, Contract, PaySlip, Other
    
    If the document does not fit any category, use Other.
    
    Respond with the document type only. Do not include any extra text, formatting, dashes, or newlines. 
    
    Your response must be exactly one of the document types above, and nothing else. And there must be only one document type in answer, choose the one you think is most appropriate.

    Give only one word answer which should be correct Document Types from above.
    
    Page Content: This "Fees Worksheet" is provided for informational purposes ONLY, to assist you in determining an estimate of cash that may be required to close and an estimate of your proposed monthly mortgage payment
    Response: LenderFees
    
    Page Content: Party A entered into contract with Party B for the employment for the period
    Response: Contract
    
    Page Content: John Wick chased the people who killed his dog
    Response: Other
    
    Page Content: {text}
    Response:
    """
    doc_type = model_response(prompt).strip()
    
    return doc_type

def is_same_document(page1, page2, doc_type = None):
  prompt = f"""
    You are a document boundary agent.
    
    You will be given three inputs: Page_1, Page_2, and Document Type of Page_1.
    
    Decide if the Page_2 starts a new document or continues the previous one.
    
    Give answer as "yes" if new document starts at Page_2. And give answer as "no" if Page_2 does not start a new document. Do not explain the answer. Do not add newlines. Do not respond in anything else than "yes" or "no".
    
    Contracts can have many pages having tables and Annexure pages.
    
    If Page_1 document type is Contract and you believe Page_2 is contract too then give answer as "no"
    
    Example 1:
    Page_1: This "Fees Worksheet" is provided for informational purposes ONLY, to assist you in determining an estimate of cash that may be required to close and an estimate of your proposed monthly mortgage payment. Actual charges may be more or less, and your transaction may not involve a fee for every item listed.
    Page_2: Payslip Pay Date 2025/07/17 Working Days : 26 Employee Name : James Bond
    Page_1 Document Type: LenderFees
    Answer: yes
    
    Example 2:
    Page_1: 7. Terms-This contract is made between Party A and Party B for the provision of services as described herein.
    Page_2: 8. Tenure- The services described in this contract shall commence on the date of signing and continue for a period of one year.
    Page_1 Document Type: Contract
    Answer: no
    
    Now Your turn
    
    Page_1 : {page1}
    Page_2 : {page2}
    Page_1 Document Type: {doc_type}
    Answer:
    """
  return model_response(prompt).strip().lower()




In [125]:
import re
re.sub(r'[\d.\n]+ ', '', "1. hello buys\n   2.").split()

['hello', 'buys', '2.']

In [123]:
s = "1. hello buys\n   2."

In [116]:
s.strip()

'1. hello buys\n 2.'

In [79]:
classify_document(documents[0].text) #testing classifier

Llama.generate: 105 prefix-match hit, remaining 1124 prompt tokens to eval
llama_perf_context_print:        load time =     515.03 ms
llama_perf_context_print: prompt eval time =   19224.87 ms /  1124 tokens (   17.10 ms per token,    58.47 tokens per second)
llama_perf_context_print:        eval time =     369.07 ms /     4 runs   (   92.27 ms per token,    10.84 tokens per second)
llama_perf_context_print:       total time =   19596.60 ms /  1128 tokens
llama_perf_context_print:    graphs reused =          3


'Lender Fees'

In [87]:
"yeah".startswith("y")

True

In [129]:
results = []
current_doc_type = None
page_in_doc = 0
is_new_doc = True

for i, page in enumerate(documents):
    if i == 0:
        current_doc_type = classify_document(page.text)
    else:
        prev_text = documents[i - 1].text
        output = is_same_document(prev_text, page.text, current_doc_type)
        
        if output.startswith('y'):
            current_doc_type = classify_document(page.text)
            is_new_doc = True
            page_in_doc = 0
            
        else:
            page_in_doc += 1
            is_new_doc = False

    results.append({
        "page": i,
        "is_new_doc": is_new_doc,
        "doc_type": current_doc_type,
        'page_in_doc': page_in_doc
    })


for r in results:
    print(r)

Llama.generate: 8 prefix-match hit, remaining 1264 prompt tokens to eval
llama_perf_context_print:        load time =    1305.38 ms
llama_perf_context_print: prompt eval time =   21343.49 ms /  1264 tokens (   16.89 ms per token,    59.22 tokens per second)
llama_perf_context_print:        eval time =    1402.22 ms /    15 runs   (   93.48 ms per token,    10.70 tokens per second)
llama_perf_context_print:       total time =   22752.91 ms /  1279 tokens
llama_perf_context_print:    graphs reused =         13
Llama.generate: 8 prefix-match hit, remaining 1574 prompt tokens to eval
llama_perf_context_print:        load time =    1305.38 ms
llama_perf_context_print: prompt eval time =   27316.97 ms /  1574 tokens (   17.36 ms per token,    57.62 tokens per second)
llama_perf_context_print:        eval time =    1218.19 ms /    13 runs   (   93.71 ms per token,    10.67 tokens per second)
llama_perf_context_print:       total time =   28540.44 ms /  1587 tokens
llama_perf_context_print:   

{'page': 0, 'is_new_doc': True, 'doc_type': '1. LenderFees\n    2. Contract\n    3. Other', 'page_in_doc': 0}
{'page': 1, 'is_new_doc': False, 'doc_type': '1. LenderFees\n    2. Contract\n    3. Other', 'page_in_doc': 1}
{'page': 2, 'is_new_doc': True, 'doc_type': 'Contract', 'page_in_doc': 0}
{'page': 3, 'is_new_doc': False, 'doc_type': 'Contract', 'page_in_doc': 1}
{'page': 4, 'is_new_doc': False, 'doc_type': 'Contract', 'page_in_doc': 2}
{'page': 5, 'is_new_doc': False, 'doc_type': 'Contract', 'page_in_doc': 3}
{'page': 6, 'is_new_doc': False, 'doc_type': 'Contract', 'page_in_doc': 4}


In [97]:
import pandas as pd
segment_df = pd.DataFrame(results)
segment_df.head()

Unnamed: 0,page,is_new_doc,doc_type,page_in_doc
0,0,True,Lender Fees\n\n Page Content:\n Purchase...,0
1,1,True,PaySlip,0
2,2,True,Contract,0
3,3,False,Contract,1
4,4,False,Contract,2


In [101]:
segment_df.iloc[0,2] = 'Lender Fees'

In [102]:
segment_df.head()

Unnamed: 0,page,is_new_doc,doc_type,page_in_doc
0,0,True,Lender Fees,0
1,1,True,PaySlip,0
2,2,True,Contract,0
3,3,False,Contract,1
4,4,False,Contract,2


In [103]:
segment_df.to_csv('segment_test.csv')

In [90]:
# Create an index with our embedding model
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

In [None]:
#checking the index nodes again
print(list(index.docstore.docs.values())[0].text)

Your actual rate, payment, and cost could be higher. Get an official Loan Estimate before choosing a loan.
Fee Details and Summary
Applicants: Application No:
Date Prepared:
Loan Program:
Prepared By:
THIS IS NOT A GOOD FAITH ESTIMATE (GFE). This "Fees W orksheet" is provided for informational purposes ONLY, to assist
you in determining an estimate of cash that may be required to close and an estimate of your proposed monthly mortgage 
payment. Actual charges may be more or less, and your transaction may not involve a fee for every item listed.
Total Loan Amount:  Interest Rate: Term/Due In:
Fee Paid To Paid By (Fee Split**) Amount PFC / F / POC
TOTAL ESTIMATED FUNDS NEEDED TO CLOSE: TOTAL ESTIMATED MONTHLY PAYMENT:
Total Estimated Funds Total Monthly Payment
Purchase Price (+)
Alterations (+)
Land (+)
Refi (incl. debts to be paid off) (+)
Est. Prepaid Items/Reserves (+)
Est. Closing Costs (+)
Loan Amount (-) Principal & Interest
Other Financing (P & I)
Hazard Insurance
Real Estate Tax

In [None]:
from llama_index.core import Settings
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.postprocessor import SentenceTransformerRerank


Settings.llm = llm


# Function to create a query engine that uses query expansion plus hybrid and reranking
def build_rag_pipeline(index, llm):

    nodes = list(index.docstore.docs.values())

    # Determine safe top_k value (number of nodes to retrieve)
    # Must be at least 1 and no more than the number of available nodes
    num_nodes = len(nodes)
    safe_top_k = min(3, max(1, num_nodes))

    print(f"Index contains {num_nodes} nodes, using top_k={safe_top_k}")

    vector_retriever = index.as_retriever(
          similarity_top_k = safe_top_k  # Retrieve top 3 most similar chunks
      )

    # Create hybrid retriever (vector + BM25)
    bm25_retriever = BM25Retriever.from_defaults(
        nodes=nodes,
        similarity_top_k=safe_top_k  # Retrieve top 3 most similar chunks
    )

    # Create a proper hybrid retriever class
    class HybridRetriever(BaseRetriever):
        """Hybrid retriever that combines vector and keyword search results."""

        def __init__(self, vector_retriever, keyword_retriever, top_k=3):
            """Initialize with vector and keyword retrievers."""
            self.vector_retriever = vector_retriever
            self.keyword_retriever = keyword_retriever
            self.top_k = top_k
            super().__init__()

        def _retrieve(self, query_bundle, **kwargs):
            """Retrieve from both retrievers and combine results."""
            # Get results from both retrievers
            vector_nodes = self.vector_retriever.retrieve(query_bundle)
            keyword_nodes = self.keyword_retriever.retrieve(query_bundle)

            # Combine all nodes
            all_nodes = list(vector_nodes) + list(keyword_nodes)

            # Remove duplicates (by node_id)
            unique_nodes = {}
            for node in all_nodes:
                if node.node_id not in unique_nodes:
                    unique_nodes[node.node_id] = node

            # Sort by score (higher is better)
            sorted_nodes = sorted(
                unique_nodes.values(),
                key=lambda x: x.score if hasattr(x, 'score') else 0.0,
                reverse=True
            )

            return sorted_nodes[:self.top_k]  # Return top results

    # Create our hybrid retriever instance
    hybrid_retriever = HybridRetriever(
        vector_retriever = vector_retriever,
        keyword_retriever = bm25_retriever,
        top_k=safe_top_k
    )

    # Use QueryFusionRetriever with the hybrid retriever
    fusion_retriever = QueryFusionRetriever(
        retrievers = [hybrid_retriever],
        llm = llm,
        similarity_top_k = 3,
        num_queries = 3,
        mode="reciprocal_rerank"
    )

    # Apply reranking
    reranker = SentenceTransformerRerank(
        model="cross-encoder/ms-marco-MiniLM-L-2-v2",
        top_n=3
    )


    # Plug into query engine
    from llama_index.core.query_engine import RetrieverQueryEngine
    query_engine = RetrieverQueryEngine.from_args(
        retriever = fusion_retriever,
        llm=llm,
        node_postprocessors = [reranker],
        verbose = True
    )
    return query_engine

In [None]:
rag_engine = build_rag_pipeline(index, llm)

DEBUG:bm25s:Building index from IDs objects


Index contains 4 nodes, using top_k=3


## Checking different Embedding models

MINI LM

In [None]:
# mini lm
response = rag_engine.query("What is the total estimated monthly payment?")
print('\nFinal Response:\n ---------------------- \n')
print(response)


Final Response:
 ---------------------- 

The total estimated monthly payment is $1,869.37.


In [None]:
for node in response.source_nodes:
  print(node)

Node ID: c485a76d-1a19-4ac1-9f24-0f398908b631
Text: Your actual rate, payment, and cost could be higher. Get an
official Loan Estimate before choosing a loan. Fee Details and Summary
Applicants: Application No: Date Prepared: Loan Program: Prepared By:
THIS IS NOT A GOOD FAITH ESTIMATE (GFE). This "Fees W orksheet" is
provided for informational purposes ONLY, to assist you in determining
an estim...
Score: -2.472

Node ID: c3290ea6-4005-45be-8b60-88bdb5585b98
Text: Payment of Principal, Interest and Late Charge. Borrower shall
pay when due the principal of, and interest on, the debt evidenced by
the Note and late charges due under the Note  2. Monthly Payment of
Taxes, Insurance and Other Charges. Borrower shall mclude mm each
monthly payment, together with the principal and interest as set forth
in the No...
Score: -3.479

Node ID: 3eaeb641-0a7b-4da5-9bab-edd74f29a25e
Text: - QML  MORTGAGE DOC.# 10009588  DOCUMENT NUMBER  RECORDED
06/28/2011 09:35AM JOHN LA FAVE NAME & RETURN ADDRESS 

In [None]:
# mini lm
response = rag_engine.query("What is the total estimated monthly payment?")
print('\nFinal Response:\n ---------------------- \n')
print(response)


Final Response:
 ---------------------- 

The borrower pays $650.00 for lender's title insurance.


In [None]:
for node in response.source_nodes:
  print(node)

Node ID: c3290ea6-4005-45be-8b60-88bdb5585b98
Text: Payment of Principal, Interest and Late Charge. Borrower shall
pay when due the principal of, and interest on, the debt evidenced by
the Note and late charges due under the Note  2. Monthly Payment of
Taxes, Insurance and Other Charges. Borrower shall mclude mm each
monthly payment, together with the principal and interest as set forth
in the No...
Score:  1.785

Node ID: c485a76d-1a19-4ac1-9f24-0f398908b631
Text: Your actual rate, payment, and cost could be higher. Get an
official Loan Estimate before choosing a loan. Fee Details and Summary
Applicants: Application No: Date Prepared: Loan Program: Prepared By:
THIS IS NOT A GOOD FAITH ESTIMATE (GFE). This "Fees W orksheet" is
provided for informational purposes ONLY, to assist you in determining
an estim...
Score:  1.500

Node ID: 8911f1a3-69c5-4564-8454-a19a9945878c
Text: which has the address of 6468 SOUTH 20TH STREET [Street]
MILWAUKEE [City], Wisconsin 53221 [Zip Code] ("Property

In [None]:
response = rag_engine.query("What are the charges?")
print('\nFinal Response:\n ---------------------- \n')
print(response)


Final Response:
 ---------------------- 

The charges include principal, interest, and late charges on the debt. Monthly payments also encompass sums for taxes and special assessments, leasehold payments or ground rents, and insurance premiums. Additionally, a mortgage insurance premium or a monthly charge in lieu of it may be required.

Other specific charges listed are:
*   Underwriting Fee
*   Wire Transfer Fee
*   Administration Fee
*   Appraisal Fee
*   Credit Report Fee
*   Tax Service Fee
*   Flood Certification Fee
*   Closing/Escrow Fee
*   Document Preparation Fee
*   Notary Fee
*   Lender's Title Insurance
*   Title - Courier Fee
*   Electronic Document Delivery Fee
*   Pest Inspection Fee
*   Home Inspection
*   Mortgage Recording Charge
*   Daily Interest Charges
*   Hazard Insurance Premium


In [None]:
for node in response.source_nodes:
  print(node)

Node ID: c3290ea6-4005-45be-8b60-88bdb5585b98
Text: Payment of Principal, Interest and Late Charge. Borrower shall
pay when due the principal of, and interest on, the debt evidenced by
the Note and late charges due under the Note  2. Monthly Payment of
Taxes, Insurance and Other Charges. Borrower shall mclude mm each
monthly payment, together with the principal and interest as set forth
in the No...
Score: -5.178

Node ID: c485a76d-1a19-4ac1-9f24-0f398908b631
Text: Your actual rate, payment, and cost could be higher. Get an
official Loan Estimate before choosing a loan. Fee Details and Summary
Applicants: Application No: Date Prepared: Loan Program: Prepared By:
THIS IS NOT A GOOD FAITH ESTIMATE (GFE). This "Fees W orksheet" is
provided for informational purposes ONLY, to assist you in determining
an estim...
Score: -6.780

Node ID: 8911f1a3-69c5-4564-8454-a19a9945878c
Text: which has the address of 6468 SOUTH 20TH STREET [Street]
MILWAUKEE [City], Wisconsin 53221 [Zip Code] ("Property

In [None]:
query = "What is the maximum loan amount a borrower can apply for?"
response = rag_engine.query(query)

print(response)

The provided context does not specify the maximum loan amount a borrower can apply for. It includes a "Fees Worksheet" with a "Total Loan Amount" of $380,000 for a specific application, but this document is for informational purposes only and does not indicate a maximum limit.


In [None]:
for node in response.source_nodes:
  print(node)

Node ID: c485a76d-1a19-4ac1-9f24-0f398908b631
Text: Your actual rate, payment, and cost could be higher. Get an
official Loan Estimate before choosing a loan. Fee Details and Summary
Applicants: Application No: Date Prepared: Loan Program: Prepared By:
THIS IS NOT A GOOD FAITH ESTIMATE (GFE). This "Fees W orksheet" is
provided for informational purposes ONLY, to assist you in determining
an estim...
Score: -2.019

Node ID: c3290ea6-4005-45be-8b60-88bdb5585b98
Text: Payment of Principal, Interest and Late Charge. Borrower shall
pay when due the principal of, and interest on, the debt evidenced by
the Note and late charges due under the Note  2. Monthly Payment of
Taxes, Insurance and Other Charges. Borrower shall mclude mm each
monthly payment, together with the principal and interest as set forth
in the No...
Score: -2.245

Node ID: 8911f1a3-69c5-4564-8454-a19a9945878c
Text: which has the address of 6468 SOUTH 20TH STREET [Street]
MILWAUKEE [City], Wisconsin 53221 [Zip Code] ("Property

In [None]:
response = rag_engine.query("What are the addresses in the document?")
print('\nFinal Response:\n ---------------------- \n')
print(response)


Final Response:
 ---------------------- 

The addresses mentioned are:

*   4121 NW Urbandale Drive, Urbandale, IA 50322
*   PO Box 2026, Flint, MI 48501-2026
*   3993 Howard Hughes Parkway, Las Vegas, NV 89109
*   6468 SOUTH 20TH STREET, MILWAUKEE, Wisconsin 53221


In [None]:
response = rag_engine.query("Who is the borrower, what is the total loan amount and what is the property Address?")
print('\nFinal Response:\n ---------------------- \n')
print(response)


Final Response:
 ---------------------- 

The borrower is KIMBERLY HOGAN. The total loan amount is $380,000. The property address is 6468 SOUTH 20TH STREET MILWAUKEE, Wisconsin 53221.


BGE Small

In [None]:
response = rag_engine.query("What is the total estimated monthly payment?")
print('\nFinal Response:\n ---------------------- \n')
print(response)

print('\n')
for node in response.source_nodes:
  print(node)

response = rag_engine.query("How much does the borrower pay for lender's title insurance?")
print('\nFinal Response:\n ---------------------- \n')
print(response)

print('\n')
for node in response.source_nodes:
  print(node)

response = rag_engine.query("What are the charges?")
print('\nFinal Response:\n ---------------------- \n')
print(response)

print('\n')
for node in response.source_nodes:
  print(node)

response = rag_engine.query("Who is the borrower, what is the total loan amount and what is the property Address?")
print('\nFinal Response:\n ---------------------- \n')
print(response)

print('\n')
for node in response.source_nodes:
  print(node)


Final Response:
 ---------------------- 

The total estimated monthly payment is $1,869.37.


Node ID: 9b64fe54-43d5-49e5-b714-cd3910d2850d
Text: Your actual rate, payment, and cost could be higher. Get an
official Loan Estimate before choosing a loan. Fee Details and Summary
Applicants: Application No: Date Prepared: Loan Program: Prepared By:
THIS IS NOT A GOOD FAITH ESTIMATE (GFE). This "Fees W orksheet" is
provided for informational purposes ONLY, to assist you in determining
an estim...
Score: -2.472

Node ID: f794f63a-4a9d-48f0-824a-20d622dbf89f
Text: Payment of Principal, Interest and Late Charge. Borrower shall
pay when due the principal of, and interest on, the debt evidenced by
the Note and late charges due under the Note  2. Monthly Payment of
Taxes, Insurance and Other Charges. Borrower shall mclude mm each
monthly payment, together with the principal and interest as set forth
in the No...
Score: -3.479

Node ID: 66ef84cd-cdb7-4022-b817-d6ab5c0ca46e
Text: - QML  MORTGAGE D

e5 small v2

In [None]:
response = rag_engine.query("What is the total estimated monthly payment?")
print('\nFinal Response:\n ---------------------- \n')
print(response)

print('\n')
for node in response.source_nodes:
  print(node)

response = rag_engine.query("How much does the borrower pay for lender's title insurance?")
print('\nFinal Response:\n ---------------------- \n')
print(response)

print('\n')
for node in response.source_nodes:
  print(node)

response = rag_engine.query("What are the charges?")
print('\nFinal Response:\n ---------------------- \n')
print(response)

print('\n')
for node in response.source_nodes:
  print(node)

response = rag_engine.query("Who is the borrower, what is the total loan amount and what is the property Address?")
print('\nFinal Response:\n ---------------------- \n')
print(response)

print('\n')
for node in response.source_nodes:
  print(node)


Final Response:
 ---------------------- 

The total estimated monthly payment is $2,308.95.


Node ID: 87bed5a6-5aa8-4410-a28f-82e51d9a9f3d
Text: Your actual rate, payment, and cost could be higher. Get an
official Loan Estimate before choosing a loan. Fee Details and Summary
Applicants: Application No: Date Prepared: Loan Program: Prepared By:
THIS IS NOT A GOOD FAITH ESTIMATE (GFE). This "Fees W orksheet" is
provided for informational purposes ONLY, to assist you in determining
an estim...
Score: -2.472

Node ID: 36099f7d-e9f4-4a06-97ad-36e74de44600
Text: Payment of Principal, Interest and Late Charge. Borrower shall
pay when due the principal of, and interest on, the debt evidenced by
the Note and late charges due under the Note  2. Monthly Payment of
Taxes, Insurance and Other Charges. Borrower shall mclude mm each
monthly payment, together with the principal and interest as set forth
in the No...
Score: -3.479

Node ID: a1131f64-102c-4c4f-94a3-ba6ee8df8d87
Text: - QML  MORTGAGE D

In [None]:
for node in response.source_nodes:
  print(node.get_text())

- QML

MORTGAGE DOC.# 10009588

DOCUMENT NUMBER

RECORDED 06/28/2011 09:35AM
JOHN LA FAVE
NAME & RETURN ADDRESS REGISTER DF DEEDS

M&I Home Lending Solutions
Attn: Secondary Marketing
4121 NW Urbandale Drive
Urbandale, IA 50322

Milwaukee County, WI]
AMOUNT : 30.00
FEE EXEMPT #:

PARCEL IDENTIFIER NUMBER
716-0027-6
[Space Above This Line For Recording Data]

 

State of Wisconsin
581-4247085-703

 

MIN 100273100009309945

THIS MORTGAGE ("Security Instrument") 1s given on June 20, 2011
The Mortgagor is KIMBERLY HOGAN, A Single Person,

("Borrower") This Security Instrument 1s given to Mortgage Electronic Registration Systems, Inc ("MERS"),
(solely as nominee for Lender, as hereinafter defined, and Lender's successors and assigns), as mortgagee MERS 1s
organized and existing under the laws of Delaware, and has an address and telephone number of PO Box .2026,
Flint, MI 48501-2026, tel (888) 679-MERS M&I Bank FSB :
("Lender") 1s organized and existing under the laws of the United States o

All three models produced different chunks overall with BGE and E5 having more contextual aware answers. We had only 2 documents in our RAG pipeline here, one for the loan worksheet and other for the title. Mini LLM worked well however it was not able to parse complex figures properly, for example when asked about monthly payments it gave answer as 1,869.37, which was only one part of overall monthly payments as we can see below. Others answered 2,308.95 properly.

However, there was another query where we had conficting answers between Mini LLM and other 2 models, query asked for total amount of loan and other 2 models extracted answer from title deed while Mini LLM extracted answer from loan worksheet, both answers were different most likely because of error in documents or some other reason. Most likely these cases will get resolved as we add more documents in our RAG pipeline, overall we are indeed seeing better results when we are using more complex embedding models.

In [None]:
retriever = index.as_retriever(similarity_top_k=10)
retrieved_nodes = retriever.retrieve(query)

In [None]:
query = "What is the loan amount?"

# Try different values of top_k
for top_k in [2, 5, 10]:
    print(f"\n--- Results for top_k = {top_k} ---\n")
    retriever = index.as_retriever(similarity_top_k=top_k)
    nodes = retriever.retrieve(query)
    for i, node in enumerate(nodes):
        print(f"Result {i+1}:")
        print(node.get_text())
        print("-" * 80)



--- Results for top_k = 2 ---

Result 1:
Your actual rate, payment, and cost could be higher. Get an official Loan Estimate before choosing a loan.
Fee Details and Summary
Applicants: Application No:
Date Prepared:
Loan Program:
Prepared By:
THIS IS NOT A GOOD FAITH ESTIMATE (GFE). This "Fees W orksheet" is provided for informational purposes ONLY, to assist
you in determining an estimate of cash that may be required to close and an estimate of your proposed monthly mortgage 
payment. Actual charges may be more or less, and your transaction may not involve a fee for every item listed.
Total Loan Amount:  Interest Rate: Term/Due In:
Fee Paid To Paid By (Fee Split**) Amount PFC / F / POC
TOTAL ESTIMATED FUNDS NEEDED TO CLOSE: TOTAL ESTIMATED MONTHLY PAYMENT:
Total Estimated Funds Total Monthly Payment
Purchase Price (+)
Alterations (+)
Land (+)
Refi (incl. debts to be paid off) (+)
Est. Prepaid Items/Reserves (+)
Est. Closing Costs (+)
Loan Amount (-) Principal & Interest
Other Financin

In [None]:
# Set a threshold (you can try 0.7, 0.75, 0.8, etc.)
threshold = 0.82

# Filter nodes based on score
filtered_nodes = [node for node in retrieved_nodes if node.score and node.score > threshold]

print(f"\nFiltered {len(filtered_nodes)} out of {len(retrieved_nodes)} total nodes.")

for i, node in enumerate(filtered_nodes):
    print(f"\nResult {i+1}: (Score: {node.score:.2f})")
    print(node.get_text())
    print("-" * 80)



Filtered 1 out of 4 total nodes.

Result 1: (Score: 0.84)
Your actual rate, payment, and cost could be higher. Get an official Loan Estimate before choosing a loan.
Fee Details and Summary
Applicants: Application No:
Date Prepared:
Loan Program:
Prepared By:
THIS IS NOT A GOOD FAITH ESTIMATE (GFE). This "Fees W orksheet" is provided for informational purposes ONLY, to assist
you in determining an estimate of cash that may be required to close and an estimate of your proposed monthly mortgage 
payment. Actual charges may be more or less, and your transaction may not involve a fee for every item listed.
Total Loan Amount:  Interest Rate: Term/Due In:
Fee Paid To Paid By (Fee Split**) Amount PFC / F / POC
TOTAL ESTIMATED FUNDS NEEDED TO CLOSE: TOTAL ESTIMATED MONTHLY PAYMENT:
Total Estimated Funds Total Monthly Payment
Purchase Price (+)
Alterations (+)
Land (+)
Refi (incl. debts to be paid off) (+)
Est. Prepaid Items/Reserves (+)
Est. Closing Costs (+)
Loan Amount (-) Principal & Intere

In [None]:
experiments = [
    {"top_k": 5, "threshold": None},
    {"top_k": 8, "threshold": 0.75},
    {"top_k": 3, "threshold": 0.8},
]

In [None]:
for exp in experiments:
    print(f"\\n--- Experiment: top_k={exp['top_k']}, threshold={exp['threshold']} ---")
    retriever = index.as_retriever(similarity_top_k=exp["top_k"])
    nodes = retriever.retrieve("What is the name of the borrower and the property address?")
    if exp["threshold"]:
        nodes = [node for node in nodes if node.score and node.score > exp["threshold"]]
    print(f"Chunks Retrieved: {len(nodes)}")
    for i, node in enumerate(nodes):
        print(f"Chunk {i+1} (Score: {node.score:.2f}):")
        print(node.get_text())
        print("-" * 80)

\n--- Experiment: top_k=5, threshold=None ---
Chunks Retrieved: 4
Chunk 1 (Score: 0.83):
which has the address of 6468 SOUTH 20TH STREET [Street]
MILWAUKEE [City], Wisconsin 53221 [Zip Code] ("Property Address"),

TOGETHER WITH all the improvements now or hereafter erected on the property, and all easements,
appurtenances and fixtures now or hereafter a part of the property All replacements and additions shall also be
covered by this Security Instrument All of the foregoing 1s referred to in this Security Instrument as the "Property "
Borrower understands and agrees that MERS holds only legal title to the mterests granted by Borrower in this
Security Instrument, but, 1f necessary to comply with law or custom, MERS, (as nominee for Lender and Lender's
successors and assigns), has the right to exercise any or all of those interests, including, but not limited to, the right
to foreclose and sell the Property, and to take any action required of Lender including, but not limited to, releasi

While we used custom retrievers in our rag pipeline we can observe the effect of Top K and threshold on the chunk selections even in default retriever. However in our case when only few chunks were available we are not seeing much change overall. And provided that we used reranking in our custom RAG pipeline where more score is given to more relevant node.