# RAG Pipeline

In this notebook we will create a RAG pipeline on the mortgage documents provided by Outomation. We are using one standard PDF and one scanned PDF as our corpus for RAG.

We will be using query extenstion and hybrid retrieval to enrich our retrieval and then use a cross encoder model to rerank the nodes better.

For the chunking stage in RAG pipeline we tested default chunking and semantic chunking before going for default word limit based chunking because we have limited documents available with us. And Gemini 2.5 is our choice of LLM as it worked well and is available free with google account allowing others to test the notebook further.

In [None]:
!pip install llama-index-core llama-index-readers-file llama-index-llms-ollama llama-index-embeddings-huggingface
!pip install llama_index.llms.gemini
!pip install transformers sentence-transformers
!pip install pypdf
!pip install nest_asyncio
!pip install llama_index
!pip install llama-index-experimental
!pip install llama-index-retrievers-bm25
!pip install pytesseract pdf2image



In [None]:
import os
import nest_asyncio
nest_asyncio.apply()

GOOGLE_API_KEY = "AIzaSyBcaa2hsV5pfYUmdi0qL5SgWZIY75VX3ao"  # Replace with your actual API key
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

In [None]:
cd /content/drive/MyDrive/Extern_Outomation

/content/drive/MyDrive/Extern_Outomation


In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(input_dir="./basic_rag", required_exts=[".pdf"]).load_data()
print(f"Loaded {len(documents)} document(s).")

Loaded 5 document(s).


In [None]:
print(documents[0].text[:1000])  # Print the first 1000 characters of the first document

Your actual rate, payment, and cost could be higher. Get an official Loan Estimate before choosing a loan.
Fee Details and Summary
Applicants: Application No:
Date Prepared:
Loan Program:
Prepared By:
THIS IS NOT A GOOD FAITH ESTIMATE (GFE). This "Fees W orksheet" is provided for informational purposes ONLY, to assist
you in determining an estimate of cash that may be required to close and an estimate of your proposed monthly mortgage 
payment. Actual charges may be more or less, and your transaction may not involve a fee for every item listed.
Total Loan Amount:  Interest Rate: Term/Due In:
Fee Paid To Paid By (Fee Split**) Amount PFC / F / POC
TOTAL ESTIMATED FUNDS NEEDED TO CLOSE: TOTAL ESTIMATED MONTHLY PAYMENT:
Total Estimated Funds Total Monthly Payment
Purchase Price (+)
Alterations (+)
Land (+)
Refi (incl. debts to be paid off) (+)
Est. Prepaid Items/Reserves (+)
Est. Closing Costs (+)
Loan Amount (-) Principal & Interest
Other Financing (P & I)
Hazard Insurance
Real Estate Tax

In [None]:
#we have one scanned pdf in the selected folder so we will convert it to text
from pdf2image import convert_from_path
import pytesseract

pages = convert_from_path('/content/drive/MyDrive/Extern_Outomation/basic_rag/MTG_10009588.pdf')
text = ""
for page in pages:
    text += pytesseract.image_to_string(page)
print(text[:1000])  # Preview extracted text


- QML

MORTGAGE DOC.# 10009588

DOCUMENT NUMBER

RECORDED 06/28/2011 09:35AM
JOHN LA FAVE
NAME & RETURN ADDRESS REGISTER DF DEEDS

M&I Home Lending Solutions
Attn: Secondary Marketing
4121 NW Urbandale Drive
Urbandale, IA 50322

Milwaukee County, WI]
AMOUNT : 30.00
FEE EXEMPT #:

PARCEL IDENTIFIER NUMBER
716-0027-6
[Space Above This Line For Recording Data]

 

State of Wisconsin
581-4247085-703

 

MIN 100273100009309945

THIS MORTGAGE ("Security Instrument") 1s given on June 20, 2011
The Mortgagor is KIMBERLY HOGAN, A Single Person,

("Borrower") This Security Instrument 1s given to Mortgage Electronic Registration Systems, Inc ("MERS"),
(solely as nominee for Lender, as hereinafter defined, and Lender's successors and assigns), as mortgagee MERS 1s
organized and existing under the laws of Delaware, and has an address and telephone number of PO Box .2026,
Flint, MI 48501-2026, tel (888) 679-MERS M&I Bank FSB :
("Lender") 1s organized and existing under the laws of the United States o

In [None]:
from llama_index.core import Document
doc = Document(text=text, metadata={"file_name": "MTG_10009588.pdf"}) #appending scanned document
documents.append(doc)

In [None]:
documents_new = []

for doc in documents:
  if doc.text != "":
    documents_new.append(doc)

documents = documents_new

In [None]:
len(documents)

2

In [None]:
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
semantic_splitter = SemanticSplitterNodeParser(embed_model=embed_model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
semantic_nodes = semantic_splitter.get_nodes_from_documents(documents)

In [None]:
print(f'Total number of semantic nodes: {len(semantic_nodes)}')
print(semantic_nodes[-1])

Total number of semantic nodes: 4
Node ID: 25f78036-ae99-49a2-8225-86c14aeaf0be
Text: Borrower shall pay when due the principal of, and interest on,
the debt evidenced by the Note and late charges due under the Note  2.
Monthly Payment of Taxes, Insurance and Other Charges. Borrower shall
mclude mm each monthly payment, together with the principal and
interest as set forth in the Note and any late charges, a sum for (a)
taxes and...


We used semantic nodes initially but after improving the rag pipeline in other parts semantic nodes were not providing all the test answers properly because of low number of documents to process leading us to use default chunk splitting method.

In [None]:
from llama_index.core import VectorStoreIndex
from llama_index.llms.gemini import Gemini

# Create an index with our embeddings
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

llm = Gemini(model="models/gemini-2.5-flash")

  llm = Gemini(model="models/gemini-2.5-flash")


In [None]:
#checking the index nodes again
print(list(index.docstore.docs.values())[0].text)

Your actual rate, payment, and cost could be higher. Get an official Loan Estimate before choosing a loan.
Fee Details and Summary
Applicants: Application No:
Date Prepared:
Loan Program:
Prepared By:
THIS IS NOT A GOOD FAITH ESTIMATE (GFE). This "Fees W orksheet" is provided for informational purposes ONLY, to assist
you in determining an estimate of cash that may be required to close and an estimate of your proposed monthly mortgage 
payment. Actual charges may be more or less, and your transaction may not involve a fee for every item listed.
Total Loan Amount:  Interest Rate: Term/Due In:
Fee Paid To Paid By (Fee Split**) Amount PFC / F / POC
TOTAL ESTIMATED FUNDS NEEDED TO CLOSE: TOTAL ESTIMATED MONTHLY PAYMENT:
Total Estimated Funds Total Monthly Payment
Purchase Price (+)
Alterations (+)
Land (+)
Refi (incl. debts to be paid off) (+)
Est. Prepaid Items/Reserves (+)
Est. Closing Costs (+)
Loan Amount (-) Principal & Interest
Other Financing (P & I)
Hazard Insurance
Real Estate Tax

In [None]:
from llama_index.core import Settings
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.postprocessor import SentenceTransformerRerank


Settings.llm = llm


# Function to create a query engine that uses query expansion plus hybrid and reranking
def build_rag_pipeline(index, llm):

    nodes = list(index.docstore.docs.values())

    # Determine safe top_k value (number of nodes to retrieve)
    # Must be at least 1 and no more than the number of available nodes
    num_nodes = len(nodes)
    safe_top_k = min(3, max(1, num_nodes))

    print(f"Index contains {num_nodes} nodes, using top_k={safe_top_k}")

    vector_retriever = index.as_retriever(
          similarity_top_k = safe_top_k  # Retrieve top 3 most similar chunks
      )

    # Create hybrid retriever (vector + BM25)
    bm25_retriever = BM25Retriever.from_defaults(
        nodes=nodes,
        similarity_top_k=safe_top_k  # Retrieve top 3 most similar chunks
    )

    # Create a proper hybrid retriever class
    class HybridRetriever(BaseRetriever):
        """Hybrid retriever that combines vector and keyword search results."""

        def __init__(self, vector_retriever, keyword_retriever, top_k=3):
            """Initialize with vector and keyword retrievers."""
            self.vector_retriever = vector_retriever
            self.keyword_retriever = keyword_retriever
            self.top_k = top_k
            super().__init__()

        def _retrieve(self, query_bundle, **kwargs):
            """Retrieve from both retrievers and combine results."""
            # Get results from both retrievers
            vector_nodes = self.vector_retriever.retrieve(query_bundle)
            keyword_nodes = self.keyword_retriever.retrieve(query_bundle)

            # Combine all nodes
            all_nodes = list(vector_nodes) + list(keyword_nodes)

            # Remove duplicates (by node_id)
            unique_nodes = {}
            for node in all_nodes:
                if node.node_id not in unique_nodes:
                    unique_nodes[node.node_id] = node

            # Sort by score (higher is better)
            sorted_nodes = sorted(
                unique_nodes.values(),
                key=lambda x: x.score if hasattr(x, 'score') else 0.0,
                reverse=True
            )

            return sorted_nodes[:self.top_k]  # Return top results

    # Create our hybrid retriever instance
    hybrid_retriever = HybridRetriever(
        vector_retriever = vector_retriever,
        keyword_retriever = bm25_retriever,
        top_k=safe_top_k
    )

    # Use QueryFusionRetriever with the hybrid retriever
    fusion_retriever = QueryFusionRetriever(
        retrievers = [hybrid_retriever],
        llm = llm,
        similarity_top_k = 3,
        num_queries = 3,
        mode="reciprocal_rerank"
    )

    # Apply reranking
    reranker = SentenceTransformerRerank(
        model="cross-encoder/ms-marco-MiniLM-L-2-v2",
        top_n=3
    )


    # Plug into query engine
    from llama_index.core.query_engine import RetrieverQueryEngine
    query_engine = RetrieverQueryEngine.from_args(
        retriever = fusion_retriever,
        llm=llm,
        node_postprocessors = [reranker],
        verbose = True
    )
    return query_engine

In [None]:
rag_engine = build_rag_pipeline(index, llm)
response = rag_engine.query("What is the total estimated monthly payment?")
print('\nFinal Response:\n ---------------------- \n')
print(response)

DEBUG:bm25s:Building index from IDs objects


Index contains 4 nodes, using top_k=3

Final Response:
 ---------------------- 

The total estimated monthly payment is $2,308.95.


In [None]:
response = rag_engine.query("How much does the borrower pay for lender's title insurance?")
print('\nFinal Response:\n ---------------------- \n')
print(response)


Final Response:
 ---------------------- 

The borrower pays $650.00 for lender's title insurance.


In [None]:
response = rag_engine.query("What are the charges?")
print('\nFinal Response:\n ---------------------- \n')
print(response)


Final Response:
 ---------------------- 

The charges include principal, interest, and late charges due under the Note. Monthly payments also include sums for taxes, special assessments, leasehold payments or ground rents, and insurance premiums. Additionally, a sum for the annual mortgage insurance premium or a monthly charge in place of a mortgage insurance premium may be required.

Other specific charges listed are:
*   Underwriting Fee
*   Wire Transfer Fee
*   Administration Fee
*   Appraisal Fee
*   Credit Report Fee
*   Tax Service Fee
*   Flood Certification Fee
*   Closing/Escrow Fee
*   Document Preparation Fee
*   Notary Fee
*   Lender's Title Insurance
*   Title - Courier Fee
*   Electronic Document Delivery Fee
*   Pest Inspection Fee
*   Home Inspection
*   Mortgage Recording Charge
*   Daily Interest Charges
*   Hazard Insurance Premium


In [None]:
response = rag_engine.query("What are the addresses in the document?")
print('\nFinal Response:\n ---------------------- \n')
print(response)


Final Response:
 ---------------------- 

The addresses mentioned in the document are:

*   4121 NW Urbandale Drive, Urbandale, IA 50322
*   PO Box 2026, Flint, MI 48501-2026
*   3993 Howard Hughes Parkway, Las Vegas, NV 89109
*   6468 SOUTH 20TH STREET, MILWAUKEE, Wisconsin 53221


In [None]:
response = rag_engine.query("Who is the borrower, what is the total loan amount and what is the property Address?")
print('\nFinal Response:\n ---------------------- \n')
print(response)


Final Response:
 ---------------------- 

The borrower is KIMBERLY HOGAN. The total loan amount is $112,084.00. The property is described as LOT 27, IN BLOCK 1, IN MILWAUKEE COLLEGE HEIGHTS, BEING A SUBDIVISION OF A PART OF THE EAST 1/2 OF SECTION 6, IN TOWNSHIP 5 NORTH, RANGE 22 EAST, IN THE CITY OF MILWAUKEE, COUNTY OF MILWAUKEE, STATE OF WISCONSIN.


Now we are done with a basic rag pipeline which is using advanced Retrieval methods such as query expansion and hybrid retrievel alongside cross-encoder reranker which allows better selection of chunks for the final prompts. If we add more data semantic chunks will be more preferable but still this is a very competent RAG pipeline to work on further.