# Outomation RAG Pipeline

This notebook contains the final RAG pipeline made for Outomation AI Workfow Automation Externship. Goal of this RAG pipeline is to provide a robust way to handle Mortgage documents obtained from clients and assist the user in processing the home mortgage applications while answering another relevant queries from the documents. We will also create a Gradio chat interface allowing users to access the RAG pipeline running on a local machine.

For this project we will be using offline open source LLM. We decided on llama3.1-8b-Q8 which performed better than many others we could use on our local hardware, we also tested quantized version of minstral7b, openai_neo, deepseek_r1, and few others before deciding. But even then, our testing of pipeline with Gemini API gave us much better and coherent result. Better LLM in backend will improve the quality of response and performance of the RAG Pipeline overall.

To design the structure of the RAG Pipeline we used Mortgage Documents provided by Outomation, documents contained scanned documents and blob documents, but all files will be in PDF form. In this notebook, we will apply the RAG pipeline on the test documents which were not referred while making the pipeline. We will also perform field accuracy analysis, recall analysis, and latency analysis. 

Following steps were taken to create this RAG Pipeline:

1. First we added ability to read all PDF files in a directory and extract text.
2. Added ability to handle Scanned PDFs by having tesseract fallback if no text is extracted from the page.
3. Added ability to handle blob files, multiple PDF documents packaged as one.
4. We also make use of Logical Document splitting, allowing RAG pipeline to decide the document boundaries. Better LLM in backend would give much better document classification and boundary. We recommend using AWS or Azure APIs to do these tasks and run other parts of RAG pipeline on local LLM. 
5. Used Recursive Splitting to create document chunks and then use Semantic Chunking to create semantic nodes or chunk for Vector Store Index.
6. Used Fiass Vector store for better Nearest Neighbor searching of chunks or nodes.
7. Then we designed a custom retriever which is using Hybrid Retriever and Query expansion while using Metadata filtering so that only doc types related to user query appear as source nodes, this was done while adding a graceful fallback in case LLM is not able to suggest the document type properly.
8. Finally we Cross Encoder Reranker and Strict Output Controls to finetune the final node retrievel and response.

Then on the basis of this RAG Pipeline we created a Gradio App which can streamline whole process through an online app with application running on local machine. 

This project will be divided into two parts:
1. RAG Pipeline
2. Gradio App

<b> Note:</b> Gradio App was made on different notebook, but we were needed to send one single notebook so we combined them into one. In gradio app section a lot of code used in RAG pipeline has been reused, this could allow users to take the code in that section and use is as a python script as its indpendent to overall RAG pipeline section.

## RAG Pipeline

In this section we will create and test a RAG pipeline which follows industry‑standard RAG pattern for grounding LLM outputs in enterprise data, and follows best‑practice retrieval steps (hybrid search, reranking, and prompt discipline) that improve accuracy and trust.

We will use FAISS vector store which is highly scalable compared to default vector store so that this RAG pipeline could be scaled to 1000s of docuemnts or even more with slight modifications. We will apply our RAG architecure on text documents which contain more type of documents than we designed this RAG pipeline for.

In [1]:
#importing essential llama index libraries and gradio, we will import others later when they are used
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex, Document
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.settings import Settings
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.readers.file import PDFReader
from llama_cpp import Llama
import gradio

In [2]:
import os
print(os.environ.get('CONDA_DEFAULT_ENV'))

llama


In [3]:
import torch
torch.cuda.is_available()

True

In [4]:
#loading llama 3.1-8b Q8 to be used for llm related tasks

llm = Llama(
    model_path = 'llama_3.1_8b_q8.gguf',
    n_ctx = 4096,
    n_gpu_layers = -1,          
    n_batch = 256,
    verbose = False,
    temperature = 0.2
)
output = llm("What is the capital of France?", max_tokens=32) #testing the model
print(output["choices"][0]["text"])

llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


 What is the capital of France?
Paris is the capital of France. It is located in the northern part of the country, on the Seine River. Paris


In [8]:
#testing a scanned document text extraction, this document is from training dataset
from PIL import Image, ImageOps
import pytesseract
import os, fitz, re

doc = fitz.open('MTG_10009588.pdf')
for page in doc:
    if page.get_text:
        img = page.get_pixmap(dpi = 300)
        img_b = img.tobytes('png')
        img = Image.open(io.BytesIO(img_b))
        text = pytesseract.image_to_string(img)
        print(text[:500])

, HRN
MORTGAGE DOC.# 10009588

DOCUMENT NUMBER

RECORDED 06/28/2011 09:35AM
JOHN LA FAVE
NAME & RETURN ADDRESS REGISTER OF DEEDS
M&I Home Lending Solutions Milwaukee County, WI|
Attn: Secondary Marketing AMOUNT: on 00

4121 NW Urbandale Drive

Urbandale, IA 50322 FEE EXEMPT #:

PARCEL IDENTIFIER NUMBER
716-0027-6

[Space Above This Line For Recording Data]

FHA Case No
State of Wisconsin
581-4247085-703 :

MIN 100273100009309945

THIS MORTGAGE ("Security Instrument") is given on June 20, 2011
Th
assigns) and to the successors and assigns of MERS, with power of sale, the following described property located in
MILWAUKEE County, Wisconsin
LOT 27, IN BLOCK 1, IN MILWAUKEE COLLEGE HEIGHTS, BEING A SUBDIVISION OF A PART

OF THE EAST 1/2 OF SECTION 6, IN TOWNSHIP 5 NORTH, RANGE 22 EAST, IN THE CITY OF
MILWAUKEE, COUNTY OF MILWAUKEE, STATE OF WISCONSIN.

which has the address of 6468 SOUTH 20TH STREET [Strect]
MILWAUKEE [City], Wisconsin 53221 [Zip Code] ("Property Address"),

TOGETHER WITH a

In [22]:
from typing import List, Dict, Any, Optional
import pytesseract
import camelot

from llama_index.core import Document

CURRENCY_RE = re.compile(r'[$€₹£]\s?\d')
NUMBER_RE   = re.compile(r'\d')
DATE_RE     = re.compile(r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b', re.I)
STOP_PHRASES = [
    "thank you", "we appreciate your business", "page", "fax", "phone", "email",
    "powered by", "this is not a", "confidential", "copyright"
]

def _has_ruled_lines(page: fitz.Page, min_lines: int = 8) -> bool:
    """Detect presence of many straight lines (rulings) on the page."""
    try:
        drawings = page.get_drawings()
        lines = 0
        for d in drawings:
            for p in d.get("items", []):
                if p[0] == "l":  # line
                    lines += 1
        return lines >= min_lines
    except Exception:
        return False

def _row_informative(cells: List[str], min_chars: int = 10) -> bool:
    s = " ".join(cells).strip()
    if len(s) < min_chars:
        return False
    low = s.lower()
    if any(ph in low for ph in STOP_PHRASES):
        return False
    # require some signal: number, currency, or date
    if CURRENCY_RE.search(s) or DATE_RE.search(s) or NUMBER_RE.search(s):
        return True
    # allow longer textual rows (e.g., purpose sentences)
    return len(s) >= (min_chars * 2)

class MixedPDFReader:
    def __init__(
        self,
        dpi: int = 300,
        lang: str = "eng",
        min_chars: int = 40,
        ocr_psm: int = 6,
        extract_tables: bool = True,
        allow_stream: bool = False,           # default off to avoid false positives
        min_table_rows: int = 3,
        min_table_cols: int = 3,
        max_rows_per_table: int = 60,
        max_nodes_per_page: int = 120,        # hard cap per page
        emit_table_rows: bool = True,         # False => emit whole-table docs only
    ):
        self.dpi = dpi
        self.lang = lang
        self.min_chars = min_chars
        self.ocr_psm = ocr_psm
        self.extract_tables = extract_tables
        self.allow_stream = allow_stream
        self.min_table_rows = min_table_rows
        self.min_table_cols = min_table_cols
        self.max_rows_per_table = max_rows_per_table
        self.max_nodes_per_page = max_nodes_per_page
        self.emit_table_rows = emit_table_rows

    def _has_meaningful_text(self, page: fitz.Page) -> bool:
        txt = page.get_text("text") or ""
        return len(txt.strip()) >= self.min_chars

    def _page_to_image(self, page: fitz.Page) -> Image.Image:
        scale = self.dpi / 72.0
        pm = page.get_pixmap(matrix=fitz.Matrix(scale, scale))
        mode = "RGBA" if pm.alpha else "RGB"
        img = Image.frombytes(mode, (pm.width, pm.height), pm.samples)
        if pm.alpha:
            img = img.convert("RGB")
        return ImageOps.autocontrast(img)

    def _read_tables(self, pdf_path: str, page_no: int, try_stream: bool) -> List:
        tables = []
        try:
            # lattice first (needs ruling lines; fewer false positives)
            t_lat = camelot.read_pdf(pdf_path, pages=str(page_no), flavor="lattice")
            if t_lat and t_lat.n > 0:
                tables = t_lat
            elif try_stream:
                t_str = camelot.read_pdf(pdf_path, pages=str(page_no), flavor="stream")
                if t_str and t_str.n > 0:
                    tables = t_str
        except Exception:
            tables = []
        return tables

    def _table_is_valid(self, table) -> bool:
        df = table.df
        n_rows, n_cols = df.shape
        if n_rows < self.min_table_rows or n_cols < self.min_table_cols:
            return False
        bad = 0
        for r in range(n_rows):
            cells = [str(v).strip() for v in list(df.iloc[r].values)]
            if not _row_informative(cells):
                bad += 1
        return (bad / max(1, n_rows)) < 0.7

    def _emit_table_docs(
        self, table, base_meta: Dict[str, Any], table_index: int
    ) -> List[Document]:
        out: List[Document] = []
        df = table.df
        n_rows, n_cols = df.shape

        # Optional: detect header row only if it looks like a header (few digits)
        def looks_like_header(cells: List[str]) -> bool:
            cell_txt = " ".join(cells)
            # header rows typically have fewer numbers
            return NUMBER_RE.search(cell_txt) is None or len(cell_txt) < 40

        headers = None
        start_row = 0
        if n_rows > 0:
            first_row = [str(v).strip() for v in list(df.iloc[0].values)]
            if looks_like_header(first_row):
                headers = first_row
                start_row = 1

        # Whole-table doc (useful when rows are too many)
        if not self.emit_table_rows or n_rows > self.max_rows_per_table:
            table_text = "\n".join(
                ["\t".join(str(v).strip() for v in list(df.iloc[r].values)) for r in range(start_row, n_rows)]
            )
            out.append(
                Document(
                    text=table_text[:8000],
                    metadata={**base_meta, "block_type": "table", "table_index": table_index,
                              "n_rows": int(n_rows), "n_cols": int(n_cols), "headers": headers},
                )
            )
            return out

        # Row-level docs with caps and filters
        emitted = 0
        for r in range(start_row, n_rows):
            if emitted >= self.max_rows_per_table:
                break
            cells = [str(v).strip() for v in list(df.iloc[r].values)]
            if not _row_informative(cells):
                continue
            row_text = "\t".join(c for c in cells if c)
            if not row_text:
                continue
            out.append(
                Document(
                    text=row_text,
                    metadata={
                        **base_meta,
                        "block_type": "table_row",
                        "table_index": table_index,
                        "row_index": r,
                        "headers": headers,
                        "n_cols": int(n_cols),
                    },
                )
            )
            emitted += 1
        return out

    def load_data(self, file: str, extra_info: Optional[Dict[str, Any]] = None, **kwargs):
        doc = fitz.open(file)
        out_docs: List[Document] = []
        total_nodes = 0

        for i, page in enumerate(doc):
            page_no = i + 1
            page_meta = {
                "source": os.path.abspath(file),
                "page": page_no,
                **(extra_info or {}),
            }

            # Base: page-level text or OCR
            if self._has_meaningful_text(page):
                text = (page.get_text("text") or "").strip()
                ocr_applied = False
                page_doc = Document(text=text, metadata={**page_meta, "ocr_applied": ocr_applied, "block_type": "page_text"})
                out_docs.append(page_doc)
                total_nodes += 1
            else:
                img = self._page_to_image(page)
                text = pytesseract.image_to_string(img, lang=self.lang, config=f"--psm {self.ocr_psm}").strip()
                ocr_applied = True
                page_doc = Document(text=text, metadata={**page_meta, "ocr_applied": ocr_applied, "block_type": "ocr_page_text"})
                out_docs.append(page_doc)
                total_nodes += 1
                # Don’t try Camelot on OCR pages
                continue

            # Early stop if page is already too noisy
            if total_nodes >= self.max_nodes_per_page * (i + 1):
                continue

            # Tables: only if enabled and page likely has rulings
            if self.extract_tables:
                ruled = _has_ruled_lines(page, min_lines=8)
                tables = []
                if ruled:
                    tables = self._read_tables(file, page_no, try_stream=self.allow_stream)
                # If not ruled and stream is allowed, still try stream but guarded
                elif self.allow_stream:
                    tables = self._read_tables(file, page_no, try_stream=True)

                # Emit valid tables only
                for t_idx, t in enumerate(tables or []):
                    if not self._table_is_valid(t):
                        continue
                    t_docs = self._emit_table_docs(t, {**page_meta, "ocr_applied": ocr_applied}, t_idx)
                    out_docs.extend(t_docs)
                    total_nodes += len(t_docs)
                    # Page-level cap
                    if total_nodes >= self.max_nodes_per_page * (i + 1):
                        break

        return out_docs


In [23]:
#loading test files
reader = MixedPDFReader(dpi=300, lang="eng", min_chars=40, ocr_psm=6)
page_reader = SimpleDirectoryReader(
    input_dir="test/",
    file_extractor={".pdf": reader}
)
documents = page_reader.load_data()

In [24]:
print('Pages Loaded:',len(documents))
print(documents[0].text[:300])

Pages Loaded: 59
Loan Estimate
Save this Loan Estimate to compare with your Closing Disclosure.
DATE ISSUED
APPLICANTS
PROPERTY
SALE PRICE
LOAN TERM
PURPOSE
PRODUCT
LOAN TYPE
LOAN ID #
RATE LOCK
Conventional
FHA
VA
NO
YES, until
Before closing, your interest rate, points, and lender credits can
change unless you loc


In [21]:
from llama_cpp import LlamaGrammar

def categories_gbnf():
    alts = " | ".join(f'"{c}"' for c in CATEGORIES)
    return LlamaGrammar.from_string(f"root ::= {alts}\n")

YESNO_GBNF = "root ::= \"yes\" | \"no\"\n"

DECODE = dict(
    temperature=0.1,
    top_p=0.9,
    top_k=30,
    repeat_penalty=1.15,
)

CATEGORIES = [
    "Resume", "Contract", "LoanAgreement", "Invoice", "PaySlip",
    "LenderFee", "LandDeed", "BankStatement", "TaxDocument",
    "Insurance", "Report", "Letter", "Form", "ID", "Medical", "Other"
]

def classify_document(text, max_chars=2000):
    snippet = (text or "")[:max_chars]

    system = (
        "You are a strict document classifier. "
        "Return exactly one label from the allowed set. "
        "No explanations. No extra words. No punctuation."
    )
    user = (
        "Allowed labels:\n"
        + ", ".join(CATEGORIES) + "\n"
        +  """Label Guide:
        - Resume: CV, resume, and documents containing work history, usually one or two pages.
        - Contract: general legal agreement not specific to mortgages, insurance, and property.
        - LoanAgreement: loan agreement for home loan.
        - Invoice: a bill requesting payment for goods/services.
        - PaySlip: salary/wage statement for an employee.
        - LenderFee: fee worksheet/closing cost breakdown, usually a single worksheet.
        - LandDeed: Title document and property deed and documents for land ownership.
        - BankStatement: transaction history of an account issued by a bank.
        - TaxDocument: Tax return and tax form
        - Insurance: Insurance policy documents
        - Report: documents containing data analysis and findings
        - Letter: correspondence communications
        - Form: applications and other Form documents requiring user to enter data
        - ID: documents used for identity checking and verficiation
        - Medical: medical reports and health prescriptions
        If unsure, choose Other
        """
        + """\n\nTask: Classify the following content into exactly ONE allowed label.\n\n"""
        f"Content:\n{snippet}\n\n"
        "Answer with ONLY the label."
    )

    out = llm.create_chat_completion(
        messages = [
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        max_tokens = 4,
        grammar = categories_gbnf(),
        **DECODE
    )
    label = out["choices"][0]["message"]["content"].strip()

    # Optional guard: map minor variations back to canonical labels
    return label

def is_same_document(page1, page2, doc_type=None, max_chars=1200):
    p1 = (page1 or "")[:max_chars]
    p2 = (page2 or "")[:max_chars]
    dtype = doc_type or "Unknown"

    system = (
        "You decide if Page_2 starts a NEW document, given Page_1 and its document type. "
        "Output exactly 'yes' or 'no'. No other text."
    )
    user = (
        f"Page_1 Document Type: {dtype}\n"
        f"Page_1:\n{p1}\n\n"
        f"Page_2:\n{p2}\n\n"
        "Question: Does Page_2 start a NEW document? Answer only 'yes' or 'no'."
    )

    out = llm.create_chat_completion(
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        max_tokens=2,
        grammar = LlamaGrammar.from_string(YESNO_GBNF),
        **DECODE
    )
    return out["choices"][0]["message"]["content"].strip().lower()


In [14]:
documents[5].metadata

{'source': '/mnt/d/ex_outo/test/appraisal_report.pdf',
 'page': 2,
 'file_path': '/mnt/d/ex_outo/test/appraisal_report.pdf',
 'file_name': 'appraisal_report.pdf',
 'file_type': 'application/pdf',
 'file_size': 462164,
 'creation_date': '2025-09-25',
 'last_modified_date': '2025-09-25',
 'ocr_applied': False,
 'block_type': 'page_text'}

In [25]:
#testing doctype function on test file
classify_document(documents[5].text)

'Report'

In [26]:
#testing document boundary function for test files
is_same_document(documents[5].text, documents[6].text, doc_type = 'Report')

'no'

In [27]:
#creating metadata and logical documents
#this is most time consuming process, and better and faster LLM with better hardware can reduce the latency

metadata = []
current_doc_type = None
page_in_doc = 0
is_new_doc = True
logical_documents = []

for i, page in enumerate(documents):
    if i == 0:
        current_doc_type = classify_document(page.text)
        text = page.text
    else:
        prev_text = documents[i - 1].text
        output = is_same_document(prev_text, page.text, current_doc_type)
        if output.startswith('y'):
            current_doc_type = classify_document(page.text)
            page_in_doc = 0
            text = page.text
            is_new_doc = True
            
        else:
            page_in_doc += 1
            text = text + "\n\n" + page.text
            is_new_doc = False

    metadata.append({
        "page": i,
        'is_new_doc': is_new_doc,
        "doc_type": current_doc_type,
        'page_in_doc': page_in_doc,
        'source_file': page.metadata['file_name'],
    })

    if is_new_doc:
        logical_documents.append({
            'text': text,
            'doc_type': current_doc_type,
            'page_start': i,
            'page_end': i
        })
    else:
        logical_documents[-1]['page_end'] = i
        logical_documents[-1]['text'] = text

In [31]:
metadata[:2]

[{'page': 0,
  'is_new_doc': True,
  'doc_type': 'LoanAgreement',
  'page_in_doc': 0,
  'source_file': 'LoanEstimate.pdf'},
 {'page': 1,
  'is_new_doc': False,
  'doc_type': 'LoanAgreement',
  'page_in_doc': 1,
  'source_file': 'LoanEstimate.pdf'}]

In [32]:
# uncomment it to save the metadata as it is base of logical documents and will take a lot of time to make it in local limited hardware
import pickle
with open('metadata_test.pkl', 'wb') as file:
    pickle.dump(metadata, file)

In [33]:
#using a better LLM will help in decision boundaries and faster and even better processing of metadata creation
len(logical_documents)

15

In [95]:
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.llms.llama_cpp import LlamaCPP
#using a CPP wrapper based llama 3.1 for rag
llm_rag = LlamaCPP(
    model_path = 'llama_3.1_8b_q8.gguf',
    temperature = 0.1,
    context_window = 4096,
    model_kwargs = {"n_gpu_layers": -1, 'n_batch': 256},
    max_new_tokens = 64,
    generate_kwargs={"stop": ["\n", "\n\n", "Reasoning", "Explanation:", 'However']},
    verbose = False
)
Settings.llm = llm_rag  
Settings.embed_model = HuggingFaceEmbedding(model_name = 'BAAI/bge-base-en-v1.5')

llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
2025-09-26 12:03:40,960 - INFO - Load pretrained SentenceTransformer: BAAI/bge-base-en-v1.5
2025-09-26 12:03:44,984 - INFO - 1 prompt is loaded, with the key: query


In [103]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size = 1600, chunk_overlap = 100)

chunked_documents = []

for idx, doc in enumerate(logical_documents):
    chunks = splitter.split_text(doc["text"])
    for chunk_idx, chunk in enumerate(chunks):
        chunked_documents.append(
            Document(
                text=chunk,
                metadata={
                    "doc_type": doc["doc_type"],
                    "chunk_index": chunk_idx,
                    "page_start": doc["page_start"],
                    "page_end": doc["page_end"], 
                }
            )
        )

In [104]:
len(chunked_documents)

121

In [105]:
#semantic splitting the logical documents chunked in recursive manner

semantic_splitter = SemanticSplitterNodeParser(
    buffer_size = 15,                      # keeps neighboring sentences in context when deciding
    breakpoint_percentile_threshold = 90,  # higher = fewer, stronger splits
    embed_model = Settings.embed_model
)

# Convert coarse nodes back to Documents for the semantic splitter
semantic_chunks = semantic_splitter.get_nodes_from_documents(chunked_documents)

In [106]:
len(semantic_chunks)

177

In [39]:
#creating a FAISS powered vector store
from llama_index.vector_stores.faiss import FaissVectorStore
import faiss

2025-09-26 11:40:02,697 - INFO - Loading faiss with AVX512 support.
2025-09-26 11:40:02,756 - INFO - Successfully loaded faiss with AVX512 support.


In [225]:
dim

768

In [107]:
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.core import StorageContext

probe = Settings.embed_model.get_text_embedding("dimension probe")
dim = len(probe)

faiss_index = faiss.IndexFlatL2(dim)
vector_store = FaissVectorStore(faiss_index = faiss_index)  # FAISS as the backend

#Build a VectorStoreIndex over semantic chunks
storage_context = StorageContext.from_defaults(vector_store = vector_store)
index = VectorStoreIndex(semantic_chunks, storage_context=storage_context)

In [41]:
#this will be used for metadata filtering for queries
def predict_doc_type_for_query(query):
    """
    Return a predicted query type so that appropriate chunks could be recalled 
    """
    system = (
    f"""
    You are an intelligent assistant that routes user queries to the most relevant document.
    Choose ONLY ONE from:  {", ".join(CATEGORIES)}
    
    """
    )
    user = (f'Which document type most likely contain answer for my query: {query}. Give only one document type as answer.')

    out = llm.create_chat_completion(
        messages = [
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        max_tokens = 4,
        grammar = categories_gbnf(),
        **DECODE
    )
    label = out["choices"][0]["message"]["content"].strip()
    return label

In [202]:
from typing import Optional

from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.core.postprocessor.types import BaseNodePostprocessor

from llama_index.core.prompts import PromptTemplate
from llama_index.core.response_synthesizers import get_response_synthesizer

QA_PROMPT = PromptTemplate(
    "Use only the provided context. If the answer is not present, output exactly: Not found\n"
    "Return only the answer text, one short line. Do not add explanations, notes, or extra words.\n\n"
    "Context:\n{context_str}\n\n"
    "Question: {query_str}\n"
    "Answer:"
)


resp_synth = get_response_synthesizer(
    response_mode="compact",
    text_qa_template=QA_PROMPT,
)

FIELD_BOOSTS = {
    "loan amount": ["loan amount", "amount financed", "base loan", "principal"],
    "interest rate": ["interest rate", "annual percentage rate", "apr", "rate"],
    "down payment": ["down payment", "cash to close", "funds due from borrower"],
    "property_address": ["property address", "subject property", "address"],
    "applicants": ["applicant", "borrower", "co-borrower", "name"]
}

from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import QueryBundle


MONEY_Q = re.compile(r'\b(loan\s*amount|amount\s*financed|base\s*loan|principal|origination|funding)\b', re.I)
GEN_MONEY_Q = re.compile(r'\b(amount|loan|principal|balance|financed)\b', re.I)

def expand_query_for_bm25(query: str) -> str:
    q = (query or "").strip()
    if not q:
        return q

    boosts = []
    # Strong rule: exact field intents
    if MONEY_Q.search(q):
        boosts += [
            "loan amount",
            "amount financed",
            "base loan",
            "principal",
        ]
    # Weak rule: generic money wording; add softer anchors
    elif GEN_MONEY_Q.search(q):
        boosts += [
            "loan amount",
            "amount financed",
        ]

    if not boosts:
        return q

    # BM25 is bag-of-words; adding tokens increases matches (no need for boolean OR)
    return f"{q} " + " ".join(boosts)


class ExpandedBM25Retriever(BaseRetriever):
    def __init__(self, base_bm25_retriever, expander=expand_query_for_bm25):
        super().__init__()
        self.base = base_bm25_retriever
        self.expander = expander

    def _retrieve(self, query_bundle, **kwargs):
        q = query_bundle.query_str or ""
        expanded = self.expander(q)
        return self.base.retrieve(QueryBundle(query_str=expanded))

# Pydantic-friendly client-side metadata filter with graceful fallback
class DocTypeFilterPostprocessor(BaseNodePostprocessor):
    doc_type: Optional[str] = None

    # implement required abstract method for your LlamaIndex version
    def _postprocess_nodes(self, nodes, query_bundle=None):
        if not self.doc_type:
            return nodes
        filtered = [n for n in nodes if (n.node.metadata or {}).get("doc_type") == self.doc_type]
        return filtered if filtered else nodes  # fallback to unfiltered if empty


def build_rag_pipeline(index, llm, query = None, doc_type_filtering = False,
                       k_per_retriever = 8, final_top_n = 3, num_queries = 3):
    all_nodes = list(index.docstore.docs.values())

    # Doc-type routing (soft, client-side)
    predicted_doc_type = None
    if doc_type_filtering and query:
        label = predict_doc_type_for_query(query)
        if label and label.strip().lower() != "other":
            predicted_doc_type = label.strip()

    k_per = max(2, min(int(k_per_retriever), len(all_nodes)))
    final_k = max(1, min(int(final_top_n), k_per))

    # Vector retriever (no server-side filters; FAISS-safe)
    vec = index.as_retriever(similarity_top_k=k_per)

    # BM25 over all (or bias to doc_type by prefiltering node list)
    bm25_nodes = (
        [n for n in all_nodes if n.metadata.get("doc_type") == predicted_doc_type] or all_nodes
    ) if predicted_doc_type else all_nodes

    bm25_base = BM25Retriever.from_defaults(nodes=bm25_nodes, similarity_top_k=k_per)

    # Wrap with query expander
    bm25_boosted = ExpandedBM25Retriever(bm25_base, expander=expand_query_for_bm25)

    # Fuse vector + boosted BM25; QueryFusionRetriever will RRF over both
    fusion = QueryFusionRetriever(
        retrievers=[vec, bm25_boosted],
        llm=llm,
        similarity_top_k=k_per,
        num_queries=num_queries,   
        mode="reciprocal_rerank",
    )

    # Postprocess: soft doc_type filter then cross-encoder rerank
    doc_filter = DocTypeFilterPostprocessor(doc_type=predicted_doc_type)
    reranker = SentenceTransformerRerank(
        model="cross-encoder/ms-marco-MiniLM-L-2-v2",
        top_n=final_k
    )

    qe = RetrieverQueryEngine.from_args(
        retriever=fusion,
        llm=llm,
        response_synthesizer=resp_synth,
        node_postprocessors=[doc_filter, reranker],
        verbose=False
    )
    return qe



In [203]:
import re

PLACEHOLDER_LINE = re.compile(r'^\s*[_\-\.\$€¥\s]{3,}\s*$', re.I)
NOISY_FOOTERS = [
    "Search results", "Source [", "Explanation:", "Note:"
]

def finalize_answer_minimal(raw_text: str) -> str:
    """
    Keep exactly one, clean line:
    - Take the substring before the first 'Not found' (case-insensitive).
    - Remove obvious footers/parentheticals.
    - Return the first meaningful sentence/line; otherwise 'Not found'.
    """
    s = (raw_text or "").strip()
    if not s:
        return "Not found"

    # Cut everything from the first 'Not found' onwards
    m = re.search(r'\bnot\s*found\b', s, flags=re.I)
    if m:
        s = s[:m.start()].strip()

    # Remove parentheticals like "(Note: ...)"
    s = re.sub(r'\([^)]*\)', '', s).strip()

    # Cut noisy footers if present
    for marker in NOISY_FOOTERS:
        if marker in s:
            s = s.split(marker, 1)[0].strip()

    # First sentence or first non-empty line
    parts = re.split(r'(?<=[.!?])\s+|\n+', s)
    first = next((p.strip() for p in parts if p and p.strip()), "")

    # Guard against placeholders or empty fragments
    if not first or PLACEHOLDER_LINE.match(first):
        return "Not found"

    # De-duplicate blunt repeats: keep up to the first clause
    first = re.split(r'[.;]\s*', first)[0].strip()

    return first[:200] if first else "Not found"


In [198]:
#testing rag initialization
rag_engine_base = build_rag_pipeline(index, llm_rag)

2025-09-26 13:33:28,698 - DEBUG - Building index from IDs objects


config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [61]:
import numpy as np

In [249]:
#test query

start_time = time.time()

query = "Summarize Neighborhood Description in the appraisal report."
rag_engine = build_rag_pipeline(index, llm_rag)
response = rag_engine.query(query)

elapsed_time = time.time() - start_time

raw = str(response)
final = finalize_answer_minimal(raw)

print('\nFinal Response:\n ---------------------- \n')
print(final)  # print the cleaned, single-line answer
print(f"\nQuery execution time: {elapsed_time:.3f} seconds")

2025-09-26 18:46:11,256 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

A residential neighborhood comprised predominantly of 2-3 story, wood frame, row style, and detached SFRs

Query execution time: 46.602 seconds


In [260]:
response.source_nodes[2]

NodeWithScore(node=TextNode(id_='3a2394b7-01f8-4776-ba9c-574e77f917d2', embedding=None, metadata={'doc_type': 'Report', 'chunk_index': 8, 'page_start': 3, 'page_end': 6}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='9175d30f-95d5-47b0-a79c-e443657e7f5d', node_type='4', metadata={'doc_type': 'Report', 'chunk_index': 8, 'page_start': 3, 'page_end': 6}, hash='31562245cdfb091889959d6efe7b3e821f83cac27e1eb4acd2e882b9aff48d90'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='48372990-af0b-4fa4-8139-28c8edd59f0c', node_type='1', metadata={}, hash='323d00d998670a8d302334a1e6f7d30747fb4a748ead54a96ffb283a7e38c894')}, metadata_template='{key}: {value}', metadata_separator='\n', text='ClickFORMS Appraisal Software 800-622-8727\nNet Adj: -5%\nGross Adj : 6%\nNet Adj: 16%\nGross Adj: 16%\nNet Adj: -1%\nGross Adj: 8%\nFile No.\nUniform Residential Appraisal Report\nThere are\ncomparable properties curr

2025-09-27 00:36:26,577 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-27 02:13:31,864 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-27 02:58:21,034 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-27 03:06:13,650 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [215]:
from collections import defaultdict

# 1) Build gold_ids_by_query (yours is fine)
queries = [
    "What is the name of person in the Driver's License ID?",
    "What is the gross adjusted sale price of Comparable Sale Number 1 in the appraisal report?",
    "What is the purpose of Appraisal Report?",
    "What is the address for the lot in parcel number 1?",
    "What is the total Loan estimate?",
    "What is the property address for which loan is being taken?",
    "What are the names of applicants in loan estimate sheet?",
    "What is the closing balance in the account statement?",
    "What are the services you cant shop for? Mention them and give their total",
    "What is the name of employee in payslip and what is the Net Pay?",
    "When was the last deposit made as per the the Bank Statement",
    "What are the Total Closing Cost as per Loan Estimate?",
    "Sumamrize Neighborhood Description in the appraisal report",
    "Give breakdown of the total livable area floor wise for the property concerned in appraisal report.",
    "What are the wages for Tina in W-2 form",
]

node_ids_in_order = [
    "c72acb7b-4666-4f18-a363-3f8812c7cc5f",
    "1b64170b-2970-459d-b831-2fed7d30acf7",
    "2f44a985-ba5e-431d-baf6-7ca72ed61d26",
    "64f89ec0-9d23-4784-af24-e28201ca692d",
    "f47a18ec-65d6-4685-9a89-66810594c4ac",
    "00bb7870-6f91-46bb-b952-2fe8930d714d",
    "00bb7870-6f91-46bb-b952-2fe8930d714d",
    "c17473d5-96d7-45f5-9ac7-4475c1cf209b",
    "180bffca-5f4a-4bb1-a428-e52dfa896ff0",
    "c72acb7b-4666-4f18-a363-3f8812c7cc5f",
    "c17473d5-96d7-45f5-9ac7-4475c1cf209b",
    "180bffca-5f4a-4bb1-a428-e52dfa896ff0",
    "e0633fdf-82ac-4f3a-85d3-29791a3a3567",
    "ca60a252-5a24-485c-9f91-245f365cdbf4",
    "248969cd-9932-4972-ba3a-248172195cbb",
]

assert len(queries) == len(node_ids_in_order), "Queries and node_id list must have same length"

gold_ids_by_query = {q: {node_ids_in_order[i]} for i, q in enumerate(queries)}
gold_ids_by_query[queries[13]].add("3821d3b2-2702-4959-a2f4-0dc34c445c44")  # extra gold for 14th query

# Optional: verify gold ids exist in index (won’t crash if index structure differs)
try:
    all_node_ids = set(getattr(index.docstore, "docs", {}).keys())
    missing = [nid for nid in set().union(*gold_ids_by_query.values()) if nid not in all_node_ids]
    if missing:
        print("Warning: some gold node_ids not present in the index:", missing)
except Exception:
    pass

# 2) Robust node_id extraction (handles NodeWithScore variants)
def _node_id_from_hit(hit):
    nid = getattr(hit, "node_id", None)
    if nid:
        return nid
    node = getattr(hit, "node", None)
    if node is not None:
        return getattr(node, "node_id", None)
    return None

# 3) Evaluator: pass retriever in
def eval_retrieval(retriever, eval_queries, gold_ids_by_query, Ks=(1,3,5,10)):
    Ks = sorted(set(Ks))
    maxK = max(Ks)
    # Ensure your retriever is configured to return at least maxK hits.
    # If not, rebuild it or set similarity_top_k accordingly before calling this.

    recall_at = defaultdict(float)
    hit_at = defaultdict(float)
    mrr_sum = 0.0
    n = 0

    for q in eval_queries:
        gold = set(gold_ids_by_query.get(q, []))
        if not gold:
            continue
        n += 1

        hits = retriever.retrieve(q) or []
        ranked_ids = []
        for h in hits:
            nid = _node_id_from_hit(h)
            if nid:
                ranked_ids.append(nid)

        # MRR
        rr = 0.0
        for rank, nid in enumerate(ranked_ids, start=1):
            if nid in gold:
                rr = 1.0 / rank
                break
        mrr_sum += rr

        # Recall@K + HitRate@K
        for K in Ks:
            topK = ranked_ids[:K]
            retrieved_rel = sum(1 for nid in topK if nid in gold)
            recall_at[K] += retrieved_rel / max(1, len(gold))
            hit_at[K]    += 1.0 if retrieved_rel > 0 else 0.0

    if n == 0:
        return {"note": "no labeled queries"}

    out = {"MRR": round(mrr_sum / n, 4)}
    for K in Ks:
        out[f"Recall@{K}"] = round(recall_at[K] / n, 4)
        out[f"HitRate@{K}"] = round(hit_at[K] / n, 4)
    return out

# 4) Build once, reuse retriever (ensure top_k >= 10 if you want Recall@10 to be meaningful)
qe = build_rag_pipeline(index, llm_rag, k_per_retriever=10, final_top_n=10)
retriever = qe._retriever

metrics = eval_retrieval(retriever, queries, gold_ids_by_query, Ks=(1,3,5,8,10))
print(metrics)

2025-09-26 14:29:00,454 - DEBUG - Building index from IDs objects


{'MRR': 0.3679, 'Recall@1': 0.1, 'HitRate@1': 0.1333, 'Recall@3': 0.5, 'HitRate@3': 0.5333, 'Recall@5': 0.7333, 'HitRate@5': 0.7333, 'Recall@8': 0.8667, 'HitRate@8': 0.8667, 'Recall@10': 0.8667, 'HitRate@10': 0.8667}


In [206]:
def eval_retrieval_filter(eval_queries, gold_ids_by_query, Ks=(1,3,5,10)):   
    
    Ks = sorted(set(Ks))
    maxK = max(Ks)
    # Ensure your retriever is configured to return at least maxK hits.
    # If not, rebuild it or set similarity_top_k accordingly before calling this.

    recall_at = defaultdict(float)
    hit_at = defaultdict(float)
    mrr_sum = 0.0
    n = 0

    for q in eval_queries:
        qe = build_rag_pipeline(index, llm_rag, query = q, doc_type_filtering= True, k_per_retriever = 10)
        retriever = qe._retriever
        gold = set(gold_ids_by_query.get(q, []))
        if not gold:
            continue
        n += 1

        hits = retriever.retrieve(q) or []
        ranked_ids = []
        for h in hits:
            nid = _node_id_from_hit(h)
            if nid:
                ranked_ids.append(nid)

        # MRR
        rr = 0.0
        for rank, nid in enumerate(ranked_ids, start=1):
            if nid in gold:
                rr = 1.0 / rank
                break
        mrr_sum += rr

        # Recall@K + HitRate@K
        for K in Ks:
            topK = ranked_ids[:K]
            retrieved_rel = sum(1 for nid in topK if nid in gold)
            recall_at[K] += retrieved_rel / max(1, len(gold))
            hit_at[K]    += 1.0 if retrieved_rel > 0 else 0.0

    if n == 0:
        return {"note": "no labeled queries"}

    out = {"MRR": round(mrr_sum / n, 4)}
    for K in Ks:
        out[f"Recall@{K}"] = round(recall_at[K] / n, 4)
        out[f"HitRate@{K}"] = round(hit_at[K] / n, 4)
    return out

metrics = eval_retrieval_filter(queries, gold_ids_by_query, Ks=(1,3,5,8,10))
print(metrics)

2025-09-26 13:57:13,402 - DEBUG - Building index from IDs objects
2025-09-26 13:57:20,000 - DEBUG - Building index from IDs objects
2025-09-26 13:57:27,996 - DEBUG - Building index from IDs objects
2025-09-26 13:57:35,064 - DEBUG - Building index from IDs objects
2025-09-26 13:57:41,089 - DEBUG - Building index from IDs objects
2025-09-26 13:57:47,509 - DEBUG - Building index from IDs objects
2025-09-26 13:57:53,529 - DEBUG - Building index from IDs objects
2025-09-26 13:57:59,616 - DEBUG - Building index from IDs objects
2025-09-26 13:58:05,975 - DEBUG - Building index from IDs objects
2025-09-26 13:58:12,772 - DEBUG - Building index from IDs objects
2025-09-26 13:58:19,283 - DEBUG - Building index from IDs objects
2025-09-26 13:58:25,541 - DEBUG - Building index from IDs objects
2025-09-26 13:58:32,309 - DEBUG - Building index from IDs objects
2025-09-26 13:58:39,009 - DEBUG - Building index from IDs objects
2025-09-26 13:58:45,822 - DEBUG - Building index from IDs objects


{'MRR': 0.2978, 'Recall@1': 0.0667, 'HitRate@1': 0.0667, 'Recall@3': 0.4667, 'HitRate@3': 0.4667, 'Recall@5': 0.6667, 'HitRate@5': 0.6667, 'Recall@8': 0.6667, 'HitRate@8': 0.6667, 'Recall@10': 0.7333, 'HitRate@10': 0.7333}


Even if we added soft metadata routing only and it worked fine in training set, in test files it is giving worse result than default routing. This happens because our Llama 3.1 8B itself is limited and is not able to decide on the document type properly. We got near perfect result with Gemini 2.5 flash and other similar commercial grade LLMs, but they are not as secure and cheap compared to our local LLM, we can run a better offline LLM like Deepseek R1 and Grok 2 on much more capable machine but for our project we will not be getting much improvements until LLM itself is not able to route metadata properly.

In [210]:
import time

def process_queries(queries):
    rag_engine = build_rag_pipeline(index, llm_rag, final_top_n = 3, k_per_retriever = 8) 
    time_list = []
    for i in range(len(queries)):
        print(f"Query No. {i+1}:\n"
        f"{queries[i]}\n"
        "----------------------\n"
        )
        start_time = time.time()
        
        response = rag_engine.query(queries[i])
        raw  = str(response)
        final = finalize_answer_minimal(raw)
        
        end_time = time.time()  
        elapsed_time = end_time - start_time  # Calculate elapsed time in seconds
        time_list.append(elapsed_time)
        print('\nFinal Response:\n ---------------------- \n')
        print(final)
        print("==============================\n")
        
    print(f"""\n Average time taken for processing queries: {np.mean(time_list):.3f} seconds.\n
    Maximum time taken for processing a single query: {np.max(time_list):.3f} seconds.
    """)

In [211]:
process_queries(queries)

2025-09-26 14:12:21,139 - DEBUG - Building index from IDs objects


Query No. 1:
What is the name of person in the Driver's License ID?
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

123 -567 sss avin | Reece 68 eye wer 1401br> O111/66 fh Conon miO(/o

Query No. 2:
What is the gross adjusted sale price of Comparable Sale Number 1 in the appraisal report?
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

$1,073,000

Query No. 3:
What is the purpose of Appraisal Report?
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

To provide the lender/client with an accurate, and adequately supported, opinion of the market value of the subject property

Query No. 4:
What is the address for the lot in parcel number 1?
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

58 of Villa Barbados North Unit No

Query No. 5:
What is the total Loan estimate?
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

$117,339

Query No. 6:
What is the property address for which loan is being taken?
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

Not found

Query No. 7:
What are the names of applicants in loan estimate sheet?
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

John Q

Query No. 8:
What is the closing balance in the account statement?
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

51,176

Query No. 9:
What are the services you cant shop for? Mention them and give their total
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

B

Query No. 10:
What is the name of employee in payslip and what is the Net Pay?
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

Sally Harley, 9500

Query No. 11:
When was the last deposit made as per the the Bank Statement
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

07/31/2018

Query No. 12:
What are the Total Closing Cost as per Loan Estimate?
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

$5,802

Query No. 13:
Sumamrize Neighborhood Description in the appraisal report
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

A residential neighborhood comprised predominantly of 2-3 story, wood frame, row style, and detached SFRs

Query No. 14:
Give breakdown of the total livable area floor wise for the property concerned in appraisal report.
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

First Floor: 948

Query No. 15:
What are the wages for Tina in W-2 form
----------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Final Response:
 ---------------------- 

8489


 Average time taken for processing queries: 42.492 seconds.

    Maximum time taken for processing a single query: 57.680 seconds.
    


In [247]:
print(len(response.source_nodes))

3


We got 66.67% accuracy with 10 queries having valid answers. However, relative accuracy of some of them have issues, for example Wages of Tina are 8489.27, due to strict answer formatting we were made, still relative accuracy is at 83%.

Overall, both accuracy, relative numerical accuracy, and latency could be improved a lot by using a better LLM. And also, retriever could be improved by an actual working metadata filtering and query routing which every Local Instruct LLM we tested failed to do as we intended. We could swap a better LLM here or use count based method where we determine the doctype on the basis of most common words.

## Gradio APP

In this part of the notebook we will wrap the essential codes for the Gradio application which would allow users to access the RAG Pipeline running on local machine. Most of the code would be repeated and used for gradio specifically, we can create a different notebook for this but since this is one project we will combine them both. nevertheless code of section below could be used seperately for the app.

App highlights:
1. Upload PDFs → Build/refresh index with a visible spinner overlay and progress updates.
2. Ask a question in a large input box; Send/Clear are compact and embedded beside it.
3. Immediate chat experience: user message appears first; a temporary “Working…” bubble is shown; then the final answer replaces it.
4. Top 3 Chunks table with score and source details.
5. PDF Viewer to see uploaded files.
6. One simple control that matters: the Retrieve top‑k slider, which balances recall vs. speed.
7. Strict QA prompt + short generation cap to keep answers crisp.
8. Output finalizer to avoid repeated text (numbers or sentences).

And while most of the pipeline is same we had to remove Logical Documents chunking as even index of 3-4 documents was taking 5-8 minutes to build the index when without it, we needed only few seconds to make the index. If we could improve the quality and latency Document Segmentation and Query Routing we could add them back for gradio app, but most immediate method to improve them is use API based LLM and for this project we are restricted from them.

In [42]:
# Complete Gradio RAG app
# - No metadata filtering in chat
# - Table-safe ingestion (optional Camelot) + OCR fallback
# - Top-3 retrieved chunks with brief summaries (no raw node text)
# - PDF viewing via base64 iframe (works regardless of local path)
# - Compatible with older Gradio (removed unsupported args like height/wrap/scale)

import os
import re
import time
import base64
import shutil
import pandas as pd
import gradio as gr
import fitz  # PyMuPDF
import faiss
from PIL import Image, ImageOps
import pytesseract

# ---------- Optional Camelot (guarded table extraction) ----------
USE_CAMELOT = True
try:
    import camelot  # pip install camelot-py; requires Ghostscript
except Exception:
    USE_CAMELOT = False

# ---------- LlamaIndex / Embeddings / LLMs ----------
from llama_index.core import Document, Settings, VectorStoreIndex, StorageContext
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.prompts import PromptTemplate
from llama_index.core.response_synthesizers import get_response_synthesizer
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.llms.llama_cpp import LlamaCPP

# ---------- App constants ----------
UPLOADS_DIR = "uploads"
os.makedirs(UPLOADS_DIR, exist_ok=True)

STOP_PHRASES = [
    "thank you", "we appreciate your business", "page", "fax", "phone", "email",
    "powered by", "confidential", "copyright"
]
CURRENCY_RE = re.compile(r'[$€₹£]\s?\d')
NUMBER_RE   = re.compile(r'\d')
DATE_RE     = re.compile(r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b', re.I)

def _has_ruled_lines(page, min_lines=8):
    try:
        drawings = page.get_drawings()
        lines = 0
        for d in drawings:
            for it in d.get("items", []):
                if it[0] == "l":
                    lines += 1
        return lines >= min_lines
    except Exception:
        return False

def _row_informative(cells, min_chars=10):
    s = " ".join(cells).strip()
    if len(s) < min_chars:
        return False
    low = s.lower()
    if any(ph in low for ph in STOP_PHRASES):
        return False
    if CURRENCY_RE.search(s) or DATE_RE.search(s) or NUMBER_RE.search(s) or len(s) >= (min_chars * 2):
        return True
    return False



# ---------- Mixed PDF Reader (OCR + guarded tables) ----------
class MixedPDFReader:
    """
    Page-level text + optional table row nodes (Camelot lattice by default, guarded).
    OCR for image pages (no Camelot on OCR pages).
    """
    def __init__(self, dpi=300, lang="eng", min_chars=40, ocr_psm=6,
                 extract_tables=True, allow_stream=False,
                 min_table_rows=3, min_table_cols=3,
                 max_rows_per_table=60, max_nodes_per_page=120):
        self.dpi = dpi
        self.lang = lang
        self.min_chars = min_chars
        self.ocr_psm = ocr_psm
        self.extract_tables = extract_tables and USE_CAMELOT
        self.allow_stream = allow_stream
        self.min_table_rows = min_table_rows
        self.min_table_cols = min_table_cols
        self.max_rows_per_table = max_rows_per_table
        self.max_nodes_per_page = max_nodes_per_page

    def _has_meaningful_text(self, page, min_chars=None):
        if min_chars is None:
            min_chars = self.min_chars
        txt = page.get_text("text") or ""
        return len(txt.strip()) >= min_chars

    def _page_to_image(self, page):
        scale = self.dpi / 72.0
        pm = page.get_pixmap(matrix=fitz.Matrix(scale, scale))
        mode = "RGBA" if pm.alpha else "RGB"
        img = Image.frombytes(mode, (pm.width, pm.height), pm.samples)
        if pm.alpha:
            img = img.convert("RGB")
        img = ImageOps.autocontrast(img)
        return img

    def _read_tables(self, pdf_path, page_no, try_stream):
        if not self.extract_tables:
            return []
        tables = []
        try:
            t_lat = camelot.read_pdf(pdf_path, pages=str(page_no), flavor="lattice")
            if t_lat and t_lat.n > 0:
                tables = t_lat
            elif try_stream:
                t_str = camelot.read_pdf(pdf_path, pages=str(page_no), flavor="stream")
                if t_str and t_str.n > 0:
                    tables = t_str
        except Exception:
            return []
        return tables

    def _table_is_valid(self, table):
        df = table.df
        n_rows, n_cols = df.shape
        if n_rows < self.min_table_rows or n_cols < self.min_table_cols:
            return False
        bad = 0
        for r in range(n_rows):
            cells = [str(v).strip() for v in list(df.iloc[r].values)]
            if not _row_informative(cells):
                bad += 1
        return (bad / max(1, n_rows)) < 0.7

    def _emit_table_rows(self, table, base_meta, table_index):
        out = []
        df = table.df
        n_rows, n_cols = df.shape

        def looks_like_header(cells):
            txt = " ".join(cells)
            return NUMBER_RE.search(txt) is None or len(txt) < 40

        headers = None
        start_row = 0
        if n_rows > 0:
            first = [str(v).strip() for v in list(df.iloc[0].values)]
            if looks_like_header(first):
                headers = first
                start_row = 1

        emitted = 0
        for r in range(start_row, n_rows):
            if emitted >= self.max_rows_per_table:
                break
            cells = [str(v).strip() for v in list(df.iloc[r].values)]
            if not _row_informative(cells):
                continue
            row_text = "\t".join(c for c in cells if c)
            if not row_text:
                continue
            md = {
                **base_meta,
                "block_type": "table_row",
                "table_index": table_index,
                "row_index": r,
                "headers": headers,
                "n_cols": int(n_cols),
            }
            out.append(Document(text=row_text, metadata=md))
            emitted += 1
        return out

    def load_data(self, file, extra_info=None, **kwargs):
        doc = fitz.open(file)
        docs = []
        total_nodes = 0

        for i, page in enumerate(doc):
            page_no = i + 1
            base_meta = {
                "source": os.path.abspath(file),
                "file_name": os.path.basename(file),
                "page": page_no,
                **(extra_info or {}),
            }

            if self._has_meaningful_text(page):
                text = (page.get_text("text") or "").strip()
                ocr_applied = False
                docs.append(Document(text=text, metadata={**base_meta, "ocr_applied": ocr_applied, "block_type": "page_text"}))
                total_nodes += 1

                if self.extract_tables:
                    ruled = _has_ruled_lines(page, min_lines=8)
                    tables = self._read_tables(file, page_no, try_stream=(self.allow_stream and not ruled)) if (ruled or self.allow_stream) else []
                    for t_idx, t in enumerate(tables or []):
                        if not self._table_is_valid(t):
                            continue
                        rows = self._emit_table_rows(t, {**base_meta, "ocr_applied": ocr_applied}, t_idx)
                        docs.extend(rows)
                        total_nodes += len(rows)
                        if total_nodes >= self.max_nodes_per_page * page_no:
                            break
            else:
                img = self._page_to_image(page)
                text = pytesseract.image_to_string(img, lang=self.lang, config=f"--psm {self.ocr_psm}").strip()
                ocr_applied = True
                docs.append(Document(text=text, metadata={**base_meta, "ocr_applied": ocr_applied, "block_type": "ocr_page_text"}))
                total_nodes += 1

        return docs

# ---------- LLMs / Embeddings ----------
llm_rag = LlamaCPP(
    model_path="llama_3.1_8b_q8.gguf",   # <-- change to your GGUF path
    temperature=0.1,
    context_window=4096,
    model_kwargs={"n_gpu_layers": -1, "n_batch": 256},
    max_new_tokens=64,
    generate_kwargs={"stop": ["\n", "\n\n", "Reasoning", "Explanation:"]},
    verbose=False,
)
Settings.llm = llm_rag
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

# ---------- QA prompt + synthesizer ----------
QA_PROMPT = PromptTemplate(
    "Use only the provided context. If the answer is not present, output exactly: Not found\n"
    "Return only the answer text, one short line. Do not add explanations, notes, or extra words.\n\n"
    "Context:\n{context_str}\n\n"
    "Question: {query_str}\n"
    "Answer:"
)
resp_synth = get_response_synthesizer(response_mode="compact", text_qa_template=QA_PROMPT)

# ---------- Robust uploader ----------
def save_uploads_chat(files):
    saved = []
    stamp_dir = os.path.join(UPLOADS_DIR, str(int(time.time())))
    os.makedirs(stamp_dir, exist_ok=True)

    def _src_path(f):
        if isinstance(f, str) and os.path.exists(f):
            return f
        for attr in ("name", "path"):
            p = getattr(f, attr, None)
            if p and os.path.exists(p):
                return p
        if isinstance(f, dict):
            for k in ("name", "path"):
                p = f.get(k)
                if p and os.path.exists(p):
                    return p
        raise ValueError(f"Could not resolve path for uploaded file: {f}")

    for f in files:
        src = _src_path(f)
        dst = os.path.join(stamp_dir, os.path.basename(src))
        shutil.copy2(src, dst)
        saved.append(dst)
    return saved

# ---------- Build logical docs (fast: one per file) ----------
def build_logical_documents_fast(page_docs):
    by_file = {}
    for d in page_docs:
        key = (d.metadata.get("source"), d.metadata.get("file_name"))
        by_file.setdefault(key, []).append(d)
    logical_docs = []
    for (src, fname), pages in by_file.items():
        pages.sort(key=lambda x: x.metadata.get("page", 0))
        text = "\n\n".join((p.text or "").strip() for p in pages if (p.text or "").strip())
        if not text:
            continue
        logical_docs.append({
            "text": text,
            "doc_type": "Other",
            "page_start": pages[0].metadata.get("page", 1),
            "page_end": pages[-1].metadata.get("page", 1),
            "source_file": src, "file_name": fname,
        })
    return logical_docs

# ---------- Build index (chunk + semantic + FAISS) ----------
def build_index_from_logical_docs_chat(logical_docs):
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    splitter = RecursiveCharacterTextSplitter(chunk_size=900, chunk_overlap=120)
    coarse_docs = []
    for d in logical_docs:
        chunks = splitter.split_text(d["text"])
        for cidx, chunk in enumerate(chunks):
            coarse_docs.append(Document(
                text=chunk,
                metadata={
                    "doc_type": d["doc_type"],
                    "chunk_index": cidx,
                    "page_start": d["page_start"],
                    "page_end": d["page_end"],
                    "source_file": d.get("source_file"),
                    "file_name": d.get("file_name"),
                }
            ))

    semantic_splitter = SemanticSplitterNodeParser(
        buffer_size=20, breakpoint_percentile_threshold=90, embed_model=Settings.embed_model
    )
    semantic_nodes = semantic_splitter.get_nodes_from_documents(coarse_docs)

    dim = len(Settings.embed_model.get_text_embedding("probe"))
    faiss_index = faiss.IndexFlatL2(dim)
    vector_store = FaissVectorStore(faiss_index=faiss_index)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)

    index = VectorStoreIndex(semantic_nodes, storage_context=storage_context)
    return index

# ---------- RAG pipeline (no metadata filtering) ----------
def build_rag_pipeline_chat(index, k_per_retriever=8, final_top_n=3, num_queries=1):
    all_nodes = list(index.docstore.docs.values())
    k_per = max(2, min(int(k_per_retriever), len(all_nodes)))
    final_k = max(1, min(int(final_top_n), k_per))

    vec = index.as_retriever(similarity_top_k=k_per)
    bm25 = BM25Retriever.from_defaults(nodes=all_nodes, similarity_top_k=k_per)

    fusion = QueryFusionRetriever(
        retrievers=[vec, bm25],
        llm=llm_rag,
        similarity_top_k=k_per,
        num_queries=num_queries,  # keep 1 for latency
        mode="reciprocal_rerank",
    )

    reranker = SentenceTransformerRerank(
        model="cross-encoder/ms-marco-MiniLM-L-2-v2",
        top_n=final_k,
    )

    qe = RetrieverQueryEngine.from_args(
        retriever=fusion,
        llm=llm_rag,
        response_synthesizer=resp_synth,
        node_postprocessors=[reranker],
        verbose=False,
    )
    return qe

# ---------- Final answer formatter ----------
PLACEHOLDER_LINE = re.compile(r'^\s*[_\-\.\$€¥\s]{3,}\s*$', re.I)
NOISY_FOOTERS = ["Search results", "Source [", "Explanation:", "Note:"]

def finalize_answer_minimal(raw_text: str) -> str:
    s = (raw_text or "").strip()
    if not s:
        return "Not found"
    m = re.search(r'\bnot\s*found\b', s, flags=re.I)
    if m:
        s = s[:m.start()].strip()
    s = re.sub(r'\([^)]*\)', '', s).strip()
    for marker in NOISY_FOOTERS:
        if marker in s:
            s = s.split(marker, 1)[0].strip()
    parts = re.split(r'(?<=[.!?])\s+|\n+', s)
    first = next((p.strip() for p in parts if p and p.strip()), "")
    if not first or PLACEHOLDER_LINE.match(first):
        return "Not found"
    first = re.split(r'[.;]\s*', first)[0].strip()
    return first[:200] if first else "Not found"

# ---------- Brief summary (no raw node text) ----------
def brief_from_text(text: str, max_words: int = 14) -> str:
    t = (text or "").strip()
    t = re.sub(r"\s+", " ", t)
    t = re.sub(r"[\(\)\[\]\{\}]", "", t)
    words = t.split()
    brief = " ".join(words[:max_words])
    return (brief + "…") if len(words) > max_words else brief

# ---------- Top-N nodes table (brief summaries) ----------
def top_nodes_table_from_response(resp, k=3) -> pd.DataFrame:
    rows = []
    try:
        sns = getattr(resp, "source_nodes", []) or []
        for rank, sn in enumerate(sns[:k], start=1):
            node = sn.node
            score = getattr(sn, "score", None)
            txt = (node.get_text() or "")
            md = node.metadata or {}
            rows.append({
                "rank": rank,
                "score": round(float(score), 4) if score is not None else None,
                "source_file": md.get("file_name") or os.path.basename(md.get("source") or md.get("source_file") or ""),
                "doc_type": md.get("doc_type", ""),
                "page_hint": md.get("page") or md.get("page_start") or "",
                "brief": brief_from_text(txt, max_words=14),
            })
    except Exception as e:
        rows.append({"rank": None, "score": None, "source_file": "", "doc_type": "", "page_hint": "", "brief": f"(error building table: {e})"})
    return pd.DataFrame(rows, columns=["rank","score","source_file","doc_type","page_hint","brief"])

# ---------- Build index action ----------
def build_index_action_chat(files, progress):
    try:
        progress(0, desc="Saving uploads…")
        paths = save_uploads_chat(files)
        if not paths:
            return None, [], [], "Please upload at least one PDF."

        progress(0.1, desc="Reading PDFs (OCR / tables guarded)…")
        reader = MixedPDFReader(dpi=300, lang="eng", min_chars=40, ocr_psm=6,
                                extract_tables=True, allow_stream=False)
        page_docs = []
        for p in paths:
            page_docs.extend(reader.load_data(p, extra_info={}))

        if not page_docs:
            return None, [], [], "No pages were read from the PDFs."

        progress(0.35, desc="Forming logical documents…")
        logical_docs = build_logical_documents_fast(page_docs)
        if not logical_docs:
            return None, [], [], "No logical documents were formed."

        progress(0.55, desc="Chunking + semantic refinement + FAISS…")
        index = build_index_from_logical_docs_chat(logical_docs)

        msg = f"✅ Indexed {len(logical_docs)} logical docs, {len(index.docstore.docs)} chunks."
        progress(0.95, desc="Finalizing…")
        return index, logical_docs, paths, msg
    except Exception as e:
        return None, [], [], f"❌ Build failed: {type(e).__name__}: {e}"

# ---------- Chat handlers ----------
def chat_answer_chat(query, index, chat_history, k_slider):
    chat_history = chat_history or []
    if index is None:
        return chat_history + [
            {"role": "user", "content": query},
            {"role": "assistant", "content": "Please upload files and build the index first."}
        ], pd.DataFrame()
    qe = build_rag_pipeline_chat(
        index,
        k_per_retriever=int(k_slider),
        final_top_n=min(3, int(k_slider)),
        num_queries=1,
    )
    resp = qe.query(query)
    answer = finalize_answer_minimal(str(resp).strip())
    top_df = top_nodes_table_from_response(resp, k=3)
    new_hist = chat_history + [{"role": "user", "content": query}, {"role": "assistant", "content": answer}]
    return new_hist, top_df

def pre_send_chat(q, hist):
    hist = hist or []
    q = (q or "").strip()
    if not q:
        return hist, ""
    new_hist = hist + [{"role": "user", "content": q}, {"role": "assistant", "content": "⏳ Working…"}]
    return new_hist, ""

def run_answer_chat(idx, hist, k):
    hist = hist or []
    q = next((m["content"] for m in reversed(hist) if m.get("role") == "user"), "").strip()
    if idx is None:
        if hist and hist[-1].get("role") == "assistant":
            hist[-1] = {"role": "assistant", "content": "Please upload files and build the index first."}
        return hist, pd.DataFrame()
    qe = build_rag_pipeline_chat(
        idx,
        k_per_retriever=int(k),
        final_top_n=min(3, int(k)),
        num_queries=1,
    )
    resp = qe.query(q)
    final_text = finalize_answer_minimal(str(resp))
    top_df = top_nodes_table_from_response(resp, k=3)
    if hist and hist[-1].get("role") == "assistant":
        hist[-1] = {"role": "assistant", "content": final_text}
    else:
        hist += [{"role": "assistant", "content": final_text}]
    return hist, top_df

# ---------- PDF viewer (base64 iframe) ----------
def _pdf_data_uri(path: str) -> str:
    with open(path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("ascii")
    return (
        f'<iframe src="data:application/pdf;base64,{b64}#view=FitH&toolbar=1" '
        f'width="100%" height="720" style="border:none;"></iframe>'
    )

def pdf_choices_from_paths(paths):
    return [os.path.basename(p) for p in paths]

def on_pdf_select(choice, paths):
    if not choice or not paths:
        return gr.update(value="", visible=False)
    p = next((p for p in paths if os.path.basename(p) == choice), None)
    if not p or not os.path.exists(p):
        return gr.update(value="", visible=False)
    html = _pdf_data_uri(p)
    return gr.update(value=html, visible=True)

llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
2025-09-27 15:50:12,811 - INFO - Load pretrained SentenceTransformer: BAAI/bge-base-en-v1.5
2025-09-27 15:50:17,478 - INFO - 1 prompt is loaded, with the key: query


In [44]:
CSS = """
.btn-sm button { padding: 6px 12px !important; min-height: 36px !important; }
.query-box textarea { min-height: 120px !important; font-size: 16px; }
"""

with gr.Blocks(title="PDF RAG APP", css=CSS) as demo:
    gr.Markdown("### Local PDF RAG\nUpload PDFs → Build index → Ask questions. Shows top‑3 retrieved chunks with brief summaries and an inline PDF viewer.")

    with gr.Row():
        uploader = gr.Files(label="Upload PDFs", file_types=[".pdf"], file_count="multiple")

    with gr.Row():
        build_btn = gr.Button("Build / Refresh Index", variant="primary")
        clear_idx_btn = gr.Button("Clear Index")
    status = gr.Markdown()

    # States
    state_index = gr.State(None)
    state_logical = gr.State([])
    state_pdf_paths = gr.State([])

    with gr.Row():
        k_slider = gr.Slider(minimum=2, maximum=12, value=8, step=1, label="Retrieve top‑k", interactive=True)

    with gr.Row():
        chat = gr.Chatbot(label="Chat", height=360, type="messages")

    with gr.Row():
        query_box = gr.Textbox(label="Ask a question", placeholder="Ask about your documents…",
                               lines=5, elem_classes=["query-box"])
        with gr.Column(min_width=140):
            send_btn = gr.Button("Send", variant="primary", elem_classes=["btn-sm"])
            clear_chat_btn = gr.Button("Clear Chat", variant="secondary", elem_classes=["btn-sm"])

    top_nodes_df = gr.Dataframe(
        headers=["rank","score","source_file","doc_type","page_hint","brief"],
        label="Top retrieved (k=3)",
        interactive=False
    )

    with gr.Row():
        pdf_dropdown = gr.Dropdown(choices=[], label="Open a PDF", interactive=True)
        pdf_view = gr.HTML(label="PDF Viewer", visible=False)

    # Build actions
    def on_build(files):
        if not files:
            return None, [], [], "Please upload at least one PDF."
        return build_index_action_chat(files, progress=gr.Progress())

    def on_clear_index():
        return None, [], [], "Index cleared."

    build_btn.click(
        on_build,
        inputs=[uploader],
        outputs=[state_index, state_logical, state_pdf_paths, status]
    ).then(
        lambda paths: gr.update(choices=pdf_choices_from_paths(paths)),
        inputs=[state_pdf_paths],
        outputs=[pdf_dropdown],
    ).then(
        on_pdf_select,
        inputs=[pdf_dropdown, state_pdf_paths],
        outputs=[pdf_view],
    )

    clear_idx_btn.click(
        on_clear_index,
        outputs=[state_index, state_logical, state_pdf_paths, status]
    ).then(
        lambda: gr.update(choices=[]),
        outputs=[pdf_dropdown]
    ).then(
        lambda: gr.update(visible=False, value=""),
        outputs=[pdf_view]
    )

    # Chat flow (returns brief summaries table)
    send_btn.click(
        pre_send_chat,
        inputs=[query_box, chat],
        outputs=[chat, query_box],
    ).then(
        run_answer_chat,
        inputs=[state_index, chat, k_slider],
        outputs=[chat, top_nodes_df],
    )

    query_box.submit(
        pre_send_chat,
        inputs=[query_box, chat],
        outputs=[chat, query_box],
    ).then(
        run_answer_chat,
        inputs=[state_index, chat, k_slider],
        outputs=[chat, top_nodes_df],
    )

    clear_chat_btn.click(lambda: [], outputs=[chat])

    # PDF selection change → show iframe
    pdf_dropdown.change(
        on_pdf_select,
        inputs=[pdf_dropdown, state_pdf_paths],
        outputs=[pdf_view]
    )

demo.queue().launch()


2025-09-27 15:52:55,393 - INFO - HTTP Request: GET https://api.gradio.app/pkg-version "HTTP/1.1 200 OK"
2025-09-27 15:52:55,543 - INFO - HTTP Request: GET http://127.0.0.1:7873/gradio_api/startup-events "HTTP/1.1 200 OK"
2025-09-27 15:52:55,554 - INFO - HTTP Request: HEAD http://127.0.0.1:7873/ "HTTP/1.1 200 OK"


* Running on local URL:  http://127.0.0.1:7873
* To create a public link, set `share=True` in `launch()`.




2025-09-27 15:57:49,891 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-27 16:00:25,864 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Traceback (most recent call last):
  File "/home/anubh/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/queueing.py", line 745, in process_events
    response = await route_utils.call_process_api(
  File "/home/anubh/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/route_utils.py", line 354, in call_process_api
    output = await app.get_blocks().process_api(
  File "/home/anubh/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/blocks.py", line 2112, in process_api
    inputs = await self.preprocess_data(
  File "/home/anubh/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/blocks.py", line 1774, in preprocess_data
    processed_input.append(block.preprocess(inputs_cached))
  File "/home/anubh/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/components/dropdown.py", line 206, in preprocess
    raise Error(
gradio.exceptions.Error: "Value: blob_scanned_id_pay_return.pdf is not in the list of choices: ['LenderFeesWorksheetNew.pdf']"
Traceback (m

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-27 16:18:08,996 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-27 16:22:52,567 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-27 16:25:03,837 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Traceback (most recent call last):
  File "/home/anubh/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/queueing.py", line 745, in process_events
    response = await route_utils.call_process_api(
  File "/home/anubh/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/route_utils.py", line 354, in call_process_api
    output = await app.get_blocks().process_api(
  File "/home/anubh/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/blocks.py", line 2112, in process_api
    inputs = await self.preprocess_data(
  File "/home/anubh/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/blocks.py", line 1774, in preprocess_data
    processed_input.append(block.preprocess(inputs_cached))
  File "/home/anubh/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/components/dropdown.py", line 206, in preprocess
    raise Error(
gradio.exceptions.Error: "Value: blob_scanned_id_pay_return.pdf is not in the list of choices: ['LenderFeesWorksheetNew.pdf', 'LenderFeesWo

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-27 16:34:26,127 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-27 16:36:43,486 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-27 16:40:42,614 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-27 16:47:29,333 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-27 16:49:29,419 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-27 16:54:08,292 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-27 16:57:43,182 - DEBUG - Building index from IDs objects


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [37]:
import math

## Conclusion

We delivered a local, privacy‑preserving PDF question‑answering assistant powered by a robust RAG pipeline and a minimal Gradio chat UI. The system ingests user‑uploaded PDFs (including scanned pages via OCR), classifies and stitches logical documents, semantically chunks them, indexes with an interchangeable vector store (FAISS), and answers questions with hybrid retrieval (vector + BM25), query expansion, cross‑encoder reranking, and strict output controls. 

The Gradio chat app stays simple: upload → build index → ask → get a concise answer with clickable source‑page previews, designed with the focus on needs of end users who would prefer simple direct answers than a complex app which has high level control over RAG. This aligns with the industry‑standard RAG pattern for grounding LLM outputs in enterprise data, and follows best‑practice retrieval steps (hybrid search, reranking, and prompt discipline) that improve accuracy and trust.

It is to be noted that we could improve the performance of RAG and reduce the latency of retrievel in Gradio APP and in overall RAG pipeline by using a better LLM like gemini or running LLMs locally on machine with better resources. 