<a href="https://colab.research.google.com/github/dishitasood/workflow/blob/master/build_rag_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Section 1: Loading PDF

In [3]:
# install necessary libraries
!pip install -q llama-index llama-index-llms-gemini pymupdf
!pip install -q llama-index-embeddings-huggingface
!pip install nest_asyncio
!pip install --upgrade transformers
!pip install -U sentence_transformers



In [4]:
import os
import fitz  # PyMuPDF
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Markdown, display
import nest_asyncio

In [5]:
import os
GOOGLE_API_KEY = "AIzaSyC8DoAne5KteQkeWFOMUGFvmFZTvwbyah4"
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

In [6]:
nest_asyncio.apply()

In [7]:
!mkdir -p sample_docs

In [8]:
from google.colab import files
import os

def upload_pdf():
    """Upload a PDF file and return its path."""
    print("Please select a PDF file to upload:")
    uploaded = files.upload()

    for filename in uploaded.keys():
        if filename.endswith('.pdf'):
            # Save to the sample_docs directory
            pdf_path = os.path.join("sample_docs", filename)

            # Create directory if it doesn't exist
            os.makedirs("sample_docs", exist_ok=True)

            # Save the file
            with open(pdf_path, 'wb') as f:
                f.write(uploaded[filename])

            print(f"PDF saved to {pdf_path}")
            return pdf_path
        else:
            print(f"File {filename} is not a PDF. Please upload a PDF file.")

    return None


In [15]:
pdf_path = upload_pdf()

Please select a PDF file to upload:


Saving LenderFeesWorksheetNew.pdf to LenderFeesWorksheetNew.pdf
PDF saved to sample_docs/LenderFeesWorksheetNew.pdf


In [16]:
def extract_text_from_pdf(pdf_path):

  doc = fitz.open(pdf_path)

  # extract texts from all pages
  text = "\n".join([page.get_text() for page in doc])

  print("PDF: ", {pdf_path})
  print("Number of pages: ", len(doc))
  print(f"Extracted {len(text.split())} from the pdf")

  doc.close()

  return text

In [17]:
if pdf_path:
  text = extract_text_from_pdf(pdf_path)
  print(text[:500])

PDF:  {'sample_docs/LenderFeesWorksheetNew.pdf'}
Number of pages:  1
Extracted 404 from the pdf
Your actual rate, payment, and cost could be higher. Get an official Loan Estimate before choosing a loan.
Fee Details and Summary
Applicants:
Application No:
Date Prepared:
Loan Program:
Prepared By:
THIS IS NOT A GOOD FAITH ESTIMATE (GFE). This "Fees Worksheet" is provided for informational purposes ONLY, to assist
you in determining an estimate of cash that may be required to close and an estimate of your proposed monthly mortgage 
payment. Actual charges may be more or less, and your transac


### Integrating PyMuPDF with LlamaIndex

In [18]:
from llama_index.core import Document
from typing import List

def load_pdf_with_pymupdf(pdf_path: str) -> List[Document]:

  # open the pdf
  doc = fitz.open(pdf_path)

  documents = []

  for i, page in enumerate(doc):
    text = page.get_text()

    if not text.strip():
      continue

    documents.append(
        Document(
            text=text,
            metadata={
                "file_name": os.path.basename(pdf_path),
                "page_number": i + 1,
                "total_pages": len(doc)
            }
        )
    )

  doc.close()

  print(f"Processed {pdf_path}:")
  print(f"Extracted {len(documents)} pages with content")

  return documents



In [19]:
# example usage
pdf_docs = load_pdf_with_pymupdf(pdf_path)

Processed sample_docs/LenderFeesWorksheetNew.pdf:
Extracted 1 pages with content


In [20]:
import os
from google.colab import userdata # Import userdata
GOOGLE_API_KEY = userdata.get('gemini_key')
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

### Section 2: Idexing and Processing PDFs

In [21]:
!pip install -q llama-index-llms-google-genai

Exception ignored in: <function Client.__del__ at 0x7cd0c1ea23e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/google/genai/client.py", line 400, in __del__
    self.close()
  File "/usr/local/lib/python3.12/dist-packages/google/genai/client.py", line 386, in close
    self._api_client.close()
    ^^^^^^^^^^^^^^^^
AttributeError: 'Client' object has no attribute '_api_client'


To add your API key to Colab secrets:

1. Click on the "🔑" icon in the left sidebar of your Colab notebook.
2. Click on "Add new secret".
3. In the "Name" field, enter a name for your secret (e.g., `GOOGLE_API_KEY`).
4. In the "Value" field, paste your API key.
5. Make sure the "Notebook access" toggle is turned on for the current notebook.
6. Click "Done".

Now you can access your API key in your code using `userdata.get('YOUR_SECRET_NAME')`, replacing `YOUR_SECRET_NAME` with the name you gave your secret.

In [22]:
from llama_index.llms.gemini import Gemini
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex
from sentence_transformers import SentenceTransformer
from llama_index.embeddings.huggingface import HuggingFaceEmbedding # Import HuggingFaceEmbedding
from llama_index.llms.google_genai import GoogleGenAI


# #initalize gemini llm
llm = GoogleGenAI(
    model="gemini-2.5-flash"
)
Settings.llm = llm

#initialize embedding model, sentence transformer
embed_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embed_model = HuggingFaceEmbedding(model_name=embed_model_name) # Wrap SentenceTransformer model
Settings.embed_model = embed_model

def process_index_pdf(pdf_path):

  #load documents
  documents = load_pdf_with_pymupdf(pdf_path)

  #create vector index
  index = VectorStoreIndex.from_documents(documents)

  print(f"Indexed {len(documents)} document chunks")

  return index

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [23]:
if pdf_path:
  index = process_index_pdf(pdf_path)
else:
  print("No PDF file uploaded. Please upload a PDF file using the previous cell.")

Processed sample_docs/LenderFeesWorksheetNew.pdf:
Extracted 1 pages with content
Indexed 1 document chunks


### Section 3: Implementing Query Expansion and Rewriting

In [24]:
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core import Settings

# Initialize Gemini LLM
llm = GoogleGenAI(
    model="gemini-2.5-flash"
)
Settings.llm = llm

#simple query expansion function using Gemini
def expand_query(query: str, num_expansions: int = 3 ) -> list:
  prompt = f"""
    I need to search a legal contract with this query: "{query}"

    Please help me expand this query by generating {num_expansions} alternative versions that:
    1. Use different but related terminology
    2. Include relevant legal terms that might appear in a contract
    3. Cover similar concepts but phrased differently

    Format your response as a list of alternative queries only, with no additional text.
    """

  response = llm.complete(prompt)

  #extract the expanded queries
  expanded_queries = [line.strip() for line in response.text.split('\n') if line.strip()]

  #add the original query if needed
  if query not in expanded_queries:
    expanded_queries = [query] + expanded_queries

  return expanded_queries

In [25]:
#Example Usage
expanded = expand_query("What are the penalties for late payments?")
for i, q in enumerate(expanded):
  print(f"{i+1}.{q}")


1.What are the penalties for late payments?
2.*   What are the remedies or consequences for payment default?
3.*   What are the charges, including default interest or administrative fees, for overdue amounts?
4.*   What are the liabilities or liquidated damages for a breach of payment terms?


###Creating a Query Expansion Engine

In [26]:
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import QueryFusionRetriever


def create_query_expansion_engine(index):
  """Create a query engine that uses query expansion."""
  # first create multiple retrivers
  base_retriever = index.as_retriever(similarity_top_k=2)

  #create a qeury fusion retriever
  fusion_retriever = QueryFusionRetriever(
      retrievers = [base_retriever],
      llm=llm,
      similarity_top_k=2,
      num_queries=3,
      mode="reciprocal_rerank"
      )

  #create query engine with query fusion rteriever
  query_engine = RetrieverQueryEngine.from_args(
      retriever=fusion_retriever,
      llm=llm,
      verbose=True
  )

  return query_engine

In [27]:
#Example usage
expanded_query_engine = create_query_expansion_engine(index)
response = expanded_query_engine.query("What is lender fees")
print(response)

The lender fees include an Underwriting Fee of $550.00, a Wire Transfer Fee of $75.00, and an Administration Fee of $445.00. Other charges paid to the lender are an Appraisal Fee of $525.00, a Credit Report Fee of $25.00, a Tax Service Fee of $80.00, a Flood Certification Fee of $20.00, and Daily Interest Charges totaling $1,121.53.


In [None]:
!pip install llama-index-retrievers-bm25