# Project 2: Abalone Age
### Objective C: Retrieval Augmented Generation Implementation in Gradio
#### DS6306: Doing Data Science
##### Aayush Dalal & Jacqueline Vu
##### December 10th, 2025


In [1]:
pip install PyMuPDF transformers faiss-cpu

Collecting PyMuPDF
  Downloading pymupdf-1.26.7-cp310-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.13.1-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading pymupdf-1.26.7-cp310-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m90.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.13.1-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF, faiss-cpu
Successfully installed PyMuPDF-1.26.7 faiss-cpu-1.13.1


In [2]:
%pip install torch



In [3]:
%pip install nltk



In [4]:
import os
import fitz  # PyMuPDF
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM
import torch
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import faiss
import numpy as np
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

## RAG Implementation

In [5]:
import os
import fitz  # PyMuPDF
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM
import torch
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import faiss
import numpy as np
nltk.download('punkt_tab')

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Folder path in Google Drive that needs to retrieve the PDF file from
folder_path = '/content/drive/My Drive/PDFs/'

# Step 1: Read PDF Files
def read_pdfs(folder_path):
    pdf_texts = []
    for file_name in os.listdir(folder_path):
        if file_name.endswith('.pdf'):
            file_path = os.path.join(folder_path, file_name)
            try:
                doc = fitz.open(file_path)
                text = ""
                for page in doc:
                    text += page.get_text()
                pdf_texts.append((file_name, text))
            except Exception as e:
                print(f"Error reading {file_name}: {e}")
    return pdf_texts

# Step 2: Chunk Text
def chunk_text(text, chunk_size=100, overlap_sentences=2):
    sentences = sent_tokenize(text)
    chunks = []
    current_sentences_for_chunk = []
    current_word_count = 0

    for i, sentence in enumerate(sentences):
        sentence_word_count = len(sentence.split())

        # If adding the current sentence would significantly exceed chunk_size
        # and we already have some sentences in the current chunk, finalize it.
        if current_word_count > 0 and (current_word_count + sentence_word_count > chunk_size):
            chunks.append(' '.join(current_sentences_for_chunk))

            # Start a new chunk by taking 'overlap_sentences' from the end of the previous sentences
            # Ensure we don't go out of bounds for the sentences list.
            current_sentences_for_chunk = sentences[max(0, i - overlap_sentences) : i]
            current_word_count = len(' '.join(current_sentences_for_chunk).split())

        current_sentences_for_chunk.append(sentence)
        current_word_count += sentence_word_count

    # Add the last chunk if it's not empty
    if current_sentences_for_chunk:
        chunks.append(' '.join(current_sentences_for_chunk))

    return chunks

# Step 3: Create Embeddings
def create_embeddings(text_chunks, tokenizer, model):
    embeddings = []
    for chunk in text_chunks:
        inputs = tokenizer(chunk, return_tensors='pt', truncation=True, padding=True, max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state.mean(dim=1).squeeze().numpy())
    return np.array(embeddings)

# Step 4: Index Embeddings
def index_embeddings(embeddings):
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)
    return index

# Step 5: Answer Questions
def answer_question(question, pdf_texts, index, embeddings, tokenizer, model, llm_tokenizer, llm_model, temperature, max_new_tokens, top_k=3):
    # Create embedding for the question
    inputs = tokenizer(question, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        question_embedding = model(**inputs).last_hidden_state.mean(dim=1).squeeze().numpy()

    # Search for the nearest text chunks
    _, indices = index.search(np.array([question_embedding]), k=top_k)
    indices = indices[0]

    # Collect top-k chunks
    retrieved_chunks = []
    sources = []
    # Adjust how chunks are retrieved based on chunk_mapping structure
    all_flat_chunks = [chunk for _, chunks in pdf_texts for chunk in chunks]

    for idx in indices:
        # Determine which PDF and which chunk within that PDF the index refers to
        current_chunk_idx = 0
        pdf_name_for_chunk = ""
        chunk_in_pdf_idx = 0
        for pdf_name, chunks_in_pdf in pdf_texts:
            if idx < current_chunk_idx + len(chunks_in_pdf):
                pdf_name_for_chunk = pdf_name
                chunk_in_pdf_idx = idx - current_chunk_idx
                retrieved_chunks.append(chunks_in_pdf[chunk_in_pdf_idx])
                sources.append(f"{pdf_name_for_chunk}, Chunk {chunk_in_pdf_idx}")
                break
            current_chunk_idx += len(chunks_in_pdf)

    # Combine retrieved chunks
    combined_text = ' '.join(retrieved_chunks)

    # Refine the answer using a language model with a more structured prompt
    prompt_template = (
        "Based on the following context, please answer the question. "
        "If the answer is not available in the context, please state that you don't have enough information."
        "\n\nContext: {context}\n\nQuestion: {question}\n\nAnswer:"
    )
    formatted_prompt = prompt_template.format(context=combined_text, question=question)

    llm_inputs = llm_tokenizer(formatted_prompt, return_tensors='pt', truncation=True, padding=True, max_length=1024)
    llm_outputs = llm_model.generate(
        **llm_inputs,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        num_beams=5,
        early_stopping=True
    )
    refined_answer = llm_tokenizer.decode(llm_outputs[0], skip_special_tokens=True)

    return f"Answer: {refined_answer}\nSources: {sources}"

# Main function to tie everything together
def main(folder_path, question, model_name, llm_model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    llm_tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
    llm_model = AutoModelForSeq2SeqLM.from_pretrained(llm_model_name)

    # Read and chunk PDFs
    pdf_texts = read_pdfs(folder_path)
    all_chunks = []
    chunk_mapping = []

    for pdf_name, text in pdf_texts:
        # Updated call to chunk_text with overlap_sentences
        chunks = chunk_text(text, overlap_sentences=2) # Default overlap of 2 sentences
        all_chunks.extend(chunks)
        chunk_mapping.append((pdf_name, chunks))

    # Create and index embeddings
    embeddings = create_embeddings(all_chunks, tokenizer, model)
    index = index_embeddings(embeddings)

    # Answer question
    # Default temperature and max_new_tokens for main function if called directly
    answer = answer_question(question, chunk_mapping, index, embeddings, tokenizer, model, llm_tokenizer, llm_model, temperature=0.5, max_new_tokens=150)
    print(answer)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Mounted at /content/drive


## Gradio Implementation

In [6]:
pip install --upgrade gradio

Collecting gradio
  Downloading gradio-6.1.0-py3-none-any.whl.metadata (16 kB)
Collecting gradio-client==2.0.1 (from gradio)
  Downloading gradio_client-2.0.1-py3-none-any.whl.metadata (7.1 kB)
Downloading gradio-6.1.0-py3-none-any.whl (23.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.0/23.0 MB[0m [31m120.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading gradio_client-2.0.1-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.4/55.4 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gradio-client, gradio
  Attempting uninstall: gradio-client
    Found existing installation: gradio_client 1.14.0
    Uninstalling gradio_client-1.14.0:
      Successfully uninstalled gradio_client-1.14.0
  Attempting uninstall: gradio
    Found existing installation: gradio 5.50.0
    Uninstalling gradio-5.50.0:
      Successfully uninstalled gradio-5.50.0
Successfully installed gradio-6.1.0 gradio-client-2.0.1


In [8]:
import gradio as gr
import os
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM
import torch
import numpy as np

# --- Global Variables for Models and Indexed Data ---
# Initialize these once when the script starts
embedding_tokenizer = None
embedding_model = None
llm_tokenizer = None
llm_model = None
faiss_index = None
all_chunks_mapping = None

# Placeholder for loaded data. This will store (pdf_name, list_of_chunks)
preprocessed_pdf_data = []

# Assuming read_pdfs, chunk_text, create_embeddings, index_embeddings, and answer_question are available from previous cells.

# Function to initialize models and preprocess PDFs
def initialize_rag_system(folder_path, embedding_model_name, llm_model_name):
    global embedding_tokenizer, embedding_model, llm_tokenizer, llm_model, faiss_index, all_chunks_mapping, preprocessed_pdf_data

    print("Initializing RAG system...")

    # Load embedding model
    embedding_tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)
    embedding_model = AutoModel.from_pretrained(embedding_model_name)

    # Load LLM
    llm_tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
    llm_model = AutoModelForSeq2SeqLM.from_pretrained(llm_model_name)

    # Read and chunk PDFs
    pdf_texts = read_pdfs(folder_path)
    all_chunks = []
    all_chunks_mapping = [] # This will hold the (pdf_name, list_of_chunks) for retrieval

    for pdf_name, text in pdf_texts:
        chunks = chunk_text(text)
        all_chunks.extend(chunks)
        all_chunks_mapping.append((pdf_name, chunks))

    # Create and index embeddings
    # Only create embeddings once for all chunks
    embeddings = create_embeddings(all_chunks, embedding_tokenizer, embedding_model)
    faiss_index = index_embeddings(embeddings)

    print("RAG system initialized.")

# Gradio Interface Functions
def process_pdfs_and_answer_question(folder_path, question, embedding_model_name, llm_model_name, temperature, max_new_tokens):
    global embedding_tokenizer, embedding_model, llm_tokenizer, llm_model, faiss_index, all_chunks_mapping

    # Check if system is initialized, if not, initialize it
    if embedding_tokenizer is None or embedding_model is None or llm_tokenizer is None or llm_model is None or faiss_index is None or all_chunks_mapping is None:
        initialize_rag_system(folder_path, embedding_model_name, llm_model_name)

    # Answer question using the globally loaded models and indexed data
    # The temperature and max_new_tokens parameters are passed to the answer_question function
    answer = answer_question(
        question,
        all_chunks_mapping, # Use the globally preprocessed chunks
        faiss_index,
        None, # Embeddings are not needed directly here, only the index is
        embedding_tokenizer,
        embedding_model,
        llm_tokenizer,
        llm_model,
        temperature=temperature,
        max_new_tokens=max_new_tokens
    )
    return answer

# Create Gradio Interface
# Note: The folder_path, embedding_model_name, and llm_model_name inputs will still be present
# but the initialization will only happen once upon the first call or when the app starts if called before launch.
iface = gr.Interface(
    fn = process_pdfs_and_answer_question,
    inputs = [
        gr.Textbox(label = "Path", value = "/content/drive/My Drive/PDFs/", interactive = False), # Made non-interactive so user doesn't modify it
        gr.Textbox(lines = 2, placeholder = "Enter your question here...", label = "Question"),
        gr.Textbox(label = "Embedding Model Name", value = 'sentence-transformers/all-mpnet-base-v2', interactive = False), # Made non-interactive
        gr.Textbox(label = "LLM Model Name", value = 'microsoft/GODEL-v1_1-large-seq2seq', interactive = False), # Made non-interactive
        gr.Slider(minimum = 0.0, maximum = 1.0, value = 0.5, label = "Temperature"),
        # Temperature control the randomness of the output. It essentially adjusts the probability distribution of the predicted tokens,
        # influencing the diversity and creativity of the generated text.
        # Low temperature value is useful when you want the model to generate precise and factual text.
        # High temperature value leads to more creative and varied responses, but is more prone to making errors.
        gr.Slider(minimum = 1, maximum = 256, value = 150, label = "Max Tokens")
        # Max tokens is used for performance, cost control, and preventing crashes, achieved by implementing limits before sending text to the LLM.
        # Higher max tokens is needed for detailed responses and the involvement of complex tasks.
        # Lower max tokens is needed when speed is critical, cost is a concern, short, specific answers are sufficient, and when computational power must be conserved.
    ],
    outputs = [
        gr.Textbox(lines = 10, label = "Answer")
    ],
    title = "PDF Question Answering System",
    description = "NOTE: Make sure to upload a PDF file into the specified folder path shown below. \n Then, ask a question to get answers based on the content of the PDF file."
)

# Call initialization function once before launching Gradio, or ensure it's called on first invocation.
# For a Colab environment, it might be better to call it explicitly here if you know the path upfront.
initial_folder_path = "/content/drive/My Drive/PDFs/"
initial_embedding_model = "sentence-transformers/all-mpnet-base-v2" # better than 'sentence-transformers/all-MiniLM-L6-v2'
# Had issues with the accuracy of the response when using sentence-transformers/all-MiniLM-L6-v2, so I opted for
# sentence-transformers/all-mpnet-base-v2, where it has the higher accuracy of the response.
initial_llm_model = 'microsoft/GODEL-v1_1-large-seq2seq'
# microsoft/GODEL-v1_1-large-seq2seq is used as the LLM (Large Language Model) that takes the user's question and the
# retrieved chunks of text (context) and synthesizes a coherent and relevant answer based on that information.
# Its grounding capabilities are particularly beneficial for ensuring that the answers provided are directly supported by
# your PDF documents.

# This call ensures models and PDFs are preloaded before the Gradio interface is even interacted with.
# It assumes the folder_path, embedding_model_name, and llm_model_name are fixed once Gradio is launched.
initialize_rag_system(initial_folder_path, initial_embedding_model, initial_llm_model)

iface.launch(share = True)

Initializing RAG system...
RAG system initialized.
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://63b0628112b9c28983.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


