<a href="https://colab.research.google.com/github/amoakoh22/rag-chat-with-multiple-pdfs/blob/main/rag_chat_with_multiple_pdfs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Environment Setup

In [1]:
!pip install langchain chromadb gradio PyPDF2 sentence-transformers

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting gradio
  Downloading gradio-5.19.0-py3-none-any.whl.metadata (16 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.17.0-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_6

This setup ensures that all components of our retrieval augmented generation (RAG) pipeline are installed at no cost .




# 2. PDF Upload and Processing

In [1]:
import gradio as gr
import PyPDF2

# Global variable to store aggregated text
all_text = ""

def extract_pdf_text(files):
    global all_text
    all_text = ""  # Reset for fresh extraction

    # Ensure that files is always a list.
    if not isinstance(files, list):
        files = [files]

    MAX_FILES = 5
    if len(files) > MAX_FILES:
        return f"Error: Please upload no more than {MAX_FILES} files."

    for file_path in files:
        try:
            with open(file_path, "rb") as f:
                pdf_reader = PyPDF2.PdfReader(f)
                text = ""
                for page in pdf_reader.pages:
                    extracted = page.extract_text()
                    if extracted:
                        text += extracted + "\n"
                all_text += f"--- Text from {file_path} ---\n" + text + "\n"
        except Exception as e:
            return f"Error processing file {file_path}: {str(e)}"

    preview = all_text[:500] + "..." if len(all_text) > 500 else all_text
    return f"Processed {len(files)} file(s) successfully. Preview of extracted text:\n\n{preview}"

iface_extract = gr.Interface(
    fn=extract_pdf_text,
    inputs=gr.File(label="Upload PDF files", type="filepath", file_count="multiple"),
    outputs="text",
    title="PDF Text Extraction Interface",
    description="Upload multiple PDFs (maximum allowed files is 5) to extract and store their text content."
)

iface_extract.launch()


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://915a2d7bc185e518f7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In this code, we open the PDF in memory and extract text page-by-page, thereby creating a string suitable for subsequent processing.

# 3. Chunking

In [2]:
from langchain.text_splitter import CharacterTextSplitter

# Ensure that all_text is defined and non-empty
if not all_text.strip():
    print("Error: No text available for chunking. Please run the PDF extraction step first.")
else:
    # Define chunking parameters
    chunk_size = 500       # Maximum characters per chunk
    chunk_overlap = 100    # Overlap between consecutive chunks to preserve context

    # Initialize the text splitter
    text_splitter = CharacterTextSplitter(separator="\n", chunk_size=chunk_size, chunk_overlap=chunk_overlap)

    # Split the aggregated text into chunks
    chunks = text_splitter.split_text(all_text)

    print(f"Successfully split text into {len(chunks)} chunks.")
    # Display a preview of the first chunk
    print("Preview of the first chunk:")
    print(chunks[0])


Successfully split text into 1074 chunks.
Preview of the first chunk:
--- Text from /tmp/gradio/80c42d7531ac3cc2a252cde9ad7b3ef60255e17acb77f2981c71847b8b187ddc/Otawa UNI master-computer-science-specialization-artificialintelligence- Personal Project.pdf ---
This is a copy of the 2024-2025 catalog.
MASTER OF COMPUTER
SCIENCE AND
CONCENTRATION APPLIED
ARTIFICIAL INTELLIGENCE
Summary
•Degree offered: Master of Computer Science (MCS)
•Registration status options: Full-time; Part-time
•Language of instruction: English
•Primary program: Computer Science


# 4. Creating Embeddings

In [3]:
from sentence_transformers import SentenceTransformer

# Load the pre-trained Sentence Transformer model.
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Assuming that 'chunks' is a list of text chunks generated in the previous step.
# Compute the embedding for each text chunk.
embeddings = [embedding_model.encode(chunk) for chunk in chunks]

# Output the number of embeddings generated for confirmation.
print(f"Generated embeddings for {len(embeddings)} text chunks.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generated embeddings for 1074 text chunks.


# 5. Storing Embeddings in Vector database

In [4]:
import chromadb

# Initialize a ChromaDB client using the updated API (in-memory).
client = chromadb.Client()

# Create (or recreate) a collection for storing PDF text chunks and their embeddings.
collection = client.create_collection(name="pdf_chunks")

# Add each text chunk and its corresponding embedding to the collection.
for i, chunk in enumerate(chunks):
    collection.add(
        documents=[chunk],
        embeddings=[embeddings[i]],
        ids=[f"chunk_{i}"]
    )

print("Successfully stored embeddings in the vector database.")


Successfully stored embeddings in the vector database.


# 6. Chat with pdfs using basic pretrained llm + user uploaded files

In [8]:
from transformers import pipeline

# Initialize a Hugging Face pipeline for text-to-text generation.
# Here, we use "google/flan-t5-small" as a lightweight model for demonstration.
qa_pipeline = pipeline("text2text-generation", model="google/flan-t5-small", tokenizer="google/flan-t5-small")

def answer_question(query: str) -> str:
    """
    Given a user query, this function performs the following steps:
    1. Computes the embedding for the query using the same SentenceTransformer.
    2. Retrieves the top 3 most similar text chunks from the Chroma collection.
    3. Constructs a prompt combining the retrieved context with the user query.
    4. Generates and returns an answer using the text generation pipeline.
    """
    # Step 1: Compute the query embedding.
    query_embedding = embedding_model.encode(query)

    # Step 2: Retrieve the top 3 most relevant text chunks.
    results = collection.query(query_embeddings=[query_embedding], n_results=3)

    # The results dictionary typically includes a "documents" key.
    # We combine the retrieved documents into a single context string.
    retrieved_context = " ".join(results["documents"][0])

    # Step 3: Construct the prompt by merging context and query.
    prompt = (
        f"Based on the context provided below, answer the following question:\n\n"
        f"Context: {retrieved_context}\n\n"
        f"Question: {query}\n\n"
        f"Answer:"
    )

    # Step 4: Generate the answer using the language model.
    answer = qa_pipeline(prompt, max_length=150)[0]['generated_text']
    return answer

# Testing the QA pipeline with a sample query.
test_query = "Who is Ottawa?"
print("Answer:", answer_question(test_query))


Device set to use cpu


Answer: University of Ottawa


In [11]:
# Step 0: Install necessary packages if you haven't already.
# Uncomment the next two lines if running in a new environment.
!pip install nltk rouge_score
!python -m nltk.downloader punkt

# Step 1: Import required libraries.
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

# Step 2: Define a function to compute the BLEU score.
def compute_bleu(reference: str, candidate: str) -> float:
    """
    Compute the BLEU score between a reference answer and a candidate (generated) answer.
    Uses NLTK's sentence_bleu with smoothing.
    """
    # Tokenize the input sentences.
    ref_tokens = nltk.word_tokenize(reference)
    cand_tokens = nltk.word_tokenize(candidate)

    # BLEU expects a list of reference token lists.
    smooth_fn = SmoothingFunction().method1
    bleu = sentence_bleu([ref_tokens], cand_tokens, smoothing_function=smooth_fn)
    return bleu

# Step 3: Define a function to compute the ROUGE-L F1 score.
def compute_rouge(reference: str, candidate: str) -> float:
    """
    Compute the ROUGE-L F1 score between a reference answer and a candidate answer.
    Uses the rouge_scorer package.
    """
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    score = scorer.score(reference, candidate)
    return score['rougeL'].fmeasure

# Step 4: Define an evaluation function for a list of generated and reference answers.
def evaluate_generation(generated_answers: list, reference_answers: list) -> dict:
    """
    Evaluate generation quality over a set of generated and reference answers.
    Returns average BLEU and ROUGE-L F1 scores.
    """
    assert len(generated_answers) == len(reference_answers), "The number of generated answers must match the number of reference answers."

    bleu_scores = []
    rouge_scores = []

    # Compute scores for each pair.
    for gen, ref in zip(generated_answers, reference_answers):
        bleu_scores.append(compute_bleu(ref, gen))
        rouge_scores.append(compute_rouge(ref, gen))

    avg_bleu = sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0
    avg_rouge = sum(rouge_scores) / len(rouge_scores) if rouge_scores else 0

    return {"avg_bleu": avg_bleu, "avg_rouge": avg_rouge}

# Step 5: Example usage with sample data.
if __name__ == "__main__":
    # Sample generated answers and corresponding reference answers.
    generated_answers = [
        "Ottawa is the capital of Canada, known for its government institutions.",
        "The document discusses climate change and renewable energy sources."
    ]

    reference_answers = [
        "Ottawa is the capital city of Canada and hosts many government offices.",
        "This document covers topics related to climate change and renewable energy."
    ]

    # Evaluate the generation.
    evaluation_results = evaluate_generation(generated_answers, reference_answers)
    print("Generation Evaluation Metrics:")
    print(f"Average BLEU Score: {evaluation_results['avg_bleu']:.3f}")
    print(f"Average ROUGE-L F1 Score: {evaluation_results['avg_rouge']:.3f}")


Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=87c51a935718aadf10068f831309fe28ba90f7641c2777d55c0107c802e1eb80
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [12]:
# Step 0: Install necessary packages if you haven't already.
# Uncomment the next two lines if running in a new environment.
!pip install nltk rouge_score
!python -m nltk.downloader punkt # Original code had missing 'tab' in 'punkt_tab'
!python -m nltk.downloader punkt_tab # This line added to download punkt_tab


# Step 1: Import required libraries.
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

# Step 2: Define a function to compute the BLEU score.
def compute_bleu(reference: str, candidate: str) -> float:
    """
    Compute the BLEU score between a reference answer and a candidate (generated) answer.
    Uses NLTK's sentence_bleu with smoothing.
    """
    # Tokenize the input sentences.
    ref_tokens = nltk.word_tokenize(reference)
    cand_tokens = nltk.word_tokenize(candidate)

    # BLEU expects a list of reference token lists.
    smooth_fn = SmoothingFunction().method1
    bleu = sentence_bleu([ref_tokens], cand_tokens, smoothing_function=smooth_fn)
    return bleu

# Step 3: Define a function to compute the ROUGE-L F1 score.
def compute_rouge(reference: str, candidate: str) -> float:
    """
    Compute the ROUGE-L F1 score between a reference answer and a candidate answer.
    Uses the rouge_scorer package.
    """
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    score = scorer.score(reference, candidate)
    return score['rougeL'].fmeasure

# Step 4: Define an evaluation function for a list of generated and reference answers.
def evaluate_generation(generated_answers: list, reference_answers: list) -> dict:
    """
    Evaluate generation quality over a set of generated and reference answers.
    Returns average BLEU and ROUGE-L F1 scores.
    """
    assert len(generated_answers) == len(reference_answers), "The number of generated answers must match the number of reference answers."

    bleu_scores = []
    rouge_scores = []

    # Compute scores for each pair.
    for gen, ref in zip(generated_answers, reference_answers):
        bleu_scores.append(compute_bleu(ref, gen))
        rouge_scores.append(compute_rouge(ref, gen))

    avg_bleu = sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0
    avg_rouge = sum(rouge_scores) / len(rouge_scores) if rouge_scores else 0

    return {"avg_bleu": avg_bleu, "avg_rouge": avg_rouge}

# Step 5: Example usage with sample data.
if __name__ == "__main__":
    # Sample generated answers and corresponding reference answers.
    generated_answers = [
        "Ottawa is the capital of Canada, known for its government institutions.",
        "The document discusses climate change and renewable energy sources."
    ]

    reference_answers = [
        "Ottawa is the capital city of Canada and hosts many government offices.",
        "This document covers topics related to climate change and renewable energy."
    ]

    # Evaluate the generation.
    evaluation_results = evaluate_generation(generated_answers, reference_answers)
    print("Generation Evaluation Metrics:")
    print(f"Average BLEU Score: {evaluation_results['avg_bleu']:.3f}")
    print(f"Average ROUGE-L F1 Score: {evaluation_results['avg_rouge']:.3f}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
Generation Evaluation Metrics:
Average BLEU Score: 0.298
Average ROUGE-L F1 Score: 0.604
