# PDF RAG Chatbot Advanced Model
This notebook implements a Retrieval-Augmented Generation (RAG) approach to query PDFs. It extracts text (both structured and raw), splits the text into meaningful chunks, creates vector embeddings using a Sentence Transformer, builds a FAISS index for fast search, applies BM25 filtering, and finally generates answers using OpenAI's ChatCompletion API.



In [1]:
# Install required packages and download the spaCy model.
!pip install -r advanced_rag_requirements.txt
!python -m spacy download en_core_web_sm


Collecting en_core_web_sm@ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl#sha256=1932429db727d4bff3deed6b34cfc05df17794f4a52eeb26cf8928f7c1a0fb85 (from -r advanced_rag_requirements.txt (line 53))
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.1/12.8 MB 299.4 kB/s eta 0:00:43
     --------------------------------------- 0.1/12.8 MB 438.1 kB/s eta 0:00:30
     --------------------------------------- 0.1/12.8 MB 438.1 kB/s eta 0:00:30
     --------------------------------------- 0.1/12.8 MB 438.

ERROR: Could not find a version that satisfies the requirement react-icons (from versions: none)
ERROR: No matching distribution found for react-icons

[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')



[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Imports and Environment Setup
This cell imports all necessary libraries, configures logging, and loads environment variables (like the OpenAI API key). It also initializes the spaCy NLP model and the Sentence Transformer model.


In [4]:
import os
import re
import time
import logging
import fitz  # PyMuPDF for raw PDF text extraction
import spacy  # For NLP tasks (tokenization, sentence segmentation)
import numpy as np
import faiss  # For fast vector search
from sentence_transformers import SentenceTransformer  # To convert text into embeddings
import openai  # For generating answers using OpenAI's API
from dotenv import load_dotenv  # To load environment variables
from unstructured.partition.pdf import partition_pdf  # For structured PDF extraction
from rank_bm25 import BM25Okapi  # For keyword-based ranking (BM25)

# Configure logging to record performance and errors
logging.basicConfig(
    filename="benchmark.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

# Load environment variables from .env file
load_dotenv()
print("Loaded API Key:", os.getenv("OPENAI_API_KEY"))
openai.api_key = os.getenv("OPENAI_API_KEY")

# Load the spaCy model and set max_length for long documents
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 2000000

# Load the Sentence Transformer model for creating embeddings
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

# Create an in-memory dictionary (if needed later)
documents = {}


Loaded API Key: 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


## PDF Text Extraction Functions
This cell defines functions to extract text from PDFs. The function `extract_structured_content` uses Unstructured to partition the PDF into structured elements, while `extract_text_from_pdf` is a fallback that extracts raw text using PyMuPDF.


In [5]:
def extract_text_from_pdf(file_path: str) -> str:
    """
    Fallback extraction using PyMuPDF.
    Opens the PDF and extracts text from all pages.
    """
    doc = fitz.open(file_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

def extract_structured_content(file_path: str):
    """
    Uses Unstructured to partition a PDF into structured elements.
    Returns a list of dictionaries with keys "type" and "text".
    """
    elements = partition_pdf(filename=file_path)
    structured_data = []
    for element in elements:
        structured_data.append({
            "type": element.type,  # e.g., "Title", "Heading", "Text"
            "text": element.text.strip(),
        })
    return structured_data


## Section Tagging Function
The following function groups extracted elements into technical sections (like Abstract, Introduction, etc.) using regex. This helps organize the document content.


In [6]:
def tag_sections_technical(structured_elements):
    """
    Groups extracted elements into sections for technical papers.
    Uses regex to capture common section headers.
    """
    section_pattern = re.compile(
        r"(Abstract|Introduction|Related Work|Background|Methodology|Approach|Experiments|Results|Discussion|Conclusion|Encoding|CLIP|Text Encoder|Embedding)",
        re.IGNORECASE
    )
    tagged_sections = {}
    current_section = None

    for element in structured_elements:
        element_type = element.get("type", "").lower()
        text = element.get("text", "")
        if element_type in ["heading", "title"] or section_pattern.search(text):
            match = section_pattern.search(text)
            new_section = match.group(0).strip() if match else text.strip()
            current_section = new_section
            if current_section not in tagged_sections:
                tagged_sections[current_section] = []
        elif current_section:
            tagged_sections[current_section].append(text)
        else:
            tagged_sections.setdefault("Body", []).append(text)
    
    for section in tagged_sections:
        tagged_sections[section] = "\n".join(tagged_sections[section]).strip()
    return tagged_sections


## Robust Text Extraction Function
This function attempts structured extraction and section tagging first; if that fails, it falls back to raw extraction.


In [7]:
def robust_extract_text(file_path: str) -> (str, dict):
    """
    Extracts text via structured partitioning and tags technical sections.
    Falls back to basic extraction if necessary.
    Returns a tuple (combined_text, tagged_sections).
    """
    try:
        structured_elements = extract_structured_content(file_path)
        tagged_sections = tag_sections_technical(structured_elements)
        combined_text = "\n\n".join([f"{section}: {content}" for section, content in tagged_sections.items()])
        if combined_text.strip():
            return combined_text, tagged_sections
        else:
            raise Exception("No structured content extracted.")
    except Exception as e:
        logging.info("Structured extraction failed; using fallback extraction. Error: " + str(e))
        fallback_text = extract_text_from_pdf(file_path)
        return fallback_text, {}


## Adaptive Chunking Function
This function splits the extracted text into semantically coherent chunks using sentence segmentation and linguistic cues (transition words). It dynamically determines the chunk size based on the average token count.


In [8]:
def adaptive_chunk_text_dynamic(text: str, min_threshold: int = None, factor: float = 1.5, transition_words=None) -> list:
    """
    Splits text into semantically coherent chunks using a dynamic token threshold.
    Uses linguistic cues (transition words) to determine natural boundaries.
    """
    import re
    text = re.sub(r'\s+', ' ', text).strip()  # Normalize whitespace
    
    # Use spaCy for sentence segmentation
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]
    
    token_counts = [len(sent.split()) for sent in sentences]
    if not token_counts:
        return [text]
    avg_tokens = sum(token_counts) / len(token_counts)
    
    if min_threshold is None:
        min_threshold = int(avg_tokens)
    threshold = int(max(min_threshold, avg_tokens * factor))
    
    if transition_words is None:
        transition_words = ["however", "moreover", "furthermore", "in conclusion", "finally", "additionally"]
    
    chunks = []
    current_chunk = ""
    current_token_count = 0
    
    for sent in sentences:
        sent_tokens = len(sent.split())
        if current_chunk:
            sent_lower = sent.lower()
            starts_with_transition = any(sent_lower.startswith(word) for word in transition_words)
        else:
            starts_with_transition = False
        
        if (current_token_count + sent_tokens > threshold) or (starts_with_transition and current_token_count > int(threshold * 0.7)):
            chunks.append(current_chunk.strip())
            current_chunk = sent
            current_token_count = sent_tokens
        else:
            current_chunk += " " + sent
            current_token_count += sent_tokens
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks


## Embedding and Indexing Functions
These functions:
- Compute embeddings for each chunk.
- Build a FAISS HNSW index for fast vector search.
- Search the index for candidate chunks based on a query.
- Optionally filter results using BM25.


In [9]:
def get_embeddings(chunks: list) -> np.ndarray:
    """
    Computes embeddings for each text chunk using the Sentence Transformer.
    """
    embeddings = embed_model.encode(chunks, convert_to_numpy=True)
    return embeddings.astype("float32")

def build_hnsw_index(embeddings: np.ndarray, M: int = 32, efConstruction: int = 40):
    """
    Builds a FAISS HNSW index from the computed embeddings.
    """
    d = embeddings.shape[1]
    index = faiss.IndexHNSWFlat(d, M)
    index.hnsw.efConstruction = efConstruction
    index.add(embeddings)
    return index

def search_index(query: str, index, chunks: list, k: int = 5) -> list:
    """
    Uses FAISS vector search to retrieve the top-k candidate chunks.
    Logs the search duration.
    """
    start_time = time.time()
    query_embedding = embed_model.encode([query], convert_to_numpy=True).astype("float32")
    distances, indices = index.search(query_embedding, k)
    results = [chunks[i] for i in indices[0] if i < len(chunks)]
    search_duration = time.time() - start_time
    logging.info(f"HNSW Search Time: {search_duration:.4f} seconds")
    return results

def bm25_filter(query, candidate_chunks, threshold=1.0):
    """
    Filters candidate chunks using BM25.
    Returns the top candidate if its score meets the threshold.
    """
    if not candidate_chunks:
        print("BM25 received an empty candidate chunk list.")
        return []
    
    tokenized_corpus = [doc.lower().split() for doc in candidate_chunks if doc.strip()]
    if not tokenized_corpus:
        print("BM25 corpus is empty after processing. Returning no results.")
        return []
    
    bm25 = BM25Okapi(tokenized_corpus)
    tokenized_query = query.lower().split()
    scores = bm25.get_scores(tokenized_query)
    
    best_idx = np.argmax(scores)
    if scores[best_idx] >= threshold:
        return [candidate_chunks[best_idx]]
    return []


## Answer Generation Function
This function creates a prompt by combining the retrieved context with the user's query, then uses the OpenAI ChatCompletion API to generate an answer.


In [10]:
def generate_answer(query: str, context: str) -> str:
    """
    Generates an answer using OpenAI's ChatCompletion API.
    """
    prompt = f"""You are given the following context extracted from a legal document:
    
{context}

Based on the above context, answer the following question:
{query}

Answer:"""
    try:
        response = openai.ChatCompletion.create(
            model="gpt-4o-mini",  # or "gpt-3.5-turbo" as appropriate
            messages=[{"role": "user", "content": prompt}],
            max_tokens=150,
        )
        answer = response.choices[0].message.content.strip()
    except Exception as e:
        answer = f"Error generating answer: {e}"
    return answer


## Example Execution
The following cells demonstrate processing a sample PDF:
1. Extract text and tag sections.
2. Chunk the text adaptively.
3. Compute embeddings and build a FAISS index.
4. Retrieve candidate chunks for a sample query.
5. Optionally filter using BM25.
6. Generate and display the answer.

In [15]:
# Define the path to a sample PDF (adjust as needed)
test_pdf_path = r"C:\Users\amaan\Desktop\RAG Chatbot\data\poa.pdf"

# Step 1: Extract and tag text from the PDF.
extracted_text, tagged_sections = robust_extract_text(test_pdf_path)
print("Tagged Sections:")
for section, content in tagged_sections.items():
    print(f"--- {section} ---\n{content[:300]}...\n")  # Print first 300 characters for brevity

# Step 2: Adaptive Chunking.
chunks = adaptive_chunk_text_dynamic(extracted_text)
print(f"Total Chunks: {len(chunks)}")
print("Sample Chunk (first 300 characters):")
print(chunks[0][:300], "...\n")

# Step 3: Generate embeddings and build the FAISS index.
embeddings = get_embeddings(chunks)
index = build_hnsw_index(embeddings)

# Step 4: Retrieve candidate chunks with similarity scores.
query = "What is the permanent address of the property?"
print(f"Retrieving candidate chunks for query: '{query}'")
start_time = time.time()
query_embedding = embed_model.encode([query], convert_to_numpy=True).astype("float32")
distances, indices = index.search(query_embedding, k=4)
search_duration = time.time() - start_time
print(f"HNSW Search Time: {search_duration:.4f} seconds\n")

candidate_chunks = []
print("Candidate Chunks and Similarity Scores:")
for i, (idx, score) in enumerate(zip(indices[0], distances[0])):
    if idx < len(chunks):
         candidate_chunks.append(chunks[idx])
         print(f"Chunk {i+1}:")
         print(f"Similarity Score: {score}")
         print(f"Text (first 300 characters): {chunks[idx][:300]}...\n")

# Step 5: Optionally filter candidate chunks using BM25.
filtered_chunks = bm25_filter(query, candidate_chunks, threshold=1.0)
if filtered_chunks:
    final_context = filtered_chunks[0]  # Use the top BM25 candidate.
    print("BM25 Filtered Candidate Found.\n")
else:
    final_context = "\n\n".join(candidate_chunks)
    print("No BM25 candidates passed the threshold. Using all candidate chunks.\n")

print("Final Context for LLM Prompt:")
print(final_context, "\n")

# Step 6: Generate an answer using the final context.
answer = generate_answer(query, final_context)
print("Generated Answer:")
print(answer)


Tagged Sections:
Total Chunks: 33
Sample Chunk (first 300 characters):
1 POWER OF ATTORNEY TO ALL TO WHOM THESE PRESENTS SHALL COME I, MR. ISMAIL MOHIDEEN ALI HASHIM MOHAMMED ...



Batches: 100%|██████████| 2/2 [00:01<00:00,  1.70it/s]


Retrieving candidate chunks for query: 'What is the permanent address of the property?'


Batches: 100%|██████████| 1/1 [00:00<00:00, 63.53it/s]


HNSW Search Time: 0.0237 seconds

Candidate Chunks and Similarity Scores:
Chunk 1:
Similarity Score: 0.978529691696167
Text (first 300 characters): To receive and recover the vacant possession of the premises in said property at the time the respective tenant, lessee, licensee, occupants thereof surrenders the same relinquishing all his/her/their rights, title claims and interest thereto and then either to retain the vacant possession of the pr...

Chunk 2:
Similarity Score: 0.9805436134338379
Text (first 300 characters): as the case may be in respect of the said property, as my said attorney think fit and proper. 7....

Chunk 3:
Similarity Score: 1.0202440023422241
Text (first 300 characters): To answer any communications, letters, documents, notices etc. which relate to the said property and carry on and conduct all the correspondence that may be necessary and discharge all lawful claims obligations and liabilities arising from any contract. Statutory laws, rules and regulations etc.