# Gemini RAG Chatbot Advanced Model
This notebook implements a Retrieval-Augmented Generation (RAG) approach to query PDFs using Google's Gemini API. It extracts text (both structured and raw), splits the text into meaningful chunks, creates vector embeddings using a Sentence Transformer, builds a FAISS index for fast search, applies BM25 filtering, and finally generates answers using Gemini's API.

## Features
- Structured PDF text extraction with fallback to raw extraction
- Technical section tagging and organization
- Adaptive text chunking based on semantic coherence
- Vector search using FAISS HNSW index
- BM25 filtering for improved relevance
- Answer generation using Gemini 2.0 Flash

In [1]:
# Install required packages
!pip install -r gemini_requirements.txt
!python -m spacy download en_core_web_sm

ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'gemini_rag_requirements.txt'

[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 100.9 kB/s eta 0:02:07
     --------------------------------------- 0.0/12.8 MB 131.3 kB/s eta 0:01:38
     --------------------------------------- 0.1/12.8 MB 262.6 kB/s eta 0:00:49
      -------------------------------------- 0.3/12.8 MB 884.2 kB/s eta 0:00:15
     --- ------------------------------------ 1.0/12.8 MB 2.5 MB/s eta 0:00:05
     ----- ---------------------------------- 1.6/12.8 MB 3.5 MB/


[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Imports and Environment Setup
This cell imports all necessary libraries, configures logging, and loads environment variables (like the Gemini API key). It also initializes the spaCy NLP model and the Sentence Transformer model.

In [36]:
import os
import re
import time
import logging
import pymupdf  # For raw PDF text extraction
import spacy  # For NLP tasks
import numpy as np
import faiss  # For fast vector search
from sentence_transformers import SentenceTransformer
import google.generativeai as genai
from dotenv import load_dotenv
from unstructured.partition.pdf import partition_pdf
from rank_bm25 import BM25Okapi

# Configure logging
logging.basicConfig(
    filename="gemini_benchmark.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

# Load environment variables
load_dotenv()
gemini_api_key = os.getenv("GEMINI_API_KEY")
genai.configure(api_key=gemini_api_key)

# Initialize models
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 2000000
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

# Global storage for processed PDF data
pdf_data = {
    "filename": None,
    "raw_text": None,
    "tagged_sections": None,
    "chunks": None,
    "index": None,
}

## PDF Text Extraction Functions
These functions handle different aspects of PDF text extraction:
- `extract_text_from_pdf`: Basic text extraction using PyMuPDF
- `extract_structured_content`: Structured extraction using Unstructured
- `tag_sections_technical`: Section tagging for technical documents
- `robust_extract_text`: Combined approach with fallback

In [37]:
def extract_text_from_pdf(file_path: str) -> str:
    """Basic text extraction using PyMuPDF."""
    doc = pymupdf.open(file_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

def extract_structured_content(file_path: str):
    """Structured extraction using Unstructured."""
    elements = partition_pdf(filename=file_path)
    structured_data = []
    for element in elements:
        structured_data.append({
            "type": element.type,
            "text": element.text.strip(),
        })
    return structured_data

def tag_sections_technical(structured_elements):
    """Tags technical sections in the document."""
    section_pattern = re.compile(
        r"(Abstract|Introduction|Related Work|Background|Methodology|Approach|Experiments|Results|Discussion|Conclusion|Encoding|CLIP|Text Encoder|Embedding)",
        re.IGNORECASE
    )
    tagged_sections = {}
    current_section = None
    
    for element in structured_elements:
        element_type = element.get("type", "").lower()
        text = element.get("text", "")
        if element_type in ["heading", "title"] or section_pattern.search(text):
            match = section_pattern.search(text)
            new_section = match.group(0).strip() if match else text.strip()
            current_section = new_section
            if current_section not in tagged_sections:
                tagged_sections[current_section] = []
        elif current_section:
            tagged_sections[current_section].append(text)
        else:
            tagged_sections.setdefault("Body", []).append(text)
    
    for section in tagged_sections:
        tagged_sections[section] = "\n".join(tagged_sections[section]).strip()
    return tagged_sections

def robust_extract_text(file_path: str):
    """Combined extraction with fallback."""
    try:
        structured_elements = extract_structured_content(file_path)
        tagged_sections = tag_sections_technical(structured_elements)
        combined_text = "\n\n".join([f"{section}: {content}" for section, content in tagged_sections.items()])
        if combined_text.strip():
            return combined_text, tagged_sections
        else:
            raise Exception("No structured content extracted.")
    except Exception as e:
        logging.info("Structured extraction failed; using fallback extraction. Error: " + str(e))
        fallback_text = extract_text_from_pdf(file_path)
        return fallback_text, {}

## Text Processing Functions
These functions handle text chunking, embedding generation, and vector search:
- `adaptive_chunk_text_dynamic`: Semantic chunking with dynamic thresholds
- `get_embeddings`: Text embedding generation
- `build_hnsw_index`: FAISS index construction
- `search_index`: Vector similarity search

In [38]:
def adaptive_chunk_text_dynamic(text: str, min_threshold: int = None, factor: float = 1.5, transition_words=None):
    """Splits text into semantically coherent chunks."""
    text = re.sub(r'\s+', ' ', text).strip()
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]
    
    token_counts = [len(sent.split()) for sent in sentences]
    if not token_counts:
        return [text]
    avg_tokens = sum(token_counts) / len(token_counts)
    
    if min_threshold is None:
        min_threshold = int(avg_tokens)
    threshold = int(max(min_threshold, avg_tokens * factor))
    
    if transition_words is None:
        transition_words = ["however", "moreover", "furthermore", "in conclusion", "finally", "additionally"]
    
    chunks = []
    current_chunk = ""
    current_token_count = 0
    
    for sent in sentences:
        sent_tokens = len(sent.split())
        if current_chunk:
            starts_with_transition = any(sent.lower().startswith(word) for word in transition_words)
        else:
            starts_with_transition = False
        
        if (current_token_count + sent_tokens > threshold) or (starts_with_transition and current_token_count > int(threshold * 0.7)):
            chunks.append(current_chunk.strip())
            current_chunk = sent
            current_token_count = sent_tokens
        else:
            current_chunk += " " + sent
            current_token_count += sent_tokens
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

def get_embeddings(chunks: list) -> np.ndarray:
    """Generates embeddings for text chunks."""
    embeddings = embed_model.encode(chunks, convert_to_numpy=True)
    return embeddings.astype("float32")

def build_hnsw_index(embeddings: np.ndarray, M: int = 32, efConstruction: int = 40):
    """Builds a FAISS HNSW index."""
    d = embeddings.shape[1]
    index = faiss.IndexHNSWFlat(d, M)
    index.hnsw.efConstruction = efConstruction
    index.add(embeddings)
    return index

def search_index(query: str, index, chunks: list, k: int = 5) -> list:
    """Performs vector similarity search."""
    start_time = time.time()
    query_embedding = embed_model.encode([query], convert_to_numpy=True).astype("float32")
    distances, indices = index.search(query_embedding, k)
    results = [chunks[i] for i in indices[0] if i < len(chunks)]
    search_duration = time.time() - start_time
    logging.info(f"HNSW Search Time: {search_duration:.4f} seconds")
    return results

## Answer Generation Functions
These functions handle answer generation using Gemini and BM25 filtering:

In [46]:
def bm25_filter(query, candidate_chunks, threshold=1.0, top_n=2):
    """
    Use BM25 to select the top N most relevant chunks instead of just one.
    
    Args:
        query: User query
        candidate_chunks: List of candidate chunks from vector search
        threshold: Minimum score to consider
        top_n: Number of top chunks to return
        
    Returns:
        List of most relevant chunks
    """
    if not candidate_chunks:
        return []
    
    tokenized_corpus = [doc.lower().split() for doc in candidate_chunks if doc.strip()]
    if not tokenized_corpus:
        return []
    
    bm25 = BM25Okapi(tokenized_corpus)
    tokenized_query = query.lower().split()
    scores = bm25.get_scores(tokenized_query)
    
    # Get indices sorted by score in descending order
    sorted_indices = np.argsort(scores)[::-1]
    
    # Select top N chunks that meet the threshold
    selected_chunks = [
        candidate_chunks[idx] for idx in sorted_indices[:top_n] 
        if scores[idx] >= threshold and idx < len(candidate_chunks)
    ]
    
    return selected_chunks

def generate_answer_with_gemini(query: str, context: str) -> str:
    """Generates a well-reasoned answer using Gemini API with enhanced prompt formatting."""
    
    prompt = f"""
    You are an advanced Retrieval-Augmented Generation (RAG) AI assistant specializing in legal document analysis. 
    Your task is to accurately answer the given query based on the provided legal context.

    **Context:** 
    {context}

    **Query:** 
    {query}

    **Instructions:**
    - Provide a **clear and direct answer first** (Yes/No with a brief explanation).
    - Then, **explain your reasoning** in structured points.
    - If the correct context is not provided,say so. 
    - Ensure proper formatting with **clear paragraph separation**.

    **Final Answer:** 
    """
    
    try:
        model = genai.GenerativeModel('gemini-1.5-pro')
        response = model.generate_content(prompt)
        answer = response.text.strip()
    except Exception as e:
        answer = f"Error generating answer: {e}"
    
    return answer

## PDF Processing Function
This function combines all the steps to process a PDF document:

In [44]:
def process_pdf_advanced(pdf_path):
    """Processes a PDF document through all steps."""
    extracted_text, tagged_sections = robust_extract_text(pdf_path)
    chunks = adaptive_chunk_text_dynamic(extracted_text)
    embeddings = get_embeddings(chunks)
    faiss_index = build_hnsw_index(embeddings)
    return {
         "raw_text": extracted_text,
         "tagged_sections": tagged_sections,
         "chunks": chunks,
         "index": faiss_index,
    }

## Example Usage
Below is an example of how to use the RAG system with a sample PDF:

In [47]:
# Define the path to a sample PDF (adjust as needed)
test_pdf_path = r"C:\Users\amaan\Downloads\Lease Agreement PDF.pdf"

# Step 1: Extract and tag text from the PDF.
extracted_text, tagged_sections = robust_extract_text(test_pdf_path)
print("Tagged Sections:")
for section, content in tagged_sections.items():
    print(f"--- {section} ---\n{content[:300]}...\n")  # Print first 300 characters for brevity

# Step 2: Adaptive Chunking.
chunks = adaptive_chunk_text_dynamic(extracted_text)
print(f"Total Chunks: {len(chunks)}")
print("Sample Chunk (first 300 characters):")
print(chunks[0][:300], "...\n")

# Step 3: Generate embeddings and build the FAISS index.
embeddings = get_embeddings(chunks)
index = build_hnsw_index(embeddings)

# Step 4: Retrieve candidate chunks with similarity scores.
query = "Will the landlord be liable for any damages to the tenant or his family?"
print(f"Retrieving candidate chunks for query: '{query}'")
start_time = time.time()
query_embedding = embed_model.encode([query], convert_to_numpy=True).astype("float32")
distances, indices = index.search(query_embedding, k=4)
search_duration = time.time() - start_time
print(f"HNSW Search Time: {search_duration:.4f} seconds\n")

candidate_chunks = []
print("Candidate Chunks and Similarity Scores:")
for i, (idx, score) in enumerate(zip(indices[0], distances[0])):
    if idx < len(chunks):
         candidate_chunks.append(chunks[idx])
         print(f"Chunk {i+1}:")
         print(f"Similarity Score: {score}")
         print(f"Text (first 300 characters): {chunks[idx][:300]}...\n")

# Step 5: Optionally filter candidate chunks using BM25.
filtered_chunks = bm25_filter(query, candidate_chunks, threshold=1.0)
if filtered_chunks:
    final_context = filtered_chunks[0]  # Use the top BM25 candidate.
    print("BM25 Filtered Candidate Found.\n")
else:
    final_context = "\n\n".join(candidate_chunks)
    print("No BM25 candidates passed the threshold. Using all candidate chunks.\n")

print("Final Context for LLM Prompt:")
print(final_context, "\n")

# Step 6: Generate an answer using the final context.
answer = generate_answer_with_gemini(query, final_context)
print("Generated Answer:")
print(answer)

Tagged Sections:
Total Chunks: 91
Sample Chunk (first 300 characters):
STANDARD LEASE AGREEMENT THIS LEASE AGREEMENT hereinafter known as the "Lease" is made and entered into this ____ ...



Batches: 100%|██████████| 3/3 [00:02<00:00,  1.16it/s]


Retrieving candidate chunks for query: 'Will the landlord be liable for any damages to the tenant or his family?'


Batches: 100%|██████████| 1/1 [00:00<00:00, 52.61it/s]


HNSW Search Time: 0.0300 seconds

Candidate Chunks and Similarity Scores:
Chunk 1:
Similarity Score: 0.613007128238678
Text (first 300 characters): If said damage was due to the negligence of the Tenant(s), the Tenant(s) shall be liable to the Landlord for all repairs and for the loss of income due to restoring the property back to a livable condition in addition to any other losses that can be proved by the Landlord....

Chunk 2:
Similarity Score: 0.7047845125198364
Text (first 300 characters): INDEMNIFICATION.​ Landlord shall not be liable for any injury to the tenant, tenant’s family, guests, or employees or to any person entering the Property and shall not be liable for any damage to the building in which the Property is located or to goods or equipment, or to the structure or equipment...

Chunk 3:
Similarity Score: 0.7810131907463074
Text (first 300 characters): DEFAULT.​ If Landlord breaches this Lease, Tenant may seek any relief provided by law....

Chunk 4:
Similarity Score: 0