# Simple PDF RAG System

A streamlined PDF Question-Answering system using:
- PyPDF2 for PDF text extraction
- Sentence Transformers for embeddings
- FAISS for vector search
- Gemini API for answer generation

## Requirements
Put your PDF file in the same directory as this notebook.

## 1. Install Required Libraries

In [1]:
!pip install PyPDF2 sentence-transformers faiss-cpu google-generativeai numpy

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m53.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2, faiss-cpu
Successfully installed PyPDF2-3.0.1 faiss-cpu-1.12.0


## 2. Import Libraries and Create RAG Class

In [2]:
import PyPDF2
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
import google.generativeai as genai
from dotenv import load_dotenv
import os


class SimplePDFRAG:
    def __init__(self, gemini_api_key=None):
        print("Loading embedding model...")
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.chunks = []
        self.embeddings = None
        self.index = None
        self.gemini_api_key = gemini_api_key

        if gemini_api_key:
            genai.configure(api_key=gemini_api_key)
            self.llm = genai.GenerativeModel('gemini-1.5-flash')
            print("Gemini API configured")
        else:
            print("No Gemini API key provided - will use basic responses")

    def load_pdf(self, pdf_path):
        """Extract text from PDF and create chunks"""
        print(f"Loading PDF: {pdf_path}")
        text = ""

        try:
            with open(pdf_path, 'rb') as file:
                reader = PyPDF2.PdfReader(file)
                print(f"PDF has {len(reader.pages)} pages")

                for page in reader.pages:
                    text += page.extract_text() + "\n"
        except FileNotFoundError:
            print(f"Error: File '{pdf_path}' not found!")
            return False
        except Exception as e:
            print(f"Error reading PDF: {e}")
            return False

        # Simple chunking - split by sentences
        sentences = text.replace('\n', ' ').split('.')

        # Create chunks of ~3 sentences each
        self.chunks = []
        for i in range(0, len(sentences), 3):
            chunk = '. '.join(sentences[i:i+3]).strip()
            if len(chunk) > 50:  # Only keep meaningful chunks
                self.chunks.append(chunk)

        print(f"Created {len(self.chunks)} text chunks")

        # Create embeddings
        print("Generating embeddings...")
        self.embeddings = self.model.encode(self.chunks)
        self.embeddings = np.array(self.embeddings).astype('float32')

        # Create FAISS index
        dimension = self.embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)  # Inner product for similarity
        faiss.normalize_L2(self.embeddings)
        self.index.add(self.embeddings)

        print("PDF loaded and indexed successfully!")
        return True

    def search(self, query, k=3):
        """Search for relevant chunks"""
        if not self.index:
            return []

        # Encode query
        query_embedding = self.model.encode([query])
        query_embedding = np.array(query_embedding).astype('float32')
        faiss.normalize_L2(query_embedding)

        # Search
        scores, indices = self.index.search(query_embedding, k)

        results = []
        for score, idx in zip(scores[0], indices[0]):
            results.append({
                'text': self.chunks[idx],
                'score': float(score)
            })

        return results

    def answer(self, query):
        """Generate answer using retrieved chunks"""
        # Get relevant chunks
        relevant_chunks = self.search(query, k=3)

        if not relevant_chunks:
            return "No relevant information found."

        print(f"Found {len(relevant_chunks)} relevant chunks")

        # Create context
        context = "\n\n".join([chunk['text'] for chunk in relevant_chunks])

        # Generate answer with Gemini if available
        if self.gemini_api_key:
            prompt = f"""Based on the following context, answer the question:

Context:
{context}

Question: {query}

Answer:"""

            try:
                response = self.llm.generate_content(prompt)
                return response.text
            except Exception as e:
                return f"Error with Gemini: {str(e)}"
        else:
            # Simple fallback
            return f"Based on the document:\n\n{context[:500]}..."

print("SimplePDFRAG class created!")

SimplePDFRAG class created!


## 3. Initialize the RAG System

**Important:** Replace `"YOUR_GEMINI_API_KEY"` with your actual Gemini API key, or set it to `None` for basic responses.

In [3]:
# Initialize the RAG system
   load_dotenv()
   api_key = os.getenv('GEMINI_API_KEY')

Loading embedding model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Gemini API configured


## 4. Load Your PDF

Replace `"your_document.pdf"` with the path to your PDF file.

In [4]:
# Load PDF file
pdf_path = "/content/10 day challenge with AI Crafters.pdf"  # Replace with your PDF filename
success = rag.load_pdf(pdf_path)

if success:
    print("\nReady to answer questions!")
else:
    print("\nFailed to load PDF. Please check the file path.")

Loading PDF: /content/10 day challenge with AI Crafters.pdf
PDF has 18 pages
Created 20 text chunks
Generating embeddings...
PDF loaded and indexed successfully!

Ready to answer questions!


## 5. Ask Questions About Your PDF

In [10]:
# Ask a question
question = "What is the main topic of this document?"
print(f"Question: {question}")
print("\nAnswer:")
answer = rag.answer(question)
print(answer)

Question: What is the main topic of this document?

Answer:
Found 3 relevant chunks


ERROR:tornado.access:503 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 3942.10ms


The main topic of the document is a series of challenges or assignments related to artificial intelligence (AI), specifically focusing on virtual try-ons, AI agents, MCP servers, and advanced RAG techniques.  The document outlines the objectives and submission requirements for each challenge.

