<a href="https://colab.research.google.com/github/amitaipat-create/chatwithpdf/blob/main/Copy_of_RAG_with_PDFs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval-Augmented Generation (RAG) with PDFs

In [None]:
!pip install openai PyPDF2 pdfplumber --quiet

In [None]:
from openai import OpenAI
from getpass import getpass
import os
import pdfplumber
from PyPDF2 import PdfReader

# üîë Enter your OpenAI API Key
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
client = OpenAI()

# 2Ô∏è‚É£ Read a PDF using Two Libraries
Let's compare how PyPDF2 and pdfplumber extract text from the same document.

In [None]:
from google.colab import files

uploaded = files.upload()

for filename in uploaded.keys():
    simple_pdf = filename
    print(f"User uploaded file '{simple_pdf}'")

In [None]:
# --- PyPDF2 ---
reader = PdfReader(simple_pdf)
text_pypdf = ""
for page in reader.pages:
    text_pypdf += page.extract_text()
print("PyPDF2 Text Sample:\n", text_pypdf[:800])

In [None]:
# --- pdfplumber ---
text_plumber = ""
with pdfplumber.open(simple_pdf) as pdf:
    for page in pdf.pages:
        text_plumber += page.extract_text()
print("pdfplumber Text Sample:\n", text_plumber[:800])

# ‚úÇÔ∏è 3Ô∏è‚É£ Chunking Strategies

Let‚Äôs turn the text into chunks suitable for embedding.
We‚Äôll try 3 chunking approaches and compare.

## A. Fixed-Size Chunking (Simple)

In [None]:
def chunk_text_fixed(text, max_words=200):
    words = text.split()
    chunks = [" ".join(words[i:i+max_words]) for i in range(0, len(words), max_words)]
    return chunks

chunks_fixed = chunk_text_fixed(text_plumber, max_words=10)
print("Fixed-size chunks:", len(chunks_fixed))
for chunk in chunks_fixed[:3]:
    print(chunk[:300])
    print("-" * 80)


In [None]:
chunks_fixed = chunk_text_fixed(text_plumber, max_words=20)
print("Fixed-size chunks:", len(chunks_fixed))
for chunk in chunks_fixed[:3]:
    print(chunk[:300])
    print("-" * 80)

‚úÖ Pros: simple, consistent sizes

‚ö†Ô∏è Cons: breaks sentences or paragraphs

## B. Overlapping Chunking

Preserves continuity between chunks ‚Äî useful for context consistency.

In [None]:
def chunk_text_overlap(text, max_words=200, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), max_words - overlap):
        chunk = " ".join(words[i:i + max_words])
        chunks.append(chunk)
    return chunks

chunks_overlap = chunk_text_overlap(text_plumber, 5, 2)
print("Overlapping chunks:", len(chunks_overlap))
for chunk in chunks_overlap[:3]:
    print(chunk[:300])
    print("-" * 80)

In [None]:
chunks_overlap = chunk_text_overlap(text_plumber, 10, 3)
print("Overlapping chunks:", len(chunks_overlap))
for chunk in chunks_overlap[:3]:
    print(chunk[:300])
    print("-" * 80)

‚úÖ Pros: smoother retrieval, less context loss

‚ö†Ô∏è Cons: more total chunks (more embeddings = higher cost)

## C. Sentence-Based Chunking

Groups sentences instead of raw word counts ‚Äî produces more semantically coherent pieces.

In [None]:
from nltk import sent_tokenize
import nltk
nltk.download('punkt_tab')

def chunk_by_sentences(text, max_sentences=5):
    sentences = sent_tokenize(text)
    chunks = [" ".join(sentences[i:i+max_sentences]) for i in range(0, len(sentences), max_sentences)]
    return chunks

chunks_sentence = chunk_by_sentences(text_plumber, max_sentences=3)
print("Sentence-based chunks:", len(chunks_sentence))
for chunk in chunks_sentence[:3]:
    print("###########CHUNK###########")
    print(chunk)
    print("-" * 80)

In [None]:
chunks_sentence = chunk_by_sentences(text_plumber, max_sentences=5)
print("Sentence-based chunks:", len(chunks_sentence))
for chunk in chunks_sentence[:3]:
    print(chunk)
    print("-" * 80)

‚úÖ Pros: natural breaks in meaning

‚ö†Ô∏è Cons: variable chunk lengths; not ideal for every embedding model

# üßÆ 4Ô∏è‚É£ Choose the Best Chunking Strategy

For most documents:
üëâ Overlapping chunking gives the best tradeoff between context and structure.

Let‚Äôs proceed using that.

In [None]:
chunks = chunks_overlap  # you can swap in other methods to compare (chunks_fixed / chunks_overlap / chunks_sentence)
for chunk in chunks[:3]:
    print(chunk[:300])
    print("-" * 80)

# üß™ Reflection ‚Äî Experiment

Try:

Different chunk sizes (100, 300, 500 words).

Overlap values (0, 50, 100).