<a href="https://colab.research.google.com/github/abharathkumarr/Projects-on-RAG/blob/main/rag_pipeline_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# README: Retrieval-Augmented Generation (RAG) Pipeline for Carroll_SG.pdf

## Overview

This notebook implements a Retrieval-Augmented Generation (RAG) pipeline to extract text from a PDF, build semantic embeddings, retrieve relevant information based on a query, and generate responses using a GPT-based model. The primary steps are:

1. **Text Extraction:**  
   - **Tool:** PyPDF2  
   - **Challenge:** Some pages in the original PDF did not yield any text due to non-standard internal formatting.  
   - **Resolution:** The PDF was re-saved from a browser to "flatten" its internal structure, resulting in successful text extraction for some pages.

2. **Text Preprocessing and Chunking:**  
   - The extracted text is split into overlapping chunks. This ensures that context is preserved across chunk boundaries, which is critical for accurate retrieval later.

3. **Embedding Generation & Indexing:**  
   - **Tool:** Sentence Transformers (`all-mpnet-base-v2`)  
   - Each text chunk is encoded into a dense vector, and a FAISS index is built for efficient similarity search.

4. **Query Retrieval:**  
   - A user query is encoded and used to search the FAISS index to find the most similar text chunks.

5. **Response Generation:**  
   - **Tool:** GPT-2 via Hugging Face’s Transformers pipeline  
   - The retrieved context is combined with the query to form a prompt, and a GPT-based model generates a response.

## Challenges and Solutions

- **Text Extraction Variability:**  
  Some pages in the original PDF did not yield extractable text due to how the text was stored internally. After re-saving the PDF from the browser, some pages still had zero-length text, indicating they might be scanned images or use a non-standard layout. In such cases, fallback methods like OCR or alternative libraries (e.g., pdfplumber, PyMuPDF) were considered.

- **Parameter Tuning:**  
  Decisions were made regarding the chunk size (e.g., 500 words with a 50-word overlap) to ensure sufficient context without overwhelming the embedding model.  
  The number of retrieved chunks (`k`) was also tuned to balance between context and response clarity.

- **Generation Configuration:**  
  Adjustments were needed in the generation pipeline to handle truncation and avoid conflicts with unwanted configuration keys (such as a progress bar setting) in GPT-2's generation method.

## Key Decisions

- **Model Selection:**  
  - **Sentence Transformer:** `all-mpnet-base-v2` was chosen for its strong performance on semantic similarity tasks.
  - **Generative Model:** GPT-2 was used as a baseline generative model, which can be replaced or fine-tuned for more specific domains if needed.

- **Indexing with FAISS:**  
  FAISS was used for its speed and efficiency in handling high-dimensional vector searches, which is critical when working with multiple text chunks.

- **Pipeline Modularity:**  
  The process is modular, allowing easy substitution of different components (e.g., text extraction libraries, embedding models, or generative models) to suit future needs.

## Conclusion

This notebook demonstrates a complete end-to-end RAG pipeline. It not only retrieves and generates responses based on a PDF’s content but also documents the approach, challenges, and design decisions. Future improvements could involve fine-tuning the generative model or integrating a user-friendly interface (e.g., using Streamlit or Gradio).

In [None]:
import os
import PyPDF2
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from transformers import pipeline

# 2. Extract Text from PDF using PyPDF2


In [None]:
def extract_text_from_pdf(pdf_path):
    """Extracts text from each page of the PDF."""
    text = ""
    with open(pdf_path, "rb") as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for i, page in enumerate(pdf_reader.pages):
            page_text = page.extract_text()
            if page_text:
                text += f"\n--- Page {i+1} ---\n" + page_text
    return text

pdf_path = "/Users/abharathkumar/Downloads/Carroll_SG.pdf"
document_text = extract_text_from_pdf(pdf_path)
print("Preview of extracted text:")
print(document_text[:500])

Preview of extracted text:

--- Page 1 ---
An Introduction  to General Relativity
E SPACETIM
and
GEOMETRY
Sean M. Carroll
--- Page 2 ---
SPACETIME  AND  GEOMETRY
An  Introduction  to General  Relativity
Sean  Carroll
University  of Chicago
Addison
Wesley
CapetownSan Francisco Boston New York
Mexico  City
Sydney Tokyo TorontoHong Kong London Madrid
Montreal Munich Paris Singapore
--- Page 3 ---
Acquisitions  Editor:  Adam Black
Project  Editor: Nancy  Benton
Text Designer:  Leslie  Galen
Cover  Designer:  Blakeley  Kim
Mar


# 3. Preprocess & Chunk the Text


In [None]:
def chunk_text(text, chunk_size=500, overlap=50):
    """
    Splits text into chunks of 'chunk_size' words with an 'overlap' between chunks.
    """
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += (chunk_size - overlap)
    return chunks

chunks = chunk_text(document_text, chunk_size=500, overlap=50)
print("Total chunks created:", len(chunks))
print("Preview of first chunk:")
print(chunks[0][:500])

Total chunks created: 3
Preview of first chunk:
--- Page 1 --- An Introduction to General Relativity E SPACETIM and GEOMETRY Sean M. Carroll --- Page 2 --- SPACETIME AND GEOMETRY An Introduction to General Relativity Sean Carroll University of Chicago Addison Wesley CapetownSan Francisco Boston New York Mexico City Sydney Tokyo TorontoHong Kong London Madrid Montreal Munich Paris Singapore --- Page 3 --- Acquisitions Editor: Adam Black Project Editor: Nancy Benton Text Designer: Leslie Galen Cover Designer: Blakeley Kim Marketing Manager: Chr


# 4. Generate Embeddings & Build FAISS Index


In [None]:
embedder = SentenceTransformer("all-mpnet-base-v2")

print("Generating embeddings for chunks...")
chunk_embeddings = embedder.encode(chunks, show_progress_bar=False)

# Convert embeddings to a NumPy array and build a FAISS index
embedding_dim = chunk_embeddings.shape[1]
embeddings_np = np.array(chunk_embeddings)
index = faiss.IndexFlatL2(embedding_dim)
index.add(embeddings_np)
print("FAISS index built with", index.ntotal, "vectors.")

Generating embeddings for chunks...
FAISS index built with 3 vectors.


# 5. Retrieve Relevant Chunks for a Query


In [None]:
def retrieve_relevant_chunks(query, embedder, index, chunks, k=2):
    """
    Encodes the query, searches the FAISS index, and returns the top 'k' matching chunks.
    """
    query_embedding = embedder.encode([query])
    distances, indices = index.search(np.array(query_embedding), k)
    retrieved_chunks = [chunks[i] for i in indices[0]]
    return retrieved_chunks

# Define your query
query = "What is the main idea behind General Relativity?"

# Retrieve the most relevant chunks
relevant_chunks = retrieve_relevant_chunks(query, embedder, index, chunks, k=2)
print("Retrieved Chunks:")
for idx, chunk in enumerate(relevant_chunks, 1):
    print(f"\n--- Chunk {idx} ---")
    print(chunk[:300])

Retrieved Chunks:

--- Chunk 1 ---
In ad- dition to being an active research area in its own right, GR is part of the standard syllabus for anyone interested in astrophysics, cosmology, string theory, and even particle physics. This is not to slight the more pragmatic uses of GR, including the workings of the Global Positioning Syste

--- Chunk 2 ---
detailed formalism, but have also attempted to include concrete examples and informal discussion of the concepts under consideration. Much of the most mathematical material has been relegated to the Appendices. Some of the material in the Appendices is actually an integral part of the course (for ex


# 6. Generate a Response Using a GPT-based Model


In [None]:
from transformers import pipeline

# Initialize the text generation pipeline with GPT-2 and disable the progress bar.
generator = pipeline("text-generation", model="gpt2", progress_bar=False)

# Directly set the progress_bar attribute in the generation_config to None.
try:
    generator.generation_config.progress_bar = None
except Exception as e:
    print("Could not set progress_bar to None:", e)

# Combine the retrieved chunks as context (assuming relevant_chunks and query are defined)
context = "\n".join(relevant_chunks)
prompt = f"Context: {context}\n\nQuery: {query}\nAnswer:"

# Generate a response with truncation enabled.
generated_response = generator(prompt, max_new_tokens=50, do_sample=True, truncation=True)[0]['generated_text']
print("Generated Response:\n")
print(generated_response)


Device set to use mps:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (1024). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


Generated Response:

Context: In ad- dition to being an active research area in its own right, GR is part of the standard syllabus for anyone interested in astrophysics, cosmology, string theory, and even particle physics. This is not to slight the more pragmatic uses of GR, including the workings of the Global Positioning System (GPS) satellite network. There is no shortage of books on GR, and many of them are excellent. Indeed, approximately thirty years ago witnessed the appearance of no fewer than three books in the subject, each of which has become a classic in its own right: those by Weinberg (1972), Misner, Thorne, and Wheeler (1973), and Hawking and Ellis (1975). Each of these books is suffused with a strongly-held point of view advo- cated by the authors. This has led to a love-hate relationship between these works and their readers; in each case, it takes little effort to find students who will de- clare them to be the best textbook ever written, or other students who find th