# **Tittle : Chat with PDF Using RAG Pipeline**

**Overview**

"Imagine having a large PDF document and being able to instantly get answers to your questions without manually searching through pages. In this project, I’ve built a RAG pipeline that allows you to upload or fetch a PDF, ask a question about its content, and receive an accurate response. This system combines advanced Natural Language Processing techniques and tools like Hugging Face models, FAISS, and Sentence Transformers."

In [None]:
!pip install transformers
!pip install sentence-transformers
!pip install faiss-cpu
!pip install requests
!pip install PyMuPDF

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1
Collecting PyMuPDF
  Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m61.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.1


In [8]:
import requests
import fitz  # PyMuPDF
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import T5ForConditionalGeneration, T5Tokenizer
import faiss

# Initialize the Hugging Face model for response generation
model_name = 't5-small'  # You can also use 'facebook/bart-large' for BART
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

# Initialize the sentence transformer model for embeddings
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

#Explanation:
#1.The SentenceTransformer model (all-MiniLM-L6-v2) is used to generate embeddings for textual data.
#2.This is a lightweight transformer-based model optimized for semantic similarity tasks.

# Function to download the PDF from URL
def download_pdf(pdf_url, download_path='/content/temp_pdf.pdf'):
    response = requests.get(pdf_url)
    if response.status_code == 200:
        with open(download_path, 'wb') as f:
            f.write(response.content)
        return download_path
    else:
        return None
#Explanation:
#We use the requests library to download the PDF file from a provided URL.

# Function to extract text from a PDF
def extract_text_from_pdf(pdf_path):
    pdf_document = fitz.open(pdf_path)
    text = ""
    for page_num in range(pdf_document.page_count):
        page = pdf_document.load_page(page_num)
        text += page.get_text("text")
    return text

#The tool PyMuPDF extracts textual data from the PDF. It supports multi-page PDFs and outputs clean text.

# Function to chunk the text
def chunk_text(text, chunk_size=500):
    chunks = []
    words = text.split()
    for i in range(0, len(words), chunk_size):
        chunk = " ".join(words[i:i+chunk_size])
        chunks.append(chunk)
    return chunks
#Explanation:
#1.Splits the extracted text into smaller chunks (default size: 500 words).
#2.These smaller chunks ensure that the embeddings are more manageable and relevant for semantic search.

# Function to create embeddings from text chunks
def create_embeddings(chunks):
    return embedding_model.encode(chunks)

#Generates vector embeddings for a list of text chunks using the SentenceTransformer model.
#The SentenceTransformer model converts each text chunk into a dense vector representation of fixed dimensions

# Function to store embeddings in FAISS (indexing and search)
def store_embeddings_in_faiss(embeddings):
    dimension = embeddings.shape[1]  # Dimension of embeddings
    index = faiss.IndexFlatL2(dimension)  # Using L2 distance for similarity search
    index.add(embeddings)  # Add embeddings to the index
    return index

#Stores the embeddings into a FAISS index to enable fast and efficient similarity search.
"works by Determines the dimensionality of the embeddings, Creates a FAISS index that uses L2 distance for similarity computation"
"Adds all the embeddings to the FAISS index for efficient retrieval."

# Function to handle user query and get the most relevant chunks
def handle_query(query, index, chunks, top_k=3):
    query_embedding = embedding_model.encode([query])

    # Search for the top K most relevant chunks based on the query embedding
    D, I = index.search(query_embedding, top_k)  # D: distances, I: indices of the retrieved chunks

    # Retrieve the corresponding chunks
    relevant_chunks = [chunks[i] for i in I[0]]
    return relevant_chunks
"When the user inputs a query, we embed the query, search the FAISS index for the closest matching top k chunks , and generate a response."

# Function to generate response using Hugging Face T5
def generate_response(query, relevant_chunks):
    context = "\n".join(relevant_chunks)  # Concatenate the retrieved chunks

    # Prepare the input prompt for the model
    input_text = f"Question: {query}\nContext: {context}\nAnswer:"

    # Tokenize the input and generate the output
    inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs, max_length=150, num_beams=4, early_stopping=True)

    # Decode the output sequence
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response
"The T5 model takes the question and context as input to generate the final answer."
#Combines the most relevant text chunks into a context and formulates a response.
#The response structure includes:
#The user’s query.
#The most relevant chunks from the PDF content.

# Complete pipeline: from PDF processing to query handling
def run_pipeline(pdf_url, user_query):
    # Download and process the PDF
    pdf_path = download_pdf(pdf_url)
    if pdf_path:
        # Extract text and chunk it
        text = extract_text_from_pdf(pdf_path)
        chunks = chunk_text(text)

        # Generate embeddings and store in FAISS
        embeddings = create_embeddings(chunks)
        embeddings = np.array(embeddings)
        index = store_embeddings_in_faiss(embeddings)

        # Handle the user query
        relevant_chunks = handle_query(user_query, index, chunks)

        # Generate the final response using Hugging Face model
        response = generate_response(user_query, relevant_chunks)
        return response
    else:
        return "Failed to download the PDF from the provided URL."
#Explanation:
#Function: run_pipeline(pdf_path, user_query)
#1.Combines all steps into a single pipeline:
#2.Text Extraction: Extract text from the given PDF (URL or local).
#3.Text Chunking: Split the text into manageable pieces.
#4.Embedding Creation: Generate and/or load cached embeddings.
#5.Index Storage: Store embeddings in FAISS for efficient retrieval.
#6.Similarity Search: Retrieve the most relevant chunks for the user query.
#7.Response Generation: Generate a meaningful response based on the relevant chunks.


In [10]:
# Example usage:
pdf_url = "https://www.hunter.cuny.edu/dolciani/pdf_files/workshop-materials/mmc-presentations/tables-charts-and-graphs-with-examples-from.pdf"
user_query = "what is overview of this pdf?"

response = run_pipeline(pdf_url, user_query)
print(response)


Tables, Charts, and Graphs with Examples from History, Economics, Education, Psychology, Urban Affairs and Everyday Life REVISED: MICHAEL LOLKUS 2018 Tables, Charts, and Graphs Basics We use charts and graphs to visualize data. This data can either be generated data, data gathered from an experiment, or data collected from some source. Notice the pie chart below is not very intuitive. Example from Everyday Life 19% 10% 10% 15% 5% 26%
