### Retrieval Augmented Generation - PDF Data Extraction


- Creating Vector Embeddings
- Indexing PDF
- Storing Vectors in Database (Chroma)
- Querying PDF
- Using Langchain for Orchestration

In [1]:
!pip install -r requirements.txt

[0m

In [2]:
from langchain.document_loaders.pdf import PyPDFDirectoryLoader

def load_documents():
    document_loader = PyPDFDirectoryLoader("data")
    return document_loader.load()

In [3]:
documents = load_documents()

In [4]:
documents

[Document(metadata={'producer': 'PyPDF', 'creator': 'Microsoft Word', 'creationdate': '2025-05-27T07:13:25+00:00', 'author': 'Anaya', 'moddate': '2025-05-27T07:13:25+00:00', 'source': 'data/CNN.pdf', 'total_pages': 7, 'page': 0, 'page_label': '1'}, page_content='Convolutional Neural Network: A Quick Overview \nIn the world of AI, Machine Learning, Deep Learning, and Computer Vision, we have come \nacross and heard about various tasks like Image Classification, Object Detection, Image Pattern \nDetection, Text Classification, and Face Recognition. In this article, I have written a quick \noverview of Convolutional Neural Networks. \nConvolutional Neural Network (CNN or ConvNets) is a Deep Learning technique which is \ngenerally used to perform the tasks mentioned above. Here, the input is an image (simply a \nmatrix of pixels) which is fed into a CNN model that assigns some learnable weights and biases \nto various aspects of an image to analyze input images for recognition and classifi

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document


# Split the document into smaller chunks using LangChain's Recursive Character Text Splitter
def split_documents(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=80,
        length_function=len,
        is_separator_regex=False,
    )
    return text_splitter.split_documents(documents)

In [6]:
chunks = split_documents(documents)
chunks

[Document(metadata={'producer': 'PyPDF', 'creator': 'Microsoft Word', 'creationdate': '2025-05-27T07:13:25+00:00', 'author': 'Anaya', 'moddate': '2025-05-27T07:13:25+00:00', 'source': 'data/CNN.pdf', 'total_pages': 7, 'page': 0, 'page_label': '1'}, page_content='Convolutional Neural Network: A Quick Overview \nIn the world of AI, Machine Learning, Deep Learning, and Computer Vision, we have come \nacross and heard about various tasks like Image Classification, Object Detection, Image Pattern \nDetection, Text Classification, and Face Recognition. In this article, I have written a quick \noverview of Convolutional Neural Networks. \nConvolutional Neural Network (CNN or ConvNets) is a Deep Learning technique which is \ngenerally used to perform the tasks mentioned above. Here, the input is an image (simply a \nmatrix of pixels) which is fed into a CNN model that assigns some learnable weights and biases \nto various aspects of an image to analyze input images for recognition and classifi

In [7]:
# Embedding for each chunk
from langchain.embeddings import HuggingFaceEmbeddings

# Used for:
# A. create database
# B. query database
def get_embedding_function():
   return HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2"
    )

In [8]:
# clear database
import shutil
import os

if os.path.exists("chroma"):
    shutil.rmtree("chroma")

In [None]:
from langchain.vectorstores.chroma import Chroma
from langchain.schema import Document

def chroma_database(chunks: list[Document]):
    # Add non-duplicate chunks to Chroma vector DB
    
    # Connect to existing Chroma DB or create a new one
    db = Chroma(
        persist_directory="chroma",
        embedding_function=get_embedding_function()
    )

    # Add unique IDs to each chunk
    chunk_with_ids = assign_ids(chunks)

    # Get existing document IDs from the DB
    existing_ids = set(db.get(include=[])["ids"])
    print("Existing documents in database:", len(existing_ids))

    # Filter only new (non-duplicate) chunks
    new_chunks = [
        chunk for chunk in chunk_with_ids
        if chunk.metadata["id"] not in existing_ids
    ]

    # Add only new chunks
    if new_chunks:
        print("Adding new documents:", len(new_chunks))
        new_ids = [chunk.metadata["id"] for chunk in new_chunks]
        db.add_documents(new_chunks, ids=new_ids)
        db.persist()
    else:
        print("No new documents to add.")

def assign_ids(chunks: list[Document]) -> list[Document]:
 
    # Assign a unique ID to each chunk based on:
    # file:page_number:chunk_index

    last_page = None
    chunk_index = 0

    for chunk in chunks:
        src = chunk.metadata.get("source")
        page = chunk.metadata.get("page")
        page_id = f"{src}:{page}"

        if page_id == last_page:
            chunk_index += 1
        else:
            chunk_index = 0

        chunk.metadata["id"] = f"{page_id}:{chunk_index}"
        last_page = page_id

    return chunks


In [10]:
chroma_database(chunks)

  return HuggingFaceEmbeddings(
  from .autonotebook import tqdm as notebook_tqdm
  db = Chroma(


Existing documents in database: 0
Adding new documents: 26


  db.persist()


In [11]:
from transformers import pipeline
# from langchain.prompts import ChatPromptTemplate
from langchain.vectorstores import Chroma

# Load a local language model
generator = pipeline("text2text-generation", model="google/flan-t5-base")

def query_rag(query_text):
    # Load the embedding function (e.g., HuggingFaceEmbeddings)
    embedding_function = get_embedding_function()

    # Load vector store
    db = Chroma(
        persist_directory="chroma",
        embedding_function=embedding_function
    )

    # Retrieve top-k most relevant documents
    result = db.similarity_search_with_score(query_text, k=5)
    context_text = "\n\n---\n\n".join([doc.page_content for doc, _ in result])

    # Format prompt
    prompt = f"Answer the question based on the context.\nContext: {context_text}\n\nQuestion: {query_text}"

    # Generate response using Hugging Face pipeline
    response = generator(prompt, max_new_tokens=256)[0]["generated_text"]

    # Format and print the response
    formatted_response = f"Question:\n{query_text}\n\nResponse:\n{response.strip()}"
    print(formatted_response)

Device set to use mps:0


In [12]:
query_rag("What are the documents about?")

Token indices sequence length is longer than the specified maximum sequence length for this model (671 > 512). Running this sequence through the model will result in indexing errors


Question:
What are the documents about?

Response:
In the world of AI, Machine Learning, Deep Learning, and Computer Vision, we have come across and heard about various tasks like Image Classification, Object Detection, Image Pattern Detection, Text Classification, and Face Recognition. In this article, I have written a quick overview of Convolutional Neural Networks.


In [13]:
query_rag("Explain COnvoluational Neural Networks")

Question:
Explain COnvoluational Neural Networks

Response:
The relevant information is: Convolutional Neural Network (CNN or ConvNets) is a Deep Learning technique which is generally used to perform the tasks mentioned above. Here, the input is an image (simply a matrix of pixels) which is fed into a CNN model that assigns some learnable weights and biases to various aspects of an image to analyze input images for recognition and classification. Here, the input is an image (simply a matrix of pixels) which is fed into a CNN model that assigns some learnable weights and biases to various aspects of an image to analyze input images for recognition and classification. Here, the input is an image (simply a matrix of pixels) which is fed into a CNN model that assigns some learnable weights and biases to various aspects of an image to analyze input images for recognition and classification. Here, the input is an image (simply a matrix of pixels) which is fed into a CNN model that assigns so