### Loading PDF files

In [1]:
from langchain_community.document_loaders import(
    PyPDFLoader,
    PyMuPDFLoader,
    UnstructuredPDFLoader
)

In [10]:
#1 PyPDFLoader
"""
✅ Simple and lightweight

✅ Good for extracting plain text from PDFs

⚠️ Sometimes formatting (tables, columns) may not be preserved well
"""

print("PyPDFLoader")

try:
    pypdf_loader = PyPDFLoader("data/Attention-is-all-you-need-Paper.pdf")
    pypdf_docs = pypdf_loader.load()
    print(f"No of pages loaded: {len(pypdf_docs)}")
    print(f"First document content preview: {pypdf_docs[0].page_content[:100]}...") 
    print(f"Metadata: {pypdf_docs[0].metadata}")

except Exception as e:
    print(f"Error loading PDF: {e}")

PyPDFLoader
No of pages loaded: 11
First document content preview: Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brai...
Metadata: {'producer': 'PyPDF2', 'creator': 'PyPDF', 'creationdate': '', 'subject': 'Neural Information Processing Systems http://nips.cc/', 'publisher': 'Curran Associates, Inc.', 'language': 'en-US', 'created': '2017', 'eventtype': 'Poster', 'description-abstract': 'The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms.  We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our 

In [11]:
#2 PyMuPDFLoader
"""
✅ More powerful than pypdf

✅ Preserves text positions and metadata

✅ Can handle scanned PDFs (with OCR if you add extra steps)

⚠️ Slightly heavier dependency

"""

print("PyMuPDFLoader")

try:
    pymupdf_loader = PyMuPDFLoader("data/Attention-is-all-you-need-Paper.pdf")
    pymupdf_docs = pymupdf_loader.load()
    print(f"No of pages loaded: {len(pymupdf_docs)}")
    print(f"First document content preview: {pymupdf_docs[0].page_content[:100]}...")
    print(f"Metadata: {pymupdf_docs[0].metadata}")

except Exception as e:
    print(f"Error loading PDF: {e}")

PyMuPDFLoader
No of pages loaded: 11
First document content preview: Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brai...
Metadata: {'producer': 'PyPDF2', 'creator': '', 'creationdate': '', 'source': 'data/Attention-is-all-you-need-Paper.pdf', 'file_path': 'data/Attention-is-all-you-need-Paper.pdf', 'total_pages': 11, 'format': 'PDF 1.3', 'title': 'Attention is All you Need', 'author': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin', 'subject': 'Neural Information Processing Systems http://nips.cc/', 'keywords': '', 'moddate': '2018-02-12T21:22:10-08:00', 'trapped': '', 'modDate': "D:20180212212210-08'00'", 'creationDate': '', 'page': 0}


### Handling PDF Challenges ✅

* Text extraction → Fails on scanned/encoded PDFs → Use pdfplumber, PyMuPDF, or OCR (Tesseract).

* Layout loss → Broken lines/tables → Use Camelot, Tabula, or layout parsers.

* Mixed content → Text + images + tables → Handle separately (OCR + table extractor + text parser).

* Large files → Slow/heavy → Process page by page, chunk, or convert to CSV/Parquet.

* Scanned PDFs → No text → OCR with preprocessing (deskew, denoise).

* Encoding/language → Non-English issues → Use multilingual OCR (Google Vision, PaddleOCR).

* Password-protected → Locked PDFs → Unlock with pikepdf or PyPDF2.

### PDF preprocessing pipeline

In [12]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [14]:
#1. Load the PDF
# -----------------------------
loader = PyPDFLoader("data/Attention-is-all-you-need-Paper.pdf")  # Change file path
documents = loader.load()

print(f"✅ Loaded {len(documents)} pages from PDF")

✅ Loaded 11 pages from PDF


In [15]:
# 2. Split text into chunks
# -----------------------------
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,    # characters per chunk
    chunk_overlap=200,  # overlap to preserve context
    length_function=len
)

docs = text_splitter.split_documents(documents)
print(f"✅ Created {len(docs)} text chunks")

✅ Created 43 text chunks


In [16]:
# 3. Clean text chunks
# -----------------------------
def clean_text(text):
    text = text.replace("\n", " ")      # remove line breaks
    text = " ".join(text.split())       # remove extra spaces
    return text

for doc in docs:
    doc.page_content = clean_text(doc.page_content)

print("✅ Cleaned all text chunks")

✅ Cleaned all text chunks


In [17]:
# Final Output
# -----------------------------
print("Sample chunk:\n")
print(docs[0].page_content[:500])

Sample chunk:

Attention Is All You Need Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com Jakob Uszkoreit∗ Google Research usz@google.com Llion Jones∗ Google Research llion@google.com Aidan N. Gomez∗† University of Toronto aidan@cs.toronto.edu Łukasz Kaiser ∗ Google Brain lukaszkaiser@google.com Illia Polosukhin∗‡ illia.polosukhin@gmail.com Abstract The dominant sequence transduction models are based on complex recurrent o
