# Parsing PDFs ⚔️

+ PDF is not designed for parsing.
+ Its purpose is to ensure consistent printing across all operating systems.
+ Unfortunately, many documents are only available in PDF format, so we need to find a way to parse them.



## Types of PDF Files

### Selectable

+ May be parsed with PDF parsers.
+ Easier for OCR (Optical Character Recognition) models.

<div style="text-align: center;">
    <img src="images/selectable-pdf.png" width="40%"></img>
</div>

### Scanned/Handwritten


+ Challenging to parse, specialy for Persian language.
+ They can only be parsed using OCR.

<div style="text-align: center;">
    <img src="images/scanned.png" width="40%"></img>
</div>


## PyPDF2

+ Good for (some) selectable PDFs only (no OCR).
+ Dosn't perserve document structure, just extracts text.
+ Reverse numbers order for our Persian document.

In [1]:
from PyPDF2 import PdfReader

reader = PdfReader("./data-sources/chapter02.pdf")
number_of_pages = len(reader.pages)
number_of_pages

40

In [4]:
from PyPDF2 import PdfReader

def extract_text(pdf_path):
    with open(pdf_path, 'rb') as f:
        pdf = PdfReader(f)
        txt = ""
        for page in pdf.pages:
            txt += page.extract_text()
        return txt

texts = extract_text("./data-sources/chapter02.pdf")

In [5]:
print(texts[500:1000])

نهادها و مؤسسات عمومی غیردولتی  
مجری این وظایف . 
۲- خرید خدمات از بخش تعاونی و خصوصی و نهادها و مؤسسات عمومی غیردولتی. 
۳-  مشارکت با بخش تعاونی و خصوصی و نهادها و مؤسسات عمومی غیردولتی از طریق اجاره ، 
واگذاری امکانات و تجهیزات و منابع فیزیکی. 
۴-  واگذاری مدیریت واحد های دولتی  به بخش تعاونی و خصوصی و نهادها و مؤسسات عمومی 
غیردولتی با پرداخت تمام و یا بخشی از هزینه سرانه خدمات . 
۵- ایجاد و اداره واحد های دولتی موضوع این ماده توسط دستگاههای اجرایی. 
تبصره  ۱-  اگر انجام امور موضوع این ماده 


In [7]:
with open("chapter2-pypdf2.txt", "wt") as file:
    file.write(texts)

## OCR models
+ [pytesseract](https://pypi.org/project/pytesseract/) (Good for Persian)
+ [EasyOCR](https://github.com/JaidedAI/EasyOCR)
+ Dosn't perserve document structure, just extract text.

## Docling
+ https://docling-project.github.io/docling/
+ https://docling-project.github.io/docling/installation/

Docling parses PDF, DOCX, PPTX, HTML, and other formats into a rich unified representation including document layout, tables etc., making them ready for generative AI workflows like RAG.

+ OCR capability
+ Perserve document structure
+ Chunking capability

<div style="text-align: center;">
    <img src="./images/docling.png" width="60%"></img>
</div>

<div style="text-align: center;">
    <img src="./images/docling_processing.png" width="50%"></img>
</div>

<div style="text-align: center;">
    <img src="./images/docling-pipeline.png" width="60%"></img>
</div>

Docling custom conversation: https://docling-project.github.io/docling/examples/custom_convert/

In [2]:
import json
import time
from pathlib import Path

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

input_doc_path = "./data-sources/chapter02-05.pdf"

# Docling Parse without EasyOCR
# -------------------------
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

###########################################################################

start_time = time.time()
conv_result = doc_converter.convert(input_doc_path)
end_time = time.time() - start_time

print(f"Document converted in {end_time:.2f} seconds.")

## Export results
output_dir = Path("scratch-docling-parse")
output_dir.mkdir(parents=True, exist_ok=True)
doc_filename = conv_result.input.file.stem

# Export Docling document JSON format:
with (output_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:
    fp.write(json.dumps(conv_result.document.export_to_dict(), indent=2, ensure_ascii=False))

# Export Markdown format:
with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
    fp.write(conv_result.document.export_to_markdown())

# Export Document Tags format:
with (output_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp:
    fp.write(conv_result.document.export_to_doctags())

2025-10-30 20:55:05,394 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-10-30 20:55:05,614 - INFO - Going to convert document batch...
2025-10-30 20:55:05,615 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 75463f421d05cb4304e1f714cf00d35d
2025-10-30 20:55:06,734 - INFO - Loading plugin 'docling_defaults'
2025-10-30 20:55:06,744 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-30 20:55:06,821 - INFO - Loading plugin 'docling_defaults'
2025-10-30 20:55:06,841 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-10-30 20:55:06,877 - INFO - Accelerator device: 'cuda:0'
2025-10-30 20:55:11,952 - INFO - Accelerator device: 'cuda:0'
2025-10-30 20:55:13,416 - INFO - Processing document chapter02-05.pdf
2025-10-30 20:56:16,701 - INFO - Finished converting document chapter02-05.pdf in 71.32 sec.


Document converted in 71.33 seconds.


In [None]:
# del doc_converter # (free memory after conversation)

In [3]:
import pickle

with open('chapter02-04-doclingdoc.pkl', 'wb') as file:
    pickle.dump(conv_result.document, file)


### checkpoint

In [4]:
import pickle

with open('chapter02-04-doclingdoc.pkl', 'rb') as file:
    docling_document = pickle.load(file)
