---
title: "Parsing PDFs is hard"
author: "Safouane Chergui"
date: "2025-08-27"
format: html
toc: true
toc-location: body
toc-depth: 4
categories: [Python, PDF]
---

Lately, I've been working on a project with a customer where the goal is to extract some specific information from PDF documents. To my big surprise, this task has proven to be quite challenging.

The main challenges stem from the following elements:
- **Privacy concerns:** using an external API to parse the PDFs was a no-go as the PDFs contained sensitive information.
- **Complexity of the documents:** the PDFs at hand contained a mix of text, complex tables, math formulas, all of which needed to be processed and understood in context.
- **Absence of a clear structure:** the lack of consistent formatting and structure across the documents made it difficult to apply standard parsing across the board.

<br><br><br>

<div align="center">
<img src="./assets/pdf-icon.png" alt="PDF icon" width="25%">
</div>

## What the hell is a PDF document ?

### A first glimpse at a PDF file internal structure

Have you ever opened a PDF document with notepad or vscode instead of your preferred PDF reader ? If you do so , you'll stumble upon something that looks like this :

<div align="center">

![PDF Internal Structure](./assets/example_pdf_start.png)

*Figure 1: Internal structure of a PDF document when viewed as raw text*

</div>

If you'd like to see the full PDF internal structure, you can find the example PDF [here](https://gist.github.com/chsafouane/0079eb20531a0effb632e9aea7ddfabe?short_path=b320036).

graph TD
    subgraph "Start"
        A
    end

    subgraph "Stage 1: Ingestion & Parsing"
        A --> B(PDF Backend);
        B --> C;
        B --> D;
    end

    subgraph "Stage 2: AI-Powered Structural Understanding"
        D --> E{Layout Segmentation<br>(DocLayNet-trained Model)};
        E --> F;
        
        subgraph "Table Processing"
            F -- Table Detected --> G;
            G --> H(Table Structure Recognition<br>TableFormer Model);
            H --> I;
            I & C --> J;
        end

        subgraph "Text & Other Element Processing"
            F -- Text/List/Title Detected --> K;
        end

        subgraph "Scanned Content Processing (Optional)"
            D --> L(OCR Engine<br>e.g., Tesseract, EasyOCR);
            L --> M;
        end
    end

    subgraph "Stage 3: Assembly & Finalization"
        J --> N(Assembly Stage);
        K --> N;
        M --> N;
        N --> O();
    end

    subgraph "Stage 4: Export & RAG Integration"
        O --> P{Export or Integrate};
        P --> Q;
        P --> R(Structure-Aware Chunking);
        R --> S;
        S --> T((LLM for RAG Application));
    end

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style O fill:#bbf,stroke:#333,stroke-width:4px
    style T fill:#9f9,stroke:#333,stroke-width:2px