---
title: "PDF parsing is hard"
author: "Safouane Chergui"
date: "2025-08-27"
format: html
toc: true
toc-location: body
toc-depth: 4
categories: [Python, PDF]
---

The goal of this blogpost is to explain what a PDF is internally and why parsing PDF files is not that easy.

Lately, I've been working on a project with a customer where the goal is to extract some specific information from PDF documents. To my big surprise, this task has proven to be quite challenging.

The main challenges stem from the following elements:

- **Privacy concerns:** using an external API to parse the PDFs was a no-go as the PDFs contained sensitive information.
- **Complexity of the documents:** the PDFs at hand contained a mix of text, complex tables, math formulas, all of which needed to be processed and understood in context.
- **Absence of a clear structure:** the lack of consistent formatting and structure across the documents made it difficult to apply standard parsing across the board.

<br><br>
<div align="center">
<img src="./assets/pdf-icon.png" alt="PDF icon" width="25%" style="display: block; margin: 0 auto;">
</div>
<br>

## What the hell is a PDF document ?

### A first glimpse at a PDF file internal structure

Have you ever opened a PDF document with notepad or vscode instead of your preferred PDF reader ? If you do so , you'll stumble upon something that looks like this :

<div align="center">

<img src="./assets/example_pdf_start.png" alt="PDF Internal Structure" style="display: block; margin: 0 auto;">

<p align="center"><em>Figure 1: Internal structure of a PDF document when viewed as raw text</em></p>

</div>

If you'd like to see the full PDF internal structure, you can find the example PDF [here](https://gist.github.com/chsafouane/0079eb20531a0effb632e9aea7ddfabe?short_path=b320036).

### PDF Page Description Language

To understand why PDFs are hard to parse, one must understand how a PDF file is built.

A PDF file is based on Page Description Language (PDL), which is a language used to describe the layout and appearance of a printed page. PDF PDL provides a standardized set of commands to reconstruct a page with perfect fidelity.

As a result, a PDF file is essentially a collection of instructions for rendering a page, rather than a linear sequence of text and images. If you look at the example pdf available in the github gist, you'll see starting line 34 the following commands:

```pdf
/F1 18 Tf
100 700 Td
(This is a PDF tutorial) Tj
```

What the following instructions do is:

- `/F1 18 Tf` : set the font to F1 with size 18
- `100 700 Td` : move the text position to (100, 700)
- `(This is a PDF tutorial) Tj` : show the text string

Every PDF looks just like this; a precise sequence of commands that specify what to draw and exactly at what coordinates. It does not contain a semantic representation of its content. It does not state, "This is a paragraph that flows through two columns" or "this is a table".

A table, for example, is just a grid of lines and text positioned at specific coordinates. There are no inherent relationships between the cells, no indication of headers or footers, and no understanding of the data contained within.

So when a parser sees what's supposed to be a table, it sees just a bunch of lines and text. Its task (rather difficult task) is to infer the structure and relationships between these elements.

This lack of semantic structure makes it challenging to parse complex PDF documents.

### The internal structure of a PDF

What you see in Figure 1 or in the gist file is the internal structure of a PDF document. Let us dive into the key components that make up this structure.

A PDF is composed internally of four sections:

<div align="center">

<img src="./assets/pdf_internal_structure.png" alt="PDF Internal Structure Components" style="display: block; margin: 0 auto;">

<p align="center"><em>Figure 2: Internal structure of a PDF document</em></p>
<p align="center"><small>Source: <a href="https://www.researchgate.net/figure/An-example-of-the-PDF-file-structure_fig1_360275035">ResearchGate - An example of the PDF file structure</a></small></p>

</div>

#### The header

The header of a PDF file tells you about the PDF specifications version used to generate it. It is always the first line of the file and starts with the `%PDF-` marker. In Figure 1, it corresponds to `%PDF-1.7`.

#### The body

Now, the body is where you "define" the content of the PDF.