Skip to content

arnabdotorg/pdfcx

Repository files navigation

pdfcx

pdf-canonical-extraction. An open specification for embedding canonical structured data directly in PDF documents. No business model. No vendor. No roadmap. Just one attachment.


The thesis

It's time for document extraction to die.

Companies spend fortunes pushing structured data into PDF documents. Other companies spend fortunes pulling that same data back out, imperfectly, through OCR and machine learning and best-effort parsing. Accuracy disappears in the middle. All because the world is still thinking in terms of paper compatibility, and nobody prints anymore.

PDFs are not the villain. They are excellent at human consumption, portable as a record, durable across decades. We are not proposing their retirement. We are proposing their completion.

Machine readability

For humans and agents alike. Accessibility tools reconstruct tables and forms from rendered glyphs. A pdfcx record hands them the truth directly. And every AI agent that will soon touch your PDFs (to file taxes, reconcile invoices, summarise lab reports, fill forms, search case law) is forced today to OCR and guess. Accessibility was always the argument. AI agents just make it urgent.

The proposal

Embed a single record of the document's structured data directly inside the PDF as an attached file. Or reference it by URL. The PDF specification has allowed file attachments since 1999. We just require a specific string in the attachment's description.

Nickname pdfcx
/Desc exactly pdf-canonical-extraction
Formats JSON · Parquet · SQLite
Transport Embedded file attachment, or URL
Auth optional. the PDF already carries the human view

The whole spec, in one sentence

Specifications are hard to adopt. So this one is a single line:

Attach one file to your PDF whose /Desc is pdf-canonical-extraction.

That is the entire spec.

One attachment. No fee. No vendor. No roadmap.

This repository

File Purpose
index.html · style.css · script.js the manifesto, rendered, and an in-browser drag-drop demo.
sample.js a minimal Node.js reference: generate and read a PDF with a pdfcx record.
skills/pdfcx/SKILL.md a Claude Code skill: give this to an agent and it will emit pdfcx-compliant PDFs.
promote.txt public roster of adopters. Pull requests welcome.
warn.txt list of known misrepresenters. Evidence required.
LICENSE Apache 2.0.

Quickstart

Browser demo

Visit the hosted page: pdf.cx. Drop a PDF on the demo tile. If it carries a pdfcx record, you'll see the embedded JSON. If not, you'll see why the spec exists.

Or click "generate a sample" and you'll get a sample invoice PDF with a pdfcx record embedded. Then drop it back in.

Node demo

npm install
node sample.js write   # generates ./sample.pdf with an embedded pdfcx record
node sample.js read    # reads ./sample.pdf and prints the pdfcx record
node sample.js         # both, in sequence

Adopt the spec

  1. Make your PDF generator attach a file whose description is exactly pdf-canonical-extraction. See skills/pdfcx/SKILL.md for recipes in Python, JavaScript, Java, Go, Rust, and .NET.
  2. Ship it.
  3. Open a pull request adding yourself to promote.txt.

Use the Claude Code skill

Drop skills/pdfcx/SKILL.md into your project's .claude/skills/ directory (or install it globally). Claude Code will use it any time you ask the agent to produce a PDF from structured data.

The only real risk

Misrepresentation, a PDF that shows one thing in its human view and claims another in its data. We treat that as malpractice. Names appear in warn.txt with evidence.

The economic reality

Some companies deliberately rasterise their PDF exports to PNG, rotate them by one to five degrees, and re-export, making their own documents harder to read back. Other companies have built entire businesses, entire investor rounds, on extracting data that was never supposed to have been lost.

We understand. But it's time to put users first.

We acknowledge that many companies and investors are counting on the business model of document extraction to continue. But it's time to put this era behind us and put users first. The accuracy of business data is worth more than the revenue of approximately extracting it.

Acknowledgements

We are not the first to say this. Dittrich & Bender's Janiform Intra-Document Analytics for Reproducible Research (VLDB 2015) introduced Portable Database Files for research papers. Germany and France have mandated ZUGFeRD / Factur-X for electronic invoicing, using PDF/A-3 with embedded CII XML per EN 16931. The US SEC and the EU ESMA require Inline XBRL for financial filings. pdfcx is not invention. It is consensus: that every PDF, not just those from tax authorities and research groups, should carry its own truth.

License

Apache 2.0. See LICENSE.

About

pdf-canonical-extraction — an open spec for embedding canonical structured data in PDF documents

Resources

License

Stars

Watchers

Forks

Contributors