pdf-canonical-extraction. An open specification for embedding canonical structured data directly in PDF documents. No business model. No vendor. No roadmap. Just one attachment.
It's time for document extraction to die.
Companies spend fortunes pushing structured data into PDF documents. Other companies spend fortunes pulling that same data back out, imperfectly, through OCR and machine learning and best-effort parsing. Accuracy disappears in the middle. All because the world is still thinking in terms of paper compatibility, and nobody prints anymore.
PDFs are not the villain. They are excellent at human consumption, portable as a record, durable across decades. We are not proposing their retirement. We are proposing their completion.
For humans and agents alike. Accessibility tools reconstruct tables and forms from rendered glyphs. A pdfcx record hands them the truth directly. And every AI agent that will soon touch your PDFs (to file taxes, reconcile invoices, summarise lab reports, fill forms, search case law) is forced today to OCR and guess. Accessibility was always the argument. AI agents just make it urgent.
Embed a single record of the document's structured data directly inside the PDF as an attached file. Or reference it by URL. The PDF specification has allowed file attachments since 1999. We just require a specific string in the attachment's description.
| Nickname | pdfcx |
|---|---|
| /Desc | exactly pdf-canonical-extraction |
| Formats | JSON · Parquet · SQLite |
| Transport | Embedded file attachment, or URL |
| Auth | optional. the PDF already carries the human view |
Specifications are hard to adopt. So this one is a single line:
Attach one file to your PDF whose
/Descispdf-canonical-extraction.That is the entire spec.
One attachment. No fee. No vendor. No roadmap.
| File | Purpose |
|---|---|
index.html · style.css · script.js |
the manifesto, rendered, and an in-browser drag-drop demo. |
sample.js |
a minimal Node.js reference: generate and read a PDF with a pdfcx record. |
skills/pdfcx/SKILL.md |
a Claude Code skill: give this to an agent and it will emit pdfcx-compliant PDFs. |
promote.txt |
public roster of adopters. Pull requests welcome. |
warn.txt |
list of known misrepresenters. Evidence required. |
LICENSE |
Apache 2.0. |
Visit the hosted page: pdf.cx. Drop a PDF on the demo tile. If it carries a pdfcx record, you'll see the embedded JSON. If not, you'll see why the spec exists.
Or click "generate a sample" and you'll get a sample invoice PDF with a pdfcx record embedded. Then drop it back in.
npm install
node sample.js write # generates ./sample.pdf with an embedded pdfcx record
node sample.js read # reads ./sample.pdf and prints the pdfcx record
node sample.js # both, in sequence- Make your PDF generator attach a file whose description is exactly
pdf-canonical-extraction. Seeskills/pdfcx/SKILL.mdfor recipes in Python, JavaScript, Java, Go, Rust, and .NET. - Ship it.
- Open a pull request adding yourself to
promote.txt.
Drop skills/pdfcx/SKILL.md into your project's .claude/skills/ directory (or install it globally). Claude Code will use it any time you ask the agent to produce a PDF from structured data.
Misrepresentation, a PDF that shows one thing in its human view and claims another in its data. We treat that as malpractice. Names appear in warn.txt with evidence.
Some companies deliberately rasterise their PDF exports to PNG, rotate them by one to five degrees, and re-export, making their own documents harder to read back. Other companies have built entire businesses, entire investor rounds, on extracting data that was never supposed to have been lost.
We understand. But it's time to put users first.
We acknowledge that many companies and investors are counting on the business model of document extraction to continue. But it's time to put this era behind us and put users first. The accuracy of business data is worth more than the revenue of approximately extracting it.
We are not the first to say this. Dittrich & Bender's Janiform Intra-Document Analytics for Reproducible Research (VLDB 2015) introduced Portable Database Files for research papers. Germany and France have mandated ZUGFeRD / Factur-X for electronic invoicing, using PDF/A-3 with embedded CII XML per EN 16931. The US SEC and the EU ESMA require Inline XBRL for financial filings. pdfcx is not invention. It is consensus: that every PDF, not just those from tax authorities and research groups, should carry its own truth.
Apache 2.0. See LICENSE.