pdfcx

pdf-canonical-extraction. An open specification for embedding canonical structured data directly in PDF documents. No business model. No vendor. No roadmap. Just one attachment.

The thesis

It's time for document extraction to die.

Companies spend fortunes pushing structured data into PDF documents. Other companies spend fortunes pulling that same data back out, imperfectly, through OCR and machine learning and best-effort parsing. Accuracy disappears in the middle. All because the world is still thinking in terms of paper compatibility, and nobody prints anymore.

PDFs are not the villain. They are excellent at human consumption, portable as a record, durable across decades. We are not proposing their retirement. We are proposing their completion.

Machine readability

For humans and agents alike. Accessibility tools reconstruct tables and forms from rendered glyphs. A pdfcx record hands them the truth directly. And every AI agent that will soon touch your PDFs (to file taxes, reconcile invoices, summarise lab reports, fill forms, search case law) is forced today to OCR and guess. Accessibility was always the argument. AI agents just make it urgent.

The proposal

Embed a single record of the document's structured data directly inside the PDF as an attached file. Or reference it by URL. The PDF specification has allowed file attachments since 1999. We just require a specific string in the attachment's description.

Nickname	`pdfcx`
/Desc	exactly `pdf-canonical-extraction`
Formats	JSON · Parquet · SQLite
Transport	Embedded file attachment, or URL
Auth	optional. the PDF already carries the human view

The whole spec, in one sentence

Specifications are hard to adopt. So this one is a single line:

Attach one file to your PDF whose /Desc is pdf-canonical-extraction.

That is the entire spec.

One attachment. No fee. No vendor. No roadmap.

This repository

File	Purpose
`index.html` · `style.css` · `script.js`	the manifesto, rendered, and an in-browser drag-drop demo.
`sample.js`	a minimal Node.js reference: generate and read a PDF with a pdfcx record.
`skills/pdfcx/SKILL.md`	a Claude Code skill: give this to an agent and it will emit pdfcx-compliant PDFs.
`promote.txt`	public roster of adopters. Pull requests welcome.
`warn.txt`	list of known misrepresenters. Evidence required.
`LICENSE`	Apache 2.0.

Quickstart

Browser demo

Visit the hosted page: pdf.cx. Drop a PDF on the demo tile. If it carries a pdfcx record, you'll see the embedded JSON. If not, you'll see why the spec exists.

Or click "generate a sample" and you'll get a sample invoice PDF with a pdfcx record embedded. Then drop it back in.

Node demo

npm install
node sample.js write   # generates ./sample.pdf with an embedded pdfcx record
node sample.js read    # reads ./sample.pdf and prints the pdfcx record
node sample.js         # both, in sequence

Adopt the spec

Make your PDF generator attach a file whose description is exactly pdf-canonical-extraction. See skills/pdfcx/SKILL.md for recipes in Python, JavaScript, Java, Go, Rust, and .NET.
Ship it.
Open a pull request adding yourself to promote.txt.

Use the Claude Code skill

Drop skills/pdfcx/SKILL.md into your project's .claude/skills/ directory (or install it globally). Claude Code will use it any time you ask the agent to produce a PDF from structured data.

The only real risk

Misrepresentation, a PDF that shows one thing in its human view and claims another in its data. We treat that as malpractice. Names appear in warn.txt with evidence.

The economic reality

Some companies deliberately rasterise their PDF exports to PNG, rotate them by one to five degrees, and re-export, making their own documents harder to read back. Other companies have built entire businesses, entire investor rounds, on extracting data that was never supposed to have been lost.

We understand. But it's time to put users first.

We acknowledge that many companies and investors are counting on the business model of document extraction to continue. But it's time to put this era behind us and put users first. The accuracy of business data is worth more than the revenue of approximately extracting it.

Acknowledgements

We are not the first to say this. Dittrich & Bender's Janiform Intra-Document Analytics for Reproducible Research (VLDB 2015) introduced Portable Database Files for research papers. Germany and France have mandated ZUGFeRD / Factur-X for electronic invoicing, using PDF/A-3 with embedded CII XML per EN 16931. The US SEC and the EU ESMA require Inline XBRL for financial filings. pdfcx is not invention. It is consensus: that every PDF, not just those from tax authorities and research groups, should carry its own truth.

License

Apache 2.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdfcx

The thesis

Machine readability

The proposal

The whole spec, in one sentence

This repository

Quickstart

Browser demo

Node demo

Adopt the spec

Use the Claude Code skill

The only real risk

The economic reality

Acknowledgements

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
skills/pdfcx		skills/pdfcx
.gitignore		.gitignore
.nojekyll		.nojekyll
CNAME		CNAME
LICENSE		LICENSE
README.md		README.md
index.html		index.html
package.json		package.json
promote.txt		promote.txt
sample.js		sample.js
script.js		script.js
style.css		style.css
warn.txt		warn.txt

Folders and files

Latest commit

History

Repository files navigation

pdfcx

The thesis

Machine readability

The proposal

The whole spec, in one sentence

This repository

Quickstart

Browser demo

Node demo

Adopt the spec

Use the Claude Code skill

The only real risk

The economic reality

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages