Reading Existing PDFs

PdfReader opens an existing (non-encrypted) PDF and exposes its object graph and page tree. It is a low-level API: it does not render or edit anything by itself, but it is the foundation the upcoming template-import and modify-existing-PDF features build on.

use DragonOfMercy\PhpPdf\Reader\PdfReader;

$reader = PdfReader::fromFile('invoice.pdf');     // or PdfReader::fromBytes($bytes)

$reader->version();        // "1.7" (catalog /Version overrides the header)
$reader->pageCount();      // 3
$page = $reader->page(1);  // 1-based, ReadPage

What it understands

Classic cross-reference tables and cross-reference streams (PDF 1.5+), including PNG/TIFF predictor encodings.
Incremental updates: /Prev revision chains are walked and merged (the newest revision wins), including hybrid-reference files (/XRefStm).
Object streams (/ObjStm): compressed objects are extracted transparently.
Stream filters needed for document structure: FlateDecode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode (with /DecodeParms). Image filters (DCT, JPX, CCITT, JBIG2) are not decoded - image streams stay opaque.
Real-world quirks: junk before the %PDF- header, slightly wrong xref offsets (a recovery scan looks around the recorded position), a wrong stream /Length (fallback scan for endstream), and a missing %%EOF.

Pages

page(int $n) returns a ReadPage with the page's inherited attributes already resolved (PDF inheritance through the /Pages tree):

$page = $reader->page(1);
$page->mediaBox;    // [llx, lly, urx, ury] in points, corner-normalized
$page->cropBox;     // same shape, or null
$page->box();       // CropBox when present, else MediaBox
$page->rotate;      // 0 / 90 / 180 / 270
$page->resources;   // the resolved /Resources dictionary, or null
$page->contents;    // list of references to the page's content stream(s)
$page->dict;        // the raw page dictionary

Raw object access

$catalog = $reader->catalog();                 // the document catalog dictionary
$trailer = $reader->trailer();                 // merged trailer across revisions
$object  = $reader->object(12);                // payload of object 12 (lazy, cached)
$value   = $reader->resolve($maybeReference);  // follow reference chains
$bytes   = $reader->decodeStream($stream);     // apply a stream's /Filter chain

Objects are returned as the library's internal PDF object model (dictionaries, arrays, names, numbers, strings, streams). Resolution is lazy and cached; circular references and over-deep chains throw a PdfParseException.

Limits

Encrypted PDFs are rejected at fromFile() / fromBytes() with a clear PdfException. Decrypt the file first (e.g. qpdf --decrypt).
Malformed input throws PdfParseException with the byte offset and what was expected.
LZWDecode and full reconstruction of severely broken files (rebuilding the xref by scanning) are not supported yet.

Reading Existing PDFs

Reading Existing PDFs

What it understands

Pages

Raw object access

Limits

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Getting Started

Content & Layout

Codes & Vector

Document Features

Forms

Security & Archival

Internals

Project

Clone this wiki locally