Skip to content

Reading Existing PDFs

Dragon edited this page Jun 10, 2026 · 1 revision

Reading Existing PDFs

PdfReader opens an existing (non-encrypted) PDF and exposes its object graph and page tree. It is a low-level API: it does not render or edit anything by itself, but it is the foundation the upcoming template-import and modify-existing-PDF features build on.

use DragonOfMercy\PhpPdf\Reader\PdfReader;

$reader = PdfReader::fromFile('invoice.pdf');     // or PdfReader::fromBytes($bytes)

$reader->version();        // "1.7" (catalog /Version overrides the header)
$reader->pageCount();      // 3
$page = $reader->page(1);  // 1-based, ReadPage

What it understands

  • Classic cross-reference tables and cross-reference streams (PDF 1.5+), including PNG/TIFF predictor encodings.
  • Incremental updates: /Prev revision chains are walked and merged (the newest revision wins), including hybrid-reference files (/XRefStm).
  • Object streams (/ObjStm): compressed objects are extracted transparently.
  • Stream filters needed for document structure: FlateDecode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode (with /DecodeParms). Image filters (DCT, JPX, CCITT, JBIG2) are not decoded - image streams stay opaque.
  • Real-world quirks: junk before the %PDF- header, slightly wrong xref offsets (a recovery scan looks around the recorded position), a wrong stream /Length (fallback scan for endstream), and a missing %%EOF.

Pages

page(int $n) returns a ReadPage with the page's inherited attributes already resolved (PDF inheritance through the /Pages tree):

$page = $reader->page(1);
$page->mediaBox;    // [llx, lly, urx, ury] in points, corner-normalized
$page->cropBox;     // same shape, or null
$page->box();       // CropBox when present, else MediaBox
$page->rotate;      // 0 / 90 / 180 / 270
$page->resources;   // the resolved /Resources dictionary, or null
$page->contents;    // list of references to the page's content stream(s)
$page->dict;        // the raw page dictionary

Raw object access

$catalog = $reader->catalog();                 // the document catalog dictionary
$trailer = $reader->trailer();                 // merged trailer across revisions
$object  = $reader->object(12);                // payload of object 12 (lazy, cached)
$value   = $reader->resolve($maybeReference);  // follow reference chains
$bytes   = $reader->decodeStream($stream);     // apply a stream's /Filter chain

Objects are returned as the library's internal PDF object model (dictionaries, arrays, names, numbers, strings, streams). Resolution is lazy and cached; circular references and over-deep chains throw a PdfParseException.

Limits

  • Encrypted PDFs are rejected at fromFile() / fromBytes() with a clear PdfException. Decrypt the file first (e.g. qpdf --decrypt).
  • Malformed input throws PdfParseException with the byte offset and what was expected.
  • LZWDecode and full reconstruction of severely broken files (rebuilding the xref by scanning) are not supported yet.

Clone this wiki locally