-
Notifications
You must be signed in to change notification settings - Fork 0
Internals Fonts and Text
This page covers the internal text encoding strategies and font subsetting pipeline used by phppdf.
phppdf uses two different text encodings depending on the font:
The 12 built-in fonts (Helvetica / Times / Courier x Regular / Bold / Italic / BoldItalic) use the WinAnsi encoding (CP1252). It covers:
- ASCII (0x20-0x7E).
- Latin-1 supplement (0xA0-0xFF) - accents, common Latin symbols.
- The typographic characters in the 0x80-0x9F range:
EUR,oe,OE, smart quotes, em-dash, en-dash, ellipsis, etc.
Characters outside WinAnsi (Greek, Cyrillic, CJK, etc.) cannot be rendered with the standard fonts. Register a custom font instead.
The 12 AFM-derived metric files in src/Font/Metrics/ are generated from Adobe Type 1 AFM source files by bin/generate-font-metrics.php and contain the per-glyph widths used by stringWidth() and the cell wrapper.
Registered TTF / OTF fonts use Identity-H encoding: each Unicode code point is mapped directly to its glyph index in the font. This is what makes full Unicode reach possible (Latin Extended, Greek, Cyrillic, supplementary planes including CJK).
The technical structure for a custom font is:
- A composite font dictionary of subtype
/Type0with/Encoding /Identity-Hand a/DescendantFontspointing at... - A
CIDFontType2(for TTF) orCIDFontType0(for OTF / CFF) dictionary describing the embedded font. - An embedded
/ToUnicodeCMap stream that maps glyph indices back to Unicode code points - this is what makes copy-paste from the rendered PDF work correctly.
When parsing a TTF / OTF, phppdf reads the cmap table looking for:
- Format 4 (BMP, U+0000 to U+FFFF) - the most common format for Western and basic CJK fonts.
- Format 12 (full Unicode, including supplementary planes U+10000+) - required for emoji, ancient scripts, and the full CJK extension blocks.
Other subtable formats are skipped. If neither format 4 nor format 12 is present, registration throws PdfException.
Both TTF and OTF / CFF fonts are automatically subsetted: the embedded font in the PDF only contains the glyphs your document actually uses, not the full original file. This is what keeps PDFs small even when you use a 10 MB CJK font family.
The subsetting strategy is GID-preserving:
- The Identity-H encoding maps Unicode -> GID directly. If the subsetter renumbered glyphs, every text-showing operator in every content stream would also need rewriting.
- GID-preserving means: keep the original glyph indices, just remove the data for unused glyphs (zero-out unused
glyfentries for TTF, drop unused CharStrings for CFF). - For TTF: the
glyfandlocatables are rebuilt with empty entries for unused GIDs. - For OTF / CFF: the CFF table is parsed, the CharStrings INDEX is rewritten to keep only used glyphs (or, for CID-keyed CJK fonts, only used CIDs), the FDSelect / FDArray are pruned, and the CFF blob is re-emitted.
Subsetting runs at output() time, after the full document has accumulated which glyphs are used.
Parsing a custom TrueType / OpenType file (TtfParser::parse) is a pure function of the file bytes, and the resulting ParsedTtf is immutable. To avoid re-reading and re-parsing the same font file for every Document, ParsedTtfCache memoizes parsed fonts process-wide, keyed by realpath | filesize | mtime. A cache hit skips both the read and the parse and returns the shared instance - output stays byte-identical because the ParsedTtf is the same object the parser would have produced.
The cache lives for the life of the PHP process. On shared web hosting (mod_php / php-fpm) static state is reset between requests, so the cache never accumulates across requests and adds no net memory there; it only helps long-lived CLI batch jobs and queue workers. A worker that parses many distinct fonts over its lifetime can reset the cache with:
ParsedTtfCache::clear();
- TrueType Collections (
.ttc) - a.ttcis a container with multiple fonts; not supported. - Variable fonts (
fvar/gvar) - not supported. - Kerning (
kerntable or GPOS) - all glyph advances are based on the un-kerned widths. - Ligatures and complex shaping (GSUB) - no Latin
fi/flligatures, no Arabic / Indic shaping, no Devanagari reordering. - Right-to-left direction.
- Identity-V (vertical writing).
MIT licensed. Source on GitHub - if phppdf helps you, you can buy me a coffee.