Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rendering of non-bmp characters #1553

Open
4 tasks
sisve opened this issue Sep 26, 2017 · 7 comments
Open
4 tasks

Rendering of non-bmp characters #1553

sisve opened this issue Sep 26, 2017 · 7 comments

Comments

@sisve
Copy link

sisve commented Sep 26, 2017

Summary:

This issue started out as bug report about Cpdf::utf8toUtf16BE, but has morphed into a larger problem of rendering non-BMP characters.

The following issues have been found:

  • Cpdf::utf8toUtf16BE calculates the high surrogate wrong. Fixing this produces proper surrogate pairs in the pdf, allowing copy/paste of non-bmp characters into other applications and they show up correctly.
  • php-font-lib needs to be upgraded to 0.5.1 to get support for format 12 cmaps
  • php-font-lib's TrueType/File.php's getUnicodeCharMap() needs to be patched to prefer the Unicode UCS-4 (platformID=3, platformSpecificID=10).
  • We need to process all text objects to change all non-bmp characters into two-byte sequences that the bfchar maps to the proper endpoint. This includes replacing the D835 DD38 in the string object with 0080, and adding a bfchar entry that points out the proper codepoint (<0080> <D835DD38>).

The steps above produces a pdf where non-bmp characters are rendered as a empty box (a missing glyph). These can be copy/pasted into other unicode-capable programs and be fully readable.

The remaining problems include the proper rendering of the glyph so they are visible within the pdf file.


Original post:

From Wikipedia: UTF-16, U+10000 to U+10FFFF

  • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0x000000..0x0FFFFF.
  • The top ten bits (a number in the range 0x0000..0x03FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate, which will be in the range 0xD800..0xDBFF.
  • The low ten bits (also in the range 0x0000..0x03FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.

The current implementation:

$c -= 0x10000;
$w1 = 0xD800 | ($c >> 0x10);
$w2 = 0xDC00 | ($c & 0x3FF);
$out .= chr($w1 >> 0x08) . chr($w1 & 0xFF) . chr($w2 >> 0x08) . chr($w2 & 0xFF);

The calculation of $w1 right shifts 0x10=16 bits, which means that we only keep 4 bits of the 20 bit value in $c, and those bits are not positioned correctly. The correct number should be 0x0A=10.

Examples:

  • 𝔖 (U+1D516 mathemathical fraktur capital s), should be D835 DD16, but is D800 DD16.
  • 𝔰 (U+1D530 mathemathical fraktur small s, should be D835 DD3, but is D800 DD30.
@bsweeney bsweeney added this to the dompdf-next milestone Sep 26, 2017
@bsweeney
Copy link
Member

This is interesting and we'll definitely take a look. Some users are reporting incorrect character representations in the rendered PDFs. I wonder if this is the cause (or at least part of the problem).

@sisve
Copy link
Author

sisve commented Sep 26, 2017

I have yet to pin down all the issues I have with non-bmp characters. First; neither of the two example characters I used had glyphs in DejaVuSans.ttf. I switched over to 𝔸 (U+1D538 mathematical double-struck capital a) which exists in DejaVuSans.

I patched utf8toUtf16BE, and the correct byte sequence for 𝔸 is now generated.

  • Chrome shows it as two whitespaces (with font subsetting). I can copy/paste both these into a text editor and get the original character.
  • Chrome shows it as two boxes (without font subsetting). I can still copy/paste these.
  • Mac's Preview shows two boxes. I cannot copy/paste from Preview.
  • Adobe Acrobat Reader DC shows two boxes. I can copy/paste those two and get the original character.

It seems that DejaVuSans.ufm only contains the glyphs for the bmp. This one is generated using php-font-lib 0.5.0, where only format 4 of the cmap was supported. Upgrading to 0.5.1 adds support for format 12.

DejaVuSans.ttf's cmap has 5 subtables:

format length platformId encodingId Meaning-ish
4 1960 0 3 Unicode 2.0 or later semantics (BMP only)
12 3388 0 10 (Unknown)
6 622 1 0 Backward compatibity with something Mac-related
4 1960 3 1 Unicode BMP (UCS-2)
12 3388 3 10 Unicode UCS-4

Adding support for format 12 opens up the number of glyphs from the 1960 in the BMP, to 3388 in total. The ufm file needs to be regenerated, and then you run into a loop in php-font-lib's TrueType/File.php that needs to be patched to use the subtable platformId=3 encodingId=10.

With this done you get an ufm file containing U 120120 ; WX 741 ; N u1D538 ; G 5495, the line for the 𝔸 ... but still no glyphs visible.

The mentioned loop also exists in Cpdf.php, but here there are no other subtables available. They are lost when the font is written to disk a few lines above the loop.

This is where I give up ... for now.

To summarize;

  1. Upgrade php-font-lib to 0.5.1
  2. Patch File.php to prioritize the 3/10 cmap.
  3. Regenerate the ufm file.
  4. Feel the despair when the glyph is still missing.

@bsweeney
Copy link
Member

oh boy

@sisve
Copy link
Author

sisve commented Sep 27, 2017

Some further debugging shows that the generated font for the subsetting has an odd CIDToGIDMap where the CID is 54584=0xD538. Your favorite calculator will show that 120120-0xFFFF=54584, which means that the generated of the cid-to-gid map is wrong-ish.

I tracked it down to https://github.com/dompdf/dompdf/blob/master/lib/Cpdf.php#L2863-L2866 which follows the specification. From "Table 117 - Entries in a CIDFont dictionary":

(Optional; Type 2 CIDFonts only) A specification of the mapping from CIDs to glyph indices. If the value is a stream, the bytes in the stream shall contain the mapping from CIDs to glyph indices: the glyph index for a particular CID value c shall be a 2-byte value stored in bytes 2 × c and 2 × c + 1, where the first byte shall be the high-order byte. If the value of CIDToGIDMap is a name, it shall be Identity, indicating that the mapping between CIDs and glyph indices is the identity mapping. Default value: Identity.

The obvious issue is that two bytes aren't enough to represent the character id 120120=0x1D538. If I disable the generation of the CIDToGIDMap entirely none of my glyphs show up, which is a reasonable outcome. This indicates, I believe, that the correct bytes are there (since I patched utf8toUtf16BE), but that the map is wrong. I tried uploading my minimal font to http://torinak.com/font/lsfont.html and tells me in a red friendly color; "Decoding error". My rendered input was "CHAR='𝔸'" and the glyphs for the ascii characters are there, but the 𝔸 seems to have been broken down to into two glyphs at D538 and FFFF. That looks like an parsing error somewhere.

I'm stopping my debugging at this point; I'm way in over my head in the pdf standard and how non-bmp characters should be handled. It could be that the entire font generation needs to change. It could be that I didn't chant enough while debugging. I'm afraid that, by keep alt-tabbing back and forth between byte arrays and php code, I will awake the unicode-consuming pdf-monster that resides in the dark areas of the specification. It haunts me. I can hear it breathing, hiding, waiting...

So, to summarize everything so far; the original issue is about fixing utf8toUtf16BE. Doing that is enough to have correct byte codes in the pdf, so that you can copy/paste from it and have the correct characters. The actual visibility of the glyphs is another matter which I leave for more experienced pdf gurus.

@bsweeney
Copy link
Member

We shall study the tomes and confer with the fontly spirits to discern where the breakdown in reality is occurring.

I greatly appreciate your efforts to bring clarity to the situation (and your knack for storytelling).

@bsweeney bsweeney modified the milestones: dompdf-next, 0.9.0 Oct 4, 2017
@sisve
Copy link
Author

sisve commented Sep 21, 2020

It has taken some time, but I come bringing more information! Quotes below are taken from PDF Reference, sixth edition, version 1.7. All testing is done in Chrome with font subsetting enabled.

Beginning with PDF 1.2, a string may be shown in a composite font that uses multiple-byte codes to select some of its glyphs. In that case, one or more consecutive bytes of the string are treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in data structure called CMap, described in Section 5.6, "Composite Fonts"

Source: Page 408, 5.3.2 Text-Showing Operators

When the current font is composite, the text-showing operators behave differently than with simple fonts. For simple fonts, each byte of a string to be shown selects one glyph, whereas for composite fonts, a sequence of one or more bytes can be decoded to select a glyph from the descendant CIDFont. [...]

Source: Page 433, 5.6 Composite Fonts

[...] CIDs range from 0 to a maximum value that is subject to an implementation limit (see Table C.1 on page 992)

Source: Page 434, 5.6.1 CID-Keyed Fonts Overview. The limit mentioned is 65535.

Those quotes paints the picture pretty well. The string object ends up containing D835 DD38, which is the correct UTF16-BE encoding of 𝔸. However, the PDF standard interprets these as two characters, D835 and DD38. These are surrogates (and together a surrogate pair) and invalid as individual references to glyphs.

This touches the cmap. The current one has the following hard-coded entries (in Cpdf.php):

<0000> <FFFF>
endcodespacerange
1 beginbfrange
<0000> <FFFF> <0000>
endbfrange
  • begincodespacerange and endcodespacerange defines codespace ranges - the valid input character code ranges - by specifying a pair of codes of some particular length giving the lower and upper bounds of each range; see "CMap Mapping" on page 453.
  • beginbfchar and endbfchar defines mappings of individual input character codes to character codes or character names in the associated font. beginbfrange and endbfrange do the same for ranges of input codes.

Source: Page 452, 5.6.4 CMaps

That cmap is stating that we will only allow 0000 to FFFF as input values, but that doesn't seem to be the problem here since this also happens to be the proper values according to ToUnicode Mapping File Tutorial. Furthermore, the cmap states that all values within the range 0000 to FFFF should map to the glyphs for the range starting at 0000 (which ends up ending at FFFF). This means that the surrogate pair (D835 + DD38) will be kept as-is.

There are no valid real BMP characters that can end up looking like a surrogate character, this is why copy/pasting this still works - something in the pasting process realizes that we've broken a UTF-16 codepoint into surrogates and merges them together. (I'm guessing at this part.)

With this newfound insight, let's add a beginbfchar entry, redirecting these two surrogates into two dollar signs ($, U+0024). This does indeed work, my pdf contains two dollar signs.

<0000> <FFFF>
endcodespacerange
1 beginbfrange
<0000> <FFFF> <0000>
endbfrange
2 beginbfchar
<D835> <0024>
<DD38> <0024>
endbfchar

This may open up the idea to map one of them to the target codepoint <D835> <1D538>, but this renders a ᵓ (U+1D53 Modifier Letter Small Open O) which indicates that only the first four bytes in that target is used. Trying to abuse the beginbfrange with <D835> <D835> <1D538> ends up with a 픸 (U+D538 Hangul Syllable Pyik) which indicates that the last four bytes was used. Fun!

It looks like I need to figure out why the beginbfchar and beginbfrange cannot point to non-bmp codepoints. It should be possible according to the ToUnicode Mapping File Tutorial. It's probably an cmap or font setting I am missing.

I think the final solution would require us to 1) preprocess all strings to build a list of all used glyphs, 2) generate a mix of beginbfrange and beginbfchar entries pointing to the used glyphs, 3) and modify all strings to output references to the cmap entries. I think we will at least keep the ASCII-range untouched so that there's still some sanity in debugging the output.

@sisve sisve changed the title Bad calculation of high surrogate in Cpdf::utf8toUtf16BE Rendering of non-bmp characters Sep 22, 2020
@sisve
Copy link
Author

sisve commented Sep 22, 2020

The bfchar and bfcharrange can be used to target non-bmp. I totally missed the chapter 1.5 Mapping Examples for Non-BMP Code Points, page 5 in the ToUnicode Mapping File Tutorial. The correct format is to use the UTF-16BE encoding of the target codepoint, so the format becomes <D835> <D835DD38> . This renders as a empty square (the unmapped character glyph), so I still believe there's a cid-to-gid problem somewhere.

I have patched my filterText to transform all non-ascii codepoints (128+) to custom byte sequences, and adding maps back from these byte sequences back to the proper codepoint in the bfchar. (I replace my character with the bytes "00 80", and have added <0080> <D835DD38> to the bfchar.) This works too, I still get a square box that can be copy/pasted into the correct unicode character in a text-editor. This sounds like the solution to map non-bmp characters into two-byte sequences we can use in the string objects.

I'm mostly guessing at the current problem; but I suspect the cid-to-gid map since it involves characters and glyphs. I have no proof of this accusation ... at this time.

@bsweeney bsweeney modified the milestones: 0.9.0, dompdf-next Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants