-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rendering of non-bmp characters #1553
Comments
This is interesting and we'll definitely take a look. Some users are reporting incorrect character representations in the rendered PDFs. I wonder if this is the cause (or at least part of the problem). |
I have yet to pin down all the issues I have with non-bmp characters. First; neither of the two example characters I used had glyphs in DejaVuSans.ttf. I switched over to 𝔸 (U+1D538 mathematical double-struck capital a) which exists in DejaVuSans. I patched utf8toUtf16BE, and the correct byte sequence for 𝔸 is now generated.
It seems that DejaVuSans.ufm only contains the glyphs for the bmp. This one is generated using php-font-lib 0.5.0, where only format 4 of the cmap was supported. Upgrading to 0.5.1 adds support for format 12. DejaVuSans.ttf's cmap has 5 subtables:
Adding support for format 12 opens up the number of glyphs from the 1960 in the BMP, to 3388 in total. The ufm file needs to be regenerated, and then you run into a loop in php-font-lib's TrueType/File.php that needs to be patched to use the subtable platformId=3 encodingId=10. With this done you get an ufm file containing The mentioned loop also exists in Cpdf.php, but here there are no other subtables available. They are lost when the font is written to disk a few lines above the loop. This is where I give up ... for now. To summarize;
|
oh boy |
Some further debugging shows that the generated font for the subsetting has an odd CIDToGIDMap where the CID is 54584=0xD538. Your favorite calculator will show that 120120-0xFFFF=54584, which means that the generated of the cid-to-gid map is wrong-ish. I tracked it down to https://github.com/dompdf/dompdf/blob/master/lib/Cpdf.php#L2863-L2866 which follows the specification. From "Table 117 - Entries in a CIDFont dictionary":
The obvious issue is that two bytes aren't enough to represent the character id 120120=0x1D538. If I disable the generation of the CIDToGIDMap entirely none of my glyphs show up, which is a reasonable outcome. This indicates, I believe, that the correct bytes are there (since I patched utf8toUtf16BE), but that the map is wrong. I tried uploading my minimal font to http://torinak.com/font/lsfont.html and tells me in a red friendly color; "Decoding error". My rendered input was "CHAR='𝔸'" and the glyphs for the ascii characters are there, but the 𝔸 seems to have been broken down to into two glyphs at D538 and FFFF. That looks like an parsing error somewhere. I'm stopping my debugging at this point; I'm way in over my head in the pdf standard and how non-bmp characters should be handled. It could be that the entire font generation needs to change. It could be that I didn't chant enough while debugging. I'm afraid that, by keep alt-tabbing back and forth between byte arrays and php code, I will awake the unicode-consuming pdf-monster that resides in the dark areas of the specification. It haunts me. I can hear it breathing, hiding, waiting... So, to summarize everything so far; the original issue is about fixing utf8toUtf16BE. Doing that is enough to have correct byte codes in the pdf, so that you can copy/paste from it and have the correct characters. The actual visibility of the glyphs is another matter which I leave for more experienced pdf gurus. |
We shall study the tomes and confer with the fontly spirits to discern where the breakdown in reality is occurring. I greatly appreciate your efforts to bring clarity to the situation (and your knack for storytelling). |
It has taken some time, but I come bringing more information! Quotes below are taken from PDF Reference, sixth edition, version 1.7. All testing is done in Chrome with font subsetting enabled.
Source: Page 408, 5.3.2 Text-Showing Operators
Source: Page 433, 5.6 Composite Fonts
Source: Page 434, 5.6.1 CID-Keyed Fonts Overview. The limit mentioned is 65535. Those quotes paints the picture pretty well. The string object ends up containing D835 DD38, which is the correct UTF16-BE encoding of 𝔸. However, the PDF standard interprets these as two characters, D835 and DD38. These are surrogates (and together a surrogate pair) and invalid as individual references to glyphs. This touches the cmap. The current one has the following hard-coded entries (in Cpdf.php): <0000> <FFFF>
endcodespacerange
1 beginbfrange
<0000> <FFFF> <0000>
endbfrange
Source: Page 452, 5.6.4 CMaps That cmap is stating that we will only allow 0000 to FFFF as input values, but that doesn't seem to be the problem here since this also happens to be the proper values according to ToUnicode Mapping File Tutorial. Furthermore, the cmap states that all values within the range 0000 to FFFF should map to the glyphs for the range starting at 0000 (which ends up ending at FFFF). This means that the surrogate pair (D835 + DD38) will be kept as-is. There are no valid real BMP characters that can end up looking like a surrogate character, this is why copy/pasting this still works - something in the pasting process realizes that we've broken a UTF-16 codepoint into surrogates and merges them together. (I'm guessing at this part.) With this newfound insight, let's add a beginbfchar entry, redirecting these two surrogates into two dollar signs ( <0000> <FFFF>
endcodespacerange
1 beginbfrange
<0000> <FFFF> <0000>
endbfrange
2 beginbfchar
<D835> <0024>
<DD38> <0024>
endbfchar This may open up the idea to map one of them to the target codepoint It looks like I need to figure out why the beginbfchar and beginbfrange cannot point to non-bmp codepoints. It should be possible according to the ToUnicode Mapping File Tutorial. It's probably an cmap or font setting I am missing. I think the final solution would require us to 1) preprocess all strings to build a list of all used glyphs, 2) generate a mix of beginbfrange and beginbfchar entries pointing to the used glyphs, 3) and modify all strings to output references to the cmap entries. I think we will at least keep the ASCII-range untouched so that there's still some sanity in debugging the output. |
The bfchar and bfcharrange can be used to target non-bmp. I totally missed the chapter 1.5 Mapping Examples for Non-BMP Code Points, page 5 in the ToUnicode Mapping File Tutorial. The correct format is to use the UTF-16BE encoding of the target codepoint, so the format becomes I have patched my filterText to transform all non-ascii codepoints (128+) to custom byte sequences, and adding maps back from these byte sequences back to the proper codepoint in the bfchar. (I replace my character with the bytes "00 80", and have added I'm mostly guessing at the current problem; but I suspect the cid-to-gid map since it involves characters and glyphs. I have no proof of this accusation ... at this time. |
Summary:
This issue started out as bug report about Cpdf::utf8toUtf16BE, but has morphed into a larger problem of rendering non-BMP characters.
The following issues have been found:
<0080> <D835DD38>
).The steps above produces a pdf where non-bmp characters are rendered as a empty box (a missing glyph). These can be copy/pasted into other unicode-capable programs and be fully readable.
The remaining problems include the proper rendering of the glyph so they are visible within the pdf file.
Original post:
From Wikipedia: UTF-16, U+10000 to U+10FFFF
The current implementation:
The calculation of $w1 right shifts 0x10=16 bits, which means that we only keep 4 bits of the 20 bit value in $c, and those bits are not positioned correctly. The correct number should be 0x0A=10.
Examples:
𝔖
(U+1D516 mathemathical fraktur capital s), should be D835 DD16, but is D800 DD16.𝔰
(U+1D530 mathemathical fraktur small s, should be D835 DD3, but is D800 DD30.The text was updated successfully, but these errors were encountered: