Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cidmap parsing problem for Source Han Sans #1582

Open
medicalwei opened this issue Jul 28, 2014 · 24 comments
Open

cidmap parsing problem for Source Han Sans #1582

medicalwei opened this issue Jul 28, 2014 · 24 comments

Comments

@medicalwei
Copy link
Contributor

This bug is separated from #1534 since opening the .otf directly isn't my case.

The cidmap parsing is incorrect for Source Han Sans. The links are cidmap of upstream and cidmap I manually converted. (which is parsed correctly)

Procedures

  1. Open Source Han Sans cidfont.ps.TWHK of any weight.
  2. FlattenByCMap with one of the test cases of below CIDMap Dropbox urls.
  3. Force Encoding to ISO 10646-1

What I see

  • Some characters are missing from the codepoints. For example 李 U+674E
  • Some of the characters are appended after codepoint 0x2F9DF

CIDMaps

Upstream (test case):
https://www.dropbox.com/s/05uj3touegoylcf/UniSourceHanSansTWHK-UTF32-H-Upstream

Converted and working reference:
https://www.dropbox.com/s/cr049gzowd9nw2n/UniSourceHanSansTWHK-UTF32-H-ShouldBe

The upstream source is from https://github.com/adobe-fonts/source-han-sans

@frank-trampe
Copy link
Contributor

@medicalwei, I don't entirely understand. Is FontForge at fault, or is the input typeface at fault?

@medicalwei
Copy link
Contributor Author

The cidmap has a shorthand which can be mapped only one character in one line rather than a range of characters. I think it is a missing compliation issue of Fontforge. But I am wondering if it is a standard that Fontforge should follow, so I don't know which is at fault.

@frank-trampe
Copy link
Contributor

So, to clarify, the upstream test case includes a map that individual character ranges, and FontForge fails to parse those since it expects to have an additional argument per line for ranges?

@frank-trampe
Copy link
Contributor

I did some digging, and FontForge supports the begincodespacerange listing type but not the begincidchar listing type in CMap files. This seems to be the root of the problem; I'll take a look.

@frank-trampe
Copy link
Contributor

@medicalwei I added support for the cidchar syntax in branch cidchar_hack_1, but opening Han Sans seems not to call ParseCMap at all. Can you provide more specific steps?

@Toufukun
Copy link

Toufukun commented May 5, 2015

Hey I'm facing similar problems too. When I'm using CID\FlattenByCMap with UniSourceHanSansCN-UTF16-H provided in Resources folder of Source Han Sans, fontforge alerted me that "Encoding Too Large".

I checked the code and found this:

    if ( cmap->groups[cmt_cid].ranges[i].last>0x100000 ) {
        ff_post_error(_("Encoding Too Large"),_("Encoding Too Large"));
        cmapfree(cmap);
return( false );
    }

It seems that the code ranges can't exceed 0x100000. As a matter of fact, those characters whose codes go out of BMP range are using 8-digit hexadecimal numbers showing their codes (seems to be UTF-16), such as <d83cdd00>. This should be fixed.

I simply deleted those codes in CMap, both manually and by regex, and tried again. Fontforge didn't pop up any message box, but after flattening was done, it crashed when I was scrolling the glyph list.

I tried to save the sfd before I scroll to avoid crash. It works. But the problem is no matter I use FlattenByCMap or simply Flatten, some codepoints have no glyph. I think this may be because in CID fonts multiple Unicode can be mapped into the same CID, for example both u+7406 and u+f9e4 are mapped into CID 26906. The latter one is in compatibility block due to some historical problems. When I flatten the font, this glyph is mapped to u+f9e4 only and u+7406 got no glyph. This is unacceptable because the former one (u+7406) is much more common and this will lead to character missing and font fallback. It will be better if it duplicates those multi-mapped glyphs when the font is flattened.
figure 2

@jeffska
Copy link

jeffska commented Dec 18, 2015

@frank-trampe Do you happen to have that branch around still? I've run into the same issue and I can verify the fix. Otherwise I'll attempt it myself and submit a pull request.

@zerng07
Copy link

zerng07 commented Jun 26, 2016

Any updates on this issue?

@frank-trampe
Copy link
Contributor

The branch is back. See #2827.

@frank-trampe
Copy link
Contributor

Hi, guys. We're going to be entering a feature embargo shortly in preparation for a new release. It would be nice to get some testing and feedback on that branch so that we can get it merged.

(Paging @zerng07, @jeffska, @medicalwei.)

@frank-trampe
Copy link
Contributor

@Toufukun, I did not quite understand your issue, and I don't have time to go digging right now. In what file do you find that limit enforced? Is that the only place? The limit is on Unicode value, not encoding size, right? Is there no obvious reason for enforcing the limit as it is? What would be an appropriate limit?

@zerng07
Copy link

zerng07 commented Sep 15, 2016

By using this branch, opening the source PS file (cidmap.ps.xx) following flatenbycmap with UTF32 upstream cidmap (such as UniSourceHanSansTW-UTF32-H), 李 U+674E and 理 u+7406 are not missing. [1] [2]

However, opening the source PS file following flatenbycmap with UTF16 upstream cidmap (such as UniSourceHanSansTW-UTF16-H), "Encoding Too Large" problem described by Toufukun was occurred.[3]

  1. https://github.com/adobe-fonts/source-han-sans/tree/master/Regular
  2. https://github.com/adobe-fonts/source-han-sans
  3. https://github.com/adobe-fonts/source-han-sans/tree/release/Resources

@frank-trampe
Copy link
Contributor

@zerng07, that's not a regression, though, right?

@zerng07
Copy link

zerng07 commented Sep 15, 2016

No, that's not. The "Encoding Too Large" problem which encountered by flatenbycmap with UTF16 upstream cidmap had existed for sometime, not introduced by this branch.

@Toufukun
Copy link

@frank-trampe I haven't been working on these things for quite a long time so maybe I can't remember very clearly. And I have Windows only. I don't know how to try the latest version. I just remember none of these files Adobe provided in the Resource would work: either the font cannot be flatten correctly, or the file cannot be recognized as a valid Cidmap or CMap file.

Some Han characters are encoded repeatly due to some compatibility reasons like 理 is encoded at u+7406 and u+f9e4 but the same CID (like CID 26906). My issue is that when you're trying to flatten a Source Han Sans font, only u+f9e4 gets the glyph but u+7406 is empty.

@jtanx
Copy link
Contributor

jtanx commented Sep 25, 2016

@Toufukun try this build: https://ci.appveyor.com/project/fontforge/fontforge/build/1.0.192/artifacts

As far as I know, the cidmap file type is a FontForge specific file type (I don't know how it's generated or made). The Adobe CMap files are used by the 'Flatten by CMap' option. Note on Windows, I think you have to manually specify the cidmap file path which should be under share/fontforge

@samizzo
Copy link

samizzo commented May 26, 2017

I'm having a similar issue. I'm using Source San Hans for a game. I can flatten using the UTF-32 cmap. Then I use a Fontforge script to remove all the unused characters, then I save the font to a ttf. However after that process, some characters are missing and no longer display. The characters that are missing all seem to be encoded as multiple Unicode characters with the same CID.

@HinTak
Copy link

HinTak commented Jun 7, 2017

I have a possibly related problem - #3080 - while the bulk of it seems to be that detached glyphs are trashing the encoding table, a few glyphs are missing from the conversion - that includes 李 U+674E being missing.

@HinTak
Copy link

HinTak commented Jun 9, 2017

Yes, it is the same as part of #3085 , and #3080 . I was using CIDFlatten, which unfortunately cannot cope with multiple encodings to the same glyph. It would be nice if MultipleEncodingsToReferences() does the right thing - it currently does not.

@HinTak
Copy link

HinTak commented Jun 9, 2017

(Sorry for repeating the cut-and-paste)

It would be nice if MultipleEncodingsToReferences() does the right thing - it currently does not.

What's happening is that the Source CJK fonts / Noto CJK fonts have about 400 glyphs having more than 2 coding points; actually about a dozen have 3. All but one is silently dropped by fontforge. So you get about 400 coding points missing when re-encoding.

The worst part of fontforge behaviour is that it uses the last one it sees as authoritative - that's often the CJK Compat variant range, rather than the lower CJK Unified region. So the lower and more often-used CJK Unified code range ended up having about 400 glyphs missing.

@HinTak
Copy link

HinTak commented Jun 9, 2017

The glyph 李 for U+674E in the CJK Unified region is also encoded as U+F9E1 in the CJK Compat-Ideograph region; and unfortunately fontforge takes the latter as authoritative...

@HinTak
Copy link

HinTak commented Jun 9, 2017

My freetype-py script to fix fontforge's encoding problem is up at

https://github.com/HinTak/freetype-py/blob/fontval-diag/examples/subfonts-script-generate.py

@rardz
Copy link

rardz commented Nov 1, 2017

I found the glyph missing on Source Han Sans still do exist right now.
Any updates on this issue?

@HinTak
Copy link

HinTak commented Dec 4, 2017

See if the "enhanced" version of Source Hans Sans/Serif Regular at
#1534 (comment)
and
#1534 (comment)
do what you want them to do?

@jtanx jtanx added the CID label May 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants