Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Font Unicode Coverage #3

Open
garygriswold opened this issue Oct 28, 2016 · 3 comments
Open

Font Unicode Coverage #3

garygriswold opened this issue Oct 28, 2016 · 3 comments

Comments

@garygriswold
Copy link

This is a feature request, not a bug report. We would like to know the unicode coverage of each font that is on a device. Our very international App sometimes displays text that has no glyph (displays as boxes). If we could test a device for the ranges of unicode characters that each font supports, then we would be able to drop the languages from a device that would not display. I am hoping this information is available in the font file, and could be provided along with the name.

@eb1
Copy link
Contributor

eb1 commented Oct 28, 2016

That sounds really useful.

It looks like the infrastructure in the Android platform could be expanded to do this. The Android implementation opens up the TTF files and pulls out the font name -- information on which characters are supported is also held inside the font files themselves (see https://unix.stackexchange.com/questions/247108/how-to-find-out-which-unicode-codepoints-are-defined-in-a-ttf-file for a python example, although our Android is in Java).

I'd have to poke around a little more for how to do this on iOS. We don't deal directly with font files there.

I'm currently working towards a beta of our mobile project (adapt-it-mobile), but might be able to poke at this once that's out the door. Alternately, if you have the desire to go and implement it, you could just send a pull request and I can fold in the changes. You would be adding a method to the TTFAnalyzer class in src/android/Fonts.java.

@garygriswold
Copy link
Author

Erik,

I have not yet done any plugin development, but I will have to learn after a while. I am hoping that you find this worth doing.

Gary

On Oct 28, 2016, at 7:11 PM, Erik Brommers notifications@github.com wrote:

That sounds really useful.

It looks like the infrastructure in the Android platform could be expanded to do this. The Android implementation opens up the TTF files and pulls out the font name -- information on which characters are supported is also held inside the font files themselves (see https://unix.stackexchange.com/questions/247108/how-to-find-out-which-unicode-codepoints-are-defined-in-a-ttf-file https://unix.stackexchange.com/questions/247108/how-to-find-out-which-unicode-codepoints-are-defined-in-a-ttf-file for a python example, although our Android is in Java).

I'd have to poke around a little more for how to do this on iOS. We don't deal directly with font files there.

I'm currently working towards a beta of our mobile project (adapt-it-mobile https://github.com/adapt-it/adapt-it-mobile), but might be able to poke at this once that's out the door. Alternately, if you have the desire to go and implement it, you could just send a pull request and I can fold in the changes. You would be adding a method to the TTFAnalyzer class in src/android/Fonts.java.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub #3 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AAlP_GKxeo0I1PvtOHTLMMNKPi4Icrohks5q4oE8gaJpZM4Kj7oB.

@bobh0303
Copy link

bobh0303 commented Dec 2, 2016

If you can access the cmap from the fonts you want to use, this would be a first step to determining whether the font supports a given character set.

Some background you need to understand: The term the cmap can be slightly ambiguous. In terms of the binary format of font files (which, at the top level of organization, is a series of data structures called tables), the term refers to the table that has the tag cmap, but this table often contains multiple subtables, each subtable representing a different character-to-glyph mapping. The subtables are identified by, among other things, a platform ID, so there are Windows mappings, Apple mappings, and Unicode mappings. For any given application environment, usually only one of these mappings is in effect, so the term "the cmap" can refer to the particular mapping that is in effect in my environment.

So if you are reading the binary font file, you first have to locate the cmap table, then figure out which of the subtables is relevant for your environment.

Once you have the relevant subtable, you now have a character to glyph mapping. Be aware that there are several different formats this mapping can take, and for the rest of this I'll assume what is the most common one, namely a mapping from Unicode character code (USV) to glyph index (i.e., the index into the glyf table). This format is represented by a kind of sparse array structure that you (or some library you are using) has to parse.

At this point you will be able to ask the question: For a given USV, does this cmap return a non-zero glyph index? The glyph with index zero is the "not defined" glyph, so if any USV maps to glyph zero you can say that character isn't supported by that font.

I suspect most algorithms that try to determine character support of a font will stop at this point and use the zero/non-zero cmap result to indicate whether the character is not-supported/supported. This may be all you need to do in your situation -- certainly it would eliminate most of the "square box" issues mentioned by the OP.

In reality, however, if you get a non-zero glyph index about the best you can say is that the character is likely to be supported by the font, and getting more resolution to the question is harder to do. If the indicated glyph has no "ink", i.e., no outline, it could be that this is intended as it is representing a whitespace character, but it might be that the font author just never implemented the outline.

Even if it has an outline, does it really behave properly? For example, suppose the USV is that of a diacritic -- does the OpenType or Graphite logic in the font correctly position the glyph over all the base characters that it needs to for the language in question? (and for that matter, do you know what bases need to take that diacritic for that given language?) As another example, many scripts require specific shapes of a character in specific contexts (e.g., Arabic contextual shaping) -- does the font logic select the correct glyph in all the needed contexts (for the given language)?

Even if the font logic is correct, does the rendering library your app uses know to use that logic? If the character is relatively new to Unicode, it may be that the shaping logic in your support libraries hasn't been updated to know this character's properties so might not know it needs positioning or contextual shaping, etc. (This is particularly problematic with OpenType logic; less so but not entirely absent for Graphite logic.)

Like I said above, it may be that in your context "in the cmap or not" is a sufficient test, but I wanted you to understand that the full answer about what characters are supported by a font is more complex than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants