Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shaping bugs for Cham? #376

Closed
ohbendy opened this issue Dec 9, 2016 · 21 comments
Closed

Shaping bugs for Cham? #376

ohbendy opened this issue Dec 9, 2016 · 21 comments

Comments

@ohbendy
Copy link

ohbendy commented Dec 9, 2016

  • Zero consonant ꨀ (u+AA00)
    This behaves in the same way as Burmese အ or Thai อ in that it can stand alone as a glottal stop consonant, or take diacritic marks as a vowel carrier (functioning like an independent vowel). In my tests, typing any diacritic mark after it is an invalid combination and dotted circles are inserted to carry the mark. The consonant ꨀ can also appear with medial ya ꨳ (AA33) or medial wa (AA36) with a diacritic, but these combinations generate dotted circles.

Unicode Cham proposal N3120 suggests "that applications permit ꨀ to bear any of the vowel signs”.

  • Independent vowels ꨄ, ꨃ
    In one text, the independent vowel ꨄ (u+AA04) is modified by a preceding ꨰ vowel (u+AA30).
    In several manuscripts, ꨃ (u+AA03) is modified with vowel lengthener ◌ꨩ (u+AA29).

N3120 mentions “Four of the other independent vowels are also attested bearing matras”, so generalising this means any combination of independent vowel plus diacritic should be allowed.

  • Combination of vowel ◌ꨭ (AA2D) with vowel lengthener ◌ꨩ (u+AA29)
    This generates a dotted circle, though N3120 notes this combination is a long -uu- vowel, with AA29 typed/stored last.

  • Numerals with diacritics
    I’ve found evidence of this in a number of manuscripts, but numeral + mark has created shaping errors in my tests with dotted circles being inserted.

Initially I thought the four issues above were USE bugs, but Andrew Glass advised me it's Firefox/Harfbuzz as "on Windows, independent vowels, consonants and digits are all given the same base class and will permit the standard clusters to form".

  • Combined medials below
    A consonant can carry medial La (AA35) and medial Wa (AA36) together, but USE disallows this combination. Andrew Glass mentions this can be remedied, so Harfbuzz will also need to allow for this combination.

N3120 mentions "Three medial clusters occur: ◌ꨴꨶ -rwa, ◌ꨵꨳ -lya, and ◌ꨵꨶ -lwa”

@lianghai
Copy link

lianghai commented Dec 10, 2016

I can't reproduce any non-USE issue, leaving only the already-confirmed 2 issues of USE confirmed again. Ben, please clarify in what environment you're doing these tests.

From a developer's point of view I see there're 3 issues reported:

  1. Invalid <vowel letter / digit, vowel sign / medial consonant sign>: can't reproduce. This report says HarfBuzz doesn't allow vowel letter or digit to be the base of vowel sign, medial consonant sign, or other combining marks. But HarfBuzz does allow this, conforming to the USE spec of how vowel letters and digits are classified to the "BASE" class just like consonant letters.

  2. Invalid <base, U sign, AA sign> and other multiple vowel sign combinations: reproducible, confirmed USE issue.

  3. Invalid <base, medial LA sign, medial WA sign> and other multiple medial consonant sign combination: reproducible, confirmed USE issue.


More about issue 2 and 3:

These issues have been discussed also on Twitter: https://twitter.com/lianghai/status/794637250361368576

Basically, the USE expects medial consonant signs and vowel signs stored in the standard visual order and only 1 medial consonant sign is allowed on each side: [MPre] [MAbv] [MBlw] [MPst], (VPre)* (VAbv)* (VBlw)* (VPst)*. These are problems.

The requirement of vowel sign order is debatable (don't vowel sign need logic order at all?) but does conform to the Unicode Standard — see the core spec 9.0, section 12.3 Gurmukhi, the last paragraph of "Encoding Principles" on page 474:

More generally, when a consonant or independent vowel is modified by multiple vowel signs, the sequence of the vowel signs in the underlying representation of the text should be: left, top, bottom, right.

However, actually, the Unicode Standard is not self-consistent. See table 16-15 "Cham Syllabic Structure" on page 645, which shows AA sign (U+AA29), an above-base mark, is expected to follow any other vowel signs (including U sign, a below-base mark).

About medial consonant signs though, the Unicode Standard clearly specifies multiple medial consonant signs on the same side is valid (<base, medial LA sign, medial WA sign>), and implies it might be necessary to store medial consonant signs logically.


My test string is "ꨀꨀꨩꨀꨳꨀꨳꨩꨀꨶꨩꨄꨰꨃꨩꨆꨭꨩꨆꨩꨭ꩑ꨩꨆꨵꨶ" (<A>, <A, AA sign>, <A, medial YA sign>, <A, medial YA sign, AA sign>, <A, medial WA sign, AA sign>, <AI, AI sign>, <E, AA sign>, *<KA, U sign, AA sign>, <KA, AA sign, U sign>, <One, AA sign>, *<KA, medial LA sign, medial WA sign>). Only the 2 asterisk-marked syllables appear invalid (with a dotted circle), and they're USE issues as explained above.

Tested in Chrome 55.0.2883.87, macOS Sierra 10.12.1, with a modified Noto Sans Cham (added the correct script tag "cham" as the USE requests, mapping U+25CC to "@" so the dotted circle (missing in Noto Sans Cham) is available and visible).

screen shot 2016-12-10 at 11 05 36 am

@ohbendy
Copy link
Author

ohbendy commented Dec 10, 2016

More generally, when a consonant or independent vowel is modified by multiple vowel signs, the sequence of the vowel signs in the underlying representation of the text should be: left, top, bottom, right.

In my understanding AA29 can serve either as a long vowel mark -aa-, when alone on a consonant, or as a vowel lengthener to act with other vowel diacritics, when it's typed/stored after that diacritic. I wonder if Unicode makes provision for things that can be in more than one category.

About medial consonant signs though, the Unicode Standard clearly specifies multiple medial consonant signs on the same side is valid (<base, medial LA sign, medial WA sign>), and implies it might be necessary to store medial consonant signs logically.

If Burmese is ever migrated to USE, we'll have the same situation there. Medial Wa and medial Ha both go underneath, and can be combined on one base. I'd be surprised if there are not other scripts that need similar arrangements. But this seems to be a USE issue.

Regarding issue 1, I'm on Mavericks, and I gather the version of USE it uses might not be as up-to-date as on other setups, so that could explain why you can't reproduce it.

@lianghai
Copy link

In my understanding AA29 can serve either as a long vowel mark -aa-, when alone on a consonant, or as a vowel lengthener to act with other vowel diacritics, when it's typed/stored after that diacritic.

I understand it as simply a pure vowel sign AA that hints a long vowel when combined with other vowel signs — so there isn't a significant logic order in this case.

It can also be analyzed as a pure vowel lengthener that lengthens all vowels, including consonant letters' inherent vowel A (then nasal consonant letter's inherent vowel UE can be analyzed as a higher-level conditioned variant of A).

I wonder if Unicode makes provision for things that can be in more than one category.

The current model of Unicode Indic_Syllabic_Category allows a single category for every character. But there're some categories that are themselves for multifunctional characters (eg, Indic_Syllabic_Category = Virama).

But this seems to be a USE issue.

Yes, this is apparently a USE issue.

Regarding issue 1, I'm on Mavericks, and I gather the version of USE it uses might not be as up-to-date as on other setups, so that could explain why you can't reproduce it.

Uh… OS X Mavericks has nothing to do with USE. If you test in the latest Chrome, then Chrome uses its own HarfBuzz, independent from the operating system, to shape complex scripts. If you test in OS X's native text fields, then OS X Mavericks' Core Text doesn't have any support of USE yet.

Ben, you really, REALLY need to explain very clearly in what software you do all those tests. Saying "I'm on Mavericks" is not helpful at all. Screenshots (of the tested string in the environment, preferably with the software's About window) is helpful if you're not sure about what to explain…

@ohbendy
Copy link
Author

ohbendy commented Dec 12, 2016

Sorry, I thought I'd mentioned, I do all my font testing on Firefox. I'm currently on 50.0.1. I'll get to Win10 later this week, try Chrome too, and post images of each.

@lianghai
Copy link

Okay, I can confirm Firefox 50.1.0 behaves like what Ben described:
screen shot 2016-12-15 at 10 14 23

So it seems to be a Firefox bug. How does Firefox behave differently if it uses the latest (?) HarfBuzz directly?

This is the patched Noto Sans Cham used in my tests:
NotoSansChamX2-Regular.ttf.zip

@jfkthame
Copy link
Collaborator

Firefox 50 doesn't have the most recent harfbuzz. An update to harfbuzz 1.3.3 landed for Firefox 52 (https://bugzilla.mozilla.org/show_bug.cgi?id=1313097), so you might like to test with Firefox Developer Edition (https://www.mozilla.org/en-US/firefox/developer/) to see whether this affects the result.

@lianghai
Copy link

Confirmed: Firefox Developer Edition 52.0a2 has the same result as Chrome's (how recent HarfBuzz behaves).

So those HarfBuzz-related issues (issue 1 in my first comment) are because of an older version of HarfBuzz in the current stable release of Firefox.

@ohbendy
Copy link
Author

ohbendy commented Dec 15, 2016

Thanks for checking that :-)

Re issue 1, did you try numeral + mark? In Win 10, USE is not accepting that in Notepad, but it's fine in Edge.

It still leaves the medial La + medial Wa combination, and the below-vowel + AA29 combination needing to be fixed, which seem now to be a USE issue, and I've informed Andrew G.

@lianghai
Copy link

lianghai commented Dec 15, 2016

The second last syllable in my test string is <One, AA sign>, a digit + mark test, which seems to work in HarfBuzz (HarfBuzz in Chrome 55.0.2883.95). Just did a quick test with "꩑ꨴꨳꨰꨩ" <One, medial RA sign, medial YA sign, AI sign, AA sign> and it works too, conforming with the USE's spec of treating digits as normal bases.

@Richard57
Copy link

Distinguishing medial consonants from Consonant_Subjoined seems a bad idea for SE Asian scripts. While I'm not sure I can get two consecutive medials in Tai Tham, I get similar rendering problems as can be seen in the word for 'iron' at http://www.wrdingham.co.uk/lanna/renderer_test.htm#surprises. The problem also shows up in MS Edge. Tai Tham also suffers from the vowel ordering issue.

@punchcutter
Copy link
Collaborator

punchcutter commented Mar 13, 2017

Can we go ahead and relax the medial consonant behaviour to allow more than one? I tested with this and it looks good to me:

hb-ot-shape-complex-use-machine.rl

- medial_consonants = MPre? MAbv? MBlw? MPst?;
+ medial_consonants = MPre? MAbv? MBlw* MPst?;

@ohbendy
Copy link
Author

ohbendy commented Mar 14, 2017

This is a USE issue. Andrew Glass is aware and plans to implement an override to allow two medials in the below-base slot.

@punchcutter
Copy link
Collaborator

It's an issue with the USE spec, but harfbuzz and every other shaper needs to implement the spec somehow. This is a harfbuzz implementation for this particular problem which we can easily do now just like we did an override for Chakma split vowels. Andrew can update the spec, but harfbuzz still needs to implement it at some point. Sooner is always nice especially when we have fonts in the pipeline that need it.

@ohbendy
Copy link
Author

ohbendy commented Mar 14, 2017

Awesome, I see :)

@punchcutter
Copy link
Collaborator

@behdad Any way we can get these two issues fixed ASAP? We want to get this Cham delivered, but it needs correct shaping. My local build of Firefox with a patched harfbuzz has been great for my own testing, but I'd like to get harfbuzz updated for real.

Just to recap, the two issues @ohbendy originally reported that are still present:

  1. Combination of vowel ◌ꨭ (AA2D) with vowel lengthener ◌ꨩ (u+AA29)
  2. Combined medials below

My local quick fix for these was just to totally relax the strictness of the USE spec in hb-ot-shape-complex-use-machine.rl:

- medial_consonants = MPre? MAbv? MBlw? MPst?;
+ medial_consonants = MPre? MAbv? MBlw* MPst?;
- dependent_vowels = VPre* VAbv* VBlw* VPst*;
+ dependent_vowels = VPre* (VAbv* VBlw* | VBlw* VAbv*) VPst*;

Of course if we want to make these explicit exceptions it needs more, but this works as a quick fix.

@behdad
Copy link
Member

behdad commented Apr 13, 2017

This is a USE issue. Andrew Glass is aware and plans to implement an override to allow two medials in the below-base slot.

Which thread was that? I can review and adjust HarfBuzz.

@ohbendy
Copy link
Author

ohbendy commented Apr 14, 2017 via email

@behdad
Copy link
Member

behdad commented Apr 14, 2017

Ok I just emailed Andrew to get a clarification of what he will be changing, so I can match.

@behdad
Copy link
Member

behdad commented Apr 14, 2017

Of course if we want to make these explicit exceptions it needs more, but this works as a quick fix.

Yeah, don't like to loosen up the regex more than needed.

@behdad
Copy link
Member

behdad commented Apr 28, 2017

Here's Andrew's response to me in private:

It’s not my recollection that I committed USE to a particular solution for Cham. ...

In principle, I think we may have been too restrictive on the medials. The real need here was to limit the prebase medials, the others aren’t such a problem. So I think that the Cham case would be fixed that way. I don’t expect any negative impact on other writing systems based on change the limit, since it is merely relaxing a restriction.

Another option was to play games with the overrides. That is cheaper to implement, but isn’t really the right solution and doesn’t help other writing systems.

I keep hoping to have time to think about the problems for Tai Tham. Ideally (from my perspective) , there would be one set of cluster model changes to fix the Cham medials and the monosyllabic Tai Tham issues. Focusing on Tai Tham is on my spare-time calendar but about fourth on the list. I’m not keen at the point, on trying to address Tai Tham polysyllabic clusters. This is more of a philosophical debate on orthographic clusters and reading order – to be hashed out at some point with Richard and others.

behdad added a commit that referenced this issue Jul 14, 2017
Part of #376
Also see https://github.com/roozbehp/unicode-data/issues/6

Test added, using NotoSansCham built from Noto Phase III sources.
@behdad
Copy link
Member

behdad commented Jul 14, 2017

Ok, fixed the U+AA29 issue. Working on other issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants