Shaping bugs for Cham? #376

ohbendy · 2016-12-09T09:48:21Z

Zero consonant ꨀ (u+AA00)
This behaves in the same way as Burmese အ or Thai อ in that it can stand alone as a glottal stop consonant, or take diacritic marks as a vowel carrier (functioning like an independent vowel). In my tests, typing any diacritic mark after it is an invalid combination and dotted circles are inserted to carry the mark. The consonant ꨀ can also appear with medial ya ꨳ (AA33) or medial wa (AA36) with a diacritic, but these combinations generate dotted circles.

Unicode Cham proposal N3120 suggests "that applications permit ꨀ to bear any of the vowel signs”.

Independent vowels ꨄ, ꨃ
In one text, the independent vowel ꨄ (u+AA04) is modified by a preceding ꨰ vowel (u+AA30).
In several manuscripts, ꨃ (u+AA03) is modified with vowel lengthener ◌ꨩ (u+AA29).

N3120 mentions “Four of the other independent vowels are also attested bearing matras”, so generalising this means any combination of independent vowel plus diacritic should be allowed.

Combination of vowel ◌ꨭ (AA2D) with vowel lengthener ◌ꨩ (u+AA29)
This generates a dotted circle, though N3120 notes this combination is a long -uu- vowel, with AA29 typed/stored last.
Numerals with diacritics
I’ve found evidence of this in a number of manuscripts, but numeral + mark has created shaping errors in my tests with dotted circles being inserted.

Initially I thought the four issues above were USE bugs, but Andrew Glass advised me it's Firefox/Harfbuzz as "on Windows, independent vowels, consonants and digits are all given the same base class and will permit the standard clusters to form".

Combined medials below
A consonant can carry medial La (AA35) and medial Wa (AA36) together, but USE disallows this combination. Andrew Glass mentions this can be remedied, so Harfbuzz will also need to allow for this combination.

N3120 mentions "Three medial clusters occur: ◌ꨴꨶ -rwa, ◌ꨵꨳ -lya, and ◌ꨵꨶ -lwa”

lianghai · 2016-12-10T03:39:33Z

I can't reproduce any non-USE issue, leaving only the already-confirmed 2 issues of USE confirmed again. Ben, please clarify in what environment you're doing these tests.

From a developer's point of view I see there're 3 issues reported:

Invalid <vowel letter / digit, vowel sign / medial consonant sign>: can't reproduce. This report says HarfBuzz doesn't allow vowel letter or digit to be the base of vowel sign, medial consonant sign, or other combining marks. But HarfBuzz does allow this, conforming to the USE spec of how vowel letters and digits are classified to the "BASE" class just like consonant letters.
Invalid <base, U sign, AA sign> and other multiple vowel sign combinations: reproducible, confirmed USE issue.
Invalid <base, medial LA sign, medial WA sign> and other multiple medial consonant sign combination: reproducible, confirmed USE issue.

More about issue 2 and 3:

These issues have been discussed also on Twitter: https://twitter.com/lianghai/status/794637250361368576

Basically, the USE expects medial consonant signs and vowel signs stored in the standard visual order and only 1 medial consonant sign is allowed on each side: [MPre] [MAbv] [MBlw] [MPst], (VPre)* (VAbv)* (VBlw)* (VPst)*. These are problems.

The requirement of vowel sign order is debatable (don't vowel sign need logic order at all?) but does conform to the Unicode Standard — see the core spec 9.0, section 12.3 Gurmukhi, the last paragraph of "Encoding Principles" on page 474:

More generally, when a consonant or independent vowel is modified by multiple vowel signs, the sequence of the vowel signs in the underlying representation of the text should be: left, top, bottom, right.

However, actually, the Unicode Standard is not self-consistent. See table 16-15 "Cham Syllabic Structure" on page 645, which shows AA sign (U+AA29), an above-base mark, is expected to follow any other vowel signs (including U sign, a below-base mark).

About medial consonant signs though, the Unicode Standard clearly specifies multiple medial consonant signs on the same side is valid (<base, medial LA sign, medial WA sign>), and implies it might be necessary to store medial consonant signs logically.

My test string is "ꨀꨀꨩꨀꨳꨀꨳꨩꨀꨶꨩꨄꨰꨃꨩꨆꨭꨩꨆꨩꨭ꩑ꨩꨆꨵꨶ" (<A>, <A, AA sign>, <A, medial YA sign>, <A, medial YA sign, AA sign>, <A, medial WA sign, AA sign>, <AI, AI sign>, <E, AA sign>, *<KA, U sign, AA sign>, <KA, AA sign, U sign>, <One, AA sign>, *<KA, medial LA sign, medial WA sign>). Only the 2 asterisk-marked syllables appear invalid (with a dotted circle), and they're USE issues as explained above.

Tested in Chrome 55.0.2883.87, macOS Sierra 10.12.1, with a modified Noto Sans Cham (added the correct script tag "cham" as the USE requests, mapping U+25CC to "@" so the dotted circle (missing in Noto Sans Cham) is available and visible).

ohbendy · 2016-12-10T10:19:06Z

More generally, when a consonant or independent vowel is modified by multiple vowel signs, the sequence of the vowel signs in the underlying representation of the text should be: left, top, bottom, right.

In my understanding AA29 can serve either as a long vowel mark -aa-, when alone on a consonant, or as a vowel lengthener to act with other vowel diacritics, when it's typed/stored after that diacritic. I wonder if Unicode makes provision for things that can be in more than one category.

About medial consonant signs though, the Unicode Standard clearly specifies multiple medial consonant signs on the same side is valid (<base, medial LA sign, medial WA sign>), and implies it might be necessary to store medial consonant signs logically.

If Burmese is ever migrated to USE, we'll have the same situation there. Medial Wa and medial Ha both go underneath, and can be combined on one base. I'd be surprised if there are not other scripts that need similar arrangements. But this seems to be a USE issue.

Regarding issue 1, I'm on Mavericks, and I gather the version of USE it uses might not be as up-to-date as on other setups, so that could explain why you can't reproduce it.

lianghai · 2016-12-12T09:24:02Z

In my understanding AA29 can serve either as a long vowel mark -aa-, when alone on a consonant, or as a vowel lengthener to act with other vowel diacritics, when it's typed/stored after that diacritic.

I understand it as simply a pure vowel sign AA that hints a long vowel when combined with other vowel signs — so there isn't a significant logic order in this case.

It can also be analyzed as a pure vowel lengthener that lengthens all vowels, including consonant letters' inherent vowel A (then nasal consonant letter's inherent vowel UE can be analyzed as a higher-level conditioned variant of A).

I wonder if Unicode makes provision for things that can be in more than one category.

The current model of Unicode Indic_Syllabic_Category allows a single category for every character. But there're some categories that are themselves for multifunctional characters (eg, Indic_Syllabic_Category = Virama).

But this seems to be a USE issue.

Yes, this is apparently a USE issue.

Regarding issue 1, I'm on Mavericks, and I gather the version of USE it uses might not be as up-to-date as on other setups, so that could explain why you can't reproduce it.

Uh… OS X Mavericks has nothing to do with USE. If you test in the latest Chrome, then Chrome uses its own HarfBuzz, independent from the operating system, to shape complex scripts. If you test in OS X's native text fields, then OS X Mavericks' Core Text doesn't have any support of USE yet.

Ben, you really, REALLY need to explain very clearly in what software you do all those tests. Saying "I'm on Mavericks" is not helpful at all. Screenshots (of the tested string in the environment, preferably with the software's About window) is helpful if you're not sure about what to explain…

ohbendy · 2016-12-12T10:39:02Z

Sorry, I thought I'd mentioned, I do all my font testing on Firefox. I'm currently on 50.0.1. I'll get to Win10 later this week, try Chrome too, and post images of each.

lianghai · 2016-12-15T02:17:20Z

Okay, I can confirm Firefox 50.1.0 behaves like what Ben described:

So it seems to be a Firefox bug. How does Firefox behave differently if it uses the latest (?) HarfBuzz directly?

This is the patched Noto Sans Cham used in my tests:
NotoSansChamX2-Regular.ttf.zip

jfkthame · 2016-12-15T09:05:06Z

Firefox 50 doesn't have the most recent harfbuzz. An update to harfbuzz 1.3.3 landed for Firefox 52 (https://bugzilla.mozilla.org/show_bug.cgi?id=1313097), so you might like to test with Firefox Developer Edition (https://www.mozilla.org/en-US/firefox/developer/) to see whether this affects the result.

lianghai · 2016-12-15T10:29:34Z

Confirmed: Firefox Developer Edition 52.0a2 has the same result as Chrome's (how recent HarfBuzz behaves).

So those HarfBuzz-related issues (issue 1 in my first comment) are because of an older version of HarfBuzz in the current stable release of Firefox.

ohbendy · 2016-12-15T12:12:07Z

Thanks for checking that :-)

Re issue 1, did you try numeral + mark? In Win 10, USE is not accepting that in Notepad, but it's fine in Edge.

It still leaves the medial La + medial Wa combination, and the below-vowel + AA29 combination needing to be fixed, which seem now to be a USE issue, and I've informed Andrew G.

lianghai · 2016-12-15T12:45:46Z

The second last syllable in my test string is <One, AA sign>, a digit + mark test, which seems to work in HarfBuzz (HarfBuzz in Chrome 55.0.2883.95). Just did a quick test with "꩑ꨴꨳꨰꨩ" <One, medial RA sign, medial YA sign, AI sign, AA sign> and it works too, conforming with the USE's spec of treating digits as normal bases.

Richard57 · 2016-12-31T18:15:14Z

Distinguishing medial consonants from Consonant_Subjoined seems a bad idea for SE Asian scripts. While I'm not sure I can get two consecutive medials in Tai Tham, I get similar rendering problems as can be seen in the word for 'iron' at http://www.wrdingham.co.uk/lanna/renderer_test.htm#surprises. The problem also shows up in MS Edge. Tai Tham also suffers from the vowel ordering issue.

punchcutter · 2017-03-13T18:33:48Z

Can we go ahead and relax the medial consonant behaviour to allow more than one? I tested with this and it looks good to me:

hb-ot-shape-complex-use-machine.rl

- medial_consonants = MPre? MAbv? MBlw? MPst?;
+ medial_consonants = MPre? MAbv? MBlw* MPst?;

ohbendy · 2017-03-14T02:45:15Z

This is a USE issue. Andrew Glass is aware and plans to implement an override to allow two medials in the below-base slot.

punchcutter · 2017-03-14T03:12:07Z

It's an issue with the USE spec, but harfbuzz and every other shaper needs to implement the spec somehow. This is a harfbuzz implementation for this particular problem which we can easily do now just like we did an override for Chakma split vowels. Andrew can update the spec, but harfbuzz still needs to implement it at some point. Sooner is always nice especially when we have fonts in the pipeline that need it.

ohbendy · 2017-03-14T03:13:23Z

Awesome, I see :)

harfbuzz#376

punchcutter · 2017-04-13T17:13:37Z

@behdad Any way we can get these two issues fixed ASAP? We want to get this Cham delivered, but it needs correct shaping. My local build of Firefox with a patched harfbuzz has been great for my own testing, but I'd like to get harfbuzz updated for real.

Just to recap, the two issues @ohbendy originally reported that are still present:

Combination of vowel ◌ꨭ (AA2D) with vowel lengthener ◌ꨩ (u+AA29)
Combined medials below

My local quick fix for these was just to totally relax the strictness of the USE spec in hb-ot-shape-complex-use-machine.rl:

- medial_consonants = MPre? MAbv? MBlw? MPst?;
+ medial_consonants = MPre? MAbv? MBlw* MPst?;
- dependent_vowels = VPre* VAbv* VBlw* VPst*;
+ dependent_vowels = VPre* (VAbv* VBlw* | VBlw* VAbv*) VPst*;

Of course if we want to make these explicit exceptions it needs more, but this works as a quick fix.

behdad · 2017-04-13T23:08:48Z

This is a USE issue. Andrew Glass is aware and plans to implement an override to allow two medials in the below-base slot.

Which thread was that? I can review and adjust HarfBuzz.

ohbendy · 2017-04-14T11:35:56Z

I was in touch with Andrew privately by email a while ago. Ben Mitchell The Fontpad Follow me on: Twitter Flickr Behance LinkedIn

…

On 14 Apr 2017, at 00:08, Behdad Esfahbod ***@***.***> wrote: This is a USE issue. Andrew Glass is aware and plans to implement an override to allow two medials in the below-base slot. Which thread was that? I can review and adjust HarfBuzz. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

behdad · 2017-04-14T15:45:21Z

Ok I just emailed Andrew to get a clarification of what he will be changing, so I can match.

behdad · 2017-04-14T15:45:42Z

Of course if we want to make these explicit exceptions it needs more, but this works as a quick fix.

Yeah, don't like to loosen up the regex more than needed.

behdad · 2017-04-28T21:46:15Z

Here's Andrew's response to me in private:

It’s not my recollection that I committed USE to a particular solution for Cham. ...

In principle, I think we may have been too restrictive on the medials. The real need here was to limit the prebase medials, the others aren’t such a problem. So I think that the Cham case would be fixed that way. I don’t expect any negative impact on other writing systems based on change the limit, since it is merely relaxing a restriction.

Another option was to play games with the overrides. That is cheaper to implement, but isn’t really the right solution and doesn’t help other writing systems.

I keep hoping to have time to think about the problems for Tai Tham. Ideally (from my perspective) , there would be one set of cluster model changes to fix the Cham medials and the monosyllabic Tai Tham issues. Focusing on Tai Tham is on my spare-time calendar but about fourth on the list. I’m not keen at the point, on trying to address Tai Tham polysyllabic clusters. This is more of a philosophical debate on orthographic clusters and reading order – to be hashed out at some point with Richard and others.

Part of #376 Also see https://github.com/roozbehp/unicode-data/issues/6 Test added, using NotoSansCham built from Noto Phase III sources.

behdad · 2017-07-14T15:40:44Z

Ok, fixed the U+AA29 issue. Working on other issue.

punchcutter pushed a commit to punchcutter/harfbuzz that referenced this issue Mar 28, 2017

Allow more than one medial consonant in USE shaper (required for Cham).

9e8a15b

harfbuzz#376

punchcutter mentioned this issue Mar 28, 2017

Allow more than one medial consonant in USE shaper (required for Cham). #447

Closed

behdad added a commit that referenced this issue Jul 14, 2017

[use] Fix shaping of U+AA29 CHAM VOWEL SIGN AA

216b003

Part of #376 Also see https://github.com/roozbehp/unicode-data/issues/6 Test added, using NotoSansCham built from Noto Phase III sources.

behdad closed this as completed in 9dd29c6 Jul 14, 2017

Richard57 mentioned this issue Jul 15, 2017

Tai Tham uses subjoined consonants after right-side vowels #170

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shaping bugs for Cham? #376

Shaping bugs for Cham? #376

ohbendy commented Dec 9, 2016

lianghai commented Dec 10, 2016 •

edited

Loading

ohbendy commented Dec 10, 2016

lianghai commented Dec 12, 2016

ohbendy commented Dec 12, 2016

lianghai commented Dec 15, 2016

jfkthame commented Dec 15, 2016

lianghai commented Dec 15, 2016

ohbendy commented Dec 15, 2016

lianghai commented Dec 15, 2016 •

edited

Loading

Richard57 commented Dec 31, 2016

punchcutter commented Mar 13, 2017 •

edited

Loading

ohbendy commented Mar 14, 2017

punchcutter commented Mar 14, 2017

ohbendy commented Mar 14, 2017

punchcutter commented Apr 13, 2017

behdad commented Apr 13, 2017

ohbendy commented Apr 14, 2017 via email

behdad commented Apr 14, 2017

behdad commented Apr 14, 2017

behdad commented Apr 28, 2017

behdad commented Jul 14, 2017

Shaping bugs for Cham? #376

Shaping bugs for Cham? #376

Comments

ohbendy commented Dec 9, 2016

lianghai commented Dec 10, 2016 • edited Loading

ohbendy commented Dec 10, 2016

lianghai commented Dec 12, 2016

ohbendy commented Dec 12, 2016

lianghai commented Dec 15, 2016

jfkthame commented Dec 15, 2016

lianghai commented Dec 15, 2016

ohbendy commented Dec 15, 2016

lianghai commented Dec 15, 2016 • edited Loading

Richard57 commented Dec 31, 2016

punchcutter commented Mar 13, 2017 • edited Loading

ohbendy commented Mar 14, 2017

punchcutter commented Mar 14, 2017

ohbendy commented Mar 14, 2017

punchcutter commented Apr 13, 2017

behdad commented Apr 13, 2017

ohbendy commented Apr 14, 2017 via email

behdad commented Apr 14, 2017

behdad commented Apr 14, 2017

behdad commented Apr 28, 2017

behdad commented Jul 14, 2017

lianghai commented Dec 10, 2016 •

edited

Loading

lianghai commented Dec 15, 2016 •

edited

Loading

punchcutter commented Mar 13, 2017 •

edited

Loading