-
Notifications
You must be signed in to change notification settings - Fork 1
Towards a Cyrillic romanisation encoding model
The ALA-LC Cyrillic romanisation tables make use of non-spacing ligature ties in the romanisation of a range of Cyrillic characters. MARC-8 Extended Latin (ANSEL) character repertoire used two diacritics, one of each letter, to display the ligature ties. The Unicode standard contains two equivalent non-spacing characters: Combining Ligature Left Half (U+FE20) and Combining Ligature Right Half (U+FE21). Unicode also has a double width non-spacing mark, Combining Double Inverted Breve (U+0361).
The original MARC-8 to Unicode mapping of LIGATURE, FIRST HALF (EB) to Combining Ligature Left Half (U+FE20), and LIGATURE, SECOND HALF (EC) to Combining Ligature Right Half (U+FE21). In 2004, the mapping table was updated, mapping a string containing LIGATURE, FIRST HALF (EB) and LIGATURE, SECOND HALF (EC) to Combining Double Inverted Breve (U+0361), inline with the Unicode's recommendations to use U+0361 instead of U+FE20 and U+FE21.
The two half marks are more common than the single double diacritic, and font support for U+FE20, U+FE21, and U+0361 is very limited. These diacritics require OpenType fonts, that support these diacritics and diacritic stacking, and sophisticated font rendering support.
Cyrillic glyph | Half form | Double span diacritic | Romanised glyph |
---|---|---|---|
ѥ | i + ◌︠ + e + ◌︡ | i + ◌͡ + e | i͡e |
U+0069 U+FE20 U+0065 U+FE21 | U+0069 U+0361 U+0065 | ||
ѿ | o + ◌̄ + ◌︠ + t + ◌︡ | o + ◌̄ + ◌͡ + t | ō͡t |
U+006F U+0304 U+FE20 U+0074 U+FE21 | U+006F U+0304 U+0361 U+0074 | ||
ѩ | i + ◌︠ + e + ◌̨ + ◌︡ | i + ◌͡ + e + ◌̨ | i͡ę |
U+0069 U+FE20 U+0065 U+0328 U+FE21 | U+0069 U+0361 U+0065 U+0328 |
The following diacritics are used in the above examples:
Character | Codepoint | Combining class |
---|---|---|
◌͡◌ | U+0361 | Double Above (234) |
◌︠ | U+FE20 | Above (230) |
◌︡ | U+FE21 | Above (230) |
◌̄ | U+0304 | Above (230) |
◌̨ | U+0328 | Attached Below (202) |
ѥ illustrates the two basic patterns for encoding ligated romanised glyphs in various Cyrillic roanisation tables:
- base_character plus Combining Ligature Left Half (U+FE20) plus base_character plus Combining Ligature Right Half (U+FE21)
- base_character plus Combining Double Inverted Breve (U+0361) plus base_character
ѿ introduces diacritic stacking, where diacritics interact typographically (when using the half forms). Combining Macron (U+0304) and Combining Ligature Left Half (U+FE20) belong to the same combining class. This means that the order of the diacritics is significant.
The sequence U+006F U+0304 U+FE20 U+0074 U+FE21 is not the same as U+006F U+FE20 U+0304 U+0074 U+FE21. They are not canonnically equivalent. Unicode normalisation will not reorder these diacritics. For the first sequence, a well designed font will display U+006F U+0304 U+FE20 U+0074 U+FE21 as the letter o with a Combining Macron (U+0304) directly above the o, and Combining Ligature Left Half (U+FE20) directly above the macron.
For the second sequence, U+006F U+FE20 U+0304 U+0074 U+FE21, the order of diacritics is reversed:, giving the letter o with a Combining Ligature Left Half (U+FE20) directly above the o, and a Combining Macron (U+0304) stacked directly above the Combining Ligature Left Half (U+FE20) and o.
On the other hand, Combining Macron (U+0304) and Combining Double Inverted Breve (U+0361) have different combining classes, so the sequences U+006F U+0304 U+0361 U+0074 and U+006F U+0361 U+0304 U+0074 are canonically equivalent and should render identically. When the strings are normalised, the diacritics are cannonically ordered to U+006F U+0304 U+0361 U+0074. The cannonically ordered sequence is the preferred form.
ѩ provides an example of multiple diacritics where the diacritics do not interact typographically. The sequence U+0069 U+FE20 U+0065 U+0328 U+FE21 is canonically equivalent to U+0069 U+FE20 U+0065 U+FE21 U+0328, with U+0069 U+FE20 U+0065 U+0328 U+FE21 being the canonically ordered version.
Most instances of ligature ties in the Non-Slavic table are the simple instances that do not involve multiple diacritics. The one standout example is the romanisation of the Abkhaz letter ҵ.
Cyrillic glyph | Half form | Double span diacritic | Romanised glyph |
---|---|---|---|
ҵ | t+ ◌͡ + CGJ +◌̇ + s | t͡͏̇s | |
No valid Unicode sequence. | U+0074 U+0361 U+034F U+0307 U+0073 |
To apply a combining diacritic stacked above the combining double diacritic, it is necessary to use the Combining Grapheme Joiner (CGJ) U+034F.
The above examples illustrate that the combing classes of the diacritics can be critical, and that half-form diacritic combinations do not necessarily align with the double spanning diacritics, and in the case of the Abkhaz letter ҵ, we have a double spanning diacritic version, but no matching ligature tie equivalents.