Skip to content

Towards a Cyrillic romanisation encoding model

Andj edited this page Jan 11, 2023 · 3 revisions

Towards a Cyrillic romanisation encoding model

The ALA-LC Cyrillic romanisation tables make use of non-spacing ligature ties in the romanisation of a range of Cyrillic characters. MARC-8 Extended Latin (ANSEL) character repertoire used two diacritics, one of each letter, to display the ligature ties. The Unicode standard contains two equivalent non-spacing characters: Combining Ligature Left Half (U+FE20) and Combining Ligature Right Half (U+FE21). Unicode also has a double width non-spacing mark, Combining Double Inverted Breve (U+0361).

The original MARC-8 to Unicode mapping of LIGATURE, FIRST HALF (EB) to Combining Ligature Left Half (U+FE20), and LIGATURE, SECOND HALF (EC) to Combining Ligature Right Half (U+FE21). In 2004, the mapping table was updated, mapping a string containing LIGATURE, FIRST HALF (EB) and LIGATURE, SECOND HALF (EC) to Combining Double Inverted Breve (U+0361), inline with the Unicode's recommendations to use U+0361 instead of U+FE20 and U+FE21.

The two half marks are more common than the single double diacritic, and font support for U+FE20, U+FE21, and U+0361 is very limited. These diacritics require OpenType fonts, that support these diacritics and diacritic stacking, and sophisticated font rendering support.

Examples from the Church Slavic romanisation table

Cyrillic glyph Half form Double span diacritic Romanised glyph
ѥ i + ◌︠ + e + ◌︡ i + ◌͡ + e i͡e
U+0069 U+FE20 U+0065 U+FE21 U+0069 U+0361 U+0065
ѿ o + ◌̄ + ◌︠ + t + ◌︡ o + ◌̄ + ◌͡ + t ō͡t
U+006F U+0304 U+FE20 U+0074 U+FE21 U+006F U+0304 U+0361 U+0074
ѩ i + ◌︠ + e + ◌̨ + ◌︡ i + ◌͡ + e + ◌̨ i͡ę
U+0069 U+FE20 U+0065 U+0328 U+FE21 U+0069 U+0361 U+0065 U+0328

The following diacritics are used in the above examples:

Character Codepoint Combining class
◌͡◌ U+0361 Double Above (234)
◌︠ U+FE20 Above (230)
◌︡ U+FE21 Above (230)
◌̄ U+0304 Above (230)
◌̨ U+0328 Attached Below (202)

ѥ illustrates the two basic patterns for encoding ligated romanised glyphs in various Cyrillic roanisation tables:

  1. base_character plus Combining Ligature Left Half (U+FE20) plus base_character plus Combining Ligature Right Half (U+FE21)
  2. base_character plus Combining Double Inverted Breve (U+0361) plus base_character

ѿ introduces diacritic stacking, where diacritics interact typographically (when using the half forms). Combining Macron (U+0304) and Combining Ligature Left Half (U+FE20) belong to the same combining class. This means that the order of the diacritics is significant.

The sequence U+006F U+0304 U+FE20 U+0074 U+FE21 is not the same as U+006F U+FE20 U+0304 U+0074 U+FE21. They are not canonnically equivalent. Unicode normalisation will not reorder these diacritics. For the first sequence, a well designed font will display U+006F U+0304 U+FE20 U+0074 U+FE21 as the letter o with a Combining Macron (U+0304) directly above the o, and Combining Ligature Left Half (U+FE20) directly above the macron.

For the second sequence, U+006F U+FE20 U+0304 U+0074 U+FE21, the order of diacritics is reversed:, giving the letter o with a Combining Ligature Left Half (U+FE20) directly above the o, and a Combining Macron (U+0304) stacked directly above the Combining Ligature Left Half (U+FE20) and o.

On the other hand, Combining Macron (U+0304) and Combining Double Inverted Breve (U+0361) have different combining classes, so the sequences U+006F U+0304 U+0361 U+0074 and U+006F U+0361 U+0304 U+0074 are canonically equivalent and should render identically. When the strings are normalised, the diacritics are cannonically ordered to U+006F U+0304 U+0361 U+0074. The cannonically ordered sequence is the preferred form.

ѩ provides an example of multiple diacritics where the diacritics do not interact typographically. The sequence U+0069 U+FE20 U+0065 U+0328 U+FE21 is canonically equivalent to U+0069 U+FE20 U+0065 U+FE21 U+0328, with U+0069 U+FE20 U+0065 U+0328 U+FE21 being the canonically ordered version.

Examples from the romanisation table for Non-Slavic Languages (in Cyrillic Script)

Most instances of ligature ties in the Non-Slavic table are the simple instances that do not involve multiple diacritics. The one standout example is the romanisation of the Abkhaz letter ҵ.

Cyrillic glyph Half form Double span diacritic Romanised glyph
ҵ t+ ◌͡ + CGJ +◌̇ + s t͡͏̇s
No valid Unicode sequence. U+0074 U+0361 U+034F U+0307 U+0073

To apply a combining diacritic stacked above the combining double diacritic, it is necessary to use the Combining Grapheme Joiner (CGJ) U+034F.

Concluding thoughts

The above examples illustrate that the combing classes of the diacritics can be critical, and that half-form diacritic combinations do not necessarily align with the double spanning diacritics, and in the case of the Abkhaz letter ҵ, we have a double spanning diacritic version, but no matching ligature tie equivalents.