Fix espeak-ng handling of Malayalam chillu-virama and letter dot reph. #246

rhdunn · 2017-04-29T10:36:03Z

In readclause.c, espeak changes U+0D4D (MALAYALAM SIGN VIRAMA) when followed by a ZERO WIDTH JOINER (U+200D) into the codepoint U+0D4E. The comment says "use this unofficial code for chillu-virama", however U+0D4E has been allocated to MALAYALAM LETTER DOT REPH. As such, this behaviour is broken (as espeak/espeak-ng are treating LETTER DOT REPH as a chillu-virama).

Any fix for this needs to identify and update the rule and list files to use the representation of these that the fix changes to.

The preferred approach would be to use the normalisation logic specified in Table 1 ("Atomic Encoding of Chillus") of the Unicode 5.1.0 documentation[2]. For example, mapping 0D23 0D4D 200D to 0D7A, and keeping unknown sequences unmodified (i.e. preserving the 0D4D 200D representation). The rule and list files should use the Unicode 5.1.0 preferred representation of the chillu letters NN, N, RR, L, LL, and K (U+0D7A - U+0D7F).

For the new tokenizer logic, these should have test cases to cover the different characters to ensure that the sequences are mapped correctly.

Reference:

The text was updated successfully, but these errors were encountered:

rhdunn added bug languages legacy:espeak labels Apr 29, 2017

rhdunn added this to the 1.49.2 milestone Apr 29, 2017

rhdunn modified the milestones: Future, 1.49.2 Jul 6, 2017

rhdunn mentioned this issue Oct 1, 2017

Emoji support produces incomplete or corrupt translations #308

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix espeak-ng handling of Malayalam chillu-virama and letter dot reph. #246

Fix espeak-ng handling of Malayalam chillu-virama and letter dot reph. #246

rhdunn commented Apr 29, 2017

Fix espeak-ng handling of Malayalam chillu-virama and letter dot reph. #246

Fix espeak-ng handling of Malayalam chillu-virama and letter dot reph. #246

Comments

rhdunn commented Apr 29, 2017

Reference: