You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In readclause.c, espeak changes U+0D4D (MALAYALAM SIGN VIRAMA) when followed by a ZERO WIDTH JOINER (U+200D) into the codepoint U+0D4E. The comment says "use this unofficial code for chillu-virama", however U+0D4E has been allocated to MALAYALAM LETTER DOT REPH. As such, this behaviour is broken (as espeak/espeak-ng are treating LETTER DOT REPH as a chillu-virama).
Any fix for this needs to identify and update the rule and list files to use the representation of these that the fix changes to.
The preferred approach would be to use the normalisation logic specified in Table 1 ("Atomic Encoding of Chillus") of the Unicode 5.1.0 documentation[2]. For example, mapping 0D23 0D4D 200D to 0D7A, and keeping unknown sequences unmodified (i.e. preserving the 0D4D 200D representation). The rule and list files should use the Unicode 5.1.0 preferred representation of the chillu letters NN, N, RR, L, LL, and K (U+0D7A - U+0D7F).
For the new tokenizer logic, these should have test cases to cover the different characters to ensure that the sequences are mapped correctly.
In
readclause.c
, espeak changesU+0D4D
(MALAYALAM SIGN VIRAMA) when followed by a ZERO WIDTH JOINER (U+200D
) into the codepointU+0D4E
. The comment says "use this unofficial code for chillu-virama", howeverU+0D4E
has been allocated to MALAYALAM LETTER DOT REPH. As such, this behaviour is broken (as espeak/espeak-ng are treating LETTER DOT REPH as a chillu-virama).Any fix for this needs to identify and update the rule and list files to use the representation of these that the fix changes to.
The preferred approach would be to use the normalisation logic specified in Table 1 ("Atomic Encoding of Chillus") of the Unicode 5.1.0 documentation[2]. For example, mapping
0D23 0D4D 200D
to0D7A
, and keeping unknown sequences unmodified (i.e. preserving the0D4D 200D
representation). The rule and list files should use the Unicode 5.1.0 preferred representation of the chillu letters NN, N, RR, L, LL, and K (U+0D7A - U+0D7F).For the new tokenizer logic, these should have test cases to cover the different characters to ensure that the sequences are mapped correctly.
Reference:
The text was updated successfully, but these errors were encountered: