Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix espeak-ng handling of Malayalam chillu-virama and letter dot reph. #246

Open
rhdunn opened this issue Apr 29, 2017 · 0 comments
Open

Comments

@rhdunn
Copy link
Member

rhdunn commented Apr 29, 2017

In readclause.c, espeak changes U+0D4D (MALAYALAM SIGN VIRAMA) when followed by a ZERO WIDTH JOINER (U+200D) into the codepoint U+0D4E. The comment says "use this unofficial code for chillu-virama", however U+0D4E has been allocated to MALAYALAM LETTER DOT REPH. As such, this behaviour is broken (as espeak/espeak-ng are treating LETTER DOT REPH as a chillu-virama).

Any fix for this needs to identify and update the rule and list files to use the representation of these that the fix changes to.

The preferred approach would be to use the normalisation logic specified in Table 1 ("Atomic Encoding of Chillus") of the Unicode 5.1.0 documentation[2]. For example, mapping 0D23 0D4D 200D to 0D7A, and keeping unknown sequences unmodified (i.e. preserving the 0D4D 200D representation). The rule and list files should use the Unicode 5.1.0 preferred representation of the chillu letters NN, N, RR, L, LL, and K (U+0D7A - U+0D7F).

For the new tokenizer logic, these should have test cases to cover the different characters to ensure that the sequences are mapped correctly.

Reference:

  1. https://en.wikipedia.org/wiki/Malayalam_script#Chillus
  2. http://www.unicode.org/versions/Unicode5.1.0/#Malayalam_Chillu_Characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant