-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Japanese Dakuten separation lead to incorrect conversion #1871
Comments
Confirmed. Any ideas for fixing this are welcome.
$ espeak-ng -v ja "るうぃは" -X
Translate 'るうぃは'
36 る [r`u]
57 るう [r`u:]
Translate 'る'
36 る [r`u]
Translate 'う'
36 う [u]
Translate 'ぃ'
Found: '_ja' ***@***.***:z]
Translate 'は'
36 は [ha]
r`'u 'u ***@***.***:z(ja)l'et@ h'a
|
Huh, I think that's the true extent of this bug; the dakuten itself isn't the cause. From what I can glean, espeak-ng consumes the longest possible grapheme sequence specified in the rules sequentially, i.e. a greedy algorithm.
Or alternatively, we could just add rules for all the smaller versions of the nouns and call it a day:
Of course, this still leaves the problem of the dakuten (and handakuten), which by definition doesn't have a fixed sound. I propose a mixed strategy: remove the separation of dakuten/handakuten and treat graphemes such as |
Hmm, this isn't limited to small kana, either. The long vowel indicator (chōonpu) $ espeak-ng -v ja とおー -X
Translate 'とおー'
36 と [to]
57 とお [to:]
Translate 'と'
36 と [to]
Translate 'お'
36 お [o]
Translate 'ー'
Found: '_ja' [dZ'ap@ni:z]
t'o 'o _:(en)dZ'ap@ni:z(ja)l'et@ Unlike the above samples which are admittedly pretty niche, this is a very common combination. |
Consider the phrase
るゔぃは
, which corresponds tor'uviha
. Theる
character isr'u
,ゔぃ
isvi
, andは
isha
.The current mechanism for Japanese splits the
ゔ (U+3094)
character into two characters,う (U+3046)
andU+3099
(the dakuten character).Alone,
ゔぃ
is converted intovi
without a hitch. However, since there exists a rule forるう
, this gets converted intor'u:
, leaving behind a danglingU+3099 ぃは
. The leading two characters doesn't have a corresponding rule and defaults toJapanese letter Japanese letter
.Thus, instead of
r'uviha
, the result becomesr'u: Japanese letter Japanese letter ha
.Disclaimer: I am not a native speaker of Japanese.
The text was updated successfully, but these errors were encountered: