Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to compile zhy dictionay on Windows #933

Closed
feerrenrut opened this issue May 4, 2021 · 15 comments
Closed

Unable to compile zhy dictionay on Windows #933

feerrenrut opened this issue May 4, 2021 · 15 comments

Comments

@feerrenrut
Copy link
Contributor

feerrenrut commented May 4, 2021

The NVDA project has had an issue open for a while. We'd like to ask for some assistance to identify the issue. This process works for all other languages.

Our build script for espeak dictionaries:

  • Calls espeak_Initialize
  • Constructs a espeak_VOICE struct (see struct definition below) with language set to zhy\0
  • Calls espeak_SetVoiceByProperties which returns 2 (ENS_VOICE_NOT_FOUND)

Struct:

class espeak_VOICE(ctypes.Structure):
	_fields_=[
		('name',ctypes.c_char_p),
		('languages',ctypes.c_char_p),
		('identifier',ctypes.c_char_p),
		('gender',ctypes.c_byte),
		('age',ctypes.c_byte),
		('variant',ctypes.c_byte),
		('xx1',ctypes.c_byte),
		('score',ctypes.c_int),
		('spare',ctypes.c_void_p),
	]

Thanks!

@jaacoppi
Copy link
Collaborator

jaacoppi commented May 4, 2021 via email

@rhdunn
Copy link
Member

rhdunn commented May 5, 2021

Yes, that's correct. The zhy name is not a valid BCP 47 name. The IANA language subtag registry (based on ISO 636-* for language codes) lists yue for Cantonese and cmn for Mandarin. Other voices have a similar change as well.

In the voice files, the old espeak names are listed as options for compatibility reasons.

The code still refers to them by the old names because they haven't been refactored to align with the changes. The naming of the phoneme files is not consistent either (espeak used the language names, e.g. ph_dutch, but not for cantonese and mandarin for some reason). I just hadn't got around to addressing it, as other things like emoji support were higher priority and I got burned out after that.

@feerrenrut
Copy link
Contributor Author

feerrenrut commented May 6, 2021

Thanks for the explanations. We are currently splitting the "zhy_rules" to get the language to switch to. I'll add an exception for that language.

@jaacoppi
Copy link
Collaborator

jaacoppi commented May 6, 2021 via email

@feerrenrut
Copy link
Contributor Author

I misspoke in my last comment (now edited), we are deriving the language codes from the *_rules files not from *_dict files. But if the same offer applies to those files, that would also simplify this.

I have a potential work-around for this nvaccess/nvda#12370 however perhaps we are only creating one dictionary when we should be creating two?

@jaacoppi
Copy link
Collaborator

jaacoppi commented May 6, 2021 via email

@feerrenrut
Copy link
Contributor Author

Ok, thanks @jaacoppi. We might go ahead with the work around for now. But it would be good to be able to remove it in the future. On the other hand, perhaps we should have an explicit listing of the voices to use with language rules to produce dictionaries.

@feerrenrut
Copy link
Contributor Author

So to confirm, I should compile the dictionaries using voice lang yue with zhy rules as well as using voice lang cmn with zhy rules to produce two dictionaries. Doing this seems to produce zhy_dict and zh_dict. If that is the case, it might be better for us just to have an explicit mapping rather than iterate over the files.

@jaacoppi
Copy link
Collaborator

jaacoppi commented May 7, 2021 via email

@jaacoppi
Copy link
Collaborator

jaacoppi commented May 8, 2021

I've refactored zhy to yue without problems. Two questions for @rhdunn before I push the changes:

  1. Are the extended dictionaries and original espeak compatibility still relevant? Do we keep them or can we simplify and just use one _lsit file per language like for most languages? See commit f672211 to refresh your memory.

  2. the codebase has instances of "zh" (like the switch case in tr_languages.c and the language tags in espeak-ng-data/lang/sit/cmn and espeak-ng-data/lang/sit/zhy. There's "language zh-cmn", "language zh 8" and so on. Can I get rid of everything that mentions zh so that the code explicitly uses either Mandarin or Cantonese, not generic Chinese?

@rhdunn
Copy link
Member

rhdunn commented May 8, 2021

The extended dictionaries are still relevant. When enabled, they add a lot of entries generated from a dictionary (which I believe is from the Unicode unihan database, but I'm not 100% sure on that). This allows distributions that want to save space to ignore the listx files. It also keeps the generated lists separate from the custom exception lists. -- Ideally, the listx files should be autogenerated from the unihan list, but I haven't figured out how to do that yet, what version was used for the original list (to compare when using a script to generate the list), and what changes (if any) were made to that process.

Ideally, espeak compatibility where possible is important. Especially around the use of espeak. So keeping the old names in the lang files allows users/applications that have e.g. zh set as their TTS voice for orca/spech-dispatcher/etc. to still work when using espeak-ng. Likewise if/when they are using the espeak API. Therefore, the language zh 8, etc. should stay, but changing the dictionary/language file names should be OK as users don't directly interact with those.

@feerrenrut
Copy link
Contributor Author

Thanks for the explanation @jaacoppi and @rhdunn. Could you link the PR / change to this issue when possible. I think it would be handy to confirm the espeak mechanism for compiling the dictionaries matches with our usage of the espeak DLL in NVDA.

jaacoppi added a commit to jaacoppi/espeak-ng that referenced this issue May 12, 2021
jaacoppi added a commit to jaacoppi/espeak-ng that referenced this issue May 12, 2021
@jaacoppi
Copy link
Collaborator

@feerrenrut: The renaming has now been done in #940.

I'll also point out the dictrules setting for Cantonese. It's in the file espeak-ng-data/lang/sit/yue.
dictrules 1 means latin characters are presumed to be English
dictrules 2 means latin characters are presumed to be Jyutping

You might want to make two versions of the voice for different use cases.

Mandarin doesn't have such a choice yet. There's an open issue at #347. It seems I've forgotten it.

At the moment there's no way to easily set another default language than English.

@jaacoppi
Copy link
Collaborator

@feerrenrut: Can this be closed?

@feerrenrut
Copy link
Contributor Author

Yes. thanks @jaacoppi. We are updating NVDA to make use of this in nvaccess/nvda#12370

feerrenrut added a commit to nvaccess/nvda that referenced this issue May 28, 2021
…12370)

Compiling the zhy dictionary has failed for a long time, it was excluded because the cause was unknown.
It was suspected that there was an error in the format of the files.
Looking into this I found the issue was caused by trying to set the voice to "zhy" by calling espeak_SetVoiceByProperties.
The result was 2 (ENS_VOICE_NOT_FOUND)
Compilation was based on using glob to find the *_rules files, and splitting the filename to get the language to use for the voice.

Espeak-ng has renamed zhy and zh files to match the language code that should be used: yue and cmn for Cantonese and Mandarin respectively.
See espeak-ng/espeak-ng#933

Description of how this pull request fixes the issue:
This change makes the compilation of espeak dictionaries explicit.
An explicit listing of the dictionaries NVDA expects (rather than using glob), allows us to be aware of the introduction or removal of languages.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants