Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support translations for 3rd party mods #505

Merged
merged 10 commits into from
May 7, 2021

Conversation

olanti-p
Copy link
Member

@olanti-p olanti-p commented May 4, 2021

Summary

SUMMARY: Infrastructure "New translation system with support for 3rd party mods"

Purpose of change

Allow 3rd party mods to ship their own MO files (CleverRaven/Cataclysm-DDA#25566).
Fix #495.
Fix plural forms for some languages on Android.

Describe the solution

Roll out custom runtime localization system (dubbed cata_libintl) that is compatible with gettext MO files.

Key differences from currently used GNU libintl:

  1. Does not support translation domains.
    BN only uses "default" translation domain, and rewriting toolchain to support multiple domains and rewriting the code to actually use them would be a huge and bugprone undertaking.
  2. Does not support encoding conversion, operates on UTF-8 only.
    BN only uses UTF-8, so that's fine.
  3. Does not depend on locale / environment variables.
    Dependence on locale / environment variables caused numerous problems with libintl in the past, as Cata aims to support multiple platforms and compilers. The most recent issue is Scaling factor x2 and x4 resets language to English #495, which seems to be caused by SDL fiddling with locale, and is absent when using cata_libintl.
  4. Supports loading MO files from arbitrary paths.
    Or from memory. This is currently used only in tests.
  5. Supports loading multiple MO files into a single "domain".
    To enable multi-MO support without the ability to use domains, the new system has to be able to merge MO files "on the fly". It is currently possible to do so manually via gettext utilities (use msgunfmt to decompile base game MO back into PO, then msgcat to concatenate with PO from a mod, then msgfmt to compile back into MO), but the process takes some time (20 seconds on my laptop) and has to be repeated each time game updates or mod list changes.

Also add documentation on how to translate 3rd party mods.

Describe alternatives you've considered

  1. Call gettext utilities from within the game to merge MO files and use some caching algorithm to cut down on re-compilations.
    This solution has multiple potential problems:

    • It may require special logic to deal with dialect mismatch (I do know that en and en_US are treated differently by libintl, and the same probably applies to msgcat).
    • It requires code for calling other process from within the game. The code that'll have to work on all platforms/compilers...
    • It requires gettext utilities to be shipped with the game, or accessible to the game. Not a problem on Linux/OSX (may have to ask user to install from apt/brew/whatever), and a bit annoying on Windows (have to include them with the distribution), but I have no idea how it works on Android, or whether GNU gettext utilities have Android binaries.
    • We'd have to keep track of various incompatibilities that are bound to show up on some platforms, and that's a whole separate headache I'd rather not deal with.
    • On Android instead of libintl Cata uses custom fork of libintl-lite, which would be its own source of fun little inconsistencies someone would have to sort through.
  2. Drop gettext completely and use some alternative localization system (e.g. tables of string id : translation as many games and software projects do), possibly with built-in merge function.
    We have over 10 thousand strings in source code and over 30 thousand in JSONs. Moving from gettext would be a pain.

  3. Use some existing lightweight library for working with gettext files.
    The only one I've found that wouldn't be a pain to use is spirit-po, but it requires Boost.

Testing

All tests pass.

Benchmarks (Intel Core i5-3230M)
Filters: [libintl][benchmark]
N strings: 43625

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cata_test is a Catch v2.12.0 host application.
Run with -? for options

-------------------------------------------------------------------------------
bench_get_translated_string
-------------------------------------------------------------------------------
cata_libintl_test.cpp:620
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
get_all_strings                                100             1      6,2616 s 
                                        62,7425 ms    62,4491 ms    63,1855 ms 
                                        1,80806 ms    1,33141 ms    2,43433 ms 
                                                                               

File size: 9789435 bytes
N strings: 43625
-------------------------------------------------------------------------------
bench_parse_mo
-------------------------------------------------------------------------------
cata_libintl_test.cpp:651
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
parse MO                                       100             1     2,46908 s 
                                        24,7708 ms    24,7168 ms    24,9087 ms 
                                        408,696 us    196,054 us    809,159 us 
                                                                               

File size: 9789435 bytes
N strings: 43625
-------------------------------------------------------------------------------
bench_asssemble_trans_lib
-------------------------------------------------------------------------------
cata_libintl_test.cpp:669
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
parse MO + assemble lib                        100             1     6,14146 s 
                                        61,5096 ms    61,4134 ms    61,6845 ms 
                                         642,66 us    403,011 us    960,402 us 
                                                                               

===============================================================================
test cases: 3 | 3 passed
assertions: - none -

Testing / benchmarking "in the wild" didn't indicate any significant changes in cpu/memory usage, and MO parsing + string sorting always took less than 50 ms. Reading from disk took 0..500 ms depending on os / hard drive / filesystem cache.

Additional context

Low-priority stuff for future PRs:

  1. Allow translating mod name/description (either by "peeking" into the mod's MO and taking translation from there, or by adding translated name/description to modinfo.json)
  2. Once testing is finished, get rid of libintl code and remove it from build dependencies

Aivean and others added 2 commits May 6, 2021 16:57
Cherry-picked from commit
CleverRaven/Cataclysm-DDA@35a2d9e

Co-authored-by:  Zhilkin Serg <ZhilkinSerg@users.noreply.github.com>
@@ -1,14 +1,14 @@
[
{
"//": "See language.h for documentation",
"id": "en",
"id": "en_US",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably have to adjust the Transifex settings regarding language names after this.
From what I can tell, it won't allow en_US to count as en, even if there is no other en.

Copy link
Member Author

@olanti-p olanti-p May 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When dialect is specified in language id, the game picks up both dialect-specific and generic MO files (both old and new implementations), so it should be fine as is

//
// This test reaffirms the assumption that both Transifex's and GNU's plf expressions
// produce same values for integer numbers.
TEST_CASE( "gnu_transifex_rules_equal", "[libintl][i18n][.]" )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dot because slow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, 9-10 seconds.

@@ -0,0 +1 @@
��
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly empty. Is it supposed to be "empty MO" with magic numbers at start, truly empty file that ended up not-empty, or something else?

Copy link
Member Author

@olanti-p olanti-p May 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is that if the file is empty, or is smaller than 4 bytes, it's guaranteed to not be able to hold a magic number, ergo be a MO file.
I should've renamed it to not_a_mo_file.mo or something similar.

@Coolthulhu Coolthulhu merged commit fbf53c7 into cataclysmbnteam:upload May 7, 2021
@Coolthulhu
Copy link
Member

Got it working with the docs, but it may be useful to have a "tl;dr" variant of the docs somewhere.
Just the commands, without the context, so that a modder/translator can paste them instead of having to understand things first.

@cainiaowu
Copy link
Contributor

For translation string collision, maybe the current extraction script need to be able to add some context like mod id.

@olanti-p
Copy link
Member Author

olanti-p commented May 8, 2021

For translation string collision, maybe the current extraction script need to be able to add some context like mod id.

I thought about that; in this case, the game would also have to know under which context to look for translations.

The main roadblock is how the game handles data loaded from JSON.

There is no such "context" right now that could allow to differentiate between whether an item has been loaded from one mod or another, or whether half of its strings come from a completely different mod (e.g. the item's name has been modified via copy-from, but the description stays the same).

Someone would have to go over every type of data loadable from JSON (items, monsters, mutations...) and make sure that every translatable string receives the proper context. This is a lot of work, and is bound to produce various obscure bugs noone would notice until much later.

The ideal solution would be to move away from gettext's "string is a translation lookup key and is also an English translation" to "JSON files contain lookup keys, localization files contain all translations including English" schema used by many games and software projects, so that mod authors could append mod id to all lookup keys in both JSON and localization data, but with 30k+ strings in JSON that would take a lot of work for us and break mod compatibility for everyone else.

@olanti-p olanti-p deleted the mod-translate-final branch May 8, 2021 10:25
@vorpal-void
Copy link
Contributor

vorpal-void commented May 9, 2021

Got it working with the docs, but it may be useful to have a "tl;dr" variant of the docs somewhere.
Just the commands, without the context, so that a modder/translator can paste them instead of having to understand things first.

@Coolthulhu I've tried to do a little translation with current docs, everything went fine! The most "complex" part is to run scripts properly. Here is simplified steps that most of the modders will be able to follow, this one is for Windows:

  1. Install PoEdit
  2. Install Python
  3. Open Windows CMD and type pip install polib
  4. Copy extract_json_strings.py and dedup_pot_file.py from the lang folder of repository to the mod folder you wish to translate
  5. In the folder mod create batch file with next content:
CD /D %~dp0
md lang
python extract_json_strings.py -i "%cd%" -o "%cd%\lang\extracted_strings.pot" --project Mod_Translation
python dedup_pot_file.py "%cd%\lang\extracted_strings.pot"
  1. After launching this .bat file there will be extracted_strings.pot in the mod's lang directory, which can be edited with Poedit
  2. [Poedit instructions here, there are simple enough already]

the .bat file could be distributed along the scripts too, if it's applicable; in that case step 4 should include that file as well, and step 5 will be obsolete.

@olanti-p
Copy link
Member Author

olanti-p commented May 9, 2021

the .bat file could be distributed along the scripts too

Yeah, that should make it easier.

Apparently full paths on Windows freak out Poedit (as : is normally used as file_name:line separator):

Screenshots

image
image

and it refuses to work with relative paths for some unknown to me reason:
More screenshots

image
image

Same with ./effects.json, .\effects.json, ..\effects.json, .\\effects.json and ..\\effects.json

So seeing as there is no difference either way I think we can use relative path .\ to keep references short and cut down on POT/PO sizes:

python extract_json_strings.py -i .\ -o lang\extracted_strings.pot --project Mod_Translation
python dedup_pot_file.py lang\extracted_strings.pot

Then tweak extract_json_strings.py to deduce project name from first encountered modinfo.json, and the final thing would look like:

@echo off
if not exist lang md lang
python extract_json_strings.py -i .\ -o lang\extracted_strings.pot
python dedup_pot_file.py lang\extracted_strings.pot
echo Done!
pause
@echo on

Alternatively, we could incorporate dedup_pot_file.py into extract_json_strings.py and have -i .\ -o lang\extracted_strings.pot be defaults if no arguments are specified. It'd be somewhat messy from code side, but then extraction would be as simple as "drag-and-drop extract_json_strings.py into mod folder -> double click to run".

@Firestorm01X2
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scaling factor x2 and x4 resets language to English
6 participants