Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate PO and Gettext into separate packages #215

Closed
kipcole9 opened this issue Sep 6, 2019 · 45 comments
Closed

Separate PO and Gettext into separate packages #215

kipcole9 opened this issue Sep 6, 2019 · 45 comments

Comments

@kipcole9
Copy link
Contributor

kipcole9 commented Sep 6, 2019

I am working on an ICU Message formatter and localisation processor as the (maybe) last part of the ex_cldr. One requirement is of course to store and manage translations and PO files are clearly the defacto globally standard.

I would like to propose splitting Gettext into as PO package that looks after storage, parsing, merging and extracting. That way different localisation packages can leverage the same infrastructure - which is supported by the format tag in a PO file. As you know, today Gettext saves all messages with the format-elixir tag.

What would need to change:

  1. API and usage for current Gettext users would need to remain identical for backwards compatibility

  2. The extractor would need to be told hold to tag messages so they aren’t all tagged as format-elixir

  3. The backend interface would be normalised. Instead of BackendModule.__gettext__(:priv) for example, it would be better to have BackendModule.__po__(:priv) in order to be more specific and not tied to the Gettext API alone

4 Compiling PO files would compile messages based upon the message format. Perhaps a behaviour would be introduced so that any message format could be compiled. Instead of use Gettext, .... it would become use PO, icu_format: MyMessageFormatter, elxiir_format: Gettext or some such. Then use Gettext, ... would also include use PO, ....

5 Lastly, this may open up the opportunity to have different storage models, like DB, or :persistent_term or even a web service that can replace PO files while still retaining the well known format.

I’m fine to work on this - its better than creating another PO file parser/merger/extracted as @tmbb as already had to do,

Thoughts?

@tmbb
Copy link

tmbb commented Sep 6, 2019

I’m fine to work on this - its better than creating another PO file parser/merger/extracted as @tmbb as already had to do

I actually think Gettext could use a new .po parser

@josevalim
Copy link
Contributor

Agreed on a new parser but that is just part of the work. We can probably re-use everything else. I think Andrea considered a nimble_parsec-based PO parser but I am not sure of the status. Contributions are definitely welcome!

I am also completely happy to go in the direction proposed here. My suggestion is to start adding the new API directly to this repository. We will introduce new modules, move the API over, and once we are happy with the results, we extract it to a separate project. By then, I think we should also create an elixir-gettext org but no need to worry about it for now.

What do you think?

@tmbb
Copy link

tmbb commented Sep 6, 2019 via email

@kipcole9
Copy link
Contributor Author

kipcole9 commented Sep 6, 2019

Thanks for the feedback, I'll start work on the functional separation first (decoupling PO and Gettext). Then aim to integrate a new PO parser. And finally package separation, possible aligned with a new elixir-gettext organization.

As José says, the parsing part is only a bit of what PO does today, albeit an important part.

@tmbb
Copy link

tmbb commented Sep 7, 2019

As promised, the gettext parser: https://github.com/tmbb/mezzofanti/blob/master/lib/mezzofanti/gettext/gettext_parser.ex

Tests here: https://github.com/tmbb/mezzofanti/blob/master/test/mezzofanti/gettext/gettext_parser_test.exs (the parser parses the file into a list of Translation structs, which can be swapped for something else that's more convenient to gettext. I believe I support all of the syntax of .po files, but it's been a couple of months since I've first written this and maybe I missed something. It's not optimized for error reporting, but it shouldn't be too hard to add.

EDIT: It turns out I had already mentioned the parser in this thread: #89 (comment)

@josevalim
Copy link
Contributor

Fantastic! @tmbb would it be an option to swap in the new parser in place of the current parser but keeping the same return API? Then we can migrate it in steps and make sure nothing is lost on the way. Then the next step would be to replace the structs from Gettext.Translation to PO.Translation and similar. Btw, what should we call the project? Is po good enough?

@tmbb
Copy link

tmbb commented Sep 7, 2019

would it be an option to swap in the new parser in place of the current parser but keeping the same return API?

Sure, it's just a question of changing this function: https://github.com/tmbb/mezzofanti/blob/master/lib/mezzofanti/gettext/gettext_parser.ex#L28-L31

Btw, what should we call the project?

Expo?

@tmbb
Copy link

tmbb commented Sep 7, 2019

OTOH, if the .po parser is meant to be reused it should return its own struct and users (such as gettext) should convert from %PoTranslation{} into whatever they want. What do you think?

@tmbb
Copy link

tmbb commented Sep 7, 2019

I would like to propose splitting Gettext into as PO package that looks after storage, parsing, merging and extracting. That way different localisation packages can leverage the same infrastructure - which is supported by the format tag in a PO file. As you know, today Gettext saves all messages with the format-elixir tag.

@kipcole9 How would you deal with the fact that ICU makes directives such as

msgid_plural "%d pages read."
msgstr[0] "Eine Seite gelesen wurde."
msgstr[1] "%d Seiten gelesen wurden."

useless? Should my po parser parse these? This might make the po parser not very reusable... Currently my parser doesn't recognize this directive (nor does it concatenates multiple strings representing multiline strings, but that's easy to add and doesn't pose many problems)

@tmbb
Copy link

tmbb commented Sep 7, 2019

Parser in its own repo: https://github.com/tmbb/ex_po

@josevalim
Copy link
Contributor

josevalim commented Sep 7, 2019 via email

@whatyouhide
Copy link
Contributor

FWIW, I have a working nimble-based parser too, here.

msgid_plural "%d pages read."
msgstr[0] "Eine Seite gelesen wurde."
msgstr[1] "%d Seiten gelesen wurden."

[...] useless?

@tmbb I'm not familiar with ICU. What is useless about this?

@tmbb
Copy link

tmbb commented Sep 7, 2019

In ICU messages, you encode the singular and the plural in te same string, as well as gammatical gender and other things. For example look at the example here: http://userguide.icu-project.org/formatparse/messages

@tmbb
Copy link

tmbb commented Sep 7, 2019

I've created some translation structs inside my ExPo project. I think that is gettext is going to be based on ExPo, it should use ExPo's structs, which AFAIK are backward compatible. There are still some things that need to be changed in ExPo (namely, I need to support more escape chars in strings and other such minor things), but it's in a pretty good shape already.

@josevalim
Copy link
Contributor

Sounds good to me! Can you please send a PR for us to see how it would look like? Btw, what do you think about naming it Expo (app: expo)? Calling it ExPo makes it sound like "I just prepended things with Ex" but calling it Expo makes it sound like a pun on "exposition", which in my humble opinion is much better. :P

@kipcole9
Copy link
Contributor Author

kipcole9 commented Sep 8, 2019

My thinking has been to separate the storage/serialization layer from the translation data layer from the message formatting and translation layer. Based on my reading of this thread that would be seem to be acceptable? It would look like:

Translation data layer

The translation data layer is the structure of translations in an Elixir data structure. In the current Gettext that would be Translation. In this proposal if I've understood the path so far we would standardise on the Expo version of the same. The structure itself needs to represent a message id, its translations, headers and flags.

Data storage/serialization layer

The layer is reponsible for taking the elixir structs, ie Translation or Expo translation and serialising them to some storage mechanism. And of course for the reverse, taking data from the storage representation to the Elixir representation. It is proposed using @tmbb's new Expo for this.

Message formatting and translation

This is the part of the stack that presented a translation API to the developer, does actual translation, message formatting and so on. It also informs, typically at compile time, the translation data layer about new messages that need to be updated, stored or deleted.

Impact on current code

Gettext would preserve its API to the developer and by default it would use the PO storage layer (Expo). So use MyApp.Gettext would, by default, preserve the same API and overall experience. No change to existing code and no expected change to compile time, runtime or performance characteristics.

Other libs, as I plan for Cldr.Messages, can use the Expo storage layer, or may use a different storage layer. It definitely has a different formatting and translation engine since it relies on CLDR underneath.

Configuration

The implication here is that for a given Gettext user there is no change required. It should be possible to configure an alternative storage engine in the future.

Mix tasks

Today the mix tasks are under the Gettext namespace. I propose the move to the PO namespace. The Gettext names could remain and be soft deprecated.

Next steps

  • @tmbb finishes up Expo as the replacement PO parser and default storage layer. This includes the @behaviour for such a storage engine and the data structure that represents the Elixir version of a translations set.

  • @kip works on the gettext code to separate gettext message formatting and translation layer from the Expo storage layer.

  • We continue the discussion to finalise an internal data format for representing translations based on either Translation or Expo or whatever else is deemed appropriate. Since the idea is to allow different message formats to leverage this mechanism it should because a public data structure with enough API to support creating/deleting/updating translations.

Have I captured this correctly?

@josevalim
Copy link
Contributor

josevalim commented Sep 8, 2019 via email

@kipcole9
Copy link
Contributor Author

kipcole9 commented Sep 8, 2019

Make sense José. Should the Expo conversation move to a separate issue to keep that focus? This would help discussion of potential resolutions to #210 and #89 as well.

@josevalim
Copy link
Contributor

josevalim commented Sep 8, 2019 via email

@tmbb
Copy link

tmbb commented Sep 8, 2019 via email

@josevalim
Copy link
Contributor

I have created an elixir-gettext organization with gettext in there. You have been invited to it. I recommend changing Gettext to use Expo and then moving Expo to the organization too. Wdyt?

@josevalim
Copy link
Contributor

Btw, it seems GitHub did not send emails for the invitation but you can accept it here: https://github.com/orgs/elixir-gettext

@whatyouhide whatyouhide changed the title [Proposal] Separate PO and Gettext into separate packages Separate PO and Gettext into separate packages Oct 29, 2019
@maennchen
Copy link
Member

I've found my muse and followed through on the expo library idea: https://github.com/jshmrtn/expo

There's a few known issues, but I think it's almost ready to be used. (Any feedback for the library or help with the remaining 3 issues would be very welcome.)

@josevalim If you're interested, I would like to move this repo into the gettext orga.

@whatyouhide
Copy link
Contributor

@maennchen and others: it's been 3 years and I lost all context. What's the benefit of Expo over the current parser? Asking from a place of not remembering anything 😄

@maennchen
Copy link
Member

@whatyouhide Sure, multiple reasons:

  1. separate packaged => detailed in the original issue message
  2. people were looking for a new nimble based parser, expo has that
  3. (optional and new) expo supports both po & mo files. therefore a more classic approach could ve implemented now
  4. (optional and new) expo supports Keeping or deleting obsolete translations #210
  5. (optional and new) expo supports previous msgids
  6. (optional and new) expo will support parsing of the plural forms header

This is by far not the only way on how to approach things. I would also provide PRs to reintegrate the improvements into gettext. (allthough I would prefer separate packages)

@josevalim
Copy link
Contributor

I have an arm injury, which makes my feedback quite limited, but 👍 from me. The code looks great and I would love to see a migration towards expo. Any remaining question from your side @whatyouhide?

@maennchen
Copy link
Member

Oh, I wish you a speedy recovery then @josevalim!

If everyone is in agreement, It would be great if I could get the permissions so that I can migrate the repository over to this orga.

Either of these will work:

  • Someone part of the gettext orga moves it
    • I can grant admin permissions on the jshmrtn repo to someone that can move it here
    • This person should also set up coveralls quickly
    • I would appreciate management rights on the target expo repo right now (at least until we actually integrate it into gettext)
  • I get permissions to move it myself
    • permission to create a new repo
    • permission to set up oauth integrations (coveralls)

@whatyouhide
Copy link
Contributor

@maennchen I won't have time to do this this week, but I'll go with option 1. early next week. In the meantime, before we integrate Expo into Gettext, I'd really love to see some benchmarks on the parsing of nimble_parsec vs yecc. Do we also have examples of error messages and if those get better, worse, or are equivalent?

@maennchen
Copy link
Member

@whatyouhide

I've invited you as an admin to the repository. Feel free to move it :)

I've created some benchmarks: elixir-gettext/expo#21
Please keep in mind, that I've spent no time at all so far improving performance.

The error messages seem fine, but not awesome to me so far: https://github.com/jshmrtn/expo/blob/main/test/expo/parser/po_test.exs
I've also not really spent a lot of time on them yet, I'm sure they can be improved with a few good placed labels.

@whatyouhide
Copy link
Contributor

Okay, if I’m interpreting the benchmarks right, it seems like the Expo implementation is about 2x slower. The error messages seem fine, right, but not an improvement over the current ones at a first glance. Which brings me to the point: let's keep the yecc-based parser currently in Gettext and port it exactly to Expo before moving Gettext to use Expo, agreed?

I think the first step in extracting a dependency from a library as widely used and field-tested as Gettext is to extract the code as is so as to not introduce the chance of anything else changes. Then, improvements can be made iteratively on the extracted library.

@maennchen
Copy link
Member

@whatyouhide Ok, I'm on it.

@maennchen
Copy link
Member

maennchen commented Apr 4, 2022

@whatyouhide I've started to look at the yecc based code a bit more closely. Based on that I do not come to the same conclusion as you:

Performance

I've spent a few minutes disabling features that are not in gettext (previous message ids and obsolete translations). I also did a few small improvements.

Since doing that, the comparison of performance is a bit more level between the two: elixir-gettext/expo#21 (comment)

While the yecc based parser is still slightly faster, it is by far not twice as fast.

I don't think that performance should be the primary deciding factor since parsing normally happens at compile-time and the difference with some improvements is negligible.

If performance is really that important, then the new .mo parser is around 4 times faster than both .po parsers.

Considering that, I propose to focus on two other factors to decide which implementation should be used: Maintainability and Developer Experience (User Side)

Developer Experience (Library User)

The only thing improving developer the experience for users of this library I can see is error messages. The error messages (which can still be improved further with some tweaking) are better IMHO. One Example:

parsed string:

msgid
msgstr "foo"

Output yecc parser: syntax error before: msgstr

Output nimble based parser: expected msgid followed by strings while processing translation (included a bit of tweaking, not yet on main)

I personally prefer the nimble based approach since it gives more detail:

  • Tells what is expected
  • Tells what the context is

Maintainability

  • The nimble-based approach is more compact. (~ -100 LOC)
  • (very subjective) The yecc based code is harder to understand. Nimble uses a more approachable syntax for elixir developers.
  • with yecc, we have the yrl (unknown syntax to most), the tokenizer which tokenizes the contents manually as the parser, which parses further content like flags

Therefore, I propose to work with the nimble-based approach.

@josevalim
Copy link
Contributor

Two more factors: How big are the parser modules in disk after compiled? Is nimble_parsec faster if we use the inline: true option?

@maennchen
Copy link
Member

maennchen commented Apr 4, 2022

@josevalim

Here you can see all the changes I did for the performance tests: https://github.com/elixir-gettext/expo/compare/performance_comparisor

The inline flag was already enabled in the last-mentioned comparison. It makes a difference of around 7% for me.

Compiled sizes:

$ du -h _build/prod/lib/expo/ebin/*.beam
# ...
9.0K	_build/prod/lib/expo/ebin/Elixir.Expo.Parser.Mo.beam
185K	_build/prod/lib/expo/ebin/Elixir.Expo.Parser.Po.beam
# ...
$ du -h _build/prod/lib/gettext/ebin/*.beam
13K	_build/prod/lib/gettext/ebin/Elixir.Gettext.PO.beam
9.0K	_build/prod/lib/gettext/ebin/Elixir.Gettext.PO.Parser.beam
#...
9.0K	_build/prod/lib/gettext/ebin/Elixir.Gettext.PO.Tokenizer.beam
#...
21K _build/prod/lib/gettext/ebin/gettext_po_parser.beam
#...

@whatyouhide
Copy link
Contributor

@maennchen the thing I am most worried about is this: Gettext's yecc parser has worked reliably in production for 6-7 years now and is thoroughly field tested. None of the reasons mentioned above are enough for me to ignore that honestly. I’m happy to eventually move to a new parser, but I would like to do that gradually (start with just MOs?). The focus of this issue (and this effort IMO) is to have a separate package for PO/MO handling so that users that don't need the whole of Gettext can depend on that. Which parser to use is not really part of that conversation, right? Which is why I would argue to just rip code straight out of Gettext as much as possible and solve this issue first.

@maennchen
Copy link
Member

@whatyouhide If we're using Yecc, I will need your help finishing this up. I can't get the full feature set to work: elixir-gettext/expo#25

@whatyouhide
Copy link
Contributor

@maennchen is Expo adding any features that don't exist today in Gettext?

@maennchen
Copy link
Member

@whatyouhide Yes, the library's goal is to support the complete PO format. That includes the following things:

Previous message id / plural

#| msgid "old message"
#| msgid_plural "old messages\n"
#| "with newline"
msgid "new message
# ...

Obsolete translations

# comment
#. extracted comment
#, flag
#~ msgctxt "context"
#~ msgid "msgid"
#~ msgstr "msgstr"

@maennchen
Copy link
Member

(I already did the data structures, tests and tokenizer, just can’t get yecc to comply. With the nimble based approach, that was already implemented.)

@whatyouhide
Copy link
Contributor

@maennchen I’m having trouble following what are the changes, because the PR you linked is adding yecc to Expo which means the diff is useless 😄 Any chance you could backport the changes for previous msgid and obsolete translations to Gettext first so that we know what to look for and know how to help out?

@whatyouhide
Copy link
Contributor

whatyouhide commented Apr 6, 2022

Btw, I realize that this experience can be frustrating and time consuming. However, as library authors, we have the responsibility to avoid breaking people's code by making rushed changes to library as widely used as Gettext. In libraries like these, we need to be careful about making as few changes at a time as possible, if that makes sense. I hope it's not causing too much trouble 😌 And to be clear: this work is welcome, necessary, and fantastic!! Thanks for all the collaboration and help 💟

@maennchen
Copy link
Member

@whatyouhide I split the PR in two:

Is that good enough or do I need to propose it to gettext directly? I would prefer to tackle #210 only after we integrated expo and not before.

@whatyouhide
Copy link
Contributor

@maennchen no doing it in Expo is alright don't worry. What's the issue that you're seeing?

Also, by looking at the tokenizer changes it looks like you're tokenizing these "special comments" differently. I wonder if that wouldn't be better done after tokenizing + parsing, in Gettext.PO.Parser? Or are you doing that in the parser so that you can make sure that, for examples, obsolete translations are properly formatted and parsed? Like you are trying to avoid malformed obsolete translations?

@maennchen
Copy link
Member

@whatyouhide Follow up here so that we don't spam everyone: elixir-gettext/expo#30 (comment)

@whatyouhide
Copy link
Contributor

@josevalim @maennchen is there anything left to do to close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants