Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gettext comments in .po files downloaded from Weblate are broken #5695

Closed
1 task
luebbe opened this issue Mar 19, 2021 · 16 comments
Closed
1 task

Gettext comments in .po files downloaded from Weblate are broken #5695

luebbe opened this issue Mar 19, 2021 · 16 comments
Assignees
Labels
bug Something is broken. translate-toolkit Issues which need to be fixed in the translate-toolkit

Comments

@luebbe
Copy link
Contributor

luebbe commented Mar 19, 2021

Gettext comments in .po files downloaded from Weblate are broken

I already tried

Describe the steps you tried to solve the problem yourself.

  • I've read and searched the docs and did not find the answer there.
    If you didn’t try already, try to search there what you wrote above.

To Reproduce the issue

Steps to reproduce the behavior:

  1. Upload a .po file with gettext comments to weblate
  2. Download the same file again via UI or via API and compare to the original
  3. Some comments are broken, the comment prefix is missing. The file fails gettext syntax checks. You can't upload it to weblate again.

Expected behavior

That the uploaded and downloaded files are identical apart from minor formatting differences and that the downloaded file is syntactically correct.

Screenshots

A snippet of the uploaded file:

#: FileInfos.pas
#: ThirdPartyLicenseForm.dfm
msgid "Version"
msgstr "Version"

The same snippet in the downloaded file:

#: FileInfos.pas
 ThirdPartyLicenseForm.dfm
msgid "Version"
msgstr "Version"

In the weblate UI the string is correctly shown as belonging to the two items in the original comments
grafik

Exception traceback

Server configuration and status

Weblate installation: weblate.org Docker

Weblate version 4.5.1

Weblate deploy checks

Additional context

@nijel nijel added bug Something is broken. and removed bug Something is broken. labels Mar 19, 2021
@nijel
Copy link
Member

nijel commented Mar 19, 2021

I can't reproduce this.

@nijel nijel added the question This is more a question for the support than an issue. label Mar 19, 2021
@github-actions
Copy link

This issue looks more like a support question than an issue. We strive to answer these reasonably fast, but purchasing the support subscription is not only more responsible and faster for your business but also makes Weblate stronger. In case your question is already answered, making a donation is the right way to say thank you!

@luebbe
Copy link
Contributor Author

luebbe commented Mar 19, 2021

Thansk for your reply,

The uploaded file is UTF-8 without BOM having CRLF line breaks
The file with the broken comments was downloaded in two ways:

  • via API as .zip file together with the other translations and, in order to check if it was a problem with the API
  • manually as a "normal" download

In both cases all multi-line comments were broken and the downloaded file has mixed line endings, like: #5624 (See right half of screenshot below).

After your hint I tried the converted download and selected .po format.
In this file the developer comments are ok and the line breaks are all CRLF, but the comments before the first msgid, which were added by other translation tools, are missing (See left half of screenshot below).

Here is a windiff view of the two downloads:

  • left: converted
  • right: simple

grafik

@nijel
Copy link
Member

nijel commented Mar 19, 2021

The header comment is expected to be missing from the converted file. The original file seems to have mixed newlines (at least your diff shows different symbols on some lines). Is that something that has happened inside Weblate or was present originally (I guess it was present originally and is cause of the problem we're seeing now).

@luebbe
Copy link
Contributor Author

luebbe commented Mar 19, 2021

Both files in the diff above are downloaded from Weblate. So yes, it happened inside Weblate.
The originial file doesn't have mixed newlines and is UTF-8 format without BOM.

Here's a diff of the originial (left) vs the "simple" download from Weblate (right), which is the same file on the right side as in the previous diff.

grafik

@luebbe
Copy link
Contributor Author

luebbe commented Mar 22, 2021

This morning I tried to investigate the issue further. It looks like we have a case of self-healing software. Today all downloaded translations, no matter if .po or .islu format have consistent line endings and the comments in the .po files are ok.
Does weblate run any overnight maintenance tasks, which fix inconsistencies?
My next step will be to delete the component and do a "create from zip" / "download everything as .zip" again.

@luebbe
Copy link
Contributor Author

luebbe commented Mar 22, 2021

As far as I can tell, the issue went away over the weekend without any changes on my part. I don't know if the docker container was restarted, in case this matters.
This leaves me with mixed feelings, because I have to expect that the problem will strike us again without knowing when and why.

@nijel
Copy link
Member

nijel commented Mar 22, 2021

It might depend on changes in the PO file content, the newlines detection is not that simple here:

https://github.com/translate/translate/blob/acebf280849ddfb81ed870557da8f1d7c906a28e/translate/storage/pypo.py#L46-L75

@nijel
Copy link
Member

nijel commented Mar 22, 2021

Does weblate run any overnight maintenance tasks, which fix inconsistencies?

There is no code to deal with newlines. There are maintenance tasks to clean up the database, or to fetch updates from remote repositories, but that should have no effect in this.

Do you run the server on Linux?

@luebbe
Copy link
Contributor Author

luebbe commented Mar 22, 2021

We are running weblate in Docker. Don't know the details, because my admin hasn't responded yet, but I guess the short answer is: "yes it runs under Linux."

This is very interesting indeed. Years ago, we wrote a pre-processor for the .po(t) files which converted them to Unix newline style.
The reason was that Virtaal, which we used for the translations, accepted line breaks in any style, but returned unix style in msgid/msgstr and DOS style in gettext comments. Very much like the behaviour that we see in the screenshots above.

In lines 31-33 on the right hand side of the screenshots above you even see three different newline styles in just three lines of .po file.

So the files saved by Virtaal contained mixed newlines when reading in DOS style and Unix newlines when reading in Unix style. Our solution was to give Virtaal what it needed and be done with it.

Looking at the author(s), the code that you pointed me to looks very much like it could be the same that was sitting at the core of Virtaal years ago. :)

But still: we are uploading consistent newlines. Either DOS or Unix, Weblate's choice. So what made Weblate break the files that were downloaded? Are they merged from the originial file (style A) and translations that were made by users (style B)?

Maybe weblate needs a (per project/per component) newline style setting that it uses for downloads, no matter which style was used during upload?

VCS handle this quite well nowadays. They store files with a standard newline style on the server and the client converts to the platform specific style upon download. Since Weblate can't know the platform to which the file is downloaded, it has to be told beforehand.

Alternatively the newline style could be specified in the get request like the ?format parameter. Something like ?newline=CR

@nijel
Copy link
Member

nijel commented Mar 22, 2021

The underlying library for handling the translation files is the same as in Virtaal (we both use translate-toolkit). Probably it still has some issues with non-unix newlines. AFAIK GNU gettext only parses unix newlines, so this is not well tested area.

@nijel
Copy link
Member

nijel commented Mar 22, 2021

i've added tests exposing this in translate/translate#4301, I will look into fixing it later.

nijel added a commit to nijel/translate that referenced this issue Mar 22, 2021
Keep the newlines during round-trip. This reduces amount of changes and
avoids producing mixed newlines files.

Fixes WeblateOrg/weblate#5695
@github-actions
Copy link

The issue you have reported is resolved now. If you don’t feel it’s right, please follow it’s labels to get a clue and take further steps.

  • In case you see a similar problem, please open a separate issue.
  • If you are happy with the outcome, don’t hesitate to support Weblate by making a donation.

@nijel nijel added bug Something is broken. translate-toolkit Issues which need to be fixed in the translate-toolkit and removed question This is more a question for the support than an issue. labels Mar 22, 2021
@nijel nijel self-assigned this Mar 22, 2021
@github-actions
Copy link

The issue you've reported needs to be addressed in the translate-toolkit. Please file the issue there, and include links to any relevant specifications about the formats (if applicable).

@luebbe
Copy link
Contributor Author

luebbe commented Mar 23, 2021

Fantastic!
If memory serves me right, I have talked about this with the translate-toolkit people years ago and I also proposed/made some code changes. But I guess this was via the mailing list, since the oldest issue I can dig up on Github is this one: translate/virtaal#1407 (Just for Nostalgia).
Interesting to see that by switching tools the same issue(s) resurface ten years later, because the tools are relying on the same libraries.
Is it possible that #5624 suffers from the same problem, only for a different file format, and can be solved the same way?

@nijel
Copy link
Member

nijel commented Mar 23, 2021

Yes, it's likely - the comments are parsed as a block, so they keep the newlines, while the rest of the file is using system newlines on serializing (as it relies on ConfigParser).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken. translate-toolkit Issues which need to be fixed in the translate-toolkit
Projects
None yet
Development

No branches or pull requests

2 participants