Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

& in XLIFF does not get unescaped #3081

Closed
mlaggner opened this issue Oct 1, 2019 · 20 comments
Closed

& in XLIFF does not get unescaped #3081

mlaggner opened this issue Oct 1, 2019 · 20 comments
Assignees
Labels
backlog This is not on the Weblate roadmap for now. Can be prioritized by sponsorship. enhancement Adding or requesting a new feature.
Milestone

Comments

@mlaggner
Copy link
Contributor

mlaggner commented Oct 1, 2019

Describe the bug
If you have & in your XLIFF files, it does not get unescaped to &.

Example:

orginal: Sicht "&" Eintrag kopieren als: Zieleintrag

XLIFF;

<?xml version="1.0" encoding="utf-8"?><xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
 <file datatype="plaintext" original="OBJECT" source-language="de-DE" target-language="fr-FR" date="2019-09-11T07:25:52Z" category="..">
  <body>
   <trans-unit size-unit="char" approved="no" maxwidth="60" id="A_TITLE" resname="A_TITLE">
    <source>Sicht "&amp;" Eintrag kopieren als: Zieleintrag</source>
    <target state="needs-review-translation">Copier entrée vue "&amp;" : entrée cible</target>
   </trans-unit>
  </body>
 </file>
</xliff>

and in weblate, exactly this string is represented as:
image

To Reproduce
Steps to reproduce the behavior:

  1. Create an XLIFF Export having a source string with an escaped &
  2. Import this into weblate
  3. View the string in weblate

Expected behavior
The &amp; should be unscaped for translation (and re-escaped on commit)

Screenshots
see above

Server configuration and status

* Weblate 3.7.1
 * Python 3.7.3
 * Django 2.2.3
 * Celery 4.3.0
 * celery-batches 0.2
 * six 1.12.0
 * social-auth-core 3.2.0
 * social-auth-app-django 3.1.0
 * django-appconf 1.0.3
 * translate-toolkit 2.4.0
 * translation-finder 1.6
 * Whoosh 2.7.4
 * defusedxml 0.6.0
 * Git 2.20.1
 * Pillow 5.4.1
 * python-dateutil 2.8.0
 * lxml 4.3.2
 * django-crispy-forms 1.7.2
 * django_compressor 2.3
 * djangorestframework 3.9.4
 * user-agents 2.0
 * jellyfish 0.7.2
 * diff-match-patch 20121119
 * pytz 2019.1
 * pyuca 1.2
 * ruamel.yaml 0.15.99
 * tesserocr 2.4.0
 * Mercurial 4.8.2
 * git-svn 2.20.1
 * Database backends: django.db.backends.postgresql
 * Cache backends: avatar:FileBasedCache, default:RedisCache
 * Email setup: django.core.mail.backends.smtp.EmailBackend: localhost
 * Celery: redis://cache:6379/1, redis://cache:6379/1, regular
 * Platform: Linux 4.4.114-94.11-default (x86_64)
WARNINGS:
?: (security.W004) You have not set a value for the SECURE_HSTS_SECONDS setting. If your entire site is served only over SSL, you may want to consider setting a value and enabling HTTP Strict Transport Security. Be sure to read the documentation first; enabling HSTS carelessly can cause serious, irreversible problems.
?: (security.W008) Your SECURE_SSL_REDIRECT setting is not set to True. Unless your site should be available over both SSL and non-SSL connections, you may want to either set this setting True or configure a load balancer or reverse-proxy server to redirect all connections to HTTPS.
?: (security.W012) SESSION_COOKIE_SECURE is not set to True. Using a secure-only session cookie makes it more difficult for network traffic sniffers to hijack user sessions.
?: (security.W018) You should not have DEBUG set to True in deployment.

INFOS:
?: (weblate.I021) Error collection is not configured, it is highly recommended for production use
        HINT: https://docs.weblate.org/en/weblate-3.7.1/admin/install.html#collecting-errors

System check identified 586 issues (1 silenced).
@mlaggner
Copy link
Contributor Author

mlaggner commented Oct 1, 2019

update: looks like xliff files imported with a previous version did not have these problems; only components I've created with weblate 3.7.1

@nijel
Copy link
Member

nijel commented Oct 1, 2019

The XML markup is handled differently for some time, see #490 and related issues.

@nijel nijel added the question This is more a question for the support than an issue. label Oct 1, 2019
@mlaggner
Copy link
Contributor Author

mlaggner commented Oct 2, 2019

as far as I can read out of the issue (and PR), the XLIFF content is now handles as rich source, where no more conversion/escaping is done?

But if I remember correctly, there is a need to escape some characters in a XML file (like the ampersand).

How could we solve that?

@nijel
Copy link
Member

nijel commented Oct 8, 2019

It should escape the strings when necessary (AFAIR it's when it's not valid XML).

@mlaggner
Copy link
Contributor Author

But if you take the example from above and create a new test component in a weblate 3.7.1 instance, you will recognize that the ampersand is not unescaped..

tbh I do not have any clue where to look for a solution

@nijel
Copy link
Member

nijel commented Oct 21, 2019

With current code, it's expected to be unescaped. We can either interpret the XML and strip tags or include the tags and leave the XML as is including tags and escaped entities.

@mlaggner
Copy link
Contributor Author

Would it be possible to include a setting in the component to choose the behavior?

@nijel
Copy link
Member

nijel commented Oct 22, 2019

Probably not in the settings, but separate file format would do that.

@nijel nijel added enhancement Adding or requesting a new feature. and removed question This is more a question for the support than an issue. labels Oct 22, 2019
@nijel nijel added this to TODO in File format support via automation Oct 22, 2019
@nijel nijel added the backlog This is not on the Weblate roadmap for now. Can be prioritized by sponsorship. label Jan 6, 2021
@github-actions
Copy link

github-actions bot commented Jan 6, 2021

This issue has been added to the backlog. It is not scheduled on the Weblate roadmap, but it eventually might be implemented. In case you need this feature soon, please consider helping or push it by funding the development.

@al-65
Copy link

al-65 commented Jan 26, 2022

Hi there,
also hit this, but actually it would already help if the checking against maxwidth tag considered escaped characters like & as just one character :)

@nijel
Copy link
Member

nijel commented Jan 26, 2022

@al-65 It should be the case since 15841ba, see #6645

@al-65
Copy link

al-65 commented Jan 27, 2022

hm, my & is still counted as 5 chars. have tried to place the xml-text flag in both header and translation unit, like:

&

@al-65
Copy link

al-65 commented Jan 27, 2022

oops xml doesn't survive inputting here :(
here is an image:
image

@nijel
Copy link
Member

nijel commented Jan 27, 2022

The flag needs to be set in Weblate, but it should be set automatically for XLIFF. You might need to use loadpo --force to enforce re-reading the file by Weblate.

@nijel
Copy link
Member

nijel commented Mar 2, 2022

What would be the preferred solution here?

  • Are we looking for string granularity to handle XML differently?
  • Would file format option to escape/not escape XML be sufficient?

In case the second option wins, implementing it is pretty straightforward - it's just a matter of adding another XLIFF file format support to Weblate which will skip current XML handling logic.

@al-65
Copy link

al-65 commented Mar 3, 2022

Hi there,
for me, option 2 would be absolutely fine. Translators would just see "&" instead of its esc sequence, and a "&" character would ideally be counted as one character (same for the other few escaped chars). The xliff files would still have to hold the escaped version to be valid xml.
Best regards,
Andre

@nijel
Copy link
Member

nijel commented Mar 3, 2022

Okay, in that case, we can go for two file formats - XLIFF (raw XML) (current one) / XLIFF (escaped XML) (to be developed).

Looking at Transifex, they escape only parts of the strings, but that sounds too fragile to me (and that won't cover the HTML use case mentioned in #7224): https://docs.transifex.com/formats/xliff#escaping-characters

@ScionOfDesign
Copy link

I just ran into this issue, hot having escaped XML Xliff formats is horribly painful to work with for us. We do not have the option to use .csv because the format Unity exports from are not compatible with the format Weblate uses.
Our use case is that we use XLIFF files exported from Unity's localization package. In Unity, strings use Rich Text to determine if they are bold or italic or whatnot. We also use XML tags to insert variables and character names.

This leads to a lot of confusing-looking characters in the text. Furthermore, the french translation check thinks that there should be a space in front of the ; used for escaping like so:
image

Having this escaped XML format would be amazing. Also having the ability to edit in rich-text would be even more amazing.

@nijel nijel self-assigned this May 19, 2022
@nijel nijel added this to the 4.13 milestone May 19, 2022
@nijel nijel closed this as completed in 4b5e68e May 19, 2022
File format support automation moved this from TODO to Done May 19, 2022
@github-actions
Copy link

Thank you for your report; the issue you have reported has just been fixed.

  • In case you see a problem with the fix, please comment on this issue.
  • In case you see a similar problem, please open a separate issue.
  • If you are happy with the outcome, don’t hesitate to support Weblate by making a donation.

@ScionOfDesign
Copy link

Thank you so much for thise! I was able to pull from bleeding and it works great for us!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog This is not on the Weblate roadmap for now. Can be prioritized by sponsorship. enhancement Adding or requesting a new feature.
Projects
Development

No branches or pull requests

4 participants