Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCX emitter doesn't set language for correction #1620

Closed
doortokaos opened this issue Apr 3, 2024 · 14 comments
Closed

DOCX emitter doesn't set language for correction #1620

doortokaos opened this issue Apr 3, 2024 · 14 comments
Assignees
Labels
Enhancement Small change to improve the current supported functionality
Milestone

Comments

@doortokaos
Copy link

doortokaos commented Apr 3, 2024

MS Word has functionalities to help the user writing correctly in a chosen language.
Since I use a German Version of MS Word, I don't know exactly the name in the English UI, but I think it might be called something like "Language for correction assistance". Here a screen of the menu in my word:
grafik

The language is always set to "English (US)" no matter in which language I create the report:
grafik

The report I used for the example is this (remove the .txt 😉):
korrekturhilfe.rptdesign.txt

I generated the DOCX with the German locale active in the designer and no further settings on the report or eclipse.
grafik

The designer is 4.15 Release all in one eclipse.

It would be nice, if the emitter could set the "Sprache für die Korrekturhilfe" to the language in which the report was created. So the user of the DOCX report won't have to set it every time manually after creating the report.

@hvbtup
Copy link
Contributor

hvbtup commented Apr 3, 2024

PRs are welcome!

Specifiying the language of (parts of) the document is probably described in Microsoft's specification for the DOCX format, and that's basically just a ZIP with several XML files, so you should be able to reverse-engineer this by saving the same document with two different language settings.

But the topic is not as simple as you might think:

First of all, I think that the preview locale is definitely not the correct source for determining the locale.
Instead, the text language must be an attribute of all texts inside the generated report itself.

Sometimes, a report contains texts in more than one language.
For example, our neighbors in Switzerland often use 3 languages in the same report.

So, one metadatum for the languae of the document will not suffice.
Instead, one would need a language attribute for all the texts inside a document, or the option to specify a "main language" for the document and an optional "per-item" language for text passages written in different languages.

This is for the generated documents (eg. HTML or PDF).

We would also need to specify how that languages are determined from the rptdesign file.
AFAIK we have a "locale" property for individual report items (inclduding the report itself). The language is part of the locale specication, so I think this part is clear:

  1. Take the language from the report layout item locale, if specified
  2. Take the language from the report locale, if specified
  3. Take the language from the system property user.language, if that is explicitly specified.

Otherwise, do not assume a language - I think we should not guess.
In particular, many companies in the EU run their servers on Windows servers with US English locale, but their reports should come out in German, French or whatever, so guessing the language from the OS locale settings is error-prone.

I don't know if and how the locale is also part of CSS (or in BIRT speak: the style sheet).

Adding metadata about the language of texts is also a precondition for creating accessible documents, BTW.

@speckyspooky speckyspooky added the Enhancement Small change to improve the current supported functionality label Apr 3, 2024
@doortokaos
Copy link
Author

doortokaos commented Apr 3, 2024

@hvbtup thanks for the extensive reply.
It all makes sense and is more complicated as it seems to be.
I already thought that it can't be that easy because it would have been done already but I couldn't find information why it wasn't done already.

I wouldn't use the locale of the OS running BIRT as well, but I think using the locale in which the report is generated is a bit better than using nothing or defaulting to en_US. Since the user or report creator already set a locale for the complete report.

Since the perfect solution with a language for each text in the report seems rather complex, using a "better" locale for the complete document would be a step in the right direction in my opinion.

In my experience and user environment most documents contain only one language, so this could be an improvement, while not being perfect.

What do you think about this?

PS I'm glad that BIRT is alive again and that the issues are seen and read. @ all keep up the good work 👍

@speckyspooky
Copy link
Contributor

Yes, the tickets will be read :o)

The solution to verify if we could set the property for the whole documents sounds good to me.
The special thing is more a technical thing because the implementation of the docx-version is a mixture of a central library and own written source (from the original developers).
Therefore we cannot use the library-api directly and so we need a research to figure out is there a cetral property on document level.
(The latest MS Word versions support the language on document level and on paragraphs & tables.)

The otherone would be that the language value would be set through a user-property as a docx-emitter specific user-property.

@hvbtup
Copy link
Contributor

hvbtup commented Apr 4, 2024

The language is always set to "English (US)" no matter in which language I create the report

@doortokaos Can you find out if the en-US locale is specified somewhere explicitly in the (etracted) DOCX file structure or if this is just a default which Word assumes if there is no explicit entry?

... but I think using the locale in which the report is generated is a bit better ...

I'm strictly -1 on some kind of magic to determine the language if it is not explicitly defined in the rptdesign file.

... using a "better" locale for the complete document would be a step in the right direction in my opinion.

Yes.

The otherone would be that the language value would be set through a user-property as a docx-emitter specific user-property.

I don't think we need to extend the data model or use a UserProperty. The locale property of items should suffice.

A good starting point for where to look into the code should be the DocxWriter.java file. It's perfectly possible to write xml fragments directly into the output. I did this in our fork to support Word "Felder" (probably called "fields" in English Word?).

@speckyspooky
Copy link
Contributor

speckyspooky commented Apr 4, 2024

  • Confirmed, the usage of a user-property with explicit language-code is the better way (see my comment).
  • The language can be added on document and paragraph/table level, so we will be focused on document-level fort the first steps.
  • The language will be entered with the tag <w:lang> and different attributes.
  • 2 classes identified: Document (emitter.docx) and DocWriter (emitter.wpml)

According to the docx-definition, the value is a part of the assigned "style" of the according document/paragraph/table.

I will test the user-property option with a first draft on my side to verify the option a little bit more.

@doortokaos
Copy link
Author

@doortokaos Can you find out if the en-US locale is specified somewhere explicitly in the (etracted) DOCX file structure or if this is just a default which Word assumes if there is no explicit entry?

@hvbtup here you go:
In the file word/styles.xml I find an entry
<w:lang w:val="en-US" w:eastAsia="zh-CN" w:bidi="ar-SA" />
somewhere beneath the <w:docDefaults> node:
grafik
When I replace "en-US" with "de-DE", save the styles.xml in the DOCX file with an archiver and open the changed DOCX with word, "German (Germany)" is set as language for the document.

So it seems, that BIRT sets en-US for the whole document as default.

The unaltered file created by BIRT for reference
korrekturhilfe.docx

@doortokaos
Copy link
Author

doortokaos commented Apr 5, 2024

  • Confirmed, the usage of a user-property with explicit language-code is the better way (see my comment).

@speckyspooky I don't quite get it why you want to use a user-property.

Correct me if I'm wrong, but as far as I know, there is a locale in the report context, that is used to determine which translation is loaded, when you have assigned a localization text key to a label and registered resource files with the translated texts for the text key.
It is also used to define the formats on elements when the locale is set to "Auto".
grafik

Why can't we use this locale to set a default for the whole document?

@speckyspooky
Copy link
Contributor

@doortokaos
Confirmed, my testing use according to my comment see above the style-element.

I'm with @hvbtup that not in every case the usage of the "Locale" is a good one
because we would change the default behavior and you won't have this behavior in every case.

My idea would be to test the following:
1. I won't change the default behavior and default will be "en-US"

  1. if it is finished with succes my testing we could have 2 user-properties
    2.A) user-property to configure explicit the language code
    2.B) user-property to activate the "Locale" usage
    2.C) if both values are active then the 2.A) will win because it is the explicit definition

  2. validation will be implemented, is the language code invalid then fallback again to "en-US"

@hvbtup
Copy link
Contributor

hvbtup commented Apr 5, 2024

@speckyspooky

You misunderstood me. I'm +1 for using the locale property as defined inside the report.
grafik

grafik

I'm -1 on guessing the local from the environment.

And I think that that existing default "en-US" is a minor bug (not everyone lives in the USA), so for me it's reasonable to change the behavior like this (changing the default behavior):

If the report property locale is explicitly set as shown above, then extract the language from there and write it into the DOCX file instead of "en-US". Otherwise, dont write the w:lang value into the DOCX.

@speckyspooky
Copy link
Contributor

Ok, understand than it was my fault.
Yes, we can use directly the "Locale" without user-properties.

@doortokaos
Copy link
Author

Thanks for your patience with me. I'm new to the whole GitHub-thing and trying my best.

@speckyspooky After clarifying your idea, I like it

@speckyspooky
Copy link
Contributor

I added PR #1627.

The enhancement include only the usage of the "report locale" without user-properties.
The fallback of "empty" or "invalid locale" will be the language "en", so we have the behavior like currently.

In my test cases I used "MS Word based on Office 365".

Example 01: "fr_FR" value

spell-check-fr_FR

Example 02: "it" value

spell-check-it

Demo report

docx_language.zip

@speckyspooky speckyspooky added this to the 4.16 milestone Apr 5, 2024
@speckyspooky speckyspooky self-assigned this Apr 7, 2024
@hvbtup
Copy link
Contributor

hvbtup commented Apr 8, 2024

Thomas, I really like your example reports!

speckyspooky added a commit that referenced this issue Apr 8, 2024
…locale (#1620) (#1627)

Enhancement to add the document language to DOCX based on the report locale  (#1620)
@speckyspooky
Copy link
Contributor

The enhancement is merged to the master with PR #1627

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Small change to improve the current supported functionality
Projects
None yet
Development

No branches or pull requests

3 participants