Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic corrupted in PDFs created with DITA OT 3.7.1 and FOP 2.6 #3910

Open
matthewarnoldstern opened this issue Apr 6, 2022 · 10 comments
Open
Labels
bug plugin/pdf/fop Issue related to FOP based processing with PDF

Comments

@matthewarnoldstern
Copy link

Expected Behavior

Generate correct Arabic text in PDFs using DITA OT 3.7.1 and FOP 2.6.

Actual Behavior

We had an issue with generating Arabic text in PDFs with DITA OT and FOP. Words are not correctly formed with the proper intermediate letter forms and connections. We were able to fix this with our custom PDF plugin for DITA OT 3.6.1 by overriding the attributes of fo:root to remove xml:lang. This introduced other issues like warnings about missing language attributes and there is no language specified in the PDF, but it generated correct Arabic text.

We upgraded to DITA OT 3.7.1 with the provided FOP 2.6, and the Arabic text problem is happening again. The override in our plugin to remove xml:lang no longer works. The xml:lang attribute still appeared in fo:root. As a workaround, we attempted to set xml:lang to ar for Arabic (because using language and locale, like ar-eg, often prevented complex scripts from working with Arabic) and adding script=arab and language attributes to for:root. These also didn't fix the problem. We also notice that the text direction in the PDF isn't correctly set to right-to-left even though we specified it in fo:root.

As it stands now, we cannot create valid Arabic PDFs with DITA OT 3.7.1 and FOP 2.6.

Possible Solution

None currently available.

Steps to Reproduce

  1. Create Arabic source files and set xml:lang of the map and each topic file to ar or ar-eg.
  2. Build PDF using our plugin for DITA OT 3.7.1 with FOP 2.6. In the dita command, set -Dclean.temp to no to preserve temporary folder.
  3. Open the PDF and notice that the Arabic words contain initial forms and are disconnected. This is not valid Arabic.
  4. Open topic.fo from the temporary folder. Notice the fo:root contains xml:lang, even when we override it in the plugin. Adding script and language attributes to fo:root do not show any changes.

Copy of the error message, log file or stack trace

The attached .zip file contains the generated PDF, log.txt, topic.fo, and the common-attr.xsl file we used in our custom PDF plugin to override the attributes of fo:root.

ArabicPDFIssuesDITAOT.zip

Environment

  • DITA-OT version: 3.7.1 with the provided FOP 2.6
  • Operating system and version: Windows 10 Enterprise
  • How did you run DITA-OT? dita command at prompt
  • Transformation type: Custom PDF
@jelovirt jelovirt added bug plugin/pdf/fop Issue related to FOP based processing with PDF labels Apr 7, 2022
@jelovirt
Copy link
Member

jelovirt commented Apr 7, 2022

@matthewarnoldstern Are you able to attach the DITA source and make it smaller. It's enough that it has just one title and a paragraph that shows the problem?

@jelovirt
Copy link
Member

jelovirt commented Apr 7, 2022

@matthewarnoldstern does https://issues.apache.org/jira/browse/FOP-2996 sound like the same problem?

@raducoravu
Copy link
Member

raducoravu commented Apr 7, 2022

From what I remember we had the same problem in the Oxygen Chemistry PDF processor based on FOP and again there we took the same decision as the end user, to remove the xml:lang from the fo:root.
There is this issue: https://issues.apache.org/jira/browse/FOP-2409
We made the same workaround for the XSL-FO based PDF publishing in the DITA OT bundled with Oxygen XML Editor with an XSLT which does something like:

<xsl:attribute-set name="__fo__root" use-attribute-sets="base-font">
    <!-- TODO: https://issues.apache.org/jira/browse/FOP-2409 -->
    <xsl:attribute name="xml:lang" select="
            if ($locale != 'ar') then
                translate($locale, '_', '-')
            else
                'dflt'"/>
</xsl:attribute-set>

So maybe as a workaround @matthewarnoldstern could focus on solving this problem:

 The override in our plugin to remove xml:lang no longer works. 

I do not know how his "commons-attr.xsl" is added to the publishing, is it via a plugin or a customization folder? Maybe the plugin was not properly installed...

@stefan-jung
Copy link
Contributor

@raducoravu, If I understand this correctly this will probably not work if the document contains a few other languages and we cannot do font-switching, which is needed when you mix western and non-western languages.

This will be a serious bug for me in ... probably 5 days.

@matthewarnoldstern
Copy link
Author

matthewarnoldstern commented Apr 8, 2022

@matthewarnoldstern Are you able to attach the DITA source and make it smaller. It's enough that it has just one title and a paragraph that shows the problem?

@jelovirt Here's one of the topic files from our source. It can give you an idea about our tagging.

c_exportservice.zip

@matthewarnoldstern
Copy link
Author

@matthewarnoldstern does https://issues.apache.org/jira/browse/FOP-2996 sound like the same problem?

@jelovirt Yes, it's the same problem. We're using FOP 2.6 provided with DITA OT 3.7.1. We use Simplified Arabic, which worked with DITA OT 3.6.1. I tried Arial Unicode MS and Scheherazade New, but those didn't work.

@matthewarnoldstern
Copy link
Author

I do not know how his "commons-attr.xsl" is added to the publishing, is it via a plugin or a customization folder? Maybe the plugin was not properly installed...

@raducoravu This is in our custom PDF plugin. It contains all of the customizations we need to generate PDFs in our desired format. It works with all other languages and kept most of the same formatting and standard strings. The problem is that it doesn't handle complex scripts correctly and set the PDF's orientation to right-to-left.

@matthewarnoldstern
Copy link
Author

...if the document contains a few other languages and we cannot do font-switching, which is needed when you mix western and non-western languages.

@xephon2 Font switching works fine in our document. It changes to the appropriate font and left-to-right orientation for Latin alphabet text. If you would like details, please let me know.

@stefan-jung
Copy link
Contributor

stefan-jung commented Jun 15, 2022

Hi @matthewarnoldstern, please try to set the language attribute to dflt, as explained in https://issues.apache.org/jira/browse/FOP-2426. This was suggested by @jlacour31. FYI @jelovirt .

@matthewarnoldstern
Copy link
Author

matthewarnoldstern commented Jun 22, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug plugin/pdf/fop Issue related to FOP based processing with PDF
Projects
None yet
Development

No branches or pull requests

4 participants