Arabic corrupted in PDFs created with DITA OT 3.7.1 and FOP 2.6 #3910

matthewarnoldstern · 2022-04-06T18:26:40Z

Expected Behavior

Generate correct Arabic text in PDFs using DITA OT 3.7.1 and FOP 2.6.

Actual Behavior

We had an issue with generating Arabic text in PDFs with DITA OT and FOP. Words are not correctly formed with the proper intermediate letter forms and connections. We were able to fix this with our custom PDF plugin for DITA OT 3.6.1 by overriding the attributes of fo:root to remove xml:lang. This introduced other issues like warnings about missing language attributes and there is no language specified in the PDF, but it generated correct Arabic text.

We upgraded to DITA OT 3.7.1 with the provided FOP 2.6, and the Arabic text problem is happening again. The override in our plugin to remove xml:lang no longer works. The xml:lang attribute still appeared in fo:root. As a workaround, we attempted to set xml:lang to ar for Arabic (because using language and locale, like ar-eg, often prevented complex scripts from working with Arabic) and adding script=arab and language attributes to for:root. These also didn't fix the problem. We also notice that the text direction in the PDF isn't correctly set to right-to-left even though we specified it in fo:root.

As it stands now, we cannot create valid Arabic PDFs with DITA OT 3.7.1 and FOP 2.6.

Possible Solution

None currently available.

Steps to Reproduce

Create Arabic source files and set xml:lang of the map and each topic file to ar or ar-eg.
Build PDF using our plugin for DITA OT 3.7.1 with FOP 2.6. In the dita command, set -Dclean.temp to no to preserve temporary folder.
Open the PDF and notice that the Arabic words contain initial forms and are disconnected. This is not valid Arabic.
Open topic.fo from the temporary folder. Notice the fo:root contains xml:lang, even when we override it in the plugin. Adding script and language attributes to fo:root do not show any changes.

Copy of the error message, log file or stack trace

The attached .zip file contains the generated PDF, log.txt, topic.fo, and the common-attr.xsl file we used in our custom PDF plugin to override the attributes of fo:root.

ArabicPDFIssuesDITAOT.zip

Environment

DITA-OT version: 3.7.1 with the provided FOP 2.6
Operating system and version: Windows 10 Enterprise
How did you run DITA-OT? dita command at prompt
Transformation type: Custom PDF

jelovirt · 2022-04-07T05:09:03Z

@matthewarnoldstern Are you able to attach the DITA source and make it smaller. It's enough that it has just one title and a paragraph that shows the problem?

jelovirt · 2022-04-07T05:13:14Z

@matthewarnoldstern does https://issues.apache.org/jira/browse/FOP-2996 sound like the same problem?

raducoravu · 2022-04-07T05:59:52Z

From what I remember we had the same problem in the Oxygen Chemistry PDF processor based on FOP and again there we took the same decision as the end user, to remove the xml:lang from the fo:root.
There is this issue: https://issues.apache.org/jira/browse/FOP-2409
We made the same workaround for the XSL-FO based PDF publishing in the DITA OT bundled with Oxygen XML Editor with an XSLT which does something like:

<xsl:attribute-set name="__fo__root" use-attribute-sets="base-font">
    <!-- TODO: https://issues.apache.org/jira/browse/FOP-2409 -->
    <xsl:attribute name="xml:lang" select="
            if ($locale != 'ar') then
                translate($locale, '_', '-')
            else
                'dflt'"/>
</xsl:attribute-set>

So maybe as a workaround @matthewarnoldstern could focus on solving this problem:

 The override in our plugin to remove xml:lang no longer works.

I do not know how his "commons-attr.xsl" is added to the publishing, is it via a plugin or a customization folder? Maybe the plugin was not properly installed...

stefan-jung · 2022-04-07T15:10:12Z

@raducoravu, If I understand this correctly this will probably not work if the document contains a few other languages and we cannot do font-switching, which is needed when you mix western and non-western languages.

This will be a serious bug for me in ... probably 5 days.

matthewarnoldstern · 2022-04-08T01:35:49Z

@matthewarnoldstern Are you able to attach the DITA source and make it smaller. It's enough that it has just one title and a paragraph that shows the problem?

@jelovirt Here's one of the topic files from our source. It can give you an idea about our tagging.

c_exportservice.zip

matthewarnoldstern · 2022-04-08T01:39:37Z

@matthewarnoldstern does https://issues.apache.org/jira/browse/FOP-2996 sound like the same problem?

@jelovirt Yes, it's the same problem. We're using FOP 2.6 provided with DITA OT 3.7.1. We use Simplified Arabic, which worked with DITA OT 3.6.1. I tried Arial Unicode MS and Scheherazade New, but those didn't work.

matthewarnoldstern · 2022-04-08T01:43:19Z

I do not know how his "commons-attr.xsl" is added to the publishing, is it via a plugin or a customization folder? Maybe the plugin was not properly installed...

@raducoravu This is in our custom PDF plugin. It contains all of the customizations we need to generate PDFs in our desired format. It works with all other languages and kept most of the same formatting and standard strings. The problem is that it doesn't handle complex scripts correctly and set the PDF's orientation to right-to-left.

matthewarnoldstern · 2022-04-08T01:49:48Z

...if the document contains a few other languages and we cannot do font-switching, which is needed when you mix western and non-western languages.

@xephon2 Font switching works fine in our document. It changes to the appropriate font and left-to-right orientation for Latin alphabet text. If you would like details, please let me know.

stefan-jung · 2022-06-15T04:33:22Z

Hi @matthewarnoldstern, please try to set the language attribute to dflt, as explained in https://issues.apache.org/jira/browse/FOP-2426. This was suggested by @jlacour31. FYI @jelovirt .

matthewarnoldstern · 2022-06-22T20:51:04Z

Hello, Stefan. That did the trick! Thank you and @jlacour31 for the suggestion. Here's how I implemented it with our custom plugin based on PDF2. I added the following attribute to our fo:root attribute set, __fo__root: <xsl:attribute name="language"> <xsl:choose> <xsl:when test="substring-before($locale, '_') = 'ar'">dflt</xsl:when> <xsl:otherwise> <xsl:value-of select="substring-before($locale, '_')"/> </xsl:otherwise> </xsl:choose> </xsl:attribute> Now, the Arabic text is generated with the correct letter forms and ligatures. Thanks again for your help.

jelovirt added bug plugin/pdf/fop Issue related to FOP based processing with PDF labels Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabic corrupted in PDFs created with DITA OT 3.7.1 and FOP 2.6 #3910

Arabic corrupted in PDFs created with DITA OT 3.7.1 and FOP 2.6 #3910

matthewarnoldstern commented Apr 6, 2022

jelovirt commented Apr 7, 2022

jelovirt commented Apr 7, 2022

raducoravu commented Apr 7, 2022 •

edited

stefan-jung commented Apr 7, 2022

matthewarnoldstern commented Apr 8, 2022 •

edited

matthewarnoldstern commented Apr 8, 2022

matthewarnoldstern commented Apr 8, 2022

matthewarnoldstern commented Apr 8, 2022

stefan-jung commented Jun 15, 2022 •

edited

matthewarnoldstern commented Jun 22, 2022 via email •

edited by jelovirt

Arabic corrupted in PDFs created with DITA OT 3.7.1 and FOP 2.6 #3910

Arabic corrupted in PDFs created with DITA OT 3.7.1 and FOP 2.6 #3910

Comments

matthewarnoldstern commented Apr 6, 2022

Expected Behavior

Actual Behavior

Possible Solution

Steps to Reproduce

Copy of the error message, log file or stack trace

Environment

jelovirt commented Apr 7, 2022

jelovirt commented Apr 7, 2022

raducoravu commented Apr 7, 2022 • edited

stefan-jung commented Apr 7, 2022

matthewarnoldstern commented Apr 8, 2022 • edited

matthewarnoldstern commented Apr 8, 2022

matthewarnoldstern commented Apr 8, 2022

matthewarnoldstern commented Apr 8, 2022

stefan-jung commented Jun 15, 2022 • edited

matthewarnoldstern commented Jun 22, 2022 via email • edited by jelovirt

raducoravu commented Apr 7, 2022 •

edited

matthewarnoldstern commented Apr 8, 2022 •

edited

stefan-jung commented Jun 15, 2022 •

edited

matthewarnoldstern commented Jun 22, 2022 via email •

edited by jelovirt