Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chm TOC shows wrong characters for many languages #1271

Closed
lade opened this issue May 8, 2012 · 12 comments
Closed

Chm TOC shows wrong characters for many languages #1271

lade opened this issue May 8, 2012 · 12 comments
Labels
bug i18n Issue related to internationalization / localization plugin/htmlhelp Issue related to HTML Help (CHM) plug-in
Milestone

Comments

@lade
Copy link

lade commented May 8, 2012

As one example, the letter Ά shows incorrectly in TOC for chm for Greek. By changing html output to 'windows-1253' instead of 'iso-8859-7' in codepages.xml I got it to show correctly.

The compiler of HTML Help Workshop uses different encodings from what are used by DITA-OT currently. The ones that work correctly are:

  • windows-1250 for East European languages (instead of iso-8859-2)
  • windows-1252 for Western languages (instead of iso-8859-1)
  • windows-1253 for Greek (instead of iso-8859-7)
  • windows-1254 for Turkish (instead of iso-8859-9)
  • windows-1257 for Estonian (instead of iso-8859-1)

In some of these cases the differences are only about some special characters. And windows-1252 is a superset of iso-8859-1 (except for control characters in the range 80-9F). But in some there are more relevant differences, and it would be nice if the correct codepage would be used.

@jelovirt
Copy link
Member

jelovirt commented May 8, 2012

Are #1269 and #1270 more or less duplicates of this bug? Preferably create a single issue for a single bug.

@lade
Copy link
Author

lade commented May 8, 2012

I already reported this problem for East European languages (#1269), but noticed that wrong encodings are used for other languages as well.

@jelovirt
Copy link
Member

jelovirt commented May 8, 2012

Ok, I will close #1269 as a duplicate of this issue.

@lade
Copy link
Author

lade commented May 8, 2012

Yep, #1269 is duplicate, but #1270 is not, as far as I can tell.

@lade
Copy link
Author

lade commented Sep 12, 2012

Not fixed in DITA OT 1.6.2; for example, Czech still shows 'š' as '?' and 'ž' as 'ł' in chm's TOC.

@eerohele
Copy link

I found out yesterday that a customer has been dealing with this problem for a long time by manually copy-pasting titles from DITA XML files into the .hhc file manually after generating output with DITA-OT and then recompiling the CHM file.

From what I can tell so far, @lade's suggestion of using Windows codepages seems to work fine. However, that is not necessarily quite sufficient: as @lade notes in #1268, to make š and Š work for Czech, the corresponding entities (352=Š and 353=š) need to be removed from entities.properties.

That in turn becomes a problem when we want to use a special character like μ in a CHM help. If we want to use it in, say, a Polish help (pl-pl), there needs to be an entry for it in entities.properties:

956=μ

However, if we add that entry and then create a Greek (el-gr) help, that (Greek) character will not be displayed correctly in the CHM TOC of the Greek help. If we remove that entry from entities.properties once again, it works OK. So the whole thing's a bit of a catch-22 and I'm not sure there's a way around it short of somehow being able to parametrize whether entities.properties should be used for a particular transformation or not.

By the way, this also means that 7fbd675 effectively breaks Greek-language CHM helps completely because all Greek characters in the TOC will be mangled (albeit only when @xml:lang is el-gr).

EDIT: Or maybe entities.properties could be language-specific and customizable by plugins?

@robander robander added the i18n Issue related to internationalization / localization label May 12, 2016
@robander
Copy link
Member

This has just come up as an urgent issue for one of my customers.

In the past, we had a hacked version of DITA-OT that converted some characters to entities, but not all, followed by a codepage conversion. What made that work was that the entity conversion was language-aware, so for characters like Ý that exist in iso-8859-2, we do not generate Ý (keep the character as Ý) when the target codepage is iso-8859-2. The convert-lang utility in DITA-OT always does the conversion, so common letters in iso-8859-2 languages are corrupted. Similarly, I've always ignored entities for Greek characters when the input file is Greek.

If nothing else - could we just add a property that allows skipping the convert-lang task, so that when all characters are common/expected characters from the language's CHM charset, everything will work? We've got similar properties on a lot of targets, but none available here. @jelovirt I don't know if adding such a property qualifies as a bug fix for a patch, as it doesn't quite fix a bug but would allow us to work around one...

@robander
Copy link
Member

robander commented Nov 16, 2017

FWIW our earlier process (run outside of DITA-OT) was actually based on target codepage rather than language. For example, if target codepage is windows-1250 (as with Czech), none of the common Czech characters are converted. In this case if I'm reading right, we convert the HHP file to 1250 but the HHC contents file, HHK index, and HTML files to the HTML codepage. I'm not sure if that's causing additional trouble -- it makes sense that HTML would be converted to the HTML codepage (with charset properly updated), but I would think that the CHM compiler on Windows would want the other project files (not just HHP) to use the Windows codepage. 😕

@robander
Copy link
Member

Hmm, as a test I've just changed this line that converts chars:

                if (entityMap.containsKey(key)) {

To this:

                if (entityMap.containsKey(key) &&
                        !entityExceptionChar(charset,charCode)) {

With the following function (for now just testing Y-acute (Ý and ý):

    private boolean entityExceptionChar(final String charset, final int charCode) {
        if (charset.equals(CODEPAGE_ISO_8859_2) &&
                (charCode == 221 || charCode == 253)) 
            return true;
        return false;
    }

I appear to get the characters generated / converted properly. I've got a full list of characters that we've been not converting in this way for Latin-1, Latin-2, and Greek -- if this approach looks good I could make this as a fix. @jelovirt does that look like a valid approach?

@robander
Copy link
Member

robander commented Nov 17, 2017

I'm sure there's a more efficient way of looking these values up, but so I don't lose them, here's the list of characters that we've avoided converting to entities for both 1252 / ISO-8859-1 and 1250 / ISO-8859-2:

if (    charCode == 193 || charCode == 225 ||    //A-acute
                    charCode == 194 || charCode == 226 ||    //A-circumflex
                    charCode == 196 || charCode == 228 ||    //A-umlaut
                    charCode == 199 || charCode == 231 ||    //C-cedilla
                    charCode == 201 || charCode == 233 ||    //E-acute
                    charCode == 203 || charCode == 235 ||    //E-umlaut
                    charCode == 205 || charCode == 237 ||    //I-acute
                    charCode == 206 || charCode == 238 ||    //I-circumflex
                    charCode == 212 || charCode == 244 ||    //O-circumflex
                    charCode == 214 || charCode == 246 ||    //O-umlaut
                    charCode == 211 || charCode == 243 ||    //O-acute
                    charCode == 218 || charCode == 250 ||    //U-acute
                    charCode == 220 || charCode == 252 ||    //U-umlaut
                    charCode == 221 || charCode == 253 ||    //Y-acute
                    charCode == 223 || charCode == 215) {   //szlig, times

Here are the characters we avoid converting to entities just for 1252 / ISO-8859-1:

if (    charCode == 192 || charCode == 224 ||   //A-grave
                    charCode == 195 || charCode == 227 ||    //A-tilde
                    charCode == 197 || charCode == 229 ||    //A-ring
                    charCode == 198 || charCode == 230 ||    //AElig
                    charCode == 200 || charCode == 232 ||    //E-grave
                    charCode == 202 || charCode == 234 ||    //E-circumflex
                    charCode == 204 || charCode == 236 ||    //I-grave
                    charCode == 207 || charCode == 239 ||    //I-uml
                    charCode == 208 || charCode == 240 ||    //ETH
                    charCode == 209 || charCode == 241 ||    //N-tilde
                    charCode == 210 || charCode == 242 ||    //O-grave
                    charCode == 213 || charCode == 245 ||    //O-tilde
                    charCode == 216 || charCode == 248 ||    //O-slash
                    charCode == 217 || charCode == 249 ||    //U-grave
                    charCode == 219 || charCode == 251 ||    //O-circumflex
                    charCode == 222 || charCode == 254 ||    //Thorn
                    charCode == 255) {                       //y-umlaut

In addition, for Greek, we avoid converting the full Greek alphabet.

Finally, for Korean, we do convert \ to ₩.

robander pushed a commit to robander/dita-ot that referenced this issue Nov 28, 2017
Signed-off-by: Robert D Anderson <robander@us.ibm.com>
@robander
Copy link
Member

robander commented Nov 28, 2017

Have a suggested fix linked above. When converting from the ugly test in my previous comment to use sets, I noticed just one character was missing from the Latin-2 set, which is converted to &Scaron; and &scaron; (characters 352 and 353).

The fix also switches the HHC and HHK files to use the Windows codepage, just like the HHP does today. Without this change, characters that use different codepoints in ISO-8859-2 versus Windows-1250 are still corrupted (as with the original submitted defect report here).

With the fix above, using my Windows 7 system + DITA-OT 3.0 + the fix in #2851, I'm able to generate English, Czech, and Greek CHM files that compile on my system and display the correct TOC + Index.

  • My English sample uses characters that exist in the Latin-1 codepages (like Á) that currently generate symbols (&Aacute;) that break the TOC/index. These are no longer converted, so the TOC works. Characters that do not exist in Latin-1 (like Ω) are still converted to entities that do not display, but this is a limitation of the format.
  • My Czech sample includes the full Czech character set: Aa Áá Bb Cc Čč Dd Ďď Ee Éé Ěě Ff Gg Hh Ii Íí Jj Kk Ll Mm Nn Ňň Oo Óó Pp Qq Rr Řř Ss Šš Tt Ťť Uu Úú Ůů Vv Ww Xx Yy Ýý Zz Žž. Many of these are corrupted today, all characters now display properly in TOC and Index.
  • Greek sample does the same with alphabet ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ αβγδεζηθικλμνξοπρςστυφχψω. These characters are still converted to entities for English (works in the content, fails in TOC), but are not converted for Greek (the codepage conversion to 1253 handles them all, and they all work in the TOC).

robander pushed a commit to robander/dita-ot that referenced this issue Nov 28, 2017
Signed-off-by: Robert D Anderson <robander@us.ibm.com>
jelovirt added a commit that referenced this issue Nov 28, 2017
@robander robander added this to the 3.0.1 milestone Nov 28, 2017
@robander
Copy link
Member

Fixed with #2852

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug i18n Issue related to internationalization / localization plugin/htmlhelp Issue related to HTML Help (CHM) plug-in
Projects
None yet
Development

No branches or pull requests

4 participants