Chm TOC shows wrong characters for many languages #1271

lade · 2012-05-08T08:56:40Z

As one example, the letter Ά shows incorrectly in TOC for chm for Greek. By changing html output to 'windows-1253' instead of 'iso-8859-7' in codepages.xml I got it to show correctly.

The compiler of HTML Help Workshop uses different encodings from what are used by DITA-OT currently. The ones that work correctly are:

windows-1250 for East European languages (instead of iso-8859-2)
windows-1252 for Western languages (instead of iso-8859-1)
windows-1253 for Greek (instead of iso-8859-7)
windows-1254 for Turkish (instead of iso-8859-9)
windows-1257 for Estonian (instead of iso-8859-1)

In some of these cases the differences are only about some special characters. And windows-1252 is a superset of iso-8859-1 (except for control characters in the range 80-9F). But in some there are more relevant differences, and it would be nice if the correct codepage would be used.

jelovirt · 2012-05-08T08:59:23Z

Are #1269 and #1270 more or less duplicates of this bug? Preferably create a single issue for a single bug.

lade · 2012-05-08T08:59:33Z

I already reported this problem for East European languages (#1269), but noticed that wrong encodings are used for other languages as well.

jelovirt · 2012-05-08T09:01:15Z

Ok, I will close #1269 as a duplicate of this issue.

lade · 2012-05-08T09:46:42Z

Yep, #1269 is duplicate, but #1270 is not, as far as I can tell.

lade · 2012-09-12T10:26:27Z

Not fixed in DITA OT 1.6.2; for example, Czech still shows 'š' as '?' and 'ž' as 'ł' in chm's TOC.

eerohele · 2014-06-18T07:15:18Z

I found out yesterday that a customer has been dealing with this problem for a long time by manually copy-pasting titles from DITA XML files into the .hhc file manually after generating output with DITA-OT and then recompiling the CHM file.

From what I can tell so far, @lade's suggestion of using Windows codepages seems to work fine. However, that is not necessarily quite sufficient: as @lade notes in #1268, to make š and Š work for Czech, the corresponding entities (352=&Scaron; and 353=&scaron;) need to be removed from entities.properties.

That in turn becomes a problem when we want to use a special character like μ in a CHM help. If we want to use it in, say, a Polish help (pl-pl), there needs to be an entry for it in entities.properties:

956=&mu;

However, if we add that entry and then create a Greek (el-gr) help, that (Greek) character will not be displayed correctly in the CHM TOC of the Greek help. If we remove that entry from entities.properties once again, it works OK. So the whole thing's a bit of a catch-22 and I'm not sure there's a way around it short of somehow being able to parametrize whether entities.properties should be used for a particular transformation or not.

By the way, this also means that 7fbd675 effectively breaks Greek-language CHM helps completely because all Greek characters in the TOC will be mangled (albeit only when @xml:lang is el-gr).

EDIT: Or maybe entities.properties could be language-specific and customizable by plugins?

robander · 2017-11-16T22:25:19Z

This has just come up as an urgent issue for one of my customers.

In the past, we had a hacked version of DITA-OT that converted some characters to entities, but not all, followed by a codepage conversion. What made that work was that the entity conversion was language-aware, so for characters like Ý that exist in iso-8859-2, we do not generate Ý (keep the character as Ý) when the target codepage is iso-8859-2. The convert-lang utility in DITA-OT always does the conversion, so common letters in iso-8859-2 languages are corrupted. Similarly, I've always ignored entities for Greek characters when the input file is Greek.

If nothing else - could we just add a property that allows skipping the convert-lang task, so that when all characters are common/expected characters from the language's CHM charset, everything will work? We've got similar properties on a lot of targets, but none available here. @jelovirt I don't know if adding such a property qualifies as a bug fix for a patch, as it doesn't quite fix a bug but would allow us to work around one...

robander · 2017-11-16T22:41:27Z

FWIW our earlier process (run outside of DITA-OT) was actually based on target codepage rather than language. For example, if target codepage is windows-1250 (as with Czech), none of the common Czech characters are converted. In this case if I'm reading right, we convert the HHP file to 1250 but the HHC contents file, HHK index, and HTML files to the HTML codepage. I'm not sure if that's causing additional trouble -- it makes sense that HTML would be converted to the HTML codepage (with charset properly updated), but I would think that the CHM compiler on Windows would want the other project files (not just HHP) to use the Windows codepage. 😕

robander · 2017-11-16T23:07:29Z

Hmm, as a test I've just changed this line that converts chars:

                if (entityMap.containsKey(key)) {

To this:

                if (entityMap.containsKey(key) &&
                        !entityExceptionChar(charset,charCode)) {

With the following function (for now just testing Y-acute (Ý and ý):

    private boolean entityExceptionChar(final String charset, final int charCode) {
        if (charset.equals(CODEPAGE_ISO_8859_2) &&
                (charCode == 221 || charCode == 253)) 
            return true;
        return false;
    }

I appear to get the characters generated / converted properly. I've got a full list of characters that we've been not converting in this way for Latin-1, Latin-2, and Greek -- if this approach looks good I could make this as a fix. @jelovirt does that look like a valid approach?

robander · 2017-11-17T03:33:51Z

I'm sure there's a more efficient way of looking these values up, but so I don't lose them, here's the list of characters that we've avoided converting to entities for both 1252 / ISO-8859-1 and 1250 / ISO-8859-2:

if (    charCode == 193 || charCode == 225 ||    //A-acute
                    charCode == 194 || charCode == 226 ||    //A-circumflex
                    charCode == 196 || charCode == 228 ||    //A-umlaut
                    charCode == 199 || charCode == 231 ||    //C-cedilla
                    charCode == 201 || charCode == 233 ||    //E-acute
                    charCode == 203 || charCode == 235 ||    //E-umlaut
                    charCode == 205 || charCode == 237 ||    //I-acute
                    charCode == 206 || charCode == 238 ||    //I-circumflex
                    charCode == 212 || charCode == 244 ||    //O-circumflex
                    charCode == 214 || charCode == 246 ||    //O-umlaut
                    charCode == 211 || charCode == 243 ||    //O-acute
                    charCode == 218 || charCode == 250 ||    //U-acute
                    charCode == 220 || charCode == 252 ||    //U-umlaut
                    charCode == 221 || charCode == 253 ||    //Y-acute
                    charCode == 223 || charCode == 215) {   //szlig, times

Here are the characters we avoid converting to entities just for 1252 / ISO-8859-1:

if (    charCode == 192 || charCode == 224 ||   //A-grave
                    charCode == 195 || charCode == 227 ||    //A-tilde
                    charCode == 197 || charCode == 229 ||    //A-ring
                    charCode == 198 || charCode == 230 ||    //AElig
                    charCode == 200 || charCode == 232 ||    //E-grave
                    charCode == 202 || charCode == 234 ||    //E-circumflex
                    charCode == 204 || charCode == 236 ||    //I-grave
                    charCode == 207 || charCode == 239 ||    //I-uml
                    charCode == 208 || charCode == 240 ||    //ETH
                    charCode == 209 || charCode == 241 ||    //N-tilde
                    charCode == 210 || charCode == 242 ||    //O-grave
                    charCode == 213 || charCode == 245 ||    //O-tilde
                    charCode == 216 || charCode == 248 ||    //O-slash
                    charCode == 217 || charCode == 249 ||    //U-grave
                    charCode == 219 || charCode == 251 ||    //O-circumflex
                    charCode == 222 || charCode == 254 ||    //Thorn
                    charCode == 255) {                       //y-umlaut

In addition, for Greek, we avoid converting the full Greek alphabet.

Finally, for Korean, we do convert \ to ₩.

Signed-off-by: Robert D Anderson <robander@us.ibm.com>

robander · 2017-11-28T17:11:01Z

Have a suggested fix linked above. When converting from the ugly test in my previous comment to use sets, I noticed just one character was missing from the Latin-2 set, which is converted to &Scaron; and &scaron; (characters 352 and 353).

The fix also switches the HHC and HHK files to use the Windows codepage, just like the HHP does today. Without this change, characters that use different codepoints in ISO-8859-2 versus Windows-1250 are still corrupted (as with the original submitted defect report here).

With the fix above, using my Windows 7 system + DITA-OT 3.0 + the fix in #2851, I'm able to generate English, Czech, and Greek CHM files that compile on my system and display the correct TOC + Index.

My English sample uses characters that exist in the Latin-1 codepages (like Á) that currently generate symbols (Á) that break the TOC/index. These are no longer converted, so the TOC works. Characters that do not exist in Latin-1 (like Ω) are still converted to entities that do not display, but this is a limitation of the format.
My Czech sample includes the full Czech character set: Aa Áá Bb Cc Čč Dd Ďď Ee Éé Ěě Ff Gg Hh Ii Íí Jj Kk Ll Mm Nn Ňň Oo Óó Pp Qq Rr Řř Ss Šš Tt Ťť Uu Úú Ůů Vv Ww Xx Yy Ýý Zz Žž. Many of these are corrupted today, all characters now display properly in TOC and Index.
Greek sample does the same with alphabet ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ αβγδεζηθικλμνξοπρςστυφχψω. These characters are still converted to entities for English (works in the content, fails in TOC), but are not converted for Greek (the codepage conversion to 1253 handles them all, and they all work in the TOC).

Signed-off-by: Robert D Anderson <robander@us.ibm.com>

Fix corrupted chars in HHC, HHK #1271 #1151

robander · 2017-11-28T19:50:54Z

Fixed with #2852

This was referenced May 8, 2012

East European languages have corrupted characters in chm TOC #1269

Closed

Chinese (Taiwan) shows garbled TOC in chm #1270

Closed

robander mentioned this issue Apr 25, 2016

Zero-width spaces rendered incorrectly for htmlhelp #1383

Closed

robander added the i18n Issue related to internationalization / localization label May 12, 2016

robander mentioned this issue May 25, 2016

HHC in HTMLHelp has no encoding set #1151

Closed

robander pushed a commit to robander/dita-ot that referenced this issue Nov 28, 2017

Fix corrupted chars in HHC, HHK dita-ot#1271

eab18c7

Signed-off-by: Robert D Anderson <robander@us.ibm.com>

robander pushed a commit to robander/dita-ot that referenced this issue Nov 28, 2017

Fix corrupted chars in HHC, HHK dita-ot#1271

1ca8cb9

Signed-off-by: Robert D Anderson <robander@us.ibm.com>

robander mentioned this issue Nov 28, 2017

Fix corrupted chars in HHC, HHK #1271 #1151 #2852

Merged

jelovirt added a commit that referenced this issue Nov 28, 2017

Merge pull request #2852 from robander/hotfix/1271

f2bb6fe

Fix corrupted chars in HHC, HHK #1271 #1151

robander added this to the 3.0.1 milestone Nov 28, 2017

robander closed this as completed Nov 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chm TOC shows wrong characters for many languages #1271

Chm TOC shows wrong characters for many languages #1271

lade commented May 8, 2012

jelovirt commented May 8, 2012

lade commented May 8, 2012

jelovirt commented May 8, 2012

lade commented May 8, 2012

lade commented Sep 12, 2012

eerohele commented Jun 18, 2014

robander commented Nov 16, 2017

robander commented Nov 16, 2017 •

edited

robander commented Nov 16, 2017

robander commented Nov 17, 2017 •

edited

robander commented Nov 28, 2017 •

edited

robander commented Nov 28, 2017

Chm TOC shows wrong characters for many languages #1271

Chm TOC shows wrong characters for many languages #1271

Comments

lade commented May 8, 2012

jelovirt commented May 8, 2012

lade commented May 8, 2012

jelovirt commented May 8, 2012

lade commented May 8, 2012

lade commented Sep 12, 2012

eerohele commented Jun 18, 2014

robander commented Nov 16, 2017

robander commented Nov 16, 2017 • edited

robander commented Nov 16, 2017

robander commented Nov 17, 2017 • edited

robander commented Nov 28, 2017 • edited

robander commented Nov 28, 2017

robander commented Nov 16, 2017 •

edited

robander commented Nov 17, 2017 •

edited

robander commented Nov 28, 2017 •

edited