New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chm TOC shows wrong characters for many languages #1271
Comments
I already reported this problem for East European languages (#1269), but noticed that wrong encodings are used for other languages as well. |
Ok, I will close #1269 as a duplicate of this issue. |
Not fixed in DITA OT 1.6.2; for example, Czech still shows 'š' as '?' and 'ž' as 'ł' in chm's TOC. |
I found out yesterday that a customer has been dealing with this problem for a long time by manually copy-pasting titles from DITA XML files into the From what I can tell so far, @lade's suggestion of using Windows codepages seems to work fine. However, that is not necessarily quite sufficient: as @lade notes in #1268, to make š and Š work for Czech, the corresponding entities ( That in turn becomes a problem when we want to use a special character like μ in a CHM help. If we want to use it in, say, a Polish help (pl-pl), there needs to be an entry for it in
However, if we add that entry and then create a Greek (el-gr) help, that (Greek) character will not be displayed correctly in the CHM TOC of the Greek help. If we remove that entry from By the way, this also means that 7fbd675 effectively breaks Greek-language CHM helps completely because all Greek characters in the TOC will be mangled (albeit only when EDIT: Or maybe |
This has just come up as an urgent issue for one of my customers. In the past, we had a hacked version of DITA-OT that converted some characters to entities, but not all, followed by a codepage conversion. What made that work was that the entity conversion was language-aware, so for characters like Ý that exist in iso-8859-2, we do not generate If nothing else - could we just add a property that allows skipping the |
FWIW our earlier process (run outside of DITA-OT) was actually based on target codepage rather than language. For example, if target codepage is windows-1250 (as with Czech), none of the common Czech characters are converted. In this case if I'm reading right, we convert the HHP file to 1250 but the HHC contents file, HHK index, and HTML files to the HTML codepage. I'm not sure if that's causing additional trouble -- it makes sense that HTML would be converted to the HTML codepage (with charset properly updated), but I would think that the CHM compiler on Windows would want the other project files (not just HHP) to use the Windows codepage. 😕 |
Hmm, as a test I've just changed this line that converts chars:
To this:
With the following function (for now just testing Y-acute (Ý and ý):
I appear to get the characters generated / converted properly. I've got a full list of characters that we've been not converting in this way for Latin-1, Latin-2, and Greek -- if this approach looks good I could make this as a fix. @jelovirt does that look like a valid approach? |
I'm sure there's a more efficient way of looking these values up, but so I don't lose them, here's the list of characters that we've avoided converting to entities for both 1252 / ISO-8859-1 and 1250 / ISO-8859-2:
Here are the characters we avoid converting to entities just for 1252 / ISO-8859-1:
In addition, for Greek, we avoid converting the full Greek alphabet. Finally, for Korean, we do convert \ to ₩. |
Signed-off-by: Robert D Anderson <robander@us.ibm.com>
Have a suggested fix linked above. When converting from the ugly test in my previous comment to use sets, I noticed just one character was missing from the Latin-2 set, which is converted to The fix also switches the HHC and HHK files to use the Windows codepage, just like the HHP does today. Without this change, characters that use different codepoints in ISO-8859-2 versus Windows-1250 are still corrupted (as with the original submitted defect report here). With the fix above, using my Windows 7 system + DITA-OT 3.0 + the fix in #2851, I'm able to generate English, Czech, and Greek CHM files that compile on my system and display the correct TOC + Index.
|
Signed-off-by: Robert D Anderson <robander@us.ibm.com>
Fixed with #2852 |
As one example, the letter Ά shows incorrectly in TOC for chm for Greek. By changing html output to 'windows-1253' instead of 'iso-8859-7' in codepages.xml I got it to show correctly.
The compiler of HTML Help Workshop uses different encodings from what are used by DITA-OT currently. The ones that work correctly are:
In some of these cases the differences are only about some special characters. And windows-1252 is a superset of iso-8859-1 (except for control characters in the range 80-9F). But in some there are more relevant differences, and it would be nice if the correct codepage would be used.
The text was updated successfully, but these errors were encountered: