Decoding error of some UTF8 strings in the input map #20

Chromowolf · 2021-01-16T11:04:55Z

When euddraft compiles an input SCX, it first decodes the strings in the STR or STRx section of the input map (by utf8 or other means), and then encodes them back to the STR or STRx section (by utf8). I believe there is such a decoding/encoding process, instead of simply copying the bytes, am I right?
Everything would be OK if euddraft use UTF8 for both decoding and encoding process. However, for some specific strings, the decoding process uses cp949 instead of UTF8, and then the wrongly decoded strings are encoded back using UTF8, resulting wrong characters displayed in game.

Example:
Modify some unit names in the input SCX (using the newest version of SCMD and save as SC:R)

"Terran Academy" is renamed to "路障"
"Terran Armory" is renamed to "大厦"
Note that the UTF8 bytes for the above 4 characters are:
路: E8 B7 AF
障: E9 9A 9C
大: E5 A4 A7
厦: E5 8E A6

Open the map in Starcraft, it displays the correct characters:

Then using the newest version of euddraft to compile the map, and open the output EUD map in game:
"路障" is displayed correctly, but "大厦" goes wrong (鸚㎩렑):

I've checked the STRx section of the input map using Hex Editor, the strings are "E8 B7 AF E9 9A 9C" and "E5 A4 A7 E5 8E A6". And the strings in the output EUD map are "E8 B7 AF E9 9A 9C" and "E9 B8 9A E3 8E A9 EB A0 91".
It happens that the second string "E5 A4 A7 E5 8E A6" is decoded using cp949 by euddraft, to "鸚㎩렑", and then encoded back using UTF8, resulting "E9 B8 9A E3 8E A9 EB A0 91".

I don't know why this happens. Could you fix this?

armoha · 2021-01-16T11:32:00Z

An input map can have map strings of multiple encodings, and there is no way to detect encoding perfectly (but only handful of encodings are used for SC maps). A majority of Korean users still use old version of SCMDraft2 because of many inconveniences in latest SCMD2, e.g. classic trigedit is literally unusable with non-ascii strings, or opening old map of CP949 map string encoding with UTF-8 settings ends up corrupt all map strings.

Old version of SCMDraft2 only uses system ANSI encoding (CP949 for Korean Windows OS) rather than UTF-8 as map string encoding. Many Korean users had tried concatenate unit name to string and got erroneous result, because in SC:R only UTF-8 encoding can be edited in-game by EUD.

So, currently euddraft only tries to convert encoding of all unit name strings, to UTF-8 when string is decodable with CP949. I know it's not ideal at all but not sure what is best way to handle this. I'll add a way to opt in if I couldn't come up with any better way.

Chromowolf · 2021-01-16T12:00:38Z

Yes, you could add some options for users to opt. For example, in main.edd:

[coding]
decode: utf8

which forces euddraft to decode all the strings using utf8.
Also:

[coding]
decode: default

to use the old default decoding method.

Chromowolf · 2021-01-16T12:17:11Z

Another suggestion:
Let euddraft to check the string section of the input map. If the input map is using STRx, then decode all the strings using utf8. Otherwise, use the old decoding method.

I know the old SCMD2 uses system ANSI encoding, which makes transcoding a headache for everyone. But since the 2019-10-03 version of SCMD2, the STRx comes out and makes everything perfect. In this or newer versions of SCMD2, as long as the map is saved as SC:R version, all the input strings would be encoded using UTF8 into STRx by SCMD2.
So, whenever the string section is STRx, we can assume that all the strings were encoded using UTF8? (I'm not sure)

armoha · 2021-01-16T12:22:03Z

It is possible to use ANSI encoding even in newer version of SCMD2. There is a custom locale option in profile settings and it is mandatory for adding new SC:R features for old maintaining maps. Not all STRx maps use UTF8 so the headache still exists xd

Chromowolf · 2021-01-16T17:28:43Z

Oh, got it. Looking forward to the new euddraft version :D

armoha · 2021-01-27T17:05:32Z

@Chromowolf Sorry for delay, now you can set encoding to decode unit name in input map with decodeUnitName option;

[main]
input: input.scx
output: output.scx
decodeUnitName : utf-8

Chromowolf · 2021-03-21T02:27:27Z

There is a custom locale option in profile settings and it is mandatory for adding new SC:R features for old maintaining maps.

Btw, I'm not sure whether it's this: (i'm using the newest version of scmd, the 2020-06-24 version)

I tried entering 65001 and 949, but got this:

I also read this
https://cafe.naver.com/edac/83224
And I don't know where to set the -charEncoding option.

(Btw I'm currently using your TrigEditPlus. It's perfect! It can set the codepage to 65001 automatically. I'm just curious about how to set the charEncoding without using your TrigEditPlus)

armoha · 2021-03-21T05:29:19Z

I tried entering 65001 and 949, but got this:

Weird, IIRC although there is an error with 65001 but unicode strings are displayed nicely.

And I don't know where to set the -charEncoding option.

https://cafe.naver.com/edac/79812
Make a shortcut for SCMDraft2, add option behind target? path.
(Also mentioned in 2019.05.20(W) patch note in http://www.stormcoast-fortress.net/Irregularies/ )

armoha added bug Something isn't working help wanted Extra attention is needed labels Jan 16, 2021

armoha closed this as completed in armoha/eudplib@962c70e Jan 27, 2021

armoha removed the help wanted Extra attention is needed label Jan 27, 2021

Chromowolf mentioned this issue Mar 21, 2021

Could you add an option for users to select unit name coding? Buizz/EUD-Editor-3#40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding error of some UTF8 strings in the input map #20

Decoding error of some UTF8 strings in the input map #20

Chromowolf commented Jan 16, 2021 •

edited

Loading

armoha commented Jan 16, 2021 •

edited

Loading

Chromowolf commented Jan 16, 2021 •

edited

Loading

Chromowolf commented Jan 16, 2021

armoha commented Jan 16, 2021

Chromowolf commented Jan 16, 2021

armoha commented Jan 27, 2021

Chromowolf commented Mar 21, 2021 •

edited

Loading

armoha commented Mar 21, 2021 •

edited

Loading

Decoding error of some UTF8 strings in the input map #20

Decoding error of some UTF8 strings in the input map #20

Comments

Chromowolf commented Jan 16, 2021 • edited Loading

armoha commented Jan 16, 2021 • edited Loading

Chromowolf commented Jan 16, 2021 • edited Loading

Chromowolf commented Jan 16, 2021

armoha commented Jan 16, 2021

Chromowolf commented Jan 16, 2021

armoha commented Jan 27, 2021

Chromowolf commented Mar 21, 2021 • edited Loading

armoha commented Mar 21, 2021 • edited Loading

Chromowolf commented Jan 16, 2021 •

edited

Loading

armoha commented Jan 16, 2021 •

edited

Loading

Chromowolf commented Jan 16, 2021 •

edited

Loading

Chromowolf commented Mar 21, 2021 •

edited

Loading

armoha commented Mar 21, 2021 •

edited

Loading