Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding error of some UTF8 strings in the input map #20

Closed
Chromowolf opened this issue Jan 16, 2021 · 8 comments
Closed

Decoding error of some UTF8 strings in the input map #20

Chromowolf opened this issue Jan 16, 2021 · 8 comments
Labels
bug Something isn't working

Comments

@Chromowolf
Copy link
Contributor

Chromowolf commented Jan 16, 2021

When euddraft compiles an input SCX, it first decodes the strings in the STR or STRx section of the input map (by utf8 or other means), and then encodes them back to the STR or STRx section (by utf8). I believe there is such a decoding/encoding process, instead of simply copying the bytes, am I right?
Everything would be OK if euddraft use UTF8 for both decoding and encoding process. However, for some specific strings, the decoding process uses cp949 instead of UTF8, and then the wrongly decoded strings are encoded back using UTF8, resulting wrong characters displayed in game.

Example:
Modify some unit names in the input SCX (using the newest version of SCMD and save as SC:R)
luzhang
dasha
"Terran Academy" is renamed to "路障"
"Terran Armory" is renamed to "大厦"
Note that the UTF8 bytes for the above 4 characters are:
路: E8 B7 AF
障: E9 9A 9C
大: E5 A4 A7
厦: E5 8E A6

Open the map in Starcraft, it displays the correct characters:
inlz
inds

Then using the newest version of euddraft to compile the map, and open the output EUD map in game:
"路障" is displayed correctly, but "大厦" goes wrong (鸚㎩렑):
outlz
outds

I've checked the STRx section of the input map using Hex Editor, the strings are "E8 B7 AF E9 9A 9C" and "E5 A4 A7 E5 8E A6". And the strings in the output EUD map are "E8 B7 AF E9 9A 9C" and "E9 B8 9A E3 8E A9 EB A0 91".
It happens that the second string "E5 A4 A7 E5 8E A6" is decoded using cp949 by euddraft, to "鸚㎩렑", and then encoded back using UTF8, resulting "E9 B8 9A E3 8E A9 EB A0 91".

I don't know why this happens. Could you fix this?

@armoha
Copy link
Owner

armoha commented Jan 16, 2021

An input map can have map strings of multiple encodings, and there is no way to detect encoding perfectly (but only handful of encodings are used for SC maps). A majority of Korean users still use old version of SCMDraft2 because of many inconveniences in latest SCMD2, e.g. classic trigedit is literally unusable with non-ascii strings, or opening old map of CP949 map string encoding with UTF-8 settings ends up corrupt all map strings.

Old version of SCMDraft2 only uses system ANSI encoding (CP949 for Korean Windows OS) rather than UTF-8 as map string encoding. Many Korean users had tried concatenate unit name to string and got erroneous result, because in SC:R only UTF-8 encoding can be edited in-game by EUD.

So, currently euddraft only tries to convert encoding of all unit name strings, to UTF-8 when string is decodable with CP949. I know it's not ideal at all but not sure what is best way to handle this. I'll add a way to opt in if I couldn't come up with any better way.

@armoha armoha added bug Something isn't working help wanted Extra attention is needed labels Jan 16, 2021
@Chromowolf
Copy link
Contributor Author

Chromowolf commented Jan 16, 2021

Yes, you could add some options for users to opt. For example, in main.edd:

[coding]
decode: utf8

which forces euddraft to decode all the strings using utf8.
Also:

[coding]
decode: default

to use the old default decoding method.

@Chromowolf
Copy link
Contributor Author

Another suggestion:
Let euddraft to check the string section of the input map. If the input map is using STRx, then decode all the strings using utf8. Otherwise, use the old decoding method.

I know the old SCMD2 uses system ANSI encoding, which makes transcoding a headache for everyone. But since the 2019-10-03 version of SCMD2, the STRx comes out and makes everything perfect. In this or newer versions of SCMD2, as long as the map is saved as SC:R version, all the input strings would be encoded using UTF8 into STRx by SCMD2.
So, whenever the string section is STRx, we can assume that all the strings were encoded using UTF8? (I'm not sure)

@armoha
Copy link
Owner

armoha commented Jan 16, 2021

It is possible to use ANSI encoding even in newer version of SCMD2. There is a custom locale option in profile settings and it is mandatory for adding new SC:R features for old maintaining maps. Not all STRx maps use UTF8 so the headache still exists xd

@Chromowolf
Copy link
Contributor Author

Oh, got it. Looking forward to the new euddraft version :D

@armoha armoha removed the help wanted Extra attention is needed label Jan 27, 2021
@armoha
Copy link
Owner

armoha commented Jan 27, 2021

@Chromowolf Sorry for delay, now you can set encoding to decode unit name in input map with decodeUnitName option;

[main]
input: input.scx
output: output.scx
decodeUnitName : utf-8

@Chromowolf
Copy link
Contributor Author

Chromowolf commented Mar 21, 2021

There is a custom locale option in profile settings and it is mandatory for adding new SC:R features for old maintaining maps.

Btw, I'm not sure whether it's this: (i'm using the newest version of scmd, the 2020-06-24 version)
image
I tried entering 65001 and 949, but got this:
image

I also read this
https://cafe.naver.com/edac/83224
And I don't know where to set the -charEncoding option.

(Btw I'm currently using your TrigEditPlus. It's perfect! It can set the codepage to 65001 automatically. I'm just curious about how to set the charEncoding without using your TrigEditPlus)

@armoha
Copy link
Owner

armoha commented Mar 21, 2021

I tried entering 65001 and 949, but got this:

Weird, IIRC although there is an error with 65001 but unicode strings are displayed nicely.

And I don't know where to set the -charEncoding option.

https://cafe.naver.com/edac/79812
Make a shortcut for SCMDraft2, add option behind target? path.
(Also mentioned in 2019.05.20(W) patch note in http://www.stormcoast-fortress.net/Irregularies/ )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants