New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding: decouples TLK and FS path encoding #47

Open
wants to merge 11 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@MarcelHB
Contributor

MarcelHB commented May 1, 2016

My setup is as follows: I have a german IWD2 (TLK encoding ISO-8859-1), located on a Linux FS running GemRB, resource listing reveals UTF-8 file paths. The Sounds directory consists of 60% of files with names containing umlauts.

When I CGen a new party, the sound files list

  1. displays visual garbage around umlaut-chars,
  2. the resource manager is unable to find the selected sound file (please review #46 first). No sound then.

In order to overcome 1, it is crucial to replace TLKEncoding by FSPathEncoding where necessary, since the global encoding makes no sense here. For 2, from the Python interface/IWD2 CSounds script, I made everything on the way aware of multi-byteness of the file path to use. Otherwise, it ends up in that beautiful strnlwrcpy that disrupts everything. I try not to break anything.

But: I don't not know the situation at large, as e.g. pl_uppercase/pl_lowercase must have been used for files as well. Looking at the name, I assume that some polish version might be broken then, but that's just a guess. This is where I need further information, we could make FSPathEncoding configurable I guess.

MarcelHB added some commits May 1, 2016

String: extends buffer in `MBCStringFromString'
If `String' aliases to `std::wstring', `::length' returns the number
of wide-chars, not bytes. For the worst case, we should allow each
`wchar_t' to be mapped into space not smaller, shrinking the space after
the result of `wcstombs'.

Otherwise, multibyte charachters make the resulting string cut-off
early.
String: moves `EncodingStruct' here
Since we are about to control string actions with custom encodings, we
better move it a little up, next to the methods.
Interface: adds distinct encoding for file system path
Consider the case: IWD2, german (ISO-8859-1), located on a mb file
system (UTF-8). For reasons such as resource listing (sound files), we
shouldn't couple the TLK encoding when file paths are subject to
something different.

While this is a quite static setting for now, it's better to prepare
decoupling anyway.
GUIScript: updates `QueryText', 'SetPlayerSound' for mb awareness
If the string results from a non-TLK encoding (such as sound folder file
names), they choose the FS encoding paths.
GUIScript: forces `ListResources' to use FS encoding
For the german IWD2 sound files list, this no longer causes artifacts of
bad encoding and allow proper lookups when returned to the game.
@MarcelHB

This comment has been minimized.

Show comment
Hide comment
@MarcelHB

MarcelHB May 2, 2016

Contributor

Damn, local Clang build looks fine here, gonna check the next days, hopefully.

Contributor

MarcelHB commented May 2, 2016

Damn, local Clang build looks fine here, gonna check the next days, hopefully.

@lynxlynxlynx

This comment has been minimized.

Show comment
Hide comment
@lynxlynxlynx

lynxlynxlynx May 2, 2016

Member

If you check the log, only the osx build is failing (setlocale).

Comments:

  • having setsoudnfolder and querytext take extra params is unwieldy, is there no way to autodetect the proper way to handle a string?
    • could the non-mb path be folded in? Or does this break iwd2 with a, let's say, English tlk?
    • the extra benefit would be, that also the other games would get fixed automatically
  • we don't have a function yet to lowercase mb strings?
    • did the original German leave them as-is too in char sound list?
  • polish and czech used a different non-utf encoding, so we do some extra translation there, as some chars are remapped (this is where the pl_ functions are from)
    • see gemrb/docs/en/default_ini.txt
Member

lynxlynxlynx commented May 2, 2016

If you check the log, only the osx build is failing (setlocale).

Comments:

  • having setsoudnfolder and querytext take extra params is unwieldy, is there no way to autodetect the proper way to handle a string?
    • could the non-mb path be folded in? Or does this break iwd2 with a, let's say, English tlk?
    • the extra benefit would be, that also the other games would get fixed automatically
  • we don't have a function yet to lowercase mb strings?
    • did the original German leave them as-is too in char sound list?
  • polish and czech used a different non-utf encoding, so we do some extra translation there, as some chars are remapped (this is where the pl_ functions are from)
    • see gemrb/docs/en/default_ini.txt
@bradallred

This comment has been minimized.

Show comment
Hide comment
@bradallred

bradallred May 2, 2016

Member

you need to include <clocale> (not <locale>) to use the locale functions as per the C++ standard.

I agree with lynx, seems unnecessarily cumbersome/hacky. The scripts shouldn't need to know anything about encodings or file systems etc.

Also, I suspect there is a better way for dealing with file system encodings then adding yet another configuration for it. This should at the very least be something detectable at runtime instead of something the user must configure, or worse, set at compile time as it seems you have it now.

Member

bradallred commented May 2, 2016

you need to include <clocale> (not <locale>) to use the locale functions as per the C++ standard.

I agree with lynx, seems unnecessarily cumbersome/hacky. The scripts shouldn't need to know anything about encodings or file systems etc.

Also, I suspect there is a better way for dealing with file system encodings then adding yet another configuration for it. This should at the very least be something detectable at runtime instead of something the user must configure, or worse, set at compile time as it seems you have it now.

@MarcelHB

This comment has been minimized.

Show comment
Hide comment
@MarcelHB

MarcelHB May 2, 2016

Contributor

@bradallred : haven't noticed your post in time. :)

Might be <locale> vs. <clocale> ...

  • Thing about SetSoundFolder/QueryText is: I expect they are used around scripts that might QueryText from some TLK-asset content. And even maybe SetSoundFolder by values not originating from reading the file system on demand? I haven't covered 100% of uses yet, so that's the proposal. But I can look at that. Auto-detecting encoding is complete no-go, and since we parse regular Python values, I don't see easy ways to inject necessary states/flags to circumvent args.
    • I don't see ASCII-only paths breaking. I'm not sure about ISO-8859x as how such paths are used in the wild. I have a feeling of endangering pl/cz as they might have ISO-like encoding while FSPathEncoding.encoding is currently set to UTF-8, that's what the wide use of strnlwrcpy makes me think.
    • Yes, at least anything equiavalent to my setup.
  • Haven't found it yet, but could check String for such a thing. Since std-things mostly rely on std::locale and no one ever included them before, I guess not. :)
    • AFAIR, for all my DE-locale games, the names were uppercased and displayed correctly (Windows, NTFS), but definetely for IWD2.
  • Yes, ack'd. And since all the string-processing went through the mentioned strnlwrcpy, I expect sound files names (as far as they were subject to non-ASCII chars, but this is something I don't know) to a) have always been broken so far or b) to be broken by now.
Contributor

MarcelHB commented May 2, 2016

@bradallred : haven't noticed your post in time. :)

Might be <locale> vs. <clocale> ...

  • Thing about SetSoundFolder/QueryText is: I expect they are used around scripts that might QueryText from some TLK-asset content. And even maybe SetSoundFolder by values not originating from reading the file system on demand? I haven't covered 100% of uses yet, so that's the proposal. But I can look at that. Auto-detecting encoding is complete no-go, and since we parse regular Python values, I don't see easy ways to inject necessary states/flags to circumvent args.
    • I don't see ASCII-only paths breaking. I'm not sure about ISO-8859x as how such paths are used in the wild. I have a feeling of endangering pl/cz as they might have ISO-like encoding while FSPathEncoding.encoding is currently set to UTF-8, that's what the wide use of strnlwrcpy makes me think.
    • Yes, at least anything equiavalent to my setup.
  • Haven't found it yet, but could check String for such a thing. Since std-things mostly rely on std::locale and no one ever included them before, I guess not. :)
    • AFAIR, for all my DE-locale games, the names were uppercased and displayed correctly (Windows, NTFS), but definetely for IWD2.
  • Yes, ack'd. And since all the string-processing went through the mentioned strnlwrcpy, I expect sound files names (as far as they were subject to non-ASCII chars, but this is something I don't know) to a) have always been broken so far or b) to be broken by now.
@lynxlynxlynx

This comment has been minimized.

Show comment
Hide comment
@lynxlynxlynx

lynxlynxlynx May 2, 2016

Member

SetSoundFolder is definitely safe to use directly with whatever the filesystem encoding is, since we generate its list by inspecting the toplevel sound folder. The soundset likely aren't even mentioned in the tlk.

QueryText is less clear. Fine for selectables (lists) like sound and emulated spinners (just numbers), but it's also used for general textedits and editable textareas. The latter are broken right now and can't be tested, but some, like the initial biographies will come straight from the tlk.

I don't know much about unicode, so this may be a silly question: can't QueryText check its result to see if it starts with a byte-order-mark? We currently do have a mask of allowable chars in savegame names, which I think applies to all textedits. Not sure if that works with the CJK locales, but I'm just thinking out loud of what kind of input (encoding) we get, because in the end, we don't want to save text written in two.

Can we treat all editables (or just editable TAs) as multibyte containers? They are few and far in between, so it wouldn't be much of a memory hit to always convert them. There are nice flags to check against.

Member

lynxlynxlynx commented May 2, 2016

SetSoundFolder is definitely safe to use directly with whatever the filesystem encoding is, since we generate its list by inspecting the toplevel sound folder. The soundset likely aren't even mentioned in the tlk.

QueryText is less clear. Fine for selectables (lists) like sound and emulated spinners (just numbers), but it's also used for general textedits and editable textareas. The latter are broken right now and can't be tested, but some, like the initial biographies will come straight from the tlk.

I don't know much about unicode, so this may be a silly question: can't QueryText check its result to see if it starts with a byte-order-mark? We currently do have a mask of allowable chars in savegame names, which I think applies to all textedits. Not sure if that works with the CJK locales, but I'm just thinking out loud of what kind of input (encoding) we get, because in the end, we don't want to save text written in two.

Can we treat all editables (or just editable TAs) as multibyte containers? They are few and far in between, so it wouldn't be much of a memory hit to always convert them. There are nice flags to check against.

@MarcelHB

This comment has been minimized.

Show comment
Hide comment
@MarcelHB

MarcelHB May 2, 2016

Contributor

SetSoundFolder was given my umlaut Kämpfer_männlich_1 (ä being 2 bytes in UTF-8) string, already returned correctly from the Py context, and made it pass strnlwrcpy. This makes the string in console debugging look like Kpfer_mnlich_1. In my proposal, I explicitly avoid this last step on the road. Then, and only then, I was able to preserve the path bytes and actually hear sound/make the RM find the file.

I haven't noticed any explicit BOMs in any input (they are not mandatory for UTF-8 or maybe strlen ignores these bytes? Haven't tested that.).

I don't know, somebody made GemRB use std::wstring for reasons, affecting things around TAs. ;)

Contributor

MarcelHB commented May 2, 2016

SetSoundFolder was given my umlaut Kämpfer_männlich_1 (ä being 2 bytes in UTF-8) string, already returned correctly from the Py context, and made it pass strnlwrcpy. This makes the string in console debugging look like Kpfer_mnlich_1. In my proposal, I explicitly avoid this last step on the road. Then, and only then, I was able to preserve the path bytes and actually hear sound/make the RM find the file.

I haven't noticed any explicit BOMs in any input (they are not mandatory for UTF-8 or maybe strlen ignores these bytes? Haven't tested that.).

I don't know, somebody made GemRB use std::wstring for reasons, affecting things around TAs. ;)

@lynxlynxlynx

This comment has been minimized.

Show comment
Hide comment
@lynxlynxlynx

lynxlynxlynx May 2, 2016

Member

For SetSoundFolder I meant that it should be fixable without needing an extra param for the gui scripts. Is there anything else but some old FAT partitions that don't use unicode paths by default?

Member

lynxlynxlynx commented May 2, 2016

For SetSoundFolder I meant that it should be fixable without needing an extra param for the gui scripts. Is there anything else but some old FAT partitions that don't use unicode paths by default?

@bradallred

This comment has been minimized.

Show comment
Hide comment
@bradallred

bradallred May 2, 2016

Member

the use of w_char is because we need 16 bit characters to properly map to the BAM fonts (for CJK versions specifically): one byte is the cycle and one byte is the frame. It also allows us to use any unicode characters that occupy a maximum of 2 bytes or the CJK TLK encodings which use 16bit chars.

The editable fields/TAs are going display based off of whatever font is set for them individually. If you are using standard fonts then only characters that exist in the standard fonts will display (basically extended ASCII, but depends on locale of the game data). Any character that doesnt have an entry in the BAM will have the little box outline glyph. If you are using TTF fonts (and have iconv) then it should be UTF16, but with the previously mentioned caveat of being incompatible with characters using more than 16bits.

as for FS that don't support unicode... FAT is the only one that comes to mind, but I think some are UTF16 instead of UTF8... anyway, I'm fine with assuming UTF8... lynx is usually more concerned about archaic device compatibility :) but sys.getfilesystemencoding() is available from the Python scripts to get an exact answer about what encoding should be used with file names.

strlen (and any other plain cstring function) is dumb and completely unaware of encodings so its not skipping anything, the BOM isn't there.

Member

bradallred commented May 2, 2016

the use of w_char is because we need 16 bit characters to properly map to the BAM fonts (for CJK versions specifically): one byte is the cycle and one byte is the frame. It also allows us to use any unicode characters that occupy a maximum of 2 bytes or the CJK TLK encodings which use 16bit chars.

The editable fields/TAs are going display based off of whatever font is set for them individually. If you are using standard fonts then only characters that exist in the standard fonts will display (basically extended ASCII, but depends on locale of the game data). Any character that doesnt have an entry in the BAM will have the little box outline glyph. If you are using TTF fonts (and have iconv) then it should be UTF16, but with the previously mentioned caveat of being incompatible with characters using more than 16bits.

as for FS that don't support unicode... FAT is the only one that comes to mind, but I think some are UTF16 instead of UTF8... anyway, I'm fine with assuming UTF8... lynx is usually more concerned about archaic device compatibility :) but sys.getfilesystemencoding() is available from the Python scripts to get an exact answer about what encoding should be used with file names.

strlen (and any other plain cstring function) is dumb and completely unaware of encodings so its not skipping anything, the BOM isn't there.

Actor: updates `GetSoundFolder' accordingly
Otherwise, there is no in-game PC sound.

Code looks dangerous around here.
@MarcelHB

This comment has been minimized.

Show comment
Hide comment
@MarcelHB

MarcelHB May 3, 2016

Contributor

Had a view on the german original of IWD2 for Windows today. The game reads and stores single-byte characters, ISO/Win-1252.

I'm not not sure how to continue. We could close this PR to keep it for the archives until someone comes up with a better approach to deal with these issues.

Contributor

MarcelHB commented May 3, 2016

Had a view on the german original of IWD2 for Windows today. The game reads and stores single-byte characters, ISO/Win-1252.

I'm not not sure how to continue. We could close this PR to keep it for the archives until someone comes up with a better approach to deal with these issues.

@lynxlynxlynx

This comment has been minimized.

Show comment
Hide comment
@lynxlynxlynx

lynxlynxlynx May 3, 2016

Member

Ok, so the problem is still just the filesystem encoding — that's good. :) I don't know if any other translation has the same problem, but I pinged Edheldil about the Czech one.

The pragmatic thing to do would be to autodetect German on startup by just checking for specific known files/folder (we do the same for GameType) and normalize the sound folder names (likely needed also for scripts/) to ascii (ae, ss ..).

Member

lynxlynxlynx commented May 3, 2016

Ok, so the problem is still just the filesystem encoding — that's good. :) I don't know if any other translation has the same problem, but I pinged Edheldil about the Czech one.

The pragmatic thing to do would be to autodetect German on startup by just checking for specific known files/folder (we do the same for GameType) and normalize the sound folder names (likely needed also for scripts/) to ascii (ae, ss ..).

@lynxlynxlynx

This comment has been minimized.

Show comment
Hide comment
@lynxlynxlynx

lynxlynxlynx May 4, 2016

Member

The Czech one doesn't touch filenames, they're still English.

Member

lynxlynxlynx commented May 4, 2016

The Czech one doesn't touch filenames, they're still English.

@MarcelHB

This comment has been minimized.

Show comment
Hide comment
@MarcelHB

MarcelHB May 4, 2016

Contributor

Good to know.

I'm not a big cross-platform fs-encoding compatibility guy, but can we assume to have std::setlocale(LC_ALL, "") (or some other LC_?) in C++ and the mentioned sys.getfilesystemencoding() in Python to reliably determine the encoding of file-paths at runtime?

I'm not sure about savegame compat. across platforms as sound file names inside them are taken-as-given. Could we ignore this for now?

Contributor

MarcelHB commented May 4, 2016

Good to know.

I'm not a big cross-platform fs-encoding compatibility guy, but can we assume to have std::setlocale(LC_ALL, "") (or some other LC_?) in C++ and the mentioned sys.getfilesystemencoding() in Python to reliably determine the encoding of file-paths at runtime?

I'm not sure about savegame compat. across platforms as sound file names inside them are taken-as-given. Could we ignore this for now?

@bradallred

This comment has been minimized.

Show comment
Hide comment
@bradallred

bradallred May 4, 2016

Member

assuming std::setlocale(LC_ALL, "") is safe enough, it may not do anything, but nothing will break (that isn't already) by using it.

I wouldn't worry about save game compatibility across platforms or localizations (for now anyway). This isn't really a new problem; map notes and other saved tidbits are going to suffer similar problems if you take the saves across platforms/localizations, but they are just annoyances rather then anything game breaking.

If loading a save shouldnt break if the sounds cant be found... if it does that is a separate issue that we can fix.

Good work so far :)

Member

bradallred commented May 4, 2016

assuming std::setlocale(LC_ALL, "") is safe enough, it may not do anything, but nothing will break (that isn't already) by using it.

I wouldn't worry about save game compatibility across platforms or localizations (for now anyway). This isn't really a new problem; map notes and other saved tidbits are going to suffer similar problems if you take the saves across platforms/localizations, but they are just annoyances rather then anything game breaking.

If loading a save shouldnt break if the sounds cant be found... if it does that is a separate issue that we can fix.

Good work so far :)

bradallred added a commit that referenced this pull request Sep 1, 2017

ActionBar: remove borders on buttons
this used to be used to "disable" the button, but now it just causes graphical issues
this closes #47
@MarcelHB

This comment has been minimized.

Show comment
Hide comment
@MarcelHB

MarcelHB Sep 19, 2017

Contributor

Any future plans here? Other approaches?

I went through some voice and sound issues, and all male PC voices for IWD2 (because of German MÄNNLICH in their sound file name) are still absent or throw exceptions when selected in CSound.

Contributor

MarcelHB commented Sep 19, 2017

Any future plans here? Other approaches?

I went through some voice and sound issues, and all male PC voices for IWD2 (because of German MÄNNLICH in their sound file name) are still absent or throw exceptions when selected in CSound.

@bradallred

This comment has been minimized.

Show comment
Hide comment
@bradallred

bradallred Sep 19, 2017

Member

I cant see a reason not to merge it per se (we certainly want that allocatedBytes fix), but it seems like this is a band aid. Is the DirectoryIterator returning the expected strings? that seems like the real place to be addressing this.

Member

bradallred commented Sep 19, 2017

I cant see a reason not to merge it per se (we certainly want that allocatedBytes fix), but it seems like this is a band aid. Is the DirectoryIterator returning the expected strings? that seems like the real place to be addressing this.

@MarcelHB

This comment has been minimized.

Show comment
Hide comment
@MarcelHB

MarcelHB Sep 19, 2017

Contributor

Unaware of this PR still being pending, I had deleted my fork in the meantime. I guess this PR is then frozen to any further modifications.

Contributor

MarcelHB commented Sep 19, 2017

Unaware of this PR still being pending, I had deleted my fork in the meantime. I guess this PR is then frozen to any further modifications.

@lynxlynxlynx

This comment has been minimized.

Show comment
Hide comment
@lynxlynxlynx

lynxlynxlynx Sep 19, 2017

Member

Can't download the patches, but the code is still viewable here.

Anyway, did a quick test:

  • touch sounds/ačč000A.wav
  • DirectoryIterator returns the mangled version before anyone else can do anything to it
  • dirent has just a normal char array to hold the name, so that's to be expected
    A bunch of things would have to be converted to use wchar_t to get that working.
Member

lynxlynxlynx commented Sep 19, 2017

Can't download the patches, but the code is still viewable here.

Anyway, did a quick test:

  • touch sounds/ačč000A.wav
  • DirectoryIterator returns the mangled version before anyone else can do anything to it
  • dirent has just a normal char array to hold the name, so that's to be expected
    A bunch of things would have to be converted to use wchar_t to get that working.
@bradallred

This comment has been minimized.

Show comment
Hide comment
@bradallred

bradallred Sep 20, 2017

Member

I assume the "mangled" version is expected and is going to be the raw utf8 bytes. I assume GDB can interpret it that way if you ask it.

What I'm saying is: the DirectoryIterator ought to be what has the EncodingStruct as a static class member, so that things that read from it can know what to pass to StringFromEncodedData et al. This way the problem is solved for everything that displays information from the FS rather then this single point.

Member

bradallred commented Sep 20, 2017

I assume the "mangled" version is expected and is going to be the raw utf8 bytes. I assume GDB can interpret it that way if you ask it.

What I'm saying is: the DirectoryIterator ought to be what has the EncodingStruct as a static class member, so that things that read from it can know what to pass to StringFromEncodedData et al. This way the problem is solved for everything that displays information from the FS rather then this single point.

bradallred added a commit that referenced this pull request Oct 25, 2017

Button: don't pule filled "borders"
we use theres for a "disabled" overlay when the button still needs to be functional (spellbook icons etc)

issue #47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment