Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multi-byte variant of CreateProcess on Windows #809

Closed

Conversation

eht16
Copy link
Member

@eht16 eht16 commented Dec 13, 2015

This should fix problems when the command line or the working directory
of a command on Windows contains non-ASCII characters.
Those characters are first converted into a multi-byte string and then
passed to CreateProcessW().

Closes #784.

I hope this won't have any unwanted side-effects. My tests were successfully and it solves the problem reported in #784.
The error handling seems a bit weird but actually works. The error strings are not marked as translatable on purpose because a) they are very technically and b) unlikely to happen at all.

This should fix problems when the command line or the working directory
of a command on Windows contains non-ASCII characters.
Those characters are first converted into a multi-byte string and then
passed to CreateProcessW().

Closes geany#784.
@codebrainz
Copy link
Member

Title is misleading, you're actually using the "wide" variant of CreateProcess() which is what we should be using everywhere automatically by defining UNICODE. I wonder what use cases we have for using the pure ASCII versions of any Win32 API functions?

@codebrainz
Copy link
Member

For fun, I converted (most of) the explicit wide stuff to be agnostic of Win32 encoding, and defined UNICODE so all API functions use Unicode/wide:

https://gist.github.com/codebrainz/fcadf6400f5f4c7263d3

(note: not tested, except that it builds)

message = g_strdup_printf(
"Failed to convert working directory to multi-byte string: %s", err);
g_free(err);
failed = ""; /* report the message only */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I read the error handler below correctly, it would be shorter, simpler and more consistent to just do

failed = "convert working directory to multi-byte string";
goto leave;

and let the common handler build the full message.

@b4n
Copy link
Member

b4n commented Dec 14, 2015

Apart from this, my limited WINAPI knowledge suggests it's mostly OK. Though, the MSDN docs states that MultiByteToWideWhar can also return U+FFFD on invalid input:

If the input byte/char sequences are invalid, returns U+FFFD for UTF encodings.

Maybe this should be handled too?

@zhekov if you don't mind, can we get your opinion on this?

@b4n
Copy link
Member

b4n commented Dec 14, 2015

@codebrainz we probably indeed never should use non-wide versions. Though,is there any harm in specifically using the wide version? e.g. what do we gain by defining UNICODE and using the non-W names?
Though, I guess I like the MultiByteToWideChar() wrapper, although I don't understand the use case for the non-UNICODE code path, and also find it weird that it doesn't use the locale codepage rather than ASCII? (maybe it's what it's supposed to do for non-W functions, not quite sure)

@codebrainz
Copy link
Member

Though, I guess I like the MultiByteToWideChar() wrapper, although I don't understand the use case for the non-UNICODE code path

With UNICODE force-defined as it has, there's isn't much of one. But it's possible (and easy) to support both the ASCII and Wide versions with minimal difference in code, so I did in that gist.

also find it weird that it doesn't use the locale codepage rather than ASCII? (maybe it's what it's supposed to do for non-W functions, not quite sure)

IIUC all functions used to be ASCII, then in Windows NT they added support for Unicode, so they added a DoFooA() function for ASCII, DoFooW() function for Wide/Unicode, and have wrapper macros DoFoo() which is #defined to one of those based on whether UNICODE is defined (likewise for type TCHAR and related types and macros).

@b4n
Copy link
Member

b4n commented Dec 14, 2015

@codebrainz okay… though, for the non-UNICODE code path, does it really want ASCII, or rather the common Winodws-ish locale code pages? And will the ? really work, too? (can be, I just have no clue and it looks odd)

@zhekov
Copy link
Member

zhekov commented Dec 14, 2015

glib uses CreateProcessW, and because of that, or of some improper character handling, FiF with non-UTF8 text often fails with "Invalid byte sequence in conversion input" (this is documented in spawn.c). So my first suggestion for this PR would be to test FiF with texts in various encodings. The most likely reason for this failure is as this:

CreateProcessA simply passes a sequence of bytes, and grep (make, gcc, whatever) receives them as-is.

For CreateProcessW, the chars are first converted to UNICODE, and then back to 8-bit, since grep (and most console programs, especially UNIX tools) expect char *_argv, not wchar_t *_argv. Both conversions may fail, if at least 1 byte is not part of the current 8-bit code page.

So, W is not a magic bullet.

For #784: the FiF utf8 dir is "converted" to locale with Geany's utils_get_locale_from_utf8(), which is specifically defined to do nothing under Win32. Unless the directory is in UTF-8 encoding, instead of "ANSI", or "UNICODE-not-convertable-to-ANSI" (the standard encodings under Win32), I would suspect this PR will not fix the problem.

g_spawn_() specifically expects all text to be passed as UTF-8. Which seems like a good idea, but as soon as you pass any text that is convertable to UNICODE, but not to the current code page, the second W conversion (UNICODE to grep's etc. char *_argv) will fail - or maybe produce strings some replacement characters, so that grep (and the other tools) will be run after all. Which is even worse.

TL;DR:

The quick fix will be to convert the directory with g_locale_from_utf8(), detect any unconvertable directory names (should be rare), and fail early.

The full fix... That will be to implement a proper locale file name support, as much as it's possible under glib (our locale support under POSIX is badly broken). But that is not an easy thing to do, and will require a lot of work.

As long as most of the tools started by Geany use char *_argv and not wchar_t *_argv, it would be better to avoid UNICODE, and stick to 8-bit.


For fun - a nice excerpt from the g_spawn*() documentation:

"For C programs built with Microsoft's tools it is enough to make the [spawned] program have a wmain() instead of main() [1]. wmain() has a wide character argument vector as parameter. At least currently, mingw doesn't support wmain(), so if you use mingw to develop the spawned program, it should call g_win32_get_command_line() to get arguments in UTF-8."

Lovely. :) Especially considering that 99.9% of the development tools do not use glib.

[1] completely non-standard, of course

@codebrainz
Copy link
Member

does it really want ASCII, or rather the common Winodws-ish locale code pages

Naw my bad, it's "ANSI" (code pages) for "A" not specifically "ASCII".

And will the ? really work, too? (can be, I just have no clue and it looks odd)

It works if it was ASCII, it just replaces all characters > 127 with a question mark (arbitrary replacement character). The code really makes no sense for non ASCII, so I think it would be best to just drop those non-wide code paths.

When I get some time I'll try to put together a proper PR instead of hijacking this one :)

@eht16
Copy link
Member Author

eht16 commented Dec 15, 2015

Sorry for the wrong subject, yes, it should have been wide and not multi-byte.

To the rest, to my limited knowledge of this stuff, @codebrainz's Gist looks more comprehensive.

@zhekov you might be correct and I don't want to argue on that topic because of too limit knowledge on my end, still, the code of this PR worked for my tests with FIF.

Anyway, I'll close this PR and hope @codebrainz or @zhekov maybe provide a better one.

@eht16 eht16 closed this Dec 15, 2015
@zhekov
Copy link
Member

zhekov commented Dec 15, 2015

Reminder: I'm not using Geany any more.

That being said, implementing reasonable locale support for Geany, both POSIX and Win~1, may be an interesting task by itself. Though it'll require an explanation of what is wrong on the ML, then a discussion how much locale should be supported, a plan how to implement it without breaking too much plugin compatibility, and a significant amount of work and testing.

(Note: by "reasonable" I don't mean Win~1 non-ASCII names which are not convertible to "ANSI", because most of the spawned tools won't support them anyway, so the W family of calls will not be required.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

On Windows cannot execute grep tool "grep" if target path contain non ascii folder name
4 participants