Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 14474 - Internally use UTF-8 strings for Windows #4602

Closed
wants to merge 1 commit into from

Conversation

davispuh
Copy link
Contributor

This is a quick hack for Issue 14474
Decode UTF-8 to windows Wide string and then encode it to default windows ANSI code page.

Best should use Wide WinAPI functions and then wouldn't need to encode it to ANSI code page but that's quite a lot of work... Note that not all UTF-8 characters can be encoded to ANSI code page. Thus it's possible to have some paths/names that can't be accessed with ANSI functions.

@MartinNowak
Copy link
Member

What if someone wrote that file by hand not using dub? Couldn't it already be in ACP encoding? Maybe we should use a file BOM.

@davispuh
Copy link
Contributor Author

You're already writing D code in UTF-8 and thus using same editor you would also save this file in UTF-8. I think there isn't any good reason for it to be in ANSI code page, especially because it varies by windows locale and IMO would cause more problems.

@@ -134,6 +135,15 @@ bool response_expand(size_t *pargc, const char ***pargv)

buffer = (char *)f.buffer;
bufend = buffer + f.len;
#if _WIN32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need comment explaining what is happening here.

@WalterBright
Copy link
Member

Could use a test case.

@MartinNowak
Copy link
Member

You can get this in a shell script like this, we also run them on Windows.
https://github.com/D-Programming-Language/dmd/blob/a5bb5a092d8917776e3d5be5a70a3ddddee48326/test/runnable/link14198b.sh

@davispuh davispuh changed the title Issue 14474 - In response_expand, decode file as UTF-8 for Windows Issue 14474 - Internally use UTF-8 strings for Windows Apr 24, 2015
@davispuh
Copy link
Contributor Author

Created a test for this, but it would have failed if windows default locale (ANSI code page) couldn't support that charset.
So I looked into how to implemented proper support so that it would always pass no matter what's window's locale and I actually managed to do that :)
Weren't needed a lot changes like I thought at first, basically idea is that we simply always store UTF-8 strings in char * , then whenever interacting with external sources (such as WinAPI functions) need to encode/decode to/from Wide stings and use Unicode WinAPI. This way we've basically full Unicode support for path/file names and error messages while still can work with char * like it would be in ASCII.

It compiles and works fine with both MSVC and DMC on Windows. No idea about Linux. Also looks like need some extra library for ddmd linking but I'm not sure how magicport works.

In action
screenshot

Whenever interacting with external sources (eg. WinAPI functions)
then encode/decode to/from Wide strings

Fix Issue 14474
@yebblies
Copy link
Member

Also looks like need some extra library for ddmd linking but I'm not sure how magicport works.

What's the error?

@davispuh
Copy link
Contributor Author

When linking

ddmd.obj(ddmd)
 Error 42: Symbol Undefined __d_assert
ddmd.obj(ddmd)
 Error 42: Symbol Undefined __d_arraycopy
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _D15TypeInfo_Struct6__vtblZ
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _D6object10_xopEqualsFxPvxPvZb
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _D14TypeInfo_Const6__vtblZ
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _D16TypeInfo_Pointer6__vtblZ
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _D10TypeInfo_a6__initZ
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _D10TypeInfo_m6__initZ
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _D10TypeInfo_i6__initZ
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _D13TypeInfo_Enum6__vtblZ
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _D10TypeInfo_k6__initZ
ddmd.obj(ddmd)
 Error 42: Symbol Undefined __d_framehandler
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _stderr
ddmd.obj(ddmd)
 Error 42: Symbol Undefined __d_local_unwind2
ddmd.obj(ddmd)
 Error 42: Symbol Undefined __memset128
ddmd.obj(ddmd)
 Error 42: Symbol Undefined __d_arraybounds
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _getErrno
ddmd.obj(ddmd)
 Error 42: Symbol Undefined __wmkdir@4
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _D10TypeInfo_e6__initZ
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _D10TypeInfo_l6__initZ
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _D10TypeInfo_h6__initZ
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _D10TypeInfo_b6__initZ
ddmd.obj(ddmd)
 Error 42: Symbol Undefined __wputenv@4
ddmd.obj(ddmd)
 Error 42: Symbol Undefined __wgetenv@4
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _stdout
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _setErrno
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _iswspace@4
ddmd.obj(ddmd)
 Error 42: Symbol Undefined __wspawnv@12
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _ShellExecuteA@24
ddmd.obj(ddmd)
 Error 42: Symbol Undefined __memset16
ddmd.obj(ddmd)
 Error 42: Symbol Undefined _D14TypeInfo_Class6__vtblZ
OPTLINK : Warning 134: No Start Address

@yebblies
Copy link
Member

Looks like it can't find druntime properly.

@@ -15,6 +15,7 @@
#include <assert.h>
#include <limits.h>
#include <string.h>
#include <locale.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dear gawd, please, no locale.h! It's a giant bug.

@WalterBright
Copy link
Member

If we're going to support Unicode in dmd source, we should do it right, not a quick hack. I suggest doing it the same way Phobos does for D - everything is UTF-8, and conversions to/from the Windows 'W' functions is done when those Windows functions are called.

@WalterBright
Copy link
Member

Note that not all UTF-8 characters can be encoded to ANSI code page. Thus it's possible to have some paths/names that can't be accessed with ANSI functions.

I.e. we need to do this right or not bother, and we know how to do it right. Phobos does it right.

@davispuh
Copy link
Contributor Author

That was written for first implementation, since then I've implemented it much better. Also this is about Unicode paths when specifying source filenames and not about it's content. Please review PR changeset, but there are still some issues, I'll work on them a bit later.

@WalterBright
Copy link
Member

The code looks good, but it should be moved to port.h and port.c. dgetenv should be Port::getenv, for example. Port is where the reimplementations of missing or buggy platform features go.

@ibuclaw
Copy link
Member

ibuclaw commented Jan 14, 2018

What's the status of this now that we've switch from C++ to D? Does declaring extern(C) int wmain() even work on Windows?

@ibuclaw ibuclaw self-assigned this Jan 14, 2018
@dlang-bot dlang-bot removed the stalled label Jan 14, 2018
@rainers
Copy link
Member

rainers commented Jan 14, 2018

Does declaring extern(C) int wmain() even work on Windows?

Not sure about dmc, but msc should be fine. If we want utf8 command line arguments we should use D main instead IMO.

Most of the file i/o should already be fixed by the long-filename-fixes (but assumes ANSI), but sources of trouble for non-ANSI characters are sensible console output and invocation of the linker (MS link understands response files with BOM, but optlink is limited in that regard).

@ibuclaw
Copy link
Member

ibuclaw commented Jan 14, 2018

OK, so speculatively closing then. @davispuh if you think this is still a problem, please rebase and reopen.

@davispuh
Copy link
Contributor Author

Original issues are still present, several places use Windows ANSI API instead of Wide API. As well as getenv is used which doesn't support Unicode characters.

Some while ago I actually rebased this and rewrote some parts from C to D as it seemed easier. This way I changed char * to D string in a lot of places. I only need to rebase it again, split it in several commits and fully test.
For some reason, I haven't been able to run dmd tests locally on Linux. Not really sure how environment is supposed to be set up.

This is how that patch looks atm 38 files changed, 1497 insertions(+), 2044 deletions(-)

@JinShil
Copy link
Contributor

JinShil commented Jan 14, 2018

For some reason, I haven't been able to run dmd tests locally on Linux. Not really sure how environment is supposed to be set up.

I suspect this is due to Linux now requiring "-fPIC" for builds. I run into this problem also. See #7420 and #7427. I currently apply #7420, and then I can build and run all but 2 or 3 tests. My understanding is @wilzbach is working on adding hardened Linux "machines" to our CIs (#7579), but I don't clearly understand what the roadmap is.

@@ -710,7 +710,7 @@ char *file_8dot3name(const char *filename)
char *buf;
int i;

h = FindFirstFile(filename,&fileinfo);
h = FindFirstFileA(filename,&fileinfo);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*A versions are basically deprecated junk on Windows. Do not use unless absolutely needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S.: NT kernel is UTF-16 throughout.

@ibuclaw ibuclaw removed their assignment Mar 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants