-
-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
curl can't open Unicode filenames in Windows #345
Comments
Can you give an example including logs? |
I know I've heard of this issue before but I can't find it. For example a UTF-8 encoded batch file like this won't work:
65001 is the UTF-8 codepage and the "19","ntdll.dll","NtCreateFile + 0x12","0x773b0112","C:\Windows\SysWOW64\ntdll.dll" The CRT specific locale or something might need to be changed, or use the version of the command line that's UTF-16 encoded and work with that and _wfopen, maybe. |
WinXPSP2 RUS (CP for non unicode app -1251).
command line for run: lflua.exe 1.lua lflua.exe -- lua 5.1 interpreter from unicode FAR3 (http://www.farmanager.com/index.php?l=en) Read Me.txt -- transfer good |
I agree that WinMain (or perhaps GetCommandLineW) and _wfopen look like the way to address this. If it's helpful to include the filename in logging/debug output, it may also mean changing the logic here: https://github.com/bagder/curl/blob/master/lib/curl_multibyte.c#L25 to get these functions whenever we build for windows. I imagine this is a big enough blob to bite off at once, but I am tempted to mention that if we want support for long (~> 260 characters) file names on windows, _wfopen isn't sufficient and we need to drop down to CreateFile/ReadFile/WriteFile and deal with the joy of HANDLEs. See https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx#maxpath for some of the details. |
Not only Another issue to tackle (if this is a concern) is how to stay compatible with Windows versions that do not natively support UNICODE ("WIDE") API variants. These Windows versions are Win95/Win98/WinME and these will either need a special build that keeps using non-WIDE ("ANSI") API variants, or use the unicows.dll layer to make them work transparently with WIDE ones. Not sure how to handle this when dealing with C RTL functions. Supporting these old versions for certain needs a C compiler that is also compatible with them; MinGW is, and MS Visual Studio 2005 or older are (plus most other 3rd party C compilers). Another potentially interesting note is that WinCE OS only supports the "WIDE" APIs (with some minor exceptions probably not relevant in context of |
The Microsoft-recommended method of handling Unicode/ASCII compatibility is to #define UNICODE 1
#define _UNICODE 1 in a header included by every source file. Then use the un-suffixed versions of the functions. Other notes:
|
I'll welcome a patch from someone that has been tested with a fair degree of success (on Windows). |
The NUL byte is a unique separator. When using NUL, filenames don't need to be escaped and we can handle all kinds of special characters in file names. That said, there is no Windows support for these characters at the moment. - `sort -u` thinks that 'a' equals 'ä' and therefore omits 'ä'. - `curl` fails to open files with unicode in their name curl/curl#345
I was kind of hoping someone would pick up on this but it's about that time it goes in the TODO. To see how this might work I wrote a draft with the idea to convert the command line arguments to UTF-8 (see the discussion in #637) so we can continue to pass the user input around as Things like this work:
The first one the host is an IDN and it's converted to UTF-8 and if a WinIDN build is used it's later converted to punycode (xn--h1alffa9f.net). Has some problems though like the whole URL is now UTF-8. output to the screen is not UTF-8 so any command line input that is output to the screen is incorrect:
The way I did it was sloppy, just to get it working to see how it might work. I'm using a global variable in the DLL The draft is here: |
Since we don't have time to work on this right now it's been added to KNOWN_BUGS. 9f740d3 |
I'm willing to implement a solution for curl.
My suggestion:
Later more smart processing can be added: autodetect text encoding for Web and other servers according to standard (several levels of detection with priority: HTTP header, HTML header, direct detection by first few bytes) and automatic conversion of GET and POST data to required encoding (must be the same as encoding of page with HTML form). Anyway, it must be documented how libcurl deals with encoding. |
Then let's take it step-by-step:
|
@bagder OK, where and how we can start documenting current behavior? |
I'd say probably in the |
Currently encoding in libcurl/curl is some kind of mess. If we document it in curl documentation for end users, users can start modifying theirs programs and scripts to much updated documentation. |
I'm fine with that too (and weirdly enough I don't think it is a "mess" ;-) |
@bagder Yep, curl is the best. 😃 Just need a little bit improvement. 😉 |
I am using libcurl on a new project and came up with a solution. It that replaces all fopen/open calls with their wchar_t versions, for anyone looking for a quick and dirty hack. Note that if using openssl with curl, some file handling is done by openssl: ie. I created an fopen/open macro in #define fopen _wfopen_hack
#define open _wopen_hack
__declspec( dllexport ) FILE *_wfopen_hack(const char *file, const char *mode);
__declspec( dllexport ) int _wopen_hack(const char *file, int oflags, ...); I then added the implementation to lib/file.c (excluded error checking) FILE *
_wfopen_hack(const char *file, const char *mode)
{
wchar_t wfile[260];
wchar_t wmode[32];
MultiByteToWideChar(CP_UTF8, 0, file, -1, wfile, 260);
MultiByteToWideChar(CP_UTF8, 0, mode, -1, wmode, 32);
return _wfopen(wfile, mode);
}
int
_wopen_hack(const char *file, int oflags, ...)
{
wchar_t wfile[260];
int mode = 0;
if(oflags & _O_CREAT)
{
va_list ap;
va_start(ap, oflags);
mode = (int)va_arg(ap, int);
va_end(ap);
}
MultiByteToWideChar(CP_UTF8, 0, file, -1, wfile, 260);
return _wopen(wfile, oflags, mode);
} I logged to a file within those functions to ensure they were being called. I also did a |
Cool! So what's the downside with this approach? Or perhaps put differently: why aren't you suggesting this as a real pull request? |
Good question. I guess I didn't find this solution all that elegant (after thought); although it does solve the issue. My solution only focused on FS stuff, thus may not be a 100% fix for unicode problems. In addition, this forces the API user to use wide versions, verse giving them the control to enable/disable them. Not sure if that can cause breakage for existing applications. Ideas to make this more committable:
Andrew |
Oh yeah, the version I am actually using doesn't hard code 260 for path buffer size; which is |
@andrewchernow thanks for taking a shot at it but I don't think it can be done in the way you propose. Unicode characters are converted to a local codepage (eg "ANSI") in Windows in a way that can be lossy. In other words if you have some russian unicode and it's converted to american ansi then the actual glyphs I'd posit are lost so when you convert it back you're not guaranteed the same thing.
Have you ever tried to open files in a folder in explorer at a depth greater than max path? It is not at all a meaningless value. In my experience some Windows API functions W (Unicode) simply do not function correctly with paths larger than max path (or some number a little larger), despite what documentation implies. |
If you use ANSI functions to manage file...then yes. NTFS stores file names as UTF-16, not ANSI. Thus, if you start with UTF-8, convert it to UTF-16 and then use a wide function to access the file system, ANSI doesn't play a role. I'm not sure where you are suggesting ANSI is injected. In my project, I was solving the issue of supplying certificate and key file paths. My application acquires them using non-ANSI functions, converts them to UTF-8 and passes them to setopt: like
Yes, I have. It's rather hilarious. However, your assertion that this means Anyhow, the use cases I need are working: supplying keys/certs to setopt, uploading, downloading.... Part of why I said "may not be a 100% fix for unicode problems". I just thought it may be useful to someone else. |
For multiplatform projects is most correct way is to use UTF-8 internally and convert (if needed) on input/output and for filesystem access. |
I see. I was thinking of filenames provided to the curl tool on the command line, which are encoded as ANSI as described in this issue (eg argv[1], argv[2] etc). Should we try to convert them back to Unicode via UTF-8 some information would be gone. I had proposed earlier in the thread converting from UTF-16 (GetCommandLineW) to UTF-8 but that had other problems then because Microsoft's CRT doesn't work with UTF-8 as a locale.
Yes I agree, my assertion was too much. Shortly after I wrote it I had edited it, but I suspect github sent out the e-mail update before then. |
Very true, I see what you are saying. You know you can change the codepage of the command prompt via
True, but I don't think this is curl's issue. If curl is given malformed UTF-8, then stuff will break. Garbage in, garbage out. Rather than |
Any progress on this? I have the same issue. Anyone know some workaround? |
WinXP SP2, cURL 7.43.
cURL can't open a file for transmission when the file name contains characters from a different code page in OS. cURL - don't support unicode?
The text was updated successfully, but these errors were encountered: