curl can't open Unicode filenames in Windows #345

z0hm · 2015-07-14T06:36:05Z

WinXP SP2, cURL 7.43.
cURL can't open a file for transmission when the file name contains characters from a different code page in OS. cURL - don't support unicode?

dfandrich · 2015-07-14T20:22:13Z

Can you give an example including logs?

jay · 2015-07-14T21:03:46Z

I know I've heard of this issue before but I can't find it. For example a UTF-8 encoded batch file like this won't work:

chcp 65001
curl -F filedata=@И.txt http://website
curl: (26) couldn't open file "?.txt"

65001 is the UTF-8 codepage and the И is UTF-8 encoded there. Same thing with cyrillic code page. In Process Monitor I can see that the error is "NAME INVALID" but I don't know why that error.

"19","ntdll.dll","NtCreateFile + 0x12","0x773b0112","C:\Windows\SysWOW64\ntdll.dll"
"20","KernelBase.dll","CreateFileW + 0x35e","0x758ac5fd","C:\Windows\SysWOW64\KernelBase.dll"
"21","kernel32.dll","CreateFileW + 0x4a","0x759c3f56","C:\Windows\SysWOW64\kernel32.dll"
"22","kernel32.dll","CreateFileA + 0x36","0x759c53b4","C:\Windows\SysWOW64\kernel32.dll"
"23","msvcrt.dll","clearerr_s + 0x75b","0x76a1a310","C:\Windows\SysWOW64\msvcrt.dll"
"24","msvcrt.dll","sopen_s + 0x79","0x76a1a789","C:\Windows\SysWOW64\msvcrt.dll"
"25","msvcrt.dll","sopen_s + 0x1b","0x76a1a72b","C:\Windows\SysWOW64\msvcrt.dll"
"26","msvcrt.dll","remove + 0x137","0x76a1a628","C:\Windows\SysWOW64\msvcrt.dll"
"27","msvcrt.dll","fsopen + 0x6a","0x76a1a6c1","C:\Windows\SysWOW64\msvcrt.dll"
"28","msvcrt.dll","fopen + 0x12","0x76a1b2d6","C:\Windows\SysWOW64\msvcrt.dll"

The CRT specific locale or something might need to be changed, or use the version of the command line that's UTF-16 encoded and work with that and _wfopen, maybe.

z0hm · 2015-07-15T06:21:39Z

WinXPSP2 RUS (CP for non unicode app -1251).

lua script in utf8 1.lua:
---------------------------
local function fread(f) local h,x = io.open(f,"rb"),nil if h then x=h:read("*all"); io.close(h) end return x end 
local s1=fread("1.txt") 
local s2=fread("2.txt") 
curl ... -T '"'..s1..'"' ...
curl ... -T '"'..s2..'"' ...

---------------------------

1.txt in utf8 (65001)
--------------
Read Me.txt

--------------

2.txt in utf8 (65001)
--------------
Første Pucambù.txt

--------------

command line for run: lflua.exe 1.lua

lflua.exe -- lua 5.1 interpreter from unicode FAR3 (http://www.farmanager.com/index.php?l=en)

Read Me.txt -- transfer good
Første Pucambù.txt -- don't transfer, curl answer: don't open file

dbyron0 · 2015-07-16T23:17:27Z

I agree that WinMain (or perhaps GetCommandLineW) and _wfopen look like the way to address this. If it's helpful to include the filename in logging/debug output, it may also mean changing the logic here: https://github.com/bagder/curl/blob/master/lib/curl_multibyte.c#L25 to get these functions whenever we build for windows.

I imagine this is a big enough blob to bite off at once, but I am tempted to mention that if we want support for long (~> 260 characters) file names on windows, _wfopen isn't sufficient and we need to drop down to CreateFile/ReadFile/WriteFile and deal with the joy of HANDLEs. See https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx#maxpath for some of the details.

vszakats · 2015-08-13T08:17:56Z

Not only GetCommandLineW(), but all string (filename) interactions via Windows API should be done using the "WIDE" variant of said API functions. F.e. in above case it'd mean calling CreateFileW() instead of CreateFileA(). And because this one is called via C RTL function fopen(), using _wfopen() may be used in this case. I would opt to go with the direct API calls to avoid messing with C RTL codepage and compatibility issues altogether. Another question to consider here is what should be the UNICODE encoding expected by libcurl API functions that expect or return strings. For portability and to retain ABI compatibility, probably UTF-8 would be best.

Another issue to tackle (if this is a concern) is how to stay compatible with Windows versions that do not natively support UNICODE ("WIDE") API variants. These Windows versions are Win95/Win98/WinME and these will either need a special build that keeps using non-WIDE ("ANSI") API variants, or use the unicows.dll layer to make them work transparently with WIDE ones. Not sure how to handle this when dealing with C RTL functions. Supporting these old versions for certain needs a C compiler that is also compatible with them; MinGW is, and MS Visual Studio 2005 or older are (plus most other 3rd party C compilers).

Another potentially interesting note is that WinCE OS only supports the "WIDE" APIs (with some minor exceptions probably not relevant in context of libcurl). Such differences should be hidden by the C RTL layer, if used.

DemiMarie · 2015-10-02T05:22:42Z

The Microsoft-recommended method of handling Unicode/ASCII compatibility is to

#define UNICODE 1
#define _UNICODE 1

in a header included by every source file. Then use the un-suffixed versions of the functions.

Other notes:

As mentioned above, never call fopen() -- use _wfopen(). This probably needs to go in a compatibility shim layer that abstracts over this.
Any write to a file not opened by libcurl must be considered to potentially be a Windows console handle. In this case, the CRT functions must not be used, as they do not support Unicode -- one must use WriteConsoleW() and do all buffering manually. This is especially serious for the command-line curl utility.

bagder · 2015-10-05T05:34:04Z

I'll welcome a patch from someone that has been tested with a fair degree of success (on Windows).

The NUL byte is a unique separator. When using NUL, filenames don't need to be escaped and we can handle all kinds of special characters in file names. That said, there is no Windows support for these characters at the moment. - `sort -u` thinks that 'a' equals 'ä' and therefore omits 'ä'. - `curl` fails to open files with unicode in their name curl/curl#345

jay · 2016-02-11T08:58:05Z

I was kind of hoping someone would pick up on this but it's about that time it goes in the TODO. To see how this might work I wrote a draft with the idea to convert the command line arguments to UTF-8 (see the discussion in #637) so we can continue to pass the user input around as char *. When files are opened or statused then convert from UTF-8 to UTF-16 encoding. There is no way to set the locale as UTF-8 so fopen will open as UTF-8, afaict.

Things like this work:

-v --output спасти.txt http://россия.net
-v -F filedata=@спасти.txt http://example.com/

The first one the host is an IDN and it's converted to UTF-8 and if a WinIDN build is used it's later converted to punycode (xn--h1alffa9f.net).

Has some problems though like the whole URL is now UTF-8. output to the screen is not UTF-8 so any command line input that is output to the screen is incorrect:

* Connection #0 to host Ñ?D_Ñ?Ñ?D,Ñ?.net left intact

The way I did it was sloppy, just to get it working to see how it might work. I'm using a global variable in the DLL g_curl_tool_args_are_utf8 and also I have some duplicated code like the fopen and stat wrappers. But it seems to work in the case of filenames and hostnames.

The draft is here:
https://github.com/curl/curl/compare/master...jay:win-utf8-test?expand=1

jay · 2016-04-06T06:47:08Z

Since we don't have time to work on this right now it's been added to KNOWN_BUGS. 9f740d3

Karlson2k · 2016-04-06T09:14:11Z

I'm willing to implement a solution for curl.
But at first curl devs need to choose how they want to handle all this staff in curl tool and libcurl.
Currently curl is not very good documented how it use given strings.
Seems that libcurl treat all string as encoded in "locale encoding", which is definitely not the best choice:

depending on platform and settings encoding can be changed for thread, for process or for system - so it's not thread-safe
some application change locale on fly to "C" and back (as they need decimal point instead of decimal comma or they need to change case of US-ASCII-only symbols)
locale encoding can be limited - you can't convert GBK/CP936/GB18030/BIG5 text to CP1251/CP866/KOI8-R and vice versa. Text will be lost.

My suggestion:

for libcurl:
- treat all given urls as UTF-8 encoded
- treat all other text (usernames, passwords) as is and don't attempt to convert it
- output all text from remote servers as is or converted to UTF-8
for curl tool:
- convert all input urls to UTF-8
- configure libcurl or convert by itself output

Later more smart processing can be added: autodetect text encoding for Web and other servers according to standard (several levels of detection with priority: HTTP header, HTML header, direct detection by first few bytes) and automatic conversion of GET and POST data to required encoding (must be the same as encoding of page with HTML form).

Anyway, it must be documented how libcurl deals with encoding.

bagder · 2016-04-06T09:18:14Z

Then let's take it step-by-step:

Document how it works now. That's important since we cannot introduce behavior changes without very careful considerations.
Then work on introducing something that can make the handling consistent between platforms.

Karlson2k · 2016-04-06T09:29:36Z

@bagder OK, where and how we can start documenting current behavior?

bagder · 2016-04-06T09:43:34Z

I'd say probably in the curl_easy_setopt.3 and the curl.1 man pages. Perhaps a new "text encoding" section would be suitable. Or what do you think?

Karlson2k · 2016-04-06T09:51:17Z

Currently encoding in libcurl/curl is some kind of mess. If we document it in curl documentation for end users, users can start modifying theirs programs and scripts to much updated documentation.
I'd prefer to create some simple internal document (can be github issue for example) to simplify taking decisions, and then, based on decision - create PRs with code and documentation updates.

bagder · 2016-04-06T09:52:55Z

I'm fine with that too (and weirdly enough I don't think it is a "mess" ;-)

Karlson2k · 2016-04-06T09:57:01Z

@bagder Yep, curl is the best. 😃 Just need a little bit improvement. 😉

andrewchernow · 2017-06-06T11:58:59Z

I am using libcurl on a new project and came up with a solution. It that replaces all fopen/open calls with their wchar_t versions, for anyone looking for a quick and dirty hack.

Note that if using openssl with curl, some file handling is done by openssl: ie. SSL_CTX_use_certificate_chain_file. However, OpenSSL properly uses _wopen. I don't know about other SSL libraries that curl supports.

I created an fopen/open macro in curl_setup.h just after including io.h (also after stdio.h).

#define fopen _wfopen_hack
#define open _wopen_hack

__declspec( dllexport ) FILE *_wfopen_hack(const char *file, const char *mode);
__declspec( dllexport ) int _wopen_hack(const char *file, int oflags, ...);

I then added the implementation to lib/file.c (excluded error checking)

FILE *
_wfopen_hack(const char *file, const char *mode)
{
    wchar_t wfile[260];
    wchar_t wmode[32];

    MultiByteToWideChar(CP_UTF8, 0, file, -1, wfile, 260);
    MultiByteToWideChar(CP_UTF8, 0, mode, -1, wmode, 32);

    return _wfopen(wfile, mode);
}

int
_wopen_hack(const char *file, int oflags, ...)
{
   wchar_t wfile[260];
   int mode = 0;

   if(oflags & _O_CREAT)
   {
      va_list ap;
      va_start(ap, oflags);
      mode = (int)va_arg(ap, int);
      va_end(ap);
   }

   MultiByteToWideChar(CP_UTF8, 0, file, -1, wfile, 260);

   return _wopen(wfile, oflags, mode);
}

I logged to a file within those functions to ensure they were being called. I also did a strings check for "open" and only found my _wfopen_hack and _wopen_hack symbols.

bagder · 2017-06-06T12:19:53Z

Cool! So what's the downside with this approach? Or perhaps put differently: why aren't you suggesting this as a real pull request?

andrewchernow · 2017-06-06T13:18:08Z

Good question. I guess I didn't find this solution all that elegant (after thought); although it does solve the issue. My solution only focused on FS stuff, thus may not be a 100% fix for unicode problems. In addition, this forces the API user to use wide versions, verse giving them the control to enable/disable them. Not sure if that can cause breakage for existing applications.

Ideas to make this more committable:

use an open/fopen option with setopt to set a callback for file opens.
Add a flag to curl_global_init to enable wide versions of open calls. (my favorite)
instead of a macro, actually change the call sites for open/fopen with a Curl_open or Curl_fopen.

Andrew

andrewchernow · 2017-06-06T13:30:42Z

Oh yeah, the version I am actually using doesn't hard code 260 for path buffer size; which is MAX_PATH on windows and is essentially a meaningless value, since windows can support path lengths up to 32767. My version queries the conversion size with an additional call to MultiByteToWideChar and then allocates the buffer.

jay · 2017-06-06T17:39:03Z

@andrewchernow thanks for taking a shot at it but I don't think it can be done in the way you propose. Unicode characters are converted to a local codepage (eg "ANSI") in Windows in a way that can be lossy. In other words if you have some russian unicode and it's converted to american ansi then the actual glyphs I'd posit are lost so when you convert it back you're not guaranteed the same thing.

the version I am actually using doesn't hard code 260 for path buffer size; which is MAX_PATH on windows and is essentially a meaningless value

Have you ever tried to open files in a folder in explorer at a depth greater than max path? It is not at all a meaningless value. In my experience some Windows API functions W (Unicode) simply do not function correctly with paths larger than max path (or some number a little larger), despite what documentation implies. ~~For that reason it's better to use max path.~~ (edit: I'm going to back away from this a bit -- it's not "better" to use max path but in my opinion there's just not much advantage to supporting longer paths, though it's fine to do so and probably a good idea, maybe Windows 10 has better support. I just take issue with it being a "meaningless" value.)

andrewchernow · 2017-06-06T18:32:29Z

Unicode characters are converted to a local codepage (eg "ANSI") in Windows in a way that can be lossy

If you use ANSI functions to manage file...then yes. NTFS stores file names as UTF-16, not ANSI. Thus, if you start with UTF-8, convert it to UTF-16 and then use a wide function to access the file system, ANSI doesn't play a role. I'm not sure where you are suggesting ANSI is injected.

In my project, I was solving the issue of supplying certificate and key file paths. My application acquires them using non-ANSI functions, converts them to UTF-8 and passes them to setopt: like CURLOPT_PINNEDPUBLICKEY which uses an fopen call within curl. Using my solution, that fopen call becomes a _wfopen call with a wide converted path. In this case, no ANSI conversion ever occurs.

Have you ever tried to open files in a folder in explorer at a depth greater than max path

Yes, I have. It's rather hilarious. However, your assertion that this means >MAX_PATH is somehow wrong or invalid, is a bit misguided. That path most likely points to a valid object in the file system. Punting a request to open such it seems like a bug; what if it was a public key or a file to upload? However, having to prefix a >MAX_PATH path with \\?\ also seems like a bug/hack ;) Windows 10 has a way to enable long paths now, I don't think the prefix is needed on win10.

Anyhow, the use cases I need are working: supplying keys/certs to setopt, uploading, downloading.... Part of why I said "may not be a 100% fix for unicode problems". I just thought it may be useful to someone else.

Karlson2k · 2017-06-06T18:42:55Z

For multiplatform projects is most correct way is to use UTF-8 internally and convert (if needed) on input/output and for filesystem access.
For W32-only project, the most correct way is to use WCHAR/wstring internally everywhere.

jay · 2017-06-06T21:50:38Z

In my project, I was solving the issue of supplying certificate and key file paths.

I see. I was thinking of filenames provided to the curl tool on the command line, which are encoded as ANSI as described in this issue (eg argv[1], argv[2] etc). Should we try to convert them back to Unicode via UTF-8 some information would be gone. I had proposed earlier in the thread converting from UTF-16 (GetCommandLineW) to UTF-8 but that had other problems then because Microsoft's CRT doesn't work with UTF-8 as a locale.

However, your assertion that this means >MAX_PATH is somehow wrong or invalid, is a bit misguided. That path most likely points to a valid object in the file system.

Yes I agree, my assertion was too much. Shortly after I wrote it I had edited it, but I suspect github sent out the e-mail update before then.

andrewchernow · 2017-06-07T00:08:28Z

command line, which are encoded as ANSI as described in this issue (eg argv[1], argv[2] etc)

Very true, I see what you are saying. You know you can change the codepage of the command prompt via chcp (change codepage) and set it to UTF-8 chcp 65001. Older windows have some display issues, but underneath the data is correct.

Should we try to convert them back to Unicode via UTF-8 some information would be gone

True, but I don't think this is curl's issue. If curl is given malformed UTF-8, then stuff will break. Garbage in, garbage out.

Rather than GetCommandLineW, I'd suggest using the CRT wmain function. It is just like main() but from process startup, arguments and the environment are UTF-16. No ANSI conversion at all. Then you can use WideCharToMultiByte(CP_UTF8, wide_argv[X]) until the cows come home.

harvald · 2017-12-14T12:37:40Z

Any progress on this? I have the same issue. Anyone know some workaround?

see curl#345 (comment)

z0hm changed the title ~~cURL don't open files with umlauts in names~~ Unicode. cURL don't open files with umlauts in names. Jul 14, 2015

jay changed the title ~~Unicode. cURL don't open files with umlauts in names.~~ curl can't open Unicode files in Windows Jul 14, 2015

jay added the enhancement label Jul 14, 2015

jay changed the title ~~curl can't open Unicode files in Windows~~ curl can't open Unicode filenames in Windows Jul 14, 2015

mkllnk mentioned this issue Dec 13, 2015

git ftp push - russian letters Windows system issue git-ftp/git-ftp#209

Open

jay mentioned this issue Feb 8, 2016

Better error checking for Win32 IDN functions #637

Closed

bagder added the KNOWN_BUGS material label Apr 3, 2016

jay removed the KNOWN_BUGS material label Apr 6, 2016

jay closed this as completed Apr 6, 2016

jay mentioned this issue Apr 6, 2016

curl on Windows incorrectly handle IDN-urls #731

Closed

Karlson2k referenced this issue Apr 6, 2016

KNOWN_BUGS: #95 curl in Windows can't handle Unicode arguments

9f740d3

dpprdan mentioned this issue May 30, 2017

curl cannot handle internationalized domain names jeroen/curl#101

Closed

sergeevabc mentioned this issue Oct 22, 2017

[FIXED] Filename with Unicode characters charonn0/VT-Hash#11

Closed

blattersturm added a commit to citizenfx/curl that referenced this issue Mar 16, 2018

curl: add hack to allow Unicode paths for files

3bd8b39

see curl#345 (comment)

lock bot locked as resolved and limited conversation to collaborators May 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

curl can't open Unicode filenames in Windows #345

curl can't open Unicode filenames in Windows #345

z0hm commented Jul 14, 2015

dfandrich commented Jul 14, 2015

jay commented Jul 14, 2015

z0hm commented Jul 15, 2015

dbyron0 commented Jul 16, 2015

vszakats commented Aug 13, 2015

DemiMarie commented Oct 2, 2015

bagder commented Oct 5, 2015

jay commented Feb 11, 2016

jay commented Apr 6, 2016

Karlson2k commented Apr 6, 2016

bagder commented Apr 6, 2016

Karlson2k commented Apr 6, 2016

bagder commented Apr 6, 2016

Karlson2k commented Apr 6, 2016

bagder commented Apr 6, 2016

Karlson2k commented Apr 6, 2016

andrewchernow commented Jun 6, 2017 •

edited by jay

Loading

bagder commented Jun 6, 2017

andrewchernow commented Jun 6, 2017

andrewchernow commented Jun 6, 2017

jay commented Jun 6, 2017 •

edited

Loading

andrewchernow commented Jun 6, 2017

Karlson2k commented Jun 6, 2017

jay commented Jun 6, 2017

andrewchernow commented Jun 7, 2017

harvald commented Dec 14, 2017

curl can't open Unicode filenames in Windows #345

curl can't open Unicode filenames in Windows #345

Comments

z0hm commented Jul 14, 2015

dfandrich commented Jul 14, 2015

jay commented Jul 14, 2015

z0hm commented Jul 15, 2015

dbyron0 commented Jul 16, 2015

vszakats commented Aug 13, 2015

DemiMarie commented Oct 2, 2015

bagder commented Oct 5, 2015

jay commented Feb 11, 2016

jay commented Apr 6, 2016

Karlson2k commented Apr 6, 2016

bagder commented Apr 6, 2016

Karlson2k commented Apr 6, 2016

bagder commented Apr 6, 2016

Karlson2k commented Apr 6, 2016

bagder commented Apr 6, 2016

Karlson2k commented Apr 6, 2016

andrewchernow commented Jun 6, 2017 • edited by jay Loading

bagder commented Jun 6, 2017

andrewchernow commented Jun 6, 2017

andrewchernow commented Jun 6, 2017

jay commented Jun 6, 2017 • edited Loading

andrewchernow commented Jun 6, 2017

Karlson2k commented Jun 6, 2017

jay commented Jun 6, 2017

andrewchernow commented Jun 7, 2017

harvald commented Dec 14, 2017

andrewchernow commented Jun 6, 2017 •

edited by jay

Loading

jay commented Jun 6, 2017 •

edited

Loading