Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows:curl -F filename error when i use chinese #10261

Closed
UnicornZhang opened this issue Jan 9, 2023 · 13 comments
Closed

windows:curl -F filename error when i use chinese #10261

UnicornZhang opened this issue Jan 9, 2023 · 13 comments
Labels
cmdline tool unicode Unicode, code page, character encoding Windows Windows-specific

Comments

@UnicornZhang
Copy link

UnicornZhang commented Jan 9, 2023

I want to upload file by curl ,but i get the file name error and missing file suffix

curl/libcurl version 7.87.0

curl  'http://10.128.6.26:3030/package/publish' --form 'projectName="electron-test"'  --form 'packageFile=@"C:\Users\Administrator\Desktop\test-upload\测试中文-0.1.3-win.exe"'

i get file name '测试中文-0.1.3-w'

In version 7.83.1

I get file name '测试中文-0.1.3-win.exe'

System:win10

@jay jay added cmdline tool Windows Windows-specific labels Jan 9, 2023
@jay
Copy link
Member

jay commented Jan 9, 2023

Please give us the version information curl -V from both versions.

@UnicornZhang
Copy link
Author

@jay error version:curl 7.87.0 (x86_64-w64-mingw32) libcurl/7.87.0 OpenSSL/1.1.1s (Schannel) zlib/1.2.13 brotli/1.0.9 zstd/1.5.2 libidn2/2.3.3 libpsl/0.21.1 (+libidn2/2.3.1) libssh2/1.10.0 nghttp2/1.51.0
Release-Date: 2022-12-21
Protocols: dict file ftp ftps gopher gophers http https imap imaps ldap ldaps mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS brotli HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz MultiSSL NTLM PSL SPNEGO SSL SSPI threadsafe TLS-SRP UnixSockets zstd

another version: curl 7.83.1 (Windows) libcurl/7.83.1 Schannel
Release-Date: 2022-05-13
Protocols: dict file ftp ftps http https imap imaps pop3 pop3s smtp smtps telnet tftp
Features: AsynchDNS HSTS IPv6 Kerberos Largefile NTLM SPNEGO SSL SSPI UnixSockets

@jay
Copy link
Member

jay commented Jan 9, 2023

I can reproduce this in a Windows 7 VM when I change the region and locale (including non-Unicode locale) to traditional Chinese and use the mingw-w64 official curl builds after 7.83.1_2. With the same locale settings I have Visual Studio builds of 7.87.0, both no character set and multibyte character set, that will not reproduce in that VM with 7.87.0. I notice the Chinese characters are multibyte in the ansi codepage so maybe that has something to do with it, even though I can't reproduce with VS builds.

This is the last good version:

curl 7.83.1 (x86_64-pc-win32) libcurl/7.83.1 OpenSSL/3.0.2 (Schannel) zlib/1.2.12 brotli/1.0.9 libidn2/2.3.2 libssh2/1.10.0 nghttp2/1.47.0 libgsasl/1.10.0
Release-Date: 2022-05-11
Protocols: dict file ftp ftps gopher gophers http https imap imaps ldap ldaps mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS brotli gsasl HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz MultiSSL NTLM SPNEGO SSL SSPI TLS-SRP UnixSockets


POST / HTTP/1.1
Host: 192.168.1.80:8000
User-Agent: curl/7.83.1
Accept: */*
Content-Length: 338
Content-Type: multipart/form-data; boundary=------------------------dc8b9491458271bb

--------------------------dc8b9491458271bb
Content-Disposition: form-data; name="projectName"

electron-test
--------------------------dc8b9491458271bb
Content-Disposition: form-data; name="packageFile"; filename="¦¦¦¦¦¦¦¦-0.1.3-win.exe"
Content-Type: application/octet-stream

test
--------------------------dc8b9491458271bb--

This is the first bad version:

curl 7.84.0 (x86_64-pc-win32) libcurl/7.84.0 OpenSSL/3.0.4 (Schannel) zlib/1.2.12 brotli/1.0.9 libidn2/2.3.2 libssh2/1.10.0 nghttp2/1.48.0 ngtcp2/0.6.0 nghttp3/0.5.0 libgsasl/1.10.0
Release-Date: 2022-06-27
Protocols: dict file ftp ftps gopher gophers http https imap imaps ldap ldaps mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS brotli gsasl HSTS HTTP2 HTTP3 HTTPS-proxy IDN IPv6 Kerberos Largefile libz MultiSSL NTLM SPNEGO SSL SSPI threadsafe TLS-SRP UnixSockets


POST / HTTP/1.1
Host: 192.168.1.80:8000
User-Agent: curl/7.84.0
Accept: */*
Content-Length: 334
Content-Type: multipart/form-data; boundary=------------------------c33fdc604f32fdb5

--------------------------c33fdc604f32fdb5
Content-Disposition: form-data; name="projectName"

electron-test
--------------------------c33fdc604f32fdb5
Content-Disposition: form-data; name="packageFile"; filename="¦¦¦¦¦¦¦¦-0.1.3-win"
Content-Type: application/octet-stream

test
--------------------------c33fdc604f32fdb5--

The ¦ are multibyte Chinese characters that socat can't print because the server is running in English. The same number of bytes in the chinese characters both times.

@vszakats any idea?

@vszakats
Copy link
Member

vszakats commented Jan 9, 2023

I cannot spot anything between 7.83.1_1 and 7.83.1_2 that might affect character handling (skimming through the 100 commits, which are mainly HTTP/3 / LibreSSL prep work, switching libssh2 builds to autotools, "unified" packaging, and some other unrelated things):
curl/curl-for-win@ce495c3...70eb533

7.83.1_2 changed packaging to the single-zip style (curl/curl-for-win@3182733).

It also enabled -O3 with curl, though this later turned out to be a no-op and had to be redone because the upstream Makefile.m32 overrode it to -O2 anyway.

So I cannot see anything in this revision that might affect Chinese text.

7.83.1_3 replaced OpenSSL with openssl-quic, adding HTTP/3 support. That seems even less related:
curl/curl-for-win@70eb533...1a0412d

Though rereading the thread makes it unclear if this became an issue between 7.84.0 and 7.83.1, or between two distinct curl-for-win builds of 7.83.1. If the former, we need to turn to the curl source code. It's also not very likely that this is curl-for-win-specific, though if we have the regression between two revisions, I can look it up again.

@vszakats
Copy link
Member

vszakats commented Jan 9, 2023

My generic guess would be this:

curl on Windows doesn't have Unicode support by default. In these builds all strings are considered raw-bytes and passed around as-is (*). This makes things appear to work in certain practical cases. For some this may be perceived as curl having special support for certain codepages/encodings. The reality though that it is just a happy coincidence.

(*) But, curl has certain places (and/or dependencies) which may steer off the track of the optimistic "as-is" string handling, and doing string converstion/manipulation, which instantly breaks the happy coincidences by assuming certain input formats and spitting out certain output formats. It's enough to call any Win32 API function with a W ending (e.g. WriteConsoleW() in src/tool_cb_wrt.c).

This is made even more complex in curl due to using both CRT functions and the Win32 API directly, each with potentially different encoding requirement or Unicode support. (And even more complexity comes when interfacing with the dependencies curl supports.)

Speaking of curl with Unicode support enabled: This is the correct path, but as of today, the level of support is just not enough to cover all cases. (And before we could cover all cases, we'd need to clear codepage requirements for each string curl is accepting or returning). The downside is that this mode breaks all the use-cases which work "correctly" by happy coincidence in non-Unicode mode, because of the added conversions and heavier use of Unicode-enabled functions needing them.

Till these fundamental issues are solved, iterations of these non-ASCII issues will keep popping up.

The solution is complex. Even if fully solved, with Unicode mode finished and enabled by default, it will inherently result in fallouts, because the old "happy coincidence" cases will start to break and will require correction to work as expected in a Unicode-enabled environment. (even more so for libcurl API users)

@Cherish98
Copy link
Contributor

@vszakats I believe it is caused by commit 68fa9bf in which HAVE_BASENAME is set to 1. Before that, it used the curl's version (Curl_basename()) which treated strings as raw-bytes and passed around as-is. The function is called by strippath() inside curl_mime_filedata() to set filename:

curl/lib/mime.c

Lines 1437 to 1445 in 38262c9

/* As a side effect, set the filename to the current file's base name.
It is possible to withdraw this by explicitly calling
curl_mime_filename() with a NULL filename argument after the current
call. */
base = strippath(filename);
if(!base)
result = CURLE_OUT_OF_MEMORY;
else {
CURLcode res = curl_mime_filename(part, base);

With mingw-w64's basename(), you'll get the truncated name 测试中文-0.1.3-win from:

#include <libgen.h>
#include <stdio.h>
int main(){
    char s[] = "测试中文-0.1.3-win.exe";
    return puts(basename(s));
}

@vszakats
Copy link
Member

vszakats commented Feb 11, 2023

@Cherish98: Nice catch! That commit syncs the HAVE_BASENAME build setting for GNU Make/.m32/.mk (MinGW-w64) builds with CMake/autotools logic, which autodetected and enabled this setting prior to that patch. In practice, it means all CMake/autotools builds were broken before, official curl Windows builds got broken with curl/curl-for-win@1dc206c (went live with curl/curl-for-win@cb966af as binary release 7.83.1_4) and all Makefile.mk/Makefile.m32 builds broke with that patch.

If we can confirm this as the root cause, the fix would be to ignore HAVE_BASENAME for all build methods for WIN32 and always stick to the local implementation (or force-disable this setting for all build methods).

Can you make a curl build without HAVE_BASENAME and confirm that it fixes the problem?

Correction: Native MSVC builds were never affected by that patch.

@vszakats
Copy link
Member

Untested patch:

diff --git a/lib/curl_setup.h b/lib/curl_setup.h
index 2eb9697fd..fcdfe3ca2 100644
--- a/lib/curl_setup.h
+++ b/lib/curl_setup.h
@@ -838,6 +838,10 @@ int getpwuid_r(uid_t uid, struct passwd *pwd, char *buf,
 #define USE_HTTP3
 #endif
 
+#if defined(HAVE_BASENAME) && defined(WIN32)
+#undef HAVE_BASENAME
+#endif
+
 #if defined(USE_UNIX_SOCKETS) && defined(WIN32)
 #  if defined(__MINGW32__) && !defined(LUP_SECURE)
      typedef u_short ADDRESS_FAMILY; /* Classic mingw, 11y+ old mingw-w64 */

vszakats added a commit to curl/curl-for-win that referenced this issue Feb 11, 2023
@vszakats
Copy link
Member

HAVE_BASENAME is enabled is official builds since 7.83.1_4, but Jay reproduced the issue with 7.83.1_2. This suggests a different cause or something in addition to this one.

@vszakats
Copy link
Member

vszakats commented Feb 11, 2023

Test binaries with disabled HAVE_BASENAME here:
https://ci.appveyor.com/project/curlorg/curl-for-win/build/artifacts

@Cherish98
Copy link
Contributor

I tested the binaries with HAVE_BASENAME disabled, and it indeed has fixed the issue.

@vszakats
Copy link
Member

Thanks for your test @Cherish98! I'll make a PR of that patch soon.

@vszakats
Copy link
Member

vszakats commented Feb 11, 2023

The mingw-w64 basename() implementation, for reference: https://sourceforge.net/p/mingw-w64/mingw-w64/ci/master/tree/mingw-w64-crt/misc/basename.c

vszakats added a commit to vszakats/curl that referenced this issue Feb 11, 2023
The `basename()` implementation provided by mingw-w64 [1] makes
assumptions about the encoding of its input and thus may break
non-ASCII strings.

`basename()` was auto-detected with CMake, autotools and since
68fa9bf (2022-10-13), also in
`Makefile.mk` after syncing its behaviour with the mainline build
methods. A similar patch broke official Windows builds earlier in
release 7.83.1_4 (2022-06-15).

This patch forces all Windows builds to use curl's internal
`basename()` implementation to avoid such problems.

Ref: https://sourceforge.net/p/mingw-w64/mingw-w64/ci/master/tree/mingw-w64-crt/misc/basename.c

Reported-by: UnicornZhang on Github
Assisted-by: Cherish98 on Github

Fixes curl#10261
Closes #xxxxx
vszakats added a commit to vszakats/curl that referenced this issue Feb 12, 2023
The `basename()` implementation provided by mingw-w64 [1] makes
assumptions about input encoding and thus may break with non-ASCII
strings.

`basename()` was auto-detected with CMake, autotools and since
68fa9bf (2022-10-13), also in
`Makefile.mk` after syncing its behaviour with the mainline build
methods. A similar patch for curl-for-win broke official Windows
builds earlier, in release 7.83.1_4 (2022-06-15).

This patch forces all Windows builds to use curl's internal
`basename()` implementation to avoid such problems.

[1]: https://sourceforge.net/p/mingw-w64/mingw-w64/ci/master/tree/mingw-w64-crt/misc/basename.c

Reported-by: UnicornZhang on Github
Assisted-by: Cherish98 on Github

Fixes curl#10261
Closes curl#10475
vszakats added a commit to vszakats/curl that referenced this issue Feb 12, 2023
The `basename()` [1][2] implementation provided by mingw-w64 [3] makes
assumptions about input encoding and thus may break with non-ASCII
strings.

`basename()` was auto-detected with CMake, autotools and since
68fa9bf (2022-10-13), also in
`Makefile.mk` after syncing its behaviour with the mainline build
methods. A similar patch for curl-for-win broke official Windows
builds earlier, in release 7.83.1_4 (2022-06-15).

This patch forces all Windows builds to use curl's internal
`basename()` implementation to avoid such problems.

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/basename.html
[2]: https://www.man7.org/linux/man-pages/man3/basename.3.html
[3]: https://sourceforge.net/p/mingw-w64/mingw-w64/ci/master/tree/mingw-w64-crt/misc/basename.c

Reported-by: UnicornZhang on Github
Assisted-by: Cherish98 on Github

Fixes curl#10261
Closes curl#10475
@vszakats vszakats added the unicode Unicode, code page, character encoding label Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmdline tool unicode Unicode, code page, character encoding Windows Windows-specific
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants