Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idn: fix libidn2 with windows unicode builds #7246

Closed
wants to merge 1 commit into from

Conversation

vszakats
Copy link
Member

@vszakats vszakats commented Jun 11, 2021

Unicode Windows builds use UTF-8 strings internally in libcurl, so make sure to call the UTF-8 flavour of the libidn2 API.

Reported-by: dEajL3kA on github
Fixes #7228

@vszakats vszakats added the Windows Windows-specific label Jun 11, 2021
@vszakats
Copy link
Member Author

@dEajL3kA: Can you give a test to this patch?

@dEajL3kA
Copy link

@dEajL3kA: Can you give a test to this patch?

Thank you. Works fine for me.

But I would like to suggest a small change to avoid compiler warning about pointer type:

diff --git a/lib/url.c b/lib/url.c
index edcdf54b1a07..2bfe7631c1d7 100644
--- a/lib/url.c
+++ b/lib/url.c
@@ -62,6 +62,12 @@
 #ifdef USE_LIBIDN2
 #include <idn2.h>
 
+#if defined(WIN32) && defined(UNICODE)
+#define IDN2_LOOKUP(name, host, flags) idn2_lookup_u8((const uint8_t*)name, (uint8_t**)host, flags)
+#else
+#define IDN2_LOOKUP(name, host, flags) idn2_lookup_ul((const char*)name, (char**)host, flags)
+#endif
+
 #elif defined(USE_WIN32_IDN)
 /* prototype for curl_win32_idn_to_ascii() */
 bool curl_win32_idn_to_ascii(const char *in, char **out);
@@ -1576,12 +1582,12 @@ CURLcode Curl_idnconvert_hostname(struct Curl_easy *data,
 #else
       int flags = IDN2_NFC_INPUT;
 #endif
-      int rc = idn2_lookup_ul((const char *)host->name, &ace_hostname, flags);
+      int rc = IDN2_LOOKUP(host->name, &ace_hostname, flags);
       if(rc != IDN2_OK)
         /* fallback to TR46 Transitional mode for better IDNA2003
            compatibility */
-        rc = idn2_lookup_ul((const char *)host->name, &ace_hostname,
-                            IDN2_TRANSITIONAL);
+        rc = IDN2_LOOKUP(host->name, &ace_hostname,
+                         IDN2_TRANSITIONAL);
       if(rc == IDN2_OK) {
         host->encalloc = (char *)ace_hostname;
         /* change the name pointer to point to the encoded hostname */

Regards.

vszakats added a commit to vszakats/curl that referenced this pull request Jun 12, 2021
@vszakats
Copy link
Member Author

@dEajL3kA: Nice catch, I've updated the patch with your fix.

@vszakats vszakats changed the title idn: fix libidn2 with unicode windows idn: fix libidn2 with windows unicode builds Jun 12, 2021
Copy link
Member

@jay jay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is true of the curl tool unicode builds they will use utf-8 it is not necessarily true of libcurl since it takes whatever the application passes. There is already a fallback for failed translation, and I wonder if that could also be done for utf8, like for example

diff --git a/lib/url.c b/lib/url.c
index 27ba7d6..0eb364f 100644
--- a/lib/url.c
+++ b/lib/url.c
@@ -1585,7 +1585,12 @@ CURLcode Curl_idnconvert_hostname(struct Curl_easy *data,
 #else
       int flags = IDN2_NFC_INPUT;
 #endif
-      int rc = idn2_lookup_ul((const char *)host->name, &ace_hostname, flags);
+      int rc;
+#if defined(WIN32) && defined(UNICODE)
+      rc = idn2_lookup_u8((const char *)host->name, &ace_hostname, flags);
+      if(rc != IDN2_OK)
+#endif
+        rc = idn2_lookup_ul((const char *)host->name, &ace_hostname, flags);
       if(rc != IDN2_OK)
         /* fallback to TR46 Transitional mode for better IDNA2003
            compatibility */

@dEajL3kA
Copy link

dEajL3kA commented Jun 14, 2021

While it is true of the curl tool unicode builds they will use utf-8 it is not necessarily true of libcurl since it takes whatever the application passes. There is already a fallback for failed translation, and I wonder if that could also be done for utf8, like for example

Shouldn't the API define which encoding is expected for char* string inputs, rather than leaving this "undefined" and having to make guesses afterwards? And, in (Windows) Unicode builds, anything but UTF-8 doesn't really make sense to me.

Also, a problem I see is that it is quite possible that the exact same char* string decodes "correctly" (seemingly) as both, UTF-8 and ANSI (e.g. Windows-1252). Still it won't necessarily decode to what was intended by the caller! In other words: If we always try idn2_lookup_u8() first and use idn2_lookup_ul() as a fallback, then idn2_lookup_u8() may succeed even if the input is actually an ANSI-encoded string. But, of course, in that case, the result wouldn't be what the user had intended. Trying idn2_lookup_ul() first and using idn2_lookup_u8() as a fallback would probably be even worse, though. IMO, it would be better to just not support ANSI inputs at all (in the Unicode build), instead of implementing an "unpredictable" behavior.

Either that, or make the API offer a function that allows the application to set the character encoding (UTF-8 being the default).

@jay
Copy link
Member

jay commented Jun 14, 2021

There's probably a slight risk of the wrong conversion. I'm interested in supporting the user's existing code while making it possible to fix this issue. We've had a lot of discussions around Unicode in Windows and what we have now mostly works without breaking anything.

@dEajL3kA
Copy link

dEajL3kA commented Jun 14, 2021

IMHO, there are only two "clean" ways to do this: Either define which encoding your API expects for char* input strings – which can only be UTF-8 for the Unicode version of the library – and strictly assume that encoding in your implementation; or offer two separate functions, one for UTF-8 strings and one for strings encoded in the "local" ANSI codepage – just like libidn2 does.

It's just not possible to know from the char* string alone which encoding is used. The exactly same sequence of bytes can be a "valid" UTF-8 string and a "valid" ANSI string at the same time, but we don't know what the user intended. So I think the chance of "wrong conversion" in a function that tries to accept UTF-8 and ANSI strings is very real. You can try to make an educated guess. And, in some cases, you will be lucky and things works as the user had expected. But, in other cases, the user will see "weird" results. This makes debugging very hard, because it first appears to wor,k but then suddenly it breaks. So I think it is much preferible to consitently fail when the string is delivered in the "wrong" encoding, instead of implementing a "fix" that cannot be relied on.

Just my two cents. As long as UTF-8 is always tried first (in the Unicode version), I will probably be fine with that 😏

@jay
Copy link
Member

jay commented Jun 14, 2021

Sure you've made that point, and I just don't think it's as big a deal. We already do this for the file api and probably other stuff, we try utf8 and if that fails fall back on local encoding. In other words, we prefer UTF-8 in Unicode builds but in order not to disrupt other clients we will try the local encoding. This seems like a pretty good compromise and so far we don't have any reports of a wrong encoding.

@dEajL3kA
Copy link

dEajL3kA commented Jun 14, 2021

Just one more thing I would like to add: This even could have some security implications. I cannot give you a concrete exploit, but suppose some user relies on the ANSI fallback behavior to open a certain file (or URL). This user probably isn't even aware that the given string will be interpreted as UTF-8 first, before falling back to ANSI. Now, if an ANSI string is interpreted as UTF-8, you probably get some "garbage" file name that is unlikely to exist – normally. Still, an attacker who knows which ANSI file name the user is trying to open can easily figure out the corresponding "garbage" UTF-8 name and place a malicious file there...

(Loading the .curlrc from your "home" directory could be a possible attack vector, e.g. if the user account name contains some "foreign" chars. This may offer the possibility to redirect the file open operation to another user's home directory)

@jay
Copy link
Member

jay commented Jun 14, 2021

Still, an attacker who knows which ANSI file name the user is trying to open can easily figure out the corresponding "garbage" UTF-8 name and then place a "malicious" file there...

IMO this is an acceptable security risk to continue supporting legacy clients. If the user has strings that are encoded locally then we try to support it but you'd run the risk of that if there were such an example.

@vszakats
Copy link
Member Author

vszakats commented Jun 14, 2021

So, if I understand correctly, in libcurl the internal representation of strings is undefined (and incidentally it is UTF-8 inside Unicode curl binary builds). If it is undefined, it will be impossible to properly handle it in API boundaries that make any codepage assumptions regarding the strings passed around.

As a possible solution, libcurl could get a new API where the string representation can be specified by the caller environment (e.g. in case of a Windows Unicode curl.exe this would be UTF-8) and decide which libidn2 API to call, based on this information (as opposed to the UNICODE build-time flag, which this patch is using for this purpose).

This leads very far, so I suggest closing this PR before the above issue is cleared up.

@jay
Copy link
Member

jay commented Jun 14, 2021

That's a lot of work and we have a best effort method that seems to work well enough, by overriding fopen, stat, access to try unicode first. We could do the same here.
https://github.com/curl/curl/blob/curl-7_77_0/lib/curl_multibyte.c#L115-L130

@dEajL3kA
Copy link

dEajL3kA commented Jun 14, 2021

Just for the notes, curl_win32_idn_to_ascii() and curl_win32_ascii_to_idn() already do assume UTF-8 encoding:
https://github.com/curl/curl/blob/master/lib/idn_win32.c#L71-L88

@vszakats
Copy link
Member Author

vszakats commented Jun 14, 2021

@jay: Hm, this is interesting though it also seems quite risky and opens a chance for security problems.

Regarding this PR: Is there a way to decide if the UTF-8 conversion failed when calling the libidn2 API? and if so retry in raw mode? This looks quite a bit risky solution as well. Another approach would be to test the string if it is UTF-8, and call the libidn2 API accordingly. Though detecting UTF-8 isn't a 100% matter, especially when someone wants to specifically trick the detection method to misdetect.

@vszakats
Copy link
Member Author

vszakats commented Jun 14, 2021

I tend to think it'd be best to document that Windows UNICODE builds are expecting UTF-8 strings and expect callers to provide string in this encoding. This was an implicit (and not at all obvious) breaking change that is already made. Mostly for libcurl calls, but even "curl.exe" calls aren't always behaving like they were before. And, "best effort" falling back to raw 8-bit mode isn't (always) feasible or secure, and even less so any kind of auto-detection.

[ Passing/retrieving UTF-8 strings isn't the most natural thing for Windows apps to do, but once implemented, it's a well defined, two-way conversion and there is a long list of other open-source projects which expect UTF-8 strings, so chances are good that this is already a familiar path. ]

Other alternatives require extreme efforts, such as introducing new API variants (with well-defined encoding requirements) where strings are involved, or adding the ability to tell curl the encoding/codepage strings are passed to (and received from) it and convert them internally to the formats requested, or adding a way to query the encoding/codepage curl is expecting and allow the caller to convert them to it before passing to curl. None of these is trivial and there is an effort to be made on both caller and curl side in all cases. Not to mention the possible errors in implementing any of these solutions.

For the first alternative the only bit required (from libcurl) is a way for a caller to detect a UNICODE libcurl build. This can already be done with: (curl_version_info(CURLVERSION_NOW)->features & CURL_VERSION_UNICODE) != 0, though it could probably made simpler.

@MarcelRaad
Copy link
Member

Note that the only supported way to get a Unicode libcurl build that expected ANSI strings in some places before version 7.71.0, when #3784 landed, was by using the Visual Studio project files or, starting with version 7.67.0, winbuild. Both support only MSVC and not e.g. MinGW. All SSPI and Schannel code in libcurl has always expected UTF-8 in Unicode builds, and e.g. using ANSI-encoded passwords never worked with them. Given all this, I don't think breaking backwards compatibility in this case and just documenting UTF-8 for Windows Unicode builds would be dramatic.

I'm not strongly opposed to a best-effort fallback like @jay proposed in #7246 (review) though.

jay added a commit to jay/curl that referenced this pull request Jun 14, 2021
- For Windows Unicode builds of libcurl that use libidn, attempt to
  convert the hostname from UTF-8 and if that fails use local encoding
  as a fallback.

Prior to this change Windows Unicode builds that use libidn attempted to
convert the hostname from the local encoding only. The curl tool passes
the command-line URL (and therefore the hostname) in Unicode's UTF-8
encoding, so the conversion would fail. Since other applications may
pass locally encoded URLs to libcurl we'll keep local encoding as a
fallback.

Reported-by: dEajL3kA@users.noreply.github.com
Assisted-by: Viktor Szakats

Fixes curl#7228
Closes curl#7246
Closes #xxxx
@jay
Copy link
Member

jay commented Jun 14, 2021

I'm not strongly opposed to a best-effort fallback like @jay proposed in #7246 (review) though.

Now #7256, as an alternative to this.

@dEajL3kA
Copy link

dEajL3kA commented Jun 14, 2021

If not breaking "legacy" applications which use ANSI encoding is a priority, then why not make all public API functions consitently assume ANSI (local Codepage) encoding by default and add a new API function to enable UTF-8 support. Unicode-aware applications, inclduing curl CLI program, will call that new function early on. And, once that function has been called, all public API functions consitently assume UTF-8 encoding (instead of ANSI). This would ensure consistent behavior for both, legacy and Unicode-aware applications and, at the same time, eliminate the need for any "best effort" workarounds.

In the library implementation a pair of functions curlx_multibyte_to_wchar() and curlx_wchar_to_multibyte(), which check the global "UTF-8 mode enabled" flag and perform the proper conversions, could be used, in order to make things easy.

@vszakats
Copy link
Member Author

vszakats commented Jun 14, 2021

@MarcelRaad: Thanks for pointing out @jay's 'best effort' patch, I did miss that.

I'm however not convinced about such solution and have a gut feeling that it leads to no good at all. First it will lead to potential exploits, and when fixing/changing it or possibly being forced to remove the workaround at a future point, it will lead to a second round of breaking change, where callers will need to finally do the right thing and send UTF-8 proper. Simply I do not believe that you can correctly guess the codepage/encoding by trial and error. (I'm coming from a place with accents.)

Either way, I'd prefer to commit this as-is now and leave adding compatibility-bandaids in a separate commit to your own discrection.

@vszakats
Copy link
Member Author

vszakats commented Jun 14, 2021

FWIW, I'm now tempted to turn off UNICODE in curl-for-win. I wasn't aware about the "best effort" measures and I see it as a massive potential to open a can of problems. Also, if UNICODE builds are still supposed to accept 8-bit ("ANSI") strings and libcurl randomly/arbitrarily decides how to interpet them, what is the purpose of UNICODE mode at all?

What if an 8-bit ("ANSI") string is a valid UTF-8 string? What if a file-operation with a valid UTF-8 filename fails for an unrelated reason then libcurl continues to try the same string as an 8-bit one and corrupts/reads/overwrites an existing but unrelated file as a result? What if this behaviour gets actively exploited by a malicious party?

[ Just to clarify "ANSI" (and "OEM") is not a codepage. It's Microsoft/Windows terminology to denote a list of 8-bit codepages. The exact interpretation will depend on the OS environment. ]

@dEajL3kA
Copy link

dEajL3kA commented Jun 14, 2021

FWIW, I'm now tempted to turn off UNICODE in curl-for-win.

UNICODE support is an absoloute must for Windows! Windows has supported Unicode for like ~20 years now and all modern Windows applications need to support it. The old ANSI APIs only still exist in Windows for legacy compatibility. Using ANSI in your application on Windows results in tons of problems, like the command-line arguments that were passed to your application get converted into a string containing "?" characters for all characters that cannot be represented in the local ANSI codepage. This way you will not be able to open files that contain such characters. And such files almost always do exist, because Window itself and its file system are fully Unicode (UTF-16) under the hood. So I think getting proper Unicode support in libcurl is essential.

Just to clarify "ANSI" (and "OEM") is not a codepage. It's Microsoft/Windows terminology to denote a list of 8-bit codepages. The exact interpretation will depend on the OS environment.

In Win32 API all funnctions exist twice: As a "Unicode" (wide string) version that uses UTF-16 strings, and as an "ANSI" version that uses multi-byte strings that are encoded in whatever local codepage happens to be select in the system (e.g. Windows-1252). So, when I say "ANSI", then I mean the system's local codepage, as used by the Win32 API in its "ANSI" functions.

@jay
Copy link
Member

jay commented Jun 14, 2021

What if a file-operation with a valid UTF-8 filename fails for an unrelated reason then libcurl continues to try the same string as an 8-bit one and corrupts/reads/overwrites an existing but unrelated file as a result?

Yes as it's written what could actually happen is opening a file using the unicode character set could fail for some reason unrelated to a failed conversion, but then fall back on treating the string as local encoding. In my opinion this could be improved to fall back only if the string is not UTF-8 encoded. The counter is clearly get rid of the fallbacks, and I just don't think we should do that because it could break existing user code. I see problems either way and I don't see a significant security issue here. I am also not convinced, but not strongly opposed either.

@vszakats
Copy link
Member Author

vszakats commented Jun 14, 2021

@dEajL3kA: Agreed completely. I'm an absolute proponent of UNICODE. But, if support for UNICODE in libcurl doesn't have clearly defined expectations and has dynamically changing behaviour when interpreting input, it's a slipperly slope.

Back to IDN specifically, it seems that WinIDN builds don't do the "best effort" interpreation and expect UTF-8 or fail. Meaning that an alternative for curl-for-win is switching back to WinIDN.

It also raises the question: Why is it desired to implement a "best effort" strategy for libidn2 and not doing so for WinIDN?

@vszakats
Copy link
Member Author

@jay: The problem is that there is no exact way of detecting UTF-8 encoding. There are several methods, but none of them 100%.

jay added a commit to jay/curl that referenced this pull request Jun 15, 2021
- If the UTF-8 to UTF-16 conversion fails in Windows Unicode builds then
  no longer fall back to assuming the string is in a local encoding.

Background:

Some functions in Windows Unicode builds must convert UTF-8 to UTF-16 to
pass to the Windows CRT API wide-character functions since in Windows
UTF-8 is not a valid locale (or at least 99% of the time right now).

Prior to this change if the Unicode encoding conversion failed then
libcurl would assume, for backwards compatibility with applications that
may have written their code for non-Unicode builds, attempt to convert
the string from local encoding to UTF-16.

That type of "best effort" could theoretically cause some type of
security or other problem if a string that was locally encoded was also
valid UTF-8, and therefore an unexpected UTF-8 to UTF-16 conversion
could occur.

Ref: curl#7246

Closes #xxxx
@jay
Copy link
Member

jay commented Jun 15, 2021

Back to IDN specifically, it seems that WinIDN builds don't do the "best effort" interpreation and expect UTF-8 or fail. Meaning that and alternative for curl-for-win is switching back to WinIDN.

It also raises the question: Why is it desired to implement a "best effort" strategy for libidn2 and not doing so for WinIDN?

It is for anyone using libidn2 and passing in the local encoding. As I said I don't want to disrupt their application; while at the same time I'd like to fix this for the curl tool. This is not the case with WinIDN which always used UTF-8.

Somewhat related to Marcel's point, I think we have to consider how likely is it right now that anyone using a Unicode build of libcurl for their application is also using libidn2 and not WinIDN? Probably very unlikely considering we don't support it outright in autobuild (see #7229). curl-for-win has a libcurl that uses libidn, but is anyone using that version of libcurl for their application or are they just using the curl tool? I don't know.

@jay: The problem is that there is no exact way of detecting UTF-8 encoding. There are several methods, but none of them 100%.

Yes I was aware of this, it is what I meant by the risk. I'm not disputing that there's some chance a locally encoded string may successfully translate to Unicode mistakenly as UTF-8.


On further consideration reading through all comments I think it's sensible that we retire the fallback behavior... so how about #7257 for the next feature window? I would rather wait with this PR until then as well. Also the documentation could be modified at that time by adding this:

diff --git a/docs/libcurl/opts/CURLOPT_URL.3 b/docs/libcurl/opts/CURLOPT_URL.3
index 17e5c8b..8d705fc 100644
--- a/docs/libcurl/opts/CURLOPT_URL.3
+++ b/docs/libcurl/opts/CURLOPT_URL.3
@@ -72,7 +72,7 @@ expected to be a sequence of characters using an ASCII compatible encoding.
 
 If libcurl is built with IDN support, the server name part of the URL can use
 an "international name" by using the current encoding (according to locale) or
-UTF-8 (when winidn is used).
+UTF-8 (when winidn is used; or a Windows Unicode build using libidn).
 
 If libcurl is built without IDN support, the server name is used exactly as
 specified when passed to the name resolver functions.

@vszakats
Copy link
Member Author

@jay: Agreed on the removal of fallbacks and the addition to the documentation. I've updated the PR with your patch.

Minor question: Shall we use the term 'libidn' in the docs, or rather 'libidn2'? It seems curl had fully switched to 'libidn2' a while back (9c91ec7).

@bagder
Copy link
Member

bagder commented Jun 15, 2021

Shall we use the term 'libidn' in the docs, or rather 'libidn2'?

Since they are two different libraries implementing different IDNA standards I think it's a point to say libidn2.

Unicode Windows builds use UTF-8 strings internally in libcurl,
so make sure to call the UTF-8 flavour of the libidn2 API.

Reported-by: dEajL3kA on github
Assisted-by: Jay Satiro
Reviewed-by: Marcel Raad
Fixes curl#7228
@vszakats vszakats closed this in 2026124 Jun 15, 2021
jay added a commit to jay/curl that referenced this pull request Jun 16, 2021
- If the UTF-8 to UTF-16 conversion fails in Windows Unicode builds then
  no longer fall back to assuming the string is in a local encoding.

Background:

Some functions in Windows Unicode builds must convert UTF-8 to UTF-16 to
pass to the Windows CRT API wide-character functions since in Windows
UTF-8 is not a valid locale (or at least 99% of the time right now).

Prior to this change if the Unicode encoding conversion failed then
libcurl would assume, for backwards compatibility with applications that
may have written their code for non-Unicode builds, attempt to convert
the string from local encoding to UTF-16.

That type of "best effort" could theoretically cause some type of
security or other problem if a string that was locally encoded was also
valid UTF-8, and therefore an unexpected UTF-8 to UTF-16 conversion
could occur.

Ref: curl#7246

Closes #xxxx
jay added a commit that referenced this pull request Jun 21, 2021
- If the UTF-8 to UTF-16 conversion fails in Windows Unicode builds then
  no longer fall back to assuming the string is in a local encoding.

Background:

Some functions in Windows Unicode builds must convert UTF-8 to UTF-16 to
pass to the Windows CRT API wide-character functions since in Windows
UTF-8 is not a valid locale (or at least 99% of the time right now).

Prior to this change if the Unicode encoding conversion failed then
libcurl would assume, for backwards compatibility with applications that
may have written their code for non-Unicode builds, attempt to convert
the string from local encoding to UTF-16.

That type of "best effort" could theoretically cause some type of
security or other problem if a string that was locally encoded was also
valid UTF-8, and therefore an unexpected UTF-8 to UTF-16 conversion
could occur.

Ref: #7246

Closes #7257
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Unicode support in libcurl does not
seem to be ready for production. Existing support extended certain Windows
interfaces to use the Unicode flavour of the Windows API, but that also
meant that the expected encoding/codepage of strings (e.g. local filenames,
URLs) exchanged via the libcurl API became ambiguous and undefined.
Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API/option, certain dynamic and static "fallback" logic inside
libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit
strings internally. From the user's perspective this poses an unreasonably
difficult task in finding out how to pass a certain non-ASCII string to a
specific API without unwanted or accidental (possibly lossy) conversions or
other side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, finding different files,
accessing the wrong URL or passing a corrupt username or password.

Note that these issues may _only_ affect strings with _non-ASCII_ content.

For now the best solution seems to be to revert back to how libcurl/curl
worked for most of its existence and only re-enable Unicode once the
remaining parts of Windows Unicode support are well-understood, ironed out
and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Unicode support in libcurl does not
seem to be ready for production. Existing support extended certain Windows
interfaces to use the Unicode flavour of the Windows API, but that also
meant that the expected encoding/codepage of strings (e.g. local filenames,
URLs) exchanged via the libcurl API became ambiguous and undefined.
Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API/option, certain dynamic and static "fallback" logic inside
libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit
strings internally. From the user's perspective this poses an unreasonably
difficult task in finding out how to pass a certain non-ASCII string to a
specific API without unwanted or accidental (possibly lossy) conversions or
other side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, finding different files,
accessing the wrong URL or passing a corrupt username or password.

Note that these issues may _only_ affect strings with _non-ASCII_ content.

For now the best solution seems to be to revert back to how libcurl/curl
worked for most of its existence and only re-enable Unicode once the
remaining parts of Windows Unicode support are well-understood, ironed out
and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Unicode support in libcurl does not
seem to be ready for production. Existing support extended certain Windows
interfaces to use the Unicode flavour of the Windows API, but that also
meant that the expected encoding/codepage of strings (e.g. local filenames,
URLs) exchanged via the libcurl API became ambiguous and undefined.
Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API/option, certain dynamic and static "fallback" logic inside
libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit
strings internally. From the user's perspective this poses an unreasonably
difficult task in finding out how to pass a certain non-ASCII string to a
specific API without unwanted or accidental (possibly lossy) conversions or
other side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, finding different files,
accessing the wrong URL or passing a corrupt username or password.

Note that these issues may _only_ affect strings with _non-ASCII_ content.

For now the best solution seems to be to revert back to how libcurl/curl
worked for most of its existence and only re-enable Unicode once the
remaining parts of Windows Unicode support are well-understood, ironed out
and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Windows Unicode support in libcurl does
not seem to be ready for production. Existing support extended certain
Windows interfaces to use the Unicode flavour of the Windows API, but that
also meant that the expected encoding/codepage of strings (e.g. local
filenames, URLs) exchanged via the libcurl API became ambiguous and
undefined.

Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API, build options/dependencies, internal fallback logic based on
runtime auto-detection of passed string, and the result of file operations
(scheduled for removal in 7.78.0). While some parts of libcurl kept using
8-bit strings internally, e.g. when reading the environment.

From the user's perspective this poses an unreasonably complex task in
finding out how to pass (or read) a certain non-ASCII string to (from) a
specific API without unwanted or accidental conversions or other
side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, reading/writing a
different file, accessing the wrong URL or passing a corrupt username or
password.

Note that these issues may only affect strings with _non-7-bit-ASCII_
content.

For now the least bad solution seems to be to revert back to how
libcurl/curl worked for most of its existence and only re-enable Unicode
once the remaining parts of Windows Unicode support are well-understood,
ironed out and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Windows Unicode support in libcurl does
not seem to be ready for production. Existing support extended certain
Windows interfaces to use the Unicode flavour of the Windows API, but that
also meant that the expected encoding/codepage of strings (e.g. local
filenames, URLs) exchanged via the libcurl API became ambiguous and
undefined.

Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API, build options/dependencies, internal fallback logic based on
runtime auto-detection of passed string, and the result of file operations
(scheduled for removal in 7.78.0). While some parts of libcurl kept using
8-bit strings internally, e.g. when reading the environment.

From the user's perspective this poses an unreasonably complex task in
finding out how to pass (or read) a certain non-ASCII string to (from) a
specific API without unwanted or accidental conversions or other
side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, reading/writing a
different file, accessing the wrong URL or passing a corrupt username or
password.

Note that these issues may only affect strings with _non-7-bit-ASCII_
content.

For now the least bad solution seems to be to revert back to how
libcurl/curl worked for most of its existence and only re-enable Unicode
once the remaining parts of Windows Unicode support are well-understood,
ironed out and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
@vszakats vszakats deleted the libidn2uni branch February 2, 2023 11:51
@vszakats vszakats added the unicode Unicode, code page, character encoding label Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Unicode, code page, character encoding Windows Windows-specific
Development

Successfully merging this pull request may close these issues.

libidn2 bug in Windows Unicode build
5 participants