New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idn: fix libidn2 with windows unicode builds #7246
Conversation
@dEajL3kA: Can you give a test to this patch? |
Thank you. Works fine for me. But I would like to suggest a small change to avoid compiler warning about pointer type: diff --git a/lib/url.c b/lib/url.c
index edcdf54b1a07..2bfe7631c1d7 100644
--- a/lib/url.c
+++ b/lib/url.c
@@ -62,6 +62,12 @@
#ifdef USE_LIBIDN2
#include <idn2.h>
+#if defined(WIN32) && defined(UNICODE)
+#define IDN2_LOOKUP(name, host, flags) idn2_lookup_u8((const uint8_t*)name, (uint8_t**)host, flags)
+#else
+#define IDN2_LOOKUP(name, host, flags) idn2_lookup_ul((const char*)name, (char**)host, flags)
+#endif
+
#elif defined(USE_WIN32_IDN)
/* prototype for curl_win32_idn_to_ascii() */
bool curl_win32_idn_to_ascii(const char *in, char **out);
@@ -1576,12 +1582,12 @@ CURLcode Curl_idnconvert_hostname(struct Curl_easy *data,
#else
int flags = IDN2_NFC_INPUT;
#endif
- int rc = idn2_lookup_ul((const char *)host->name, &ace_hostname, flags);
+ int rc = IDN2_LOOKUP(host->name, &ace_hostname, flags);
if(rc != IDN2_OK)
/* fallback to TR46 Transitional mode for better IDNA2003
compatibility */
- rc = idn2_lookup_ul((const char *)host->name, &ace_hostname,
- IDN2_TRANSITIONAL);
+ rc = IDN2_LOOKUP(host->name, &ace_hostname,
+ IDN2_TRANSITIONAL);
if(rc == IDN2_OK) {
host->encalloc = (char *)ace_hostname;
/* change the name pointer to point to the encoded hostname */ Regards. |
@dEajL3kA: Nice catch, I've updated the patch with your fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is true of the curl tool unicode builds they will use utf-8 it is not necessarily true of libcurl since it takes whatever the application passes. There is already a fallback for failed translation, and I wonder if that could also be done for utf8, like for example
diff --git a/lib/url.c b/lib/url.c
index 27ba7d6..0eb364f 100644
--- a/lib/url.c
+++ b/lib/url.c
@@ -1585,7 +1585,12 @@ CURLcode Curl_idnconvert_hostname(struct Curl_easy *data,
#else
int flags = IDN2_NFC_INPUT;
#endif
- int rc = idn2_lookup_ul((const char *)host->name, &ace_hostname, flags);
+ int rc;
+#if defined(WIN32) && defined(UNICODE)
+ rc = idn2_lookup_u8((const char *)host->name, &ace_hostname, flags);
+ if(rc != IDN2_OK)
+#endif
+ rc = idn2_lookup_ul((const char *)host->name, &ace_hostname, flags);
if(rc != IDN2_OK)
/* fallback to TR46 Transitional mode for better IDNA2003
compatibility */
Shouldn't the API define which encoding is expected for Also, a problem I see is that it is quite possible that the exact same Either that, or make the API offer a function that allows the application to set the character encoding (UTF-8 being the default). |
There's probably a slight risk of the wrong conversion. I'm interested in supporting the user's existing code while making it possible to fix this issue. We've had a lot of discussions around Unicode in Windows and what we have now mostly works without breaking anything. |
IMHO, there are only two "clean" ways to do this: Either define which encoding your API expects for It's just not possible to know from the Just my two cents. As long as UTF-8 is always tried first (in the Unicode version), I will probably be fine with that 😏 |
Sure you've made that point, and I just don't think it's as big a deal. We already do this for the file api and probably other stuff, we try utf8 and if that fails fall back on local encoding. In other words, we prefer UTF-8 in Unicode builds but in order not to disrupt other clients we will try the local encoding. This seems like a pretty good compromise and so far we don't have any reports of a wrong encoding. |
Just one more thing I would like to add: This even could have some security implications. I cannot give you a concrete exploit, but suppose some user relies on the ANSI fallback behavior to open a certain file (or URL). This user probably isn't even aware that the given string will be interpreted as UTF-8 first, before falling back to ANSI. Now, if an ANSI string is interpreted as UTF-8, you probably get some "garbage" file name that is unlikely to exist – normally. Still, an attacker who knows which ANSI file name the user is trying to open can easily figure out the corresponding "garbage" UTF-8 name and place a malicious file there... (Loading the |
IMO this is an acceptable security risk to continue supporting legacy clients. If the user has strings that are encoded locally then we try to support it but you'd run the risk of that if there were such an example. |
So, if I understand correctly, in libcurl the internal representation of strings is undefined (and incidentally it is UTF-8 inside Unicode curl binary builds). If it is undefined, it will be impossible to properly handle it in API boundaries that make any codepage assumptions regarding the strings passed around. As a possible solution, libcurl could get a new API where the string representation can be specified by the caller environment (e.g. in case of a Windows Unicode curl.exe this would be UTF-8) and decide which libidn2 API to call, based on this information (as opposed to the This leads very far, so I suggest closing this PR before the above issue is cleared up. |
That's a lot of work and we have a best effort method that seems to work well enough, by overriding fopen, stat, access to try unicode first. We could do the same here. |
Just for the notes, |
@jay: Hm, this is interesting though it also seems quite risky and opens a chance for security problems. Regarding this PR: Is there a way to decide if the UTF-8 conversion failed when calling the libidn2 API? and if so retry in raw mode? This looks quite a bit risky solution as well. Another approach would be to test the string if it is UTF-8, and call the libidn2 API accordingly. Though detecting UTF-8 isn't a 100% matter, especially when someone wants to specifically trick the detection method to misdetect. |
I tend to think it'd be best to document that Windows UNICODE builds are expecting UTF-8 strings and expect callers to provide string in this encoding. This was an implicit (and not at all obvious) breaking change that is already made. Mostly for libcurl calls, but even "curl.exe" calls aren't always behaving like they were before. And, "best effort" falling back to raw 8-bit mode isn't (always) feasible or secure, and even less so any kind of auto-detection. [ Passing/retrieving UTF-8 strings isn't the most natural thing for Windows apps to do, but once implemented, it's a well defined, two-way conversion and there is a long list of other open-source projects which expect UTF-8 strings, so chances are good that this is already a familiar path. ] Other alternatives require extreme efforts, such as introducing new API variants (with well-defined encoding requirements) where strings are involved, or adding the ability to tell curl the encoding/codepage strings are passed to (and received from) it and convert them internally to the formats requested, or adding a way to query the encoding/codepage curl is expecting and allow the caller to convert them to it before passing to curl. None of these is trivial and there is an effort to be made on both caller and curl side in all cases. Not to mention the possible errors in implementing any of these solutions. For the first alternative the only bit required (from libcurl) is a way for a caller to detect a UNICODE libcurl build. This can already be done with: |
Note that the only supported way to get a Unicode libcurl build that expected ANSI strings in some places before version 7.71.0, when #3784 landed, was by using the Visual Studio project files or, starting with version 7.67.0, winbuild. Both support only MSVC and not e.g. MinGW. All SSPI and Schannel code in libcurl has always expected UTF-8 in Unicode builds, and e.g. using ANSI-encoded passwords never worked with them. Given all this, I don't think breaking backwards compatibility in this case and just documenting UTF-8 for Windows Unicode builds would be dramatic. I'm not strongly opposed to a best-effort fallback like @jay proposed in #7246 (review) though. |
- For Windows Unicode builds of libcurl that use libidn, attempt to convert the hostname from UTF-8 and if that fails use local encoding as a fallback. Prior to this change Windows Unicode builds that use libidn attempted to convert the hostname from the local encoding only. The curl tool passes the command-line URL (and therefore the hostname) in Unicode's UTF-8 encoding, so the conversion would fail. Since other applications may pass locally encoded URLs to libcurl we'll keep local encoding as a fallback. Reported-by: dEajL3kA@users.noreply.github.com Assisted-by: Viktor Szakats Fixes curl#7228 Closes curl#7246 Closes #xxxx
Now #7256, as an alternative to this. |
If not breaking "legacy" applications which use ANSI encoding is a priority, then why not make all public API functions consitently assume ANSI (local Codepage) encoding by default and add a new API function to enable UTF-8 support. Unicode-aware applications, inclduing curl CLI program, will call that new function early on. And, once that function has been called, all public API functions consitently assume UTF-8 encoding (instead of ANSI). This would ensure consistent behavior for both, legacy and Unicode-aware applications and, at the same time, eliminate the need for any "best effort" workarounds. In the library implementation a pair of functions |
@MarcelRaad: Thanks for pointing out @jay's 'best effort' patch, I did miss that. I'm however not convinced about such solution and have a gut feeling that it leads to no good at all. First it will lead to potential exploits, and when fixing/changing it or possibly being forced to remove the workaround at a future point, it will lead to a second round of breaking change, where callers will need to finally do the right thing and send UTF-8 proper. Simply I do not believe that you can correctly guess the codepage/encoding by trial and error. (I'm coming from a place with accents.) Either way, I'd prefer to commit this as-is now and leave adding compatibility-bandaids in a separate commit to your own discrection. |
FWIW, I'm now tempted to turn off UNICODE in curl-for-win. I wasn't aware about the "best effort" measures and I see it as a massive potential to open a can of problems. Also, if UNICODE builds are still supposed to accept 8-bit ("ANSI") strings and libcurl randomly/arbitrarily decides how to interpet them, what is the purpose of UNICODE mode at all? What if an 8-bit ("ANSI") string is a valid UTF-8 string? What if a file-operation with a valid UTF-8 filename fails for an unrelated reason then libcurl continues to try the same string as an 8-bit one and corrupts/reads/overwrites an existing but unrelated file as a result? What if this behaviour gets actively exploited by a malicious party? [ Just to clarify "ANSI" (and "OEM") is not a codepage. It's Microsoft/Windows terminology to denote a list of 8-bit codepages. The exact interpretation will depend on the OS environment. ] |
UNICODE support is an absoloute must for Windows! Windows has supported Unicode for like ~20 years now and all modern Windows applications need to support it. The old ANSI APIs only still exist in Windows for legacy compatibility. Using ANSI in your application on Windows results in tons of problems, like the command-line arguments that were passed to your application get converted into a string containing "?" characters for all characters that cannot be represented in the local ANSI codepage. This way you will not be able to open files that contain such characters. And such files almost always do exist, because Window itself and its file system are fully Unicode (UTF-16) under the hood. So I think getting proper Unicode support in libcurl is essential.
In Win32 API all funnctions exist twice: As a "Unicode" (wide string) version that uses UTF-16 strings, and as an "ANSI" version that uses multi-byte strings that are encoded in whatever local codepage happens to be select in the system (e.g. Windows-1252). So, when I say "ANSI", then I mean the system's local codepage, as used by the Win32 API in its "ANSI" functions. |
Yes as it's written what could actually happen is opening a file using the unicode character set could fail for some reason unrelated to a failed conversion, but then fall back on treating the string as local encoding. In my opinion this could be improved to fall back only if the string is not UTF-8 encoded. The counter is clearly get rid of the fallbacks, and I just don't think we should do that because it could break existing user code. I see problems either way and I don't see a significant security issue here. I am also not convinced, but not strongly opposed either. |
@dEajL3kA: Agreed completely. I'm an absolute proponent of UNICODE. But, if support for UNICODE in libcurl doesn't have clearly defined expectations and has dynamically changing behaviour when interpreting input, it's a slipperly slope. Back to IDN specifically, it seems that WinIDN builds don't do the "best effort" interpreation and expect UTF-8 or fail. Meaning that an alternative for curl-for-win is switching back to WinIDN. It also raises the question: Why is it desired to implement a "best effort" strategy for libidn2 and not doing so for WinIDN? |
@jay: The problem is that there is no exact way of detecting UTF-8 encoding. There are several methods, but none of them 100%. |
- If the UTF-8 to UTF-16 conversion fails in Windows Unicode builds then no longer fall back to assuming the string is in a local encoding. Background: Some functions in Windows Unicode builds must convert UTF-8 to UTF-16 to pass to the Windows CRT API wide-character functions since in Windows UTF-8 is not a valid locale (or at least 99% of the time right now). Prior to this change if the Unicode encoding conversion failed then libcurl would assume, for backwards compatibility with applications that may have written their code for non-Unicode builds, attempt to convert the string from local encoding to UTF-16. That type of "best effort" could theoretically cause some type of security or other problem if a string that was locally encoded was also valid UTF-8, and therefore an unexpected UTF-8 to UTF-16 conversion could occur. Ref: curl#7246 Closes #xxxx
It is for anyone using libidn2 and passing in the local encoding. As I said I don't want to disrupt their application; while at the same time I'd like to fix this for the curl tool. This is not the case with WinIDN which always used UTF-8. Somewhat related to Marcel's point, I think we have to consider how likely is it right now that anyone using a Unicode build of libcurl for their application is also using libidn2 and not WinIDN? Probably very unlikely considering we don't support it outright in autobuild (see #7229). curl-for-win has a libcurl that uses libidn, but is anyone using that version of libcurl for their application or are they just using the curl tool? I don't know.
Yes I was aware of this, it is what I meant by the risk. I'm not disputing that there's some chance a locally encoded string may successfully translate to Unicode mistakenly as UTF-8. On further consideration reading through all comments I think it's sensible that we retire the fallback behavior... so how about #7257 for the next feature window? I would rather wait with this PR until then as well. Also the documentation could be modified at that time by adding this: diff --git a/docs/libcurl/opts/CURLOPT_URL.3 b/docs/libcurl/opts/CURLOPT_URL.3
index 17e5c8b..8d705fc 100644
--- a/docs/libcurl/opts/CURLOPT_URL.3
+++ b/docs/libcurl/opts/CURLOPT_URL.3
@@ -72,7 +72,7 @@ expected to be a sequence of characters using an ASCII compatible encoding.
If libcurl is built with IDN support, the server name part of the URL can use
an "international name" by using the current encoding (according to locale) or
-UTF-8 (when winidn is used).
+UTF-8 (when winidn is used; or a Windows Unicode build using libidn).
If libcurl is built without IDN support, the server name is used exactly as
specified when passed to the name resolver functions. |
Since they are two different libraries implementing different IDNA standards I think it's a point to say |
Unicode Windows builds use UTF-8 strings internally in libcurl, so make sure to call the UTF-8 flavour of the libidn2 API. Reported-by: dEajL3kA on github Assisted-by: Jay Satiro Reviewed-by: Marcel Raad Fixes curl#7228
- If the UTF-8 to UTF-16 conversion fails in Windows Unicode builds then no longer fall back to assuming the string is in a local encoding. Background: Some functions in Windows Unicode builds must convert UTF-8 to UTF-16 to pass to the Windows CRT API wide-character functions since in Windows UTF-8 is not a valid locale (or at least 99% of the time right now). Prior to this change if the Unicode encoding conversion failed then libcurl would assume, for backwards compatibility with applications that may have written their code for non-Unicode builds, attempt to convert the string from local encoding to UTF-16. That type of "best effort" could theoretically cause some type of security or other problem if a string that was locally encoded was also valid UTF-8, and therefore an unexpected UTF-8 to UTF-16 conversion could occur. Ref: curl#7246 Closes #xxxx
- If the UTF-8 to UTF-16 conversion fails in Windows Unicode builds then no longer fall back to assuming the string is in a local encoding. Background: Some functions in Windows Unicode builds must convert UTF-8 to UTF-16 to pass to the Windows CRT API wide-character functions since in Windows UTF-8 is not a valid locale (or at least 99% of the time right now). Prior to this change if the Unicode encoding conversion failed then libcurl would assume, for backwards compatibility with applications that may have written their code for non-Unicode builds, attempt to convert the string from local encoding to UTF-16. That type of "best effort" could theoretically cause some type of security or other problem if a string that was locally encoded was also valid UTF-8, and therefore an unexpected UTF-8 to UTF-16 conversion could occur. Ref: #7246 Closes #7257
On closer inspection, the state of Unicode support in libcurl does not seem to be ready for production. Existing support extended certain Windows interfaces to use the Unicode flavour of the Windows API, but that also meant that the expected encoding/codepage of strings (e.g. local filenames, URLs) exchanged via the libcurl API became ambiguous and undefined. Previously all strings had to be passed in the active Windows locale, using an 8-bit codepage. In Unicode libcurl builds, the expected string encoding became an undocumented mixture of UTF-8 and 8-bit locale, depending on the actual API/option, certain dynamic and static "fallback" logic inside libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit strings internally. From the user's perspective this poses an unreasonably difficult task in finding out how to pass a certain non-ASCII string to a specific API without unwanted or accidental (possibly lossy) conversions or other side-effects. Missing the correct encoding may result in unexpected behaviour, e.g. in some cases not finding files, finding different files, accessing the wrong URL or passing a corrupt username or password. Note that these issues may _only_ affect strings with _non-ASCII_ content. For now the best solution seems to be to revert back to how libcurl/curl worked for most of its existence and only re-enable Unicode once the remaining parts of Windows Unicode support are well-understood, ironed out and documented. Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully this period had the benefit to have surfaced some of these issues. Ref: curl/curl#6089 Ref: curl/curl#7246 Ref: curl/curl#7251 Ref: curl/curl#7252 Ref: curl/curl#7257 Ref: curl/curl#7281 Ref: curl/curl#7421 Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings Ref: 8023ee5
On closer inspection, the state of Unicode support in libcurl does not seem to be ready for production. Existing support extended certain Windows interfaces to use the Unicode flavour of the Windows API, but that also meant that the expected encoding/codepage of strings (e.g. local filenames, URLs) exchanged via the libcurl API became ambiguous and undefined. Previously all strings had to be passed in the active Windows locale, using an 8-bit codepage. In Unicode libcurl builds, the expected string encoding became an undocumented mixture of UTF-8 and 8-bit locale, depending on the actual API/option, certain dynamic and static "fallback" logic inside libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit strings internally. From the user's perspective this poses an unreasonably difficult task in finding out how to pass a certain non-ASCII string to a specific API without unwanted or accidental (possibly lossy) conversions or other side-effects. Missing the correct encoding may result in unexpected behaviour, e.g. in some cases not finding files, finding different files, accessing the wrong URL or passing a corrupt username or password. Note that these issues may _only_ affect strings with _non-ASCII_ content. For now the best solution seems to be to revert back to how libcurl/curl worked for most of its existence and only re-enable Unicode once the remaining parts of Windows Unicode support are well-understood, ironed out and documented. Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully this period had the benefit to have surfaced some of these issues. Ref: curl/curl#6089 Ref: curl/curl#7246 Ref: curl/curl#7251 Ref: curl/curl#7252 Ref: curl/curl#7257 Ref: curl/curl#7281 Ref: curl/curl#7421 Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings Ref: 8023ee5
On closer inspection, the state of Unicode support in libcurl does not seem to be ready for production. Existing support extended certain Windows interfaces to use the Unicode flavour of the Windows API, but that also meant that the expected encoding/codepage of strings (e.g. local filenames, URLs) exchanged via the libcurl API became ambiguous and undefined. Previously all strings had to be passed in the active Windows locale, using an 8-bit codepage. In Unicode libcurl builds, the expected string encoding became an undocumented mixture of UTF-8 and 8-bit locale, depending on the actual API/option, certain dynamic and static "fallback" logic inside libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit strings internally. From the user's perspective this poses an unreasonably difficult task in finding out how to pass a certain non-ASCII string to a specific API without unwanted or accidental (possibly lossy) conversions or other side-effects. Missing the correct encoding may result in unexpected behaviour, e.g. in some cases not finding files, finding different files, accessing the wrong URL or passing a corrupt username or password. Note that these issues may _only_ affect strings with _non-ASCII_ content. For now the best solution seems to be to revert back to how libcurl/curl worked for most of its existence and only re-enable Unicode once the remaining parts of Windows Unicode support are well-understood, ironed out and documented. Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully this period had the benefit to have surfaced some of these issues. Ref: curl/curl#6089 Ref: curl/curl#7246 Ref: curl/curl#7251 Ref: curl/curl#7252 Ref: curl/curl#7257 Ref: curl/curl#7281 Ref: curl/curl#7421 Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings Ref: 8023ee5
On closer inspection, the state of Windows Unicode support in libcurl does not seem to be ready for production. Existing support extended certain Windows interfaces to use the Unicode flavour of the Windows API, but that also meant that the expected encoding/codepage of strings (e.g. local filenames, URLs) exchanged via the libcurl API became ambiguous and undefined. Previously all strings had to be passed in the active Windows locale, using an 8-bit codepage. In Unicode libcurl builds, the expected string encoding became an undocumented mixture of UTF-8 and 8-bit locale, depending on the actual API, build options/dependencies, internal fallback logic based on runtime auto-detection of passed string, and the result of file operations (scheduled for removal in 7.78.0). While some parts of libcurl kept using 8-bit strings internally, e.g. when reading the environment. From the user's perspective this poses an unreasonably complex task in finding out how to pass (or read) a certain non-ASCII string to (from) a specific API without unwanted or accidental conversions or other side-effects. Missing the correct encoding may result in unexpected behaviour, e.g. in some cases not finding files, reading/writing a different file, accessing the wrong URL or passing a corrupt username or password. Note that these issues may only affect strings with _non-7-bit-ASCII_ content. For now the least bad solution seems to be to revert back to how libcurl/curl worked for most of its existence and only re-enable Unicode once the remaining parts of Windows Unicode support are well-understood, ironed out and documented. Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully this period had the benefit to have surfaced some of these issues. Ref: curl/curl#6089 Ref: curl/curl#7246 Ref: curl/curl#7251 Ref: curl/curl#7252 Ref: curl/curl#7257 Ref: curl/curl#7281 Ref: curl/curl#7421 Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings Ref: 8023ee5
On closer inspection, the state of Windows Unicode support in libcurl does not seem to be ready for production. Existing support extended certain Windows interfaces to use the Unicode flavour of the Windows API, but that also meant that the expected encoding/codepage of strings (e.g. local filenames, URLs) exchanged via the libcurl API became ambiguous and undefined. Previously all strings had to be passed in the active Windows locale, using an 8-bit codepage. In Unicode libcurl builds, the expected string encoding became an undocumented mixture of UTF-8 and 8-bit locale, depending on the actual API, build options/dependencies, internal fallback logic based on runtime auto-detection of passed string, and the result of file operations (scheduled for removal in 7.78.0). While some parts of libcurl kept using 8-bit strings internally, e.g. when reading the environment. From the user's perspective this poses an unreasonably complex task in finding out how to pass (or read) a certain non-ASCII string to (from) a specific API without unwanted or accidental conversions or other side-effects. Missing the correct encoding may result in unexpected behaviour, e.g. in some cases not finding files, reading/writing a different file, accessing the wrong URL or passing a corrupt username or password. Note that these issues may only affect strings with _non-7-bit-ASCII_ content. For now the least bad solution seems to be to revert back to how libcurl/curl worked for most of its existence and only re-enable Unicode once the remaining parts of Windows Unicode support are well-understood, ironed out and documented. Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully this period had the benefit to have surfaced some of these issues. Ref: curl/curl#6089 Ref: curl/curl#7246 Ref: curl/curl#7251 Ref: curl/curl#7252 Ref: curl/curl#7257 Ref: curl/curl#7281 Ref: curl/curl#7421 Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings Ref: 8023ee5
Unicode Windows builds use UTF-8 strings internally in libcurl, so make sure to call the UTF-8 flavour of the libidn2 API.
Reported-by: dEajL3kA on github
Fixes #7228