-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idn_win32: Fall back to ANSI encoding if UTF-8 fails #654
Conversation
When converting the hostname from IDN to punycode if the IDN is not valid UTF-8, or if it is and fails to convert, fall back to converting the IDN to punycode using the user's ANSI codepage.
Some very minor issues:
Otherwise this looks good to me. Thank you @jay ! |
Squash this into parent! - unsigned => unsigned int - CP_ACP and CP_UTF8 defines use actual values instead - Prevent overflow in strict codepoint count Ref: curl#654
Thanks Michael I've updated this to
I did not change the unicode spec reference from 7 to 8 because 7 was the spec I used back when I wrote that codepoint count function, so I'd rather that stay as is. |
Table 3-7 in the Unicode 8 specification has the same content as in the Unicode 7 specification. |
Ok. Even if the tables are same I'd rather point to what I used to implement it. There's a whole section on it which may differ in some way in spec 8. I could of course find out how it differs I just don't want to create more work for myself. |
An update on this, I looked into libidn source to see exactly what they're doing, and if libidn is built without libiconv then it treats a locale encoded hostname as if it's UTF-8 and attempts to convert it. I'm not sure if that's intended. Ref http://lists.gnu.org/archive/html/help-libidn/2016-03/msg00001.html So to sum up this is the current behavior: If libcurl built with libidn, and libidn built with libiconv: If libcurl built with libidn, but libidn not built with libiconv: If libcurl built with WinIDN: |
Mothballing this for now. My thoughts are we may need another solution, such as in the curl tool getting the URLs in UTF-8 encoding and passing them to libcurl. And then for consistency, an option to force IDNs to be handled in UTF-8 mode. That would only affect libidn w/libiconv which does the current locale. Then we could have some consistency like CURLOPT_URL_IS_UTF8 1L or something, and no matter which IDN curl was built with it would work. |
@pierrejoye @mkauf @gvanem based on our discussion in #637.
This will bring the WinIDN code more in line with libidn. We try treating the IDN in UTF-8 encoding and if that doesn't work we use locale encoding.
This detection isn't 100% accurate since an ANSI encoded hostname could be determined to be valid UTF-8, though that seems unlikely. How unlikely? I don't know.
Another thing is a possible privacy issue where a server may be able to determine the user's codepage by sending a hostname for redirect with bad UTF-8 to force it to be converted using the user's codepage. This is already possible with libidn, afaics.
In terms of codepage conversion I went with CP_ACP which is the user's current ANSI codepage. I did not go with CP_OEMCP which may be more appropriate because it's the console codepage. I've never seen the latter used in any codebase I've worked on. Sometimes the OEM and ANSI codepages can differ, apparently. I would be curious how this affects those users but I don't know much about the differences. Microsoft's support article on when to use ANSI and when to use OEM is lacking. I believe more people use ANSI simply because when you save a file for, say, running a script and you are not saving in Unicode then you're most likely saving in ANSI.