idn_win32: Fall back to ANSI encoding if UTF-8 fails #654

jay · 2016-02-14T22:46:07Z

When converting the hostname from IDN to punycode if the IDN is not
valid UTF-8, or if it is and fails to convert, fall back to converting
the IDN to punycode using the user's ANSI codepage.

@pierrejoye @mkauf @gvanem based on our discussion in #637.

This will bring the WinIDN code more in line with libidn. We try treating the IDN in UTF-8 encoding and if that doesn't work we use locale encoding.

This detection isn't 100% accurate since an ANSI encoded hostname could be determined to be valid UTF-8, though that seems unlikely. How unlikely? I don't know.

Another thing is a possible privacy issue where a server may be able to determine the user's codepage by sending a hostname for redirect with bad UTF-8 to force it to be converted using the user's codepage. This is already possible with libidn, afaics.

In terms of codepage conversion I went with CP_ACP which is the user's current ANSI codepage. I did not go with CP_OEMCP which may be more appropriate because it's the console codepage. I've never seen the latter used in any codebase I've worked on. Sometimes the OEM and ANSI codepages can differ, apparently. I would be curious how this affects those users but I don't know much about the differences. Microsoft's support article on when to use ANSI and when to use OEM is lacking. I believe more people use ANSI simply because when you save a file for, say, running a script and you are not saving in Unicode then you're most likely saving in ANSI.

When converting the hostname from IDN to punycode if the IDN is not valid UTF-8, or if it is and fails to convert, fall back to converting the IDN to punycode using the user's ANSI codepage.

mkauf · 2016-02-15T11:11:11Z

Some very minor issues:

I prefer to write "unsigned int" instead of "unsigned"
maybe link to the Unicode 8.0 specification instead of Unicode 7.0

Otherwise this looks good to me. Thank you @jay !

Squash this into parent! - unsigned => unsigned int - CP_ACP and CP_UTF8 defines use actual values instead - Prevent overflow in strict codepoint count Ref: curl#654

jay · 2016-02-19T07:58:51Z

Thanks Michael I've updated this to

unsigned => unsigned int
CP_ACP and CP_UTF8 defines use actual values instead
Prevent overflow in strict codepoint count

I did not change the unicode spec reference from 7 to 8 because 7 was the spec I used back when I wrote that codepoint count function, so I'd rather that stay as is.

mkauf · 2016-02-22T09:43:03Z

Table 3-7 in the Unicode 8 specification has the same content as in the Unicode 7 specification.

jay · 2016-02-22T23:32:11Z

Table 3-7 in the Unicode 8 specification has the same content as in the Unicode 7 specification.

Ok. Even if the tables are same I'd rather point to what I used to implement it. There's a whole section on it which may differ in some way in spec 8. I could of course find out how it differs I just don't want to create more work for myself.

jay · 2016-03-19T04:58:52Z

An update on this, I looked into libidn source to see exactly what they're doing, and if libidn is built without libiconv then it treats a locale encoded hostname as if it's UTF-8 and attempts to convert it. I'm not sure if that's intended. Ref http://lists.gnu.org/archive/html/help-libidn/2016-03/msg00001.html

So to sum up this is the current behavior:

If libcurl built with libidn, and libidn built with libiconv:
hostname is treated as locale encoded and converted to punycode

If libcurl built with libidn, but libidn not built with libiconv:
hostname is treated as UTF-8 encoded and converted to punycode

If libcurl built with WinIDN:
hostname is treated as UTF-8 encoded and converted to punycode

jay · 2016-04-06T06:56:12Z

Mothballing this for now. My thoughts are we may need another solution, such as in the curl tool getting the URLs in UTF-8 encoding and passing them to libcurl. And then for consistency, an option to force IDNs to be handled in UTF-8 mode. That would only affect libidn w/libiconv which does the current locale. Then we could have some consistency like CURLOPT_URL_IS_UTF8 1L or something, and no matter which IDN curl was built with it would work.

idn_win32: Fall back to ANSI encoding if UTF-8 fails

d179454

When converting the hostname from IDN to punycode if the IDN is not valid UTF-8, or if it is and fails to convert, fall back to converting the IDN to punycode using the user's ANSI codepage.

idn_win32: Fall back to ANSI encoding if UTF-8 fails. draft2

656704e

Squash this into parent! - unsigned => unsigned int - CP_ACP and CP_UTF8 defines use actual values instead - Prevent overflow in strict codepoint count Ref: curl#654

jay mentioned this pull request Feb 24, 2016

Issue #627: Checks for doomed connections when selecting a connection. #669

Closed

jay mentioned this pull request Mar 26, 2016

curl on Windows incorrectly handle IDN-urls #731

Closed

jay added the mothballed label Apr 6, 2016

jay closed this Apr 6, 2016

lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idn_win32: Fall back to ANSI encoding if UTF-8 fails #654

idn_win32: Fall back to ANSI encoding if UTF-8 fails #654

jay commented Feb 14, 2016

mkauf commented Feb 15, 2016

jay commented Feb 19, 2016

mkauf commented Feb 22, 2016

jay commented Feb 22, 2016

jay commented Mar 19, 2016

jay commented Apr 6, 2016

idn_win32: Fall back to ANSI encoding if UTF-8 fails #654

idn_win32: Fall back to ANSI encoding if UTF-8 fails #654

Conversation

jay commented Feb 14, 2016

mkauf commented Feb 15, 2016

jay commented Feb 19, 2016

mkauf commented Feb 22, 2016

jay commented Feb 22, 2016

jay commented Mar 19, 2016

jay commented Apr 6, 2016