Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idn_win32: Fall back to ANSI encoding if UTF-8 fails #654

Closed
wants to merge 2 commits into from

Conversation

jay
Copy link
Member

@jay jay commented Feb 14, 2016

When converting the hostname from IDN to punycode if the IDN is not
valid UTF-8, or if it is and fails to convert, fall back to converting
the IDN to punycode using the user's ANSI codepage.

@pierrejoye @mkauf @gvanem based on our discussion in #637.

This will bring the WinIDN code more in line with libidn. We try treating the IDN in UTF-8 encoding and if that doesn't work we use locale encoding.

This detection isn't 100% accurate since an ANSI encoded hostname could be determined to be valid UTF-8, though that seems unlikely. How unlikely? I don't know.

Another thing is a possible privacy issue where a server may be able to determine the user's codepage by sending a hostname for redirect with bad UTF-8 to force it to be converted using the user's codepage. This is already possible with libidn, afaics.

In terms of codepage conversion I went with CP_ACP which is the user's current ANSI codepage. I did not go with CP_OEMCP which may be more appropriate because it's the console codepage. I've never seen the latter used in any codebase I've worked on. Sometimes the OEM and ANSI codepages can differ, apparently. I would be curious how this affects those users but I don't know much about the differences. Microsoft's support article on when to use ANSI and when to use OEM is lacking. I believe more people use ANSI simply because when you save a file for, say, running a script and you are not saving in Unicode then you're most likely saving in ANSI.

When converting the hostname from IDN to punycode if the IDN is not
valid UTF-8, or if it is and fails to convert, fall back to converting
the IDN to punycode using the user's ANSI codepage.
@mkauf
Copy link
Contributor

mkauf commented Feb 15, 2016

Some very minor issues:

  • I prefer to write "unsigned int" instead of "unsigned"
  • maybe link to the Unicode 8.0 specification instead of Unicode 7.0

Otherwise this looks good to me. Thank you @jay !

Squash this into parent!

- unsigned => unsigned int
- CP_ACP and CP_UTF8 defines use actual values instead
- Prevent overflow in strict codepoint count

Ref: curl#654
@jay
Copy link
Member Author

jay commented Feb 19, 2016

Thanks Michael I've updated this to

  • unsigned => unsigned int
  • CP_ACP and CP_UTF8 defines use actual values instead
  • Prevent overflow in strict codepoint count

I did not change the unicode spec reference from 7 to 8 because 7 was the spec I used back when I wrote that codepoint count function, so I'd rather that stay as is.

@mkauf
Copy link
Contributor

mkauf commented Feb 22, 2016

Table 3-7 in the Unicode 8 specification has the same content as in the Unicode 7 specification.

@jay
Copy link
Member Author

jay commented Feb 22, 2016

Table 3-7 in the Unicode 8 specification has the same content as in the Unicode 7 specification.

Ok. Even if the tables are same I'd rather point to what I used to implement it. There's a whole section on it which may differ in some way in spec 8. I could of course find out how it differs I just don't want to create more work for myself.

@jay
Copy link
Member Author

jay commented Mar 19, 2016

An update on this, I looked into libidn source to see exactly what they're doing, and if libidn is built without libiconv then it treats a locale encoded hostname as if it's UTF-8 and attempts to convert it. I'm not sure if that's intended. Ref http://lists.gnu.org/archive/html/help-libidn/2016-03/msg00001.html

So to sum up this is the current behavior:

If libcurl built with libidn, and libidn built with libiconv:
hostname is treated as locale encoded and converted to punycode

If libcurl built with libidn, but libidn not built with libiconv:
hostname is treated as UTF-8 encoded and converted to punycode

If libcurl built with WinIDN:
hostname is treated as UTF-8 encoded and converted to punycode

@jay
Copy link
Member Author

jay commented Apr 6, 2016

Mothballing this for now. My thoughts are we may need another solution, such as in the curl tool getting the URLs in UTF-8 encoding and passing them to libcurl. And then for consistency, an option to force IDNs to be handled in UTF-8 mode. That would only affect libidn w/libiconv which does the current locale. Then we could have some consistency like CURLOPT_URL_IS_UTF8 1L or something, and no matter which IDN curl was built with it would work.

@jay jay closed this Apr 6, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging this pull request may close these issues.

None yet

2 participants