Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error checking for Win32 IDN functions #637

Closed
wants to merge 1 commit into from

Conversation

mkauf
Copy link
Contributor

@mkauf mkauf commented Feb 5, 2016

  • Return 0 (error) if Curl_convert_UTF8_to_wchar() fails
  • Return 0 (error) if the input string is NULL
  • Only return 1 (success) if the operation has really succeeded

@mention-bot
Copy link

By analyzing the blame information on this pull request, we identified @pierrejoye, @yangtse and @bagder to be potential reviewers

@jay
Copy link
Member

jay commented Feb 5, 2016

Ref 5d7c937#commitcomment-15913422

There's no path for it to be NULL afaict. Personally I'd prefer an early-exit pattern in a function like this but your changes will be fine. Also, unrelated to your changes, that curl_win32_ascii_to_idn looks wrong I don't know why it is casting char *in to wchar_t * that doesn't look right I'll have to check on that.

@jay jay self-assigned this Feb 5, 2016
@mkauf
Copy link
Contributor Author

mkauf commented Feb 5, 2016

That cast in curl_win32_ascii_to_idn is wrong, Curl_convert_UTF8_to_wchar() should be used instead (like in curl_win32_idn_to_ascii). But the whole function is currently unused. Maybe we can remove it.

@jay
Copy link
Member

jay commented Feb 5, 2016

It might be useful in the future I'd rather fix it and then leave it be. Weird, I don't recall seeing an unused function warning when I built for WinIDN, maybe that warning is disabled somewhere.

- Fix a conversion bug in the unused function curl_win32_ascii_to_idn()
@mkauf
Copy link
Contributor Author

mkauf commented Feb 6, 2016

I have updated the pull request with a bugfix for curl_win32_ascii_to_idn.

jay pushed a commit that referenced this pull request Feb 6, 2016
.. also fix a conversion bug in the unused function
curl_win32_ascii_to_idn().

And remove wprintfs on error (Jay).

Bug: #637
@jay
Copy link
Member

jay commented Feb 6, 2016

Modified in 3 ways, otherwise it's good:

  1. Pass the length+1 of in_w to IdnToUnicode. My reasoning for that is the IdnToUnicode doc does not address the use of -1 the way IdnToAscii does (although it does seem to work). I don't like that it's undocumented. The remarks in Return value say: "The retrieved string is null-terminated only if the input string is null-terminated". I addressed this by changing the input stream character count to length+1.
  2. free(in_w) after IdnToUnicode.
  3. Remove wprintf statements. It's wrong to be sending that output to stdout. The responsibility is on the caller to show the error from a function like that via warnf or something. It already does this in url.c.

Landed in 9e7fcd4. Thanks Michael, and thanks Gisle for getting the ball rolling.

@jay jay closed this Feb 6, 2016
@gvanem
Copy link
Contributor

gvanem commented Feb 7, 2016

Jay> The responsibility is on the caller to show the error from a function like that via warnf or something.
Jay> It already does this in url.c.

The warnings specific to WinIDN are non-existent; no help in case the Active CodePage is wrong.
Maybe I cook up a Curl_idn_strerror() for USE_WIN32_IDN?

Also it seems UTF8 input-encoding is assumed throughout libcurl/WinIDN (?). But where is the call to GetACP() to check that this is true? Can you enlighten me, Michael?

@jay
Copy link
Member

jay commented Feb 8, 2016

You are right WinIDN does not work properly from URLs provided by the curl tool, that is probably related to #345. From libcurl UTF-8 is expected. In Visual Studio as far as I know this can be expressed one of two ways. If you have saved your source as UTF-8 with BOM Visual Studio will do translations to the local codepage for literal char strings but if you have saved your source as UTF-8 without BOM it will not do any translation. Therefore either of these is acceptable:

  // connect to http://россия.net
  curl_easy_setopt(curl, CURLOPT_URL,
    // "http://\xD1\x80\xD0\xBE\xD1\x81\xD1\x81\xD0\xB8\xD1\x8F.net"); // use if src is UTF-8 with BOM
    "http://россия.net"); // use if src is UTF-8 without BOM

and WinIDN will convert that UTF-8 string to xn--h1alffa9f.net and make the connection.

So one issue is whether a URL passed to CURLOPT_URL is supposed to be in UTF-8 format or the local codepage. It doesn't appear to be documented either way, actually.

Another issue is being able to access Unicode from the command prompt using the curl tool. Frankly I've been hoping someone would come along and work on that, but it's been open too long we'll probably have to put it in the TODO.

@gvanem
Copy link
Contributor

gvanem commented Feb 8, 2016

.. that is probably related to #345.

In which Viktor Szakats proposed to build curl.exe with -DUNICODE -D_UNICODE. I also tried that back in 2014, but failed. Trying this now, there were only these warnings:

tool_homedir.c(42): warning C4133: 'function': incompatible types - from 'const char *' to 'LPCWSTR'
tool_homedir.c(42): warning C4133: 'function': incompatible types - from 'char [1024]' to 'LPWSTR'
tool_homedir.c(49): warning C4133: 'function': incompatible types - from 'const char *' to 'LPCWSTR'
tool_homedir.c(49): warning C4133: 'function': incompatible types - from 'char [1024]' to 'LPWSTR'

(which are simple to fix). But I still have problems with e.g. curl.exe www.seoghør.no and ACE-conversion. The only way that works is a idn-test.bat with:

@echo url = "www.seoghør.no" | iconv -f iso8859-1 -t utf-8 > %TEMP\url-file
@curl -K %TEMP\url-file

@mkauf
Copy link
Contributor Author

mkauf commented Feb 8, 2016

@jay Thank you for improving the patch!

@gvanem You have found another bug... see @jay's comment. A quick hack (not the real bugfix) is to replace CP_UTF8 with CP_ACP in Curl_convert_UTF8_to_wchar()

@mkauf mkauf deleted the win-idn-bugfix branch February 8, 2016 12:27
@pierrejoye
Copy link
Contributor

Hi,

On Feb 8, 2016 7:30 PM, "Michael Kaufmann" notifications@github.com wrote:

@jay Thank you for improving the patch!

@gvanem You have found another bug... see @jay's comment. A quick hack
(not the real bugfix) is to replace CP_UTF8 with CP_ACP in
Curl_convert_UTF8_to_wchar()

I do not think there is a bug per se in this specific part. Eventually in
the command line arguments parsing but not here.

Please do not replace UTF8 with ACP there as any valid code using curl via
the APIs may break.

To get a UTF8 console on windows one can use:

CHCP 65001 in a batch file and create a new console with cmd.exe /k
utf8.bat file f.e. as changing it using the registry or system wide have
other side effects.


Reply to this email directly or view it on GitHub.

@jay
Copy link
Member

jay commented Feb 8, 2016

@pierrejoye The issue though is this is undocumented and the behavior appears to be different depending on the IDN library. For libidn we're calling idna_to_ascii_lz which does the current locale. One would expect that WinIDN (aka normaliz.lib) would do the same but instead as Gisle pointed out we're using UTF-8 regardless of locale. Currently UTF-8 on the command line isn't compatible with the curl tool in Windows, see my comments in #345.

Personally I'd rather WinIDN behave the same as libidn. At this point in order to preserve backwards compatibility I don't know what we can do, maybe UTF-8 detection or something.

Also: There is a mothballed function I wrote which may be useful, utf8_strict_codepoint_count

@pierrejoye
Copy link
Contributor

Yes, I saw it :)

As they seem related there different issues.

One is how curl command line manages the input argument encoding.

The 2nd issue is the usage of the various windows APIs, as far as I
remember only the non wide char apis us used.

The 3rd issue or question is how curl API should work. To me it should keep
using UTF-8 and really not rely on runtime charset setting, no matter the
platform. It makes the caller and maintainers way easier. @Badger thoughts?

In any case, I am in the process of making php on windows Unicode friendly
and curl is part of this effort. I can provide a PR and we can discuss
specifics there?
On Feb 9, 2016 5:14 AM, "Jay Satiro" notifications@github.com wrote:

@pierrejoye https://github.com/pierrejoye The issue though is this is
undocumented and the behavior appears to be different depending on the IDN
library. For libidn we're calling idna_to_ascii_lz which does the current
locale. One would expect that WinIDN (aka normaliz.lib) would do the same
but instead as Gisle pointed out we're using UTF-8 regardless of locale.
Currently UTF-8 on the command line isn't compatible with the curl tool in
Windows, see my comments in #345 #345
.

Personally I'd rather WinIDN behave the same as libidn. At this point in
order to preserve backwards compatibility I don't know what we can do,
maybe UTF-8 detection or something.

Also: There is a mothballed function I wrote which may be useful,
utf8_strict_codepoint_count
master...jay:check_utf8_before_libidn#diff-7bfe625cf96f5f724dd32ce3e5a1ff6fR343


Reply to this email directly or view it on GitHub
#637 (comment).

@jay
Copy link
Member

jay commented Feb 9, 2016

Sure. But what do you think of the idea I mentioned where for Windows we do an autodetect like

if(!is_ASCII_name(host->name)) {
  if(utf8_strict_codepoint_count(host->name) > 0)
    /* string is valid strict UTF-8 do UTF-8 => wchar => IDN */
  else
    /* string is invalid strict UTF-8 do locale => wchar => IDN */
}

IIRC libidn when not built with libiconv will internally default to UTF-8 encoding even when idna_to_ascii_lz is used (I last read it about 8 months ago, that may have changed). So how many libidn's are built with libiconv, how many are actually getting locale specific translations and not UTF-8? I really don't know.

@gvanem
Copy link
Contributor

gvanem commented Feb 9, 2016

CHCP 65001 in a batch file and create a new console with cmd.exe /k
utf8.bat file f.e. as changing it using the registry or system wide have
other side effects.

@pierrejoye Using chcp 65001 globally breaks some command-line tools here (e.g. list from Windows Debug Tools). So that is out the question for me. Or do you mean launching curl via such an utf8.bat? Out of the question too.

PS. I use 4NT 5 (Unicode) as my shell. From it's help-file:
When you use CHCP under Windows it only affects the current process, and any new
programs started from within that process; the active code page in other processes
remains unchanged.

cmd behaves the same in this regard I think.

@pierrejoye
Copy link
Contributor

@gvanem Sorry if I was not clear.
What I mean is exactly what you said. Most of the time enabling Unicode globally for the shell is not an option.
using cmd /k to create a specific console with UTF-8 (or whatever fits).
But I miss your point. You cannot (for good reasons) have UTF8 globally and you do not want to create a UTF8 console either? I do not 4NT 5, I use conemu and do the cmd calls I need in the settings. I suppose 4NT 5 is simply unicode by default, but is it UTF-8? or does it use UCS/Windows Unicode Wide chars or other?

In any case, what should be used by curl should be detected during the arguments processing. The underlying curl APIs can or should remain using UTF-8 for portability and ease of use reasons. That's at least what I can see in many other libraries and it works quite well in a portable manner. It leaves the responsibility to the caller to provide UTF-8 or ASCII. For example, in our case, the caller is the curl.exe command.

@jay I think it is nice to do yes, checking if it is ASCII or not. I am not sure about actually validating UTF8 (or other) as it should be ideally done at a later stage and fails accordingly.

@pierrejoye
Copy link
Contributor

@gvanem by the way, about building with -DUNICODE -D_UNICODE:

I would rather avoid it and uses the *W APIs accordingly. It is by far easier and cleaner to maintain and less magic happens. It also makes crystal clear what is being called and why.

@gvanem
Copy link
Contributor

gvanem commented Feb 9, 2016

@pierrejoye

.. or does it use UCS/Windows Unicode Wide chars or other?

Yes, built using -DUNICODE as Microsoft wants (or wanted?) us to do. Now the shift seems to be towards UTF-8 everywhere else. No? That doesn't make this messy internationalisation issue anything easier...

I would rather avoid it and uses the *W APIs accordingly.

On that I agree.

@pierrejoye
Copy link
Contributor

@gvanem We agree then :)

I will try to work on a proposal/PR for the command line tool(s) and underlying APIs. I need it for PHP anyway (the APIs part).

@jay
Copy link
Member

jay commented Feb 9, 2016

I envision this spiraling out of control with a whole lot of supporting wide character functions that would be more work to maintain. We have to pass char* strings around all the time and it might work better if any user input is retrieved with W APIs and then converted to UTF-8 and passed around as UTF-8 in the library. Regardless, I look forward to seeing what you write. I will get started tomorrow on a draft of UTF-8 autodetect for host->name so that we have an idea how that might work in practice.

@pierrejoye
Copy link
Contributor

@jay For example, the idea of using native windows IDN functions was to reduce the amount of dependency (which requires quite some work to keep up with :).

I think we agree or talk about the same thing. On windows there is no other solution than working with the wide chars APIs when it comes to Unicode. Even using the _UNICODE build mode is about that. They are only tricks to redefine native functions to use the A or W version of an API.

Per default on linux, UTF8 is considered as available, with some checks if a string is pure ASCII or not. It can be exactly the same for Windows. The only annoyance, inevitable, is the conversion between wide char and UTF8 before calling a windows function, but that's easy to do and not very intrusive, only utf8towidechar and widechartoutf8 being used.

Is it what you have in mind too?

@jay
Copy link
Member

jay commented Feb 10, 2016

Right. UNICODE is for the Windows API (MessageBox => MessageBoxW) and _UNICODE is for the CRT API (_tcslen => wcslen). The latter _t functions are rarely used throughout the codebase, but they are used in SSPI so it looks like someone had in mind Unicode builds for Windows. I'm going to try UTF-8 conversion of the args and see how that looks. It seems like either way whether we're wide char or UTF-8 there's still going to be conversions when we call fopen or wfopen, etc. setlocale doesn't support UTF-8 so maybe it's not as good an idea as I thought. I didn't have time today but tomorrow I'll try to get to a draft.

@jay
Copy link
Member

jay commented Feb 11, 2016

@pierrejoye I made a draft but frankly I don't really feel like taking it on for assignment. Refer to my latest comment in 345.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants