New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error checking for Win32 IDN functions #637

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
5 participants
@mkauf
Contributor

mkauf commented Feb 5, 2016

  • Return 0 (error) if Curl_convert_UTF8_to_wchar() fails
  • Return 0 (error) if the input string is NULL
  • Only return 1 (success) if the operation has really succeeded
@mention-bot

This comment has been minimized.

Show comment
Hide comment
@mention-bot

mention-bot Feb 5, 2016

By analyzing the blame information on this pull request, we identified @pierrejoye, @yangtse and @bagder to be potential reviewers

mention-bot commented Feb 5, 2016

By analyzing the blame information on this pull request, we identified @pierrejoye, @yangtse and @bagder to be potential reviewers

mkauf referenced this pull request Feb 5, 2016

@jay

This comment has been minimized.

Show comment
Hide comment
@jay

jay Feb 5, 2016

Member

Ref 5d7c937#commitcomment-15913422

There's no path for it to be NULL afaict. Personally I'd prefer an early-exit pattern in a function like this but your changes will be fine. Also, unrelated to your changes, that curl_win32_ascii_to_idn looks wrong I don't know why it is casting char *in to wchar_t * that doesn't look right I'll have to check on that.

Member

jay commented Feb 5, 2016

Ref 5d7c937#commitcomment-15913422

There's no path for it to be NULL afaict. Personally I'd prefer an early-exit pattern in a function like this but your changes will be fine. Also, unrelated to your changes, that curl_win32_ascii_to_idn looks wrong I don't know why it is casting char *in to wchar_t * that doesn't look right I'll have to check on that.

@jay jay self-assigned this Feb 5, 2016

@mkauf

This comment has been minimized.

Show comment
Hide comment
@mkauf

mkauf Feb 5, 2016

Contributor

That cast in curl_win32_ascii_to_idn is wrong, Curl_convert_UTF8_to_wchar() should be used instead (like in curl_win32_idn_to_ascii). But the whole function is currently unused. Maybe we can remove it.

Contributor

mkauf commented Feb 5, 2016

That cast in curl_win32_ascii_to_idn is wrong, Curl_convert_UTF8_to_wchar() should be used instead (like in curl_win32_idn_to_ascii). But the whole function is currently unused. Maybe we can remove it.

@jay

This comment has been minimized.

Show comment
Hide comment
@jay

jay Feb 5, 2016

Member

It might be useful in the future I'd rather fix it and then leave it be. Weird, I don't recall seeing an unused function warning when I built for WinIDN, maybe that warning is disabled somewhere.

Member

jay commented Feb 5, 2016

It might be useful in the future I'd rather fix it and then leave it be. Weird, I don't recall seeing an unused function warning when I built for WinIDN, maybe that warning is disabled somewhere.

- Better error checking for Win32 IDN functions
- Fix a conversion bug in the unused function curl_win32_ascii_to_idn()
@mkauf

This comment has been minimized.

Show comment
Hide comment
@mkauf

mkauf Feb 6, 2016

Contributor

I have updated the pull request with a bugfix for curl_win32_ascii_to_idn.

Contributor

mkauf commented Feb 6, 2016

I have updated the pull request with a bugfix for curl_win32_ascii_to_idn.

jay added a commit that referenced this pull request Feb 6, 2016

idn_win32: Better error checking
.. also fix a conversion bug in the unused function
curl_win32_ascii_to_idn().

And remove wprintfs on error (Jay).

Bug: #637
@jay

This comment has been minimized.

Show comment
Hide comment
@jay

jay Feb 6, 2016

Member

Modified in 3 ways, otherwise it's good:

  1. Pass the length+1 of in_w to IdnToUnicode. My reasoning for that is the IdnToUnicode doc does not address the use of -1 the way IdnToAscii does (although it does seem to work). I don't like that it's undocumented. The remarks in Return value say: "The retrieved string is null-terminated only if the input string is null-terminated". I addressed this by changing the input stream character count to length+1.
  2. free(in_w) after IdnToUnicode.
  3. Remove wprintf statements. It's wrong to be sending that output to stdout. The responsibility is on the caller to show the error from a function like that via warnf or something. It already does this in url.c.

Landed in 9e7fcd4. Thanks Michael, and thanks Gisle for getting the ball rolling.

Member

jay commented Feb 6, 2016

Modified in 3 ways, otherwise it's good:

  1. Pass the length+1 of in_w to IdnToUnicode. My reasoning for that is the IdnToUnicode doc does not address the use of -1 the way IdnToAscii does (although it does seem to work). I don't like that it's undocumented. The remarks in Return value say: "The retrieved string is null-terminated only if the input string is null-terminated". I addressed this by changing the input stream character count to length+1.
  2. free(in_w) after IdnToUnicode.
  3. Remove wprintf statements. It's wrong to be sending that output to stdout. The responsibility is on the caller to show the error from a function like that via warnf or something. It already does this in url.c.

Landed in 9e7fcd4. Thanks Michael, and thanks Gisle for getting the ball rolling.

@jay jay closed this Feb 6, 2016

@gvanem

This comment has been minimized.

Show comment
Hide comment
@gvanem

gvanem Feb 7, 2016

Member

Jay> The responsibility is on the caller to show the error from a function like that via warnf or something.
Jay> It already does this in url.c.

The warnings specific to WinIDN are non-existent; no help in case the Active CodePage is wrong.
Maybe I cook up a Curl_idn_strerror() for USE_WIN32_IDN?

Also it seems UTF8 input-encoding is assumed throughout libcurl/WinIDN (?). But where is the call to GetACP() to check that this is true? Can you enlighten me, Michael?

Member

gvanem commented Feb 7, 2016

Jay> The responsibility is on the caller to show the error from a function like that via warnf or something.
Jay> It already does this in url.c.

The warnings specific to WinIDN are non-existent; no help in case the Active CodePage is wrong.
Maybe I cook up a Curl_idn_strerror() for USE_WIN32_IDN?

Also it seems UTF8 input-encoding is assumed throughout libcurl/WinIDN (?). But where is the call to GetACP() to check that this is true? Can you enlighten me, Michael?

@jay

This comment has been minimized.

Show comment
Hide comment
@jay

jay Feb 8, 2016

Member

You are right WinIDN does not work properly from URLs provided by the curl tool, that is probably related to #345. From libcurl UTF-8 is expected. In Visual Studio as far as I know this can be expressed one of two ways. If you have saved your source as UTF-8 with BOM Visual Studio will do translations to the local codepage for literal char strings but if you have saved your source as UTF-8 without BOM it will not do any translation. Therefore either of these is acceptable:

  // connect to http://россия.net
  curl_easy_setopt(curl, CURLOPT_URL,
    // "http://\xD1\x80\xD0\xBE\xD1\x81\xD1\x81\xD0\xB8\xD1\x8F.net"); // use if src is UTF-8 with BOM
    "http://россия.net"); // use if src is UTF-8 without BOM

and WinIDN will convert that UTF-8 string to xn--h1alffa9f.net and make the connection.

So one issue is whether a URL passed to CURLOPT_URL is supposed to be in UTF-8 format or the local codepage. It doesn't appear to be documented either way, actually.

Another issue is being able to access Unicode from the command prompt using the curl tool. Frankly I've been hoping someone would come along and work on that, but it's been open too long we'll probably have to put it in the TODO.

Member

jay commented Feb 8, 2016

You are right WinIDN does not work properly from URLs provided by the curl tool, that is probably related to #345. From libcurl UTF-8 is expected. In Visual Studio as far as I know this can be expressed one of two ways. If you have saved your source as UTF-8 with BOM Visual Studio will do translations to the local codepage for literal char strings but if you have saved your source as UTF-8 without BOM it will not do any translation. Therefore either of these is acceptable:

  // connect to http://россия.net
  curl_easy_setopt(curl, CURLOPT_URL,
    // "http://\xD1\x80\xD0\xBE\xD1\x81\xD1\x81\xD0\xB8\xD1\x8F.net"); // use if src is UTF-8 with BOM
    "http://россия.net"); // use if src is UTF-8 without BOM

and WinIDN will convert that UTF-8 string to xn--h1alffa9f.net and make the connection.

So one issue is whether a URL passed to CURLOPT_URL is supposed to be in UTF-8 format or the local codepage. It doesn't appear to be documented either way, actually.

Another issue is being able to access Unicode from the command prompt using the curl tool. Frankly I've been hoping someone would come along and work on that, but it's been open too long we'll probably have to put it in the TODO.

@gvanem

This comment has been minimized.

Show comment
Hide comment
@gvanem

gvanem Feb 8, 2016

Member

.. that is probably related to #345.

In which Viktor Szakats proposed to build curl.exe with -DUNICODE -D_UNICODE. I also tried that back in 2014, but failed. Trying this now, there were only these warnings:

tool_homedir.c(42): warning C4133: 'function': incompatible types - from 'const char *' to 'LPCWSTR'
tool_homedir.c(42): warning C4133: 'function': incompatible types - from 'char [1024]' to 'LPWSTR'
tool_homedir.c(49): warning C4133: 'function': incompatible types - from 'const char *' to 'LPCWSTR'
tool_homedir.c(49): warning C4133: 'function': incompatible types - from 'char [1024]' to 'LPWSTR'

(which are simple to fix). But I still have problems with e.g. curl.exe www.seoghør.no and ACE-conversion. The only way that works is a idn-test.bat with:

@echo url = "www.seoghør.no" | iconv -f iso8859-1 -t utf-8 > %TEMP\url-file
@curl -K %TEMP\url-file
Member

gvanem commented Feb 8, 2016

.. that is probably related to #345.

In which Viktor Szakats proposed to build curl.exe with -DUNICODE -D_UNICODE. I also tried that back in 2014, but failed. Trying this now, there were only these warnings:

tool_homedir.c(42): warning C4133: 'function': incompatible types - from 'const char *' to 'LPCWSTR'
tool_homedir.c(42): warning C4133: 'function': incompatible types - from 'char [1024]' to 'LPWSTR'
tool_homedir.c(49): warning C4133: 'function': incompatible types - from 'const char *' to 'LPCWSTR'
tool_homedir.c(49): warning C4133: 'function': incompatible types - from 'char [1024]' to 'LPWSTR'

(which are simple to fix). But I still have problems with e.g. curl.exe www.seoghør.no and ACE-conversion. The only way that works is a idn-test.bat with:

@echo url = "www.seoghør.no" | iconv -f iso8859-1 -t utf-8 > %TEMP\url-file
@curl -K %TEMP\url-file
@mkauf

This comment has been minimized.

Show comment
Hide comment
@mkauf

mkauf Feb 8, 2016

Contributor

@jay Thank you for improving the patch!

@gvanem You have found another bug... see @jay's comment. A quick hack (not the real bugfix) is to replace CP_UTF8 with CP_ACP in Curl_convert_UTF8_to_wchar()

Contributor

mkauf commented Feb 8, 2016

@jay Thank you for improving the patch!

@gvanem You have found another bug... see @jay's comment. A quick hack (not the real bugfix) is to replace CP_UTF8 with CP_ACP in Curl_convert_UTF8_to_wchar()

@mkauf mkauf deleted the mkauf:win-idn-bugfix branch Feb 8, 2016

@pierrejoye

This comment has been minimized.

Show comment
Hide comment
@pierrejoye

pierrejoye Feb 8, 2016

Contributor

Hi,

On Feb 8, 2016 7:30 PM, "Michael Kaufmann" notifications@github.com wrote:

@jay Thank you for improving the patch!

@gvanem You have found another bug... see @jay's comment. A quick hack
(not the real bugfix) is to replace CP_UTF8 with CP_ACP in
Curl_convert_UTF8_to_wchar()

I do not think there is a bug per se in this specific part. Eventually in
the command line arguments parsing but not here.

Please do not replace UTF8 with ACP there as any valid code using curl via
the APIs may break.

To get a UTF8 console on windows one can use:

CHCP 65001 in a batch file and create a new console with cmd.exe /k
utf8.bat file f.e. as changing it using the registry or system wide have
other side effects.


Reply to this email directly or view it on GitHub.

Contributor

pierrejoye commented Feb 8, 2016

Hi,

On Feb 8, 2016 7:30 PM, "Michael Kaufmann" notifications@github.com wrote:

@jay Thank you for improving the patch!

@gvanem You have found another bug... see @jay's comment. A quick hack
(not the real bugfix) is to replace CP_UTF8 with CP_ACP in
Curl_convert_UTF8_to_wchar()

I do not think there is a bug per se in this specific part. Eventually in
the command line arguments parsing but not here.

Please do not replace UTF8 with ACP there as any valid code using curl via
the APIs may break.

To get a UTF8 console on windows one can use:

CHCP 65001 in a batch file and create a new console with cmd.exe /k
utf8.bat file f.e. as changing it using the registry or system wide have
other side effects.


Reply to this email directly or view it on GitHub.

@jay

This comment has been minimized.

Show comment
Hide comment
@jay

jay Feb 8, 2016

Member

@pierrejoye The issue though is this is undocumented and the behavior appears to be different depending on the IDN library. For libidn we're calling idna_to_ascii_lz which does the current locale. One would expect that WinIDN (aka normaliz.lib) would do the same but instead as Gisle pointed out we're using UTF-8 regardless of locale. Currently UTF-8 on the command line isn't compatible with the curl tool in Windows, see my comments in #345.

Personally I'd rather WinIDN behave the same as libidn. At this point in order to preserve backwards compatibility I don't know what we can do, maybe UTF-8 detection or something.

Also: There is a mothballed function I wrote which may be useful, utf8_strict_codepoint_count

Member

jay commented Feb 8, 2016

@pierrejoye The issue though is this is undocumented and the behavior appears to be different depending on the IDN library. For libidn we're calling idna_to_ascii_lz which does the current locale. One would expect that WinIDN (aka normaliz.lib) would do the same but instead as Gisle pointed out we're using UTF-8 regardless of locale. Currently UTF-8 on the command line isn't compatible with the curl tool in Windows, see my comments in #345.

Personally I'd rather WinIDN behave the same as libidn. At this point in order to preserve backwards compatibility I don't know what we can do, maybe UTF-8 detection or something.

Also: There is a mothballed function I wrote which may be useful, utf8_strict_codepoint_count

@pierrejoye

This comment has been minimized.

Show comment
Hide comment
@pierrejoye

pierrejoye Feb 9, 2016

Contributor

Yes, I saw it :)

As they seem related there different issues.

One is how curl command line manages the input argument encoding.

The 2nd issue is the usage of the various windows APIs, as far as I
remember only the non wide char apis us used.

The 3rd issue or question is how curl API should work. To me it should keep
using UTF-8 and really not rely on runtime charset setting, no matter the
platform. It makes the caller and maintainers way easier. @badger thoughts?

In any case, I am in the process of making php on windows Unicode friendly
and curl is part of this effort. I can provide a PR and we can discuss
specifics there?
On Feb 9, 2016 5:14 AM, "Jay Satiro" notifications@github.com wrote:

@pierrejoye https://github.com/pierrejoye The issue though is this is
undocumented and the behavior appears to be different depending on the IDN
library. For libidn we're calling idna_to_ascii_lz which does the current
locale. One would expect that WinIDN (aka normaliz.lib) would do the same
but instead as Gisle pointed out we're using UTF-8 regardless of locale.
Currently UTF-8 on the command line isn't compatible with the curl tool in
Windows, see my comments in #345 #345
.

Personally I'd rather WinIDN behave the same as libidn. At this point in
order to preserve backwards compatibility I don't know what we can do,
maybe UTF-8 detection or something.

Also: There is a mothballed function I wrote which may be useful,
utf8_strict_codepoint_count
master...jay:check_utf8_before_libidndiff-7bfe625cf96f5f724dd32ce3e5a1ff6fR343


Reply to this email directly or view it on GitHub
#637 (comment).

Contributor

pierrejoye commented Feb 9, 2016

Yes, I saw it :)

As they seem related there different issues.

One is how curl command line manages the input argument encoding.

The 2nd issue is the usage of the various windows APIs, as far as I
remember only the non wide char apis us used.

The 3rd issue or question is how curl API should work. To me it should keep
using UTF-8 and really not rely on runtime charset setting, no matter the
platform. It makes the caller and maintainers way easier. @badger thoughts?

In any case, I am in the process of making php on windows Unicode friendly
and curl is part of this effort. I can provide a PR and we can discuss
specifics there?
On Feb 9, 2016 5:14 AM, "Jay Satiro" notifications@github.com wrote:

@pierrejoye https://github.com/pierrejoye The issue though is this is
undocumented and the behavior appears to be different depending on the IDN
library. For libidn we're calling idna_to_ascii_lz which does the current
locale. One would expect that WinIDN (aka normaliz.lib) would do the same
but instead as Gisle pointed out we're using UTF-8 regardless of locale.
Currently UTF-8 on the command line isn't compatible with the curl tool in
Windows, see my comments in #345 #345
.

Personally I'd rather WinIDN behave the same as libidn. At this point in
order to preserve backwards compatibility I don't know what we can do,
maybe UTF-8 detection or something.

Also: There is a mothballed function I wrote which may be useful,
utf8_strict_codepoint_count
master...jay:check_utf8_before_libidndiff-7bfe625cf96f5f724dd32ce3e5a1ff6fR343


Reply to this email directly or view it on GitHub
#637 (comment).

@jay

This comment has been minimized.

Show comment
Hide comment
@jay

jay Feb 9, 2016

Member

Sure. But what do you think of the idea I mentioned where for Windows we do an autodetect like

if(!is_ASCII_name(host->name)) {
  if(utf8_strict_codepoint_count(host->name) > 0)
    /* string is valid strict UTF-8 do UTF-8 => wchar => IDN */
  else
    /* string is invalid strict UTF-8 do locale => wchar => IDN */
}

IIRC libidn when not built with libiconv will internally default to UTF-8 encoding even when idna_to_ascii_lz is used (I last read it about 8 months ago, that may have changed). So how many libidn's are built with libiconv, how many are actually getting locale specific translations and not UTF-8? I really don't know.

Member

jay commented Feb 9, 2016

Sure. But what do you think of the idea I mentioned where for Windows we do an autodetect like

if(!is_ASCII_name(host->name)) {
  if(utf8_strict_codepoint_count(host->name) > 0)
    /* string is valid strict UTF-8 do UTF-8 => wchar => IDN */
  else
    /* string is invalid strict UTF-8 do locale => wchar => IDN */
}

IIRC libidn when not built with libiconv will internally default to UTF-8 encoding even when idna_to_ascii_lz is used (I last read it about 8 months ago, that may have changed). So how many libidn's are built with libiconv, how many are actually getting locale specific translations and not UTF-8? I really don't know.

@gvanem

This comment has been minimized.

Show comment
Hide comment
@gvanem

gvanem Feb 9, 2016

Member

CHCP 65001 in a batch file and create a new console with cmd.exe /k
utf8.bat file f.e. as changing it using the registry or system wide have
other side effects.

@pierrejoye Using chcp 65001 globally breaks some command-line tools here (e.g. list from Windows Debug Tools). So that is out the question for me. Or do you mean launching curl via such an utf8.bat? Out of the question too.

PS. I use 4NT 5 (Unicode) as my shell. From it's help-file:
When you use CHCP under Windows it only affects the current process, and any new
programs started from within that process; the active code page in other processes
remains unchanged.

cmd behaves the same in this regard I think.

Member

gvanem commented Feb 9, 2016

CHCP 65001 in a batch file and create a new console with cmd.exe /k
utf8.bat file f.e. as changing it using the registry or system wide have
other side effects.

@pierrejoye Using chcp 65001 globally breaks some command-line tools here (e.g. list from Windows Debug Tools). So that is out the question for me. Or do you mean launching curl via such an utf8.bat? Out of the question too.

PS. I use 4NT 5 (Unicode) as my shell. From it's help-file:
When you use CHCP under Windows it only affects the current process, and any new
programs started from within that process; the active code page in other processes
remains unchanged.

cmd behaves the same in this regard I think.

@pierrejoye

This comment has been minimized.

Show comment
Hide comment
@pierrejoye

pierrejoye Feb 9, 2016

Contributor

@gvanem Sorry if I was not clear.
What I mean is exactly what you said. Most of the time enabling Unicode globally for the shell is not an option.
using cmd /k to create a specific console with UTF-8 (or whatever fits).
But I miss your point. You cannot (for good reasons) have UTF8 globally and you do not want to create a UTF8 console either? I do not 4NT 5, I use conemu and do the cmd calls I need in the settings. I suppose 4NT 5 is simply unicode by default, but is it UTF-8? or does it use UCS/Windows Unicode Wide chars or other?

In any case, what should be used by curl should be detected during the arguments processing. The underlying curl APIs can or should remain using UTF-8 for portability and ease of use reasons. That's at least what I can see in many other libraries and it works quite well in a portable manner. It leaves the responsibility to the caller to provide UTF-8 or ASCII. For example, in our case, the caller is the curl.exe command.

@jay I think it is nice to do yes, checking if it is ASCII or not. I am not sure about actually validating UTF8 (or other) as it should be ideally done at a later stage and fails accordingly.

Contributor

pierrejoye commented Feb 9, 2016

@gvanem Sorry if I was not clear.
What I mean is exactly what you said. Most of the time enabling Unicode globally for the shell is not an option.
using cmd /k to create a specific console with UTF-8 (or whatever fits).
But I miss your point. You cannot (for good reasons) have UTF8 globally and you do not want to create a UTF8 console either? I do not 4NT 5, I use conemu and do the cmd calls I need in the settings. I suppose 4NT 5 is simply unicode by default, but is it UTF-8? or does it use UCS/Windows Unicode Wide chars or other?

In any case, what should be used by curl should be detected during the arguments processing. The underlying curl APIs can or should remain using UTF-8 for portability and ease of use reasons. That's at least what I can see in many other libraries and it works quite well in a portable manner. It leaves the responsibility to the caller to provide UTF-8 or ASCII. For example, in our case, the caller is the curl.exe command.

@jay I think it is nice to do yes, checking if it is ASCII or not. I am not sure about actually validating UTF8 (or other) as it should be ideally done at a later stage and fails accordingly.

@pierrejoye

This comment has been minimized.

Show comment
Hide comment
@pierrejoye

pierrejoye Feb 9, 2016

Contributor

@gvanem by the way, about building with -DUNICODE -D_UNICODE:

I would rather avoid it and uses the *W APIs accordingly. It is by far easier and cleaner to maintain and less magic happens. It also makes crystal clear what is being called and why.

Contributor

pierrejoye commented Feb 9, 2016

@gvanem by the way, about building with -DUNICODE -D_UNICODE:

I would rather avoid it and uses the *W APIs accordingly. It is by far easier and cleaner to maintain and less magic happens. It also makes crystal clear what is being called and why.

@gvanem

This comment has been minimized.

Show comment
Hide comment
@gvanem

gvanem Feb 9, 2016

Member

@pierrejoye

.. or does it use UCS/Windows Unicode Wide chars or other?

Yes, built using -DUNICODE as Microsoft wants (or wanted?) us to do. Now the shift seems to be towards UTF-8 everywhere else. No? That doesn't make this messy internationalisation issue anything easier...

I would rather avoid it and uses the *W APIs accordingly.

On that I agree.

Member

gvanem commented Feb 9, 2016

@pierrejoye

.. or does it use UCS/Windows Unicode Wide chars or other?

Yes, built using -DUNICODE as Microsoft wants (or wanted?) us to do. Now the shift seems to be towards UTF-8 everywhere else. No? That doesn't make this messy internationalisation issue anything easier...

I would rather avoid it and uses the *W APIs accordingly.

On that I agree.

@pierrejoye

This comment has been minimized.

Show comment
Hide comment
@pierrejoye

pierrejoye Feb 9, 2016

Contributor

@gvanem We agree then :)

I will try to work on a proposal/PR for the command line tool(s) and underlying APIs. I need it for PHP anyway (the APIs part).

Contributor

pierrejoye commented Feb 9, 2016

@gvanem We agree then :)

I will try to work on a proposal/PR for the command line tool(s) and underlying APIs. I need it for PHP anyway (the APIs part).

@jay

This comment has been minimized.

Show comment
Hide comment
@jay

jay Feb 9, 2016

Member

I envision this spiraling out of control with a whole lot of supporting wide character functions that would be more work to maintain. We have to pass char* strings around all the time and it might work better if any user input is retrieved with W APIs and then converted to UTF-8 and passed around as UTF-8 in the library. Regardless, I look forward to seeing what you write. I will get started tomorrow on a draft of UTF-8 autodetect for host->name so that we have an idea how that might work in practice.

Member

jay commented Feb 9, 2016

I envision this spiraling out of control with a whole lot of supporting wide character functions that would be more work to maintain. We have to pass char* strings around all the time and it might work better if any user input is retrieved with W APIs and then converted to UTF-8 and passed around as UTF-8 in the library. Regardless, I look forward to seeing what you write. I will get started tomorrow on a draft of UTF-8 autodetect for host->name so that we have an idea how that might work in practice.

@pierrejoye

This comment has been minimized.

Show comment
Hide comment
@pierrejoye

pierrejoye Feb 9, 2016

Contributor

@jay For example, the idea of using native windows IDN functions was to reduce the amount of dependency (which requires quite some work to keep up with :).

I think we agree or talk about the same thing. On windows there is no other solution than working with the wide chars APIs when it comes to Unicode. Even using the _UNICODE build mode is about that. They are only tricks to redefine native functions to use the A or W version of an API.

Per default on linux, UTF8 is considered as available, with some checks if a string is pure ASCII or not. It can be exactly the same for Windows. The only annoyance, inevitable, is the conversion between wide char and UTF8 before calling a windows function, but that's easy to do and not very intrusive, only utf8towidechar and widechartoutf8 being used.

Is it what you have in mind too?

Contributor

pierrejoye commented Feb 9, 2016

@jay For example, the idea of using native windows IDN functions was to reduce the amount of dependency (which requires quite some work to keep up with :).

I think we agree or talk about the same thing. On windows there is no other solution than working with the wide chars APIs when it comes to Unicode. Even using the _UNICODE build mode is about that. They are only tricks to redefine native functions to use the A or W version of an API.

Per default on linux, UTF8 is considered as available, with some checks if a string is pure ASCII or not. It can be exactly the same for Windows. The only annoyance, inevitable, is the conversion between wide char and UTF8 before calling a windows function, but that's easy to do and not very intrusive, only utf8towidechar and widechartoutf8 being used.

Is it what you have in mind too?

@jay

This comment has been minimized.

Show comment
Hide comment
@jay

jay Feb 10, 2016

Member

Right. UNICODE is for the Windows API (MessageBox => MessageBoxW) and _UNICODE is for the CRT API (_tcslen => wcslen). The latter _t functions are rarely used throughout the codebase, but they are used in SSPI so it looks like someone had in mind Unicode builds for Windows. I'm going to try UTF-8 conversion of the args and see how that looks. It seems like either way whether we're wide char or UTF-8 there's still going to be conversions when we call fopen or wfopen, etc. setlocale doesn't support UTF-8 so maybe it's not as good an idea as I thought. I didn't have time today but tomorrow I'll try to get to a draft.

Member

jay commented Feb 10, 2016

Right. UNICODE is for the Windows API (MessageBox => MessageBoxW) and _UNICODE is for the CRT API (_tcslen => wcslen). The latter _t functions are rarely used throughout the codebase, but they are used in SSPI so it looks like someone had in mind Unicode builds for Windows. I'm going to try UTF-8 conversion of the args and see how that looks. It seems like either way whether we're wide char or UTF-8 there's still going to be conversions when we call fopen or wfopen, etc. setlocale doesn't support UTF-8 so maybe it's not as good an idea as I thought. I didn't have time today but tomorrow I'll try to get to a draft.

@jay

This comment has been minimized.

Show comment
Hide comment
@jay

jay Feb 11, 2016

Member

@pierrejoye I made a draft but frankly I don't really feel like taking it on for assignment. Refer to my latest comment in 345.

Member

jay commented Feb 11, 2016

@pierrejoye I made a draft but frankly I don't really feel like taking it on for assignment. Refer to my latest comment in 345.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment