Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

curl_multibyte: Remove local encoding fallbacks #7257

Closed
wants to merge 1 commit into from

Conversation

jay
Copy link
Member

@jay jay commented Jun 15, 2021

  • If the UTF-8 to UTF-16 conversion fails in Windows Unicode builds then
    no longer fall back to assuming the string is in a local encoding.

Background:

Some functions in Windows Unicode builds must convert UTF-8 to UTF-16 to
pass to the Windows CRT API wide-character functions since in Windows
UTF-8 is not a valid locale (or at least 99% of the time right now).

Prior to this change if the Unicode encoding conversion failed then
libcurl would assume, for backwards compatibility with applications that
may have written their code for non-Unicode builds, attempt to convert
the string from local encoding to UTF-16.

That type of "best effort" could theoretically cause some type of
security or other problem if a string that was locally encoded was also
valid UTF-8, and therefore an unexpected UTF-8 to UTF-16 conversion
could occur.

Ref: #7246

Closes #xxxx

@jay jay added feature-window A merge of this requires an open feature window URL Windows Windows-specific libcurl API labels Jun 15, 2021
@jay jay removed the feature-window A merge of this requires an open feature window label Jun 16, 2021
- If the UTF-8 to UTF-16 conversion fails in Windows Unicode builds then
  no longer fall back to assuming the string is in a local encoding.

Background:

Some functions in Windows Unicode builds must convert UTF-8 to UTF-16 to
pass to the Windows CRT API wide-character functions since in Windows
UTF-8 is not a valid locale (or at least 99% of the time right now).

Prior to this change if the Unicode encoding conversion failed then
libcurl would assume, for backwards compatibility with applications that
may have written their code for non-Unicode builds, attempt to convert
the string from local encoding to UTF-16.

That type of "best effort" could theoretically cause some type of
security or other problem if a string that was locally encoded was also
valid UTF-8, and therefore an unexpected UTF-8 to UTF-16 conversion
could occur.

Ref: curl#7246

Closes #xxxx
@jay jay force-pushed the win_remove_local_encoding_fallback branch from 4603a4a to cb52b96 Compare June 16, 2021 06:51
@jay jay closed this in 765e060 Jun 21, 2021
@jay jay deleted the win_remove_local_encoding_fallback branch June 21, 2021 06:09
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Unicode support in libcurl does not
seem to be ready for production. Existing support extended certain Windows
interfaces to use the Unicode flavour of the Windows API, but that also
meant that the expected encoding/codepage of strings (e.g. local filenames,
URLs) exchanged via the libcurl API became ambiguous and undefined.
Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API/option, certain dynamic and static "fallback" logic inside
libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit
strings internally. From the user's perspective this poses an unreasonably
difficult task in finding out how to pass a certain non-ASCII string to a
specific API without unwanted or accidental (possibly lossy) conversions or
other side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, finding different files,
accessing the wrong URL or passing a corrupt username or password.

Note that these issues may _only_ affect strings with _non-ASCII_ content.

For now the best solution seems to be to revert back to how libcurl/curl
worked for most of its existence and only re-enable Unicode once the
remaining parts of Windows Unicode support are well-understood, ironed out
and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Unicode support in libcurl does not
seem to be ready for production. Existing support extended certain Windows
interfaces to use the Unicode flavour of the Windows API, but that also
meant that the expected encoding/codepage of strings (e.g. local filenames,
URLs) exchanged via the libcurl API became ambiguous and undefined.
Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API/option, certain dynamic and static "fallback" logic inside
libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit
strings internally. From the user's perspective this poses an unreasonably
difficult task in finding out how to pass a certain non-ASCII string to a
specific API without unwanted or accidental (possibly lossy) conversions or
other side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, finding different files,
accessing the wrong URL or passing a corrupt username or password.

Note that these issues may _only_ affect strings with _non-ASCII_ content.

For now the best solution seems to be to revert back to how libcurl/curl
worked for most of its existence and only re-enable Unicode once the
remaining parts of Windows Unicode support are well-understood, ironed out
and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Unicode support in libcurl does not
seem to be ready for production. Existing support extended certain Windows
interfaces to use the Unicode flavour of the Windows API, but that also
meant that the expected encoding/codepage of strings (e.g. local filenames,
URLs) exchanged via the libcurl API became ambiguous and undefined.
Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API/option, certain dynamic and static "fallback" logic inside
libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit
strings internally. From the user's perspective this poses an unreasonably
difficult task in finding out how to pass a certain non-ASCII string to a
specific API without unwanted or accidental (possibly lossy) conversions or
other side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, finding different files,
accessing the wrong URL or passing a corrupt username or password.

Note that these issues may _only_ affect strings with _non-ASCII_ content.

For now the best solution seems to be to revert back to how libcurl/curl
worked for most of its existence and only re-enable Unicode once the
remaining parts of Windows Unicode support are well-understood, ironed out
and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Windows Unicode support in libcurl does
not seem to be ready for production. Existing support extended certain
Windows interfaces to use the Unicode flavour of the Windows API, but that
also meant that the expected encoding/codepage of strings (e.g. local
filenames, URLs) exchanged via the libcurl API became ambiguous and
undefined.

Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API, build options/dependencies, internal fallback logic based on
runtime auto-detection of passed string, and the result of file operations
(scheduled for removal in 7.78.0). While some parts of libcurl kept using
8-bit strings internally, e.g. when reading the environment.

From the user's perspective this poses an unreasonably complex task in
finding out how to pass (or read) a certain non-ASCII string to (from) a
specific API without unwanted or accidental conversions or other
side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, reading/writing a
different file, accessing the wrong URL or passing a corrupt username or
password.

Note that these issues may only affect strings with _non-7-bit-ASCII_
content.

For now the least bad solution seems to be to revert back to how
libcurl/curl worked for most of its existence and only re-enable Unicode
once the remaining parts of Windows Unicode support are well-understood,
ironed out and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Windows Unicode support in libcurl does
not seem to be ready for production. Existing support extended certain
Windows interfaces to use the Unicode flavour of the Windows API, but that
also meant that the expected encoding/codepage of strings (e.g. local
filenames, URLs) exchanged via the libcurl API became ambiguous and
undefined.

Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API, build options/dependencies, internal fallback logic based on
runtime auto-detection of passed string, and the result of file operations
(scheduled for removal in 7.78.0). While some parts of libcurl kept using
8-bit strings internally, e.g. when reading the environment.

From the user's perspective this poses an unreasonably complex task in
finding out how to pass (or read) a certain non-ASCII string to (from) a
specific API without unwanted or accidental conversions or other
side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, reading/writing a
different file, accessing the wrong URL or passing a corrupt username or
password.

Note that these issues may only affect strings with _non-7-bit-ASCII_
content.

For now the least bad solution seems to be to revert back to how
libcurl/curl worked for most of its existence and only re-enable Unicode
once the remaining parts of Windows Unicode support are well-understood,
ironed out and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
@vszakats vszakats added the Unicode Unicode, code page, character encoding label Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libcurl API Unicode Unicode, code page, character encoding URL Windows Windows-specific
Development

Successfully merging this pull request may close these issues.

2 participants