Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tool_cb_wrt: fix invalid unicode for windows console #10890

Closed
wants to merge 1 commit into from

Conversation

jay
Copy link
Member

@jay jay commented Apr 5, 2023

  • Suppress an incomplete UTF-8 sequence at the end of the buffer.

  • Attempt to reconstruct incomplete UTF-8 sequence from prior call(s) in current call.

Prior to this change, in Windows console, UTF-8 sequences split between two or more calls to the write callback would cause invalid "replacement characters" U+FFFD to be printed instead of the actual Unicode character. This is because in Windows only UTF-16 encoded characters are printed to the console, therefore we convert the UTF-8 contents to UTF-16, which cannot be done with partial UTF-8 sequences.

Reported-by: Maksim Arhipov

Fixes #9841
Closes #xxx


Untested, WIP. Also I have no automated way to test for the issue this fixes because it only happens in the Windows console.

@jay jay added the Windows Windows-specific label Apr 5, 2023
@vszakats vszakats added the unicode Unicode, code page, character encoding label Apr 5, 2023
@OSPanel
Copy link

OSPanel commented Apr 26, 2023

I can't compile the binary, but I can check the test binary if needed. Maybe there is a link to the compiled version?

@jay
Copy link
Member Author

jay commented Apr 27, 2023

You can download artifacts from the most recent appveyor CI jobs, for example

https://ci.appveyor.com/project/curlorg/curl/builds/46705722
CMake, VS2022, Debug x64, Schannel, Static, Unicode
You'll also need vcruntime140d.dll and ucrtbased.dll from Visual C++ 2022. See curl.zip

There are other ones you can download that do not require Visual C++ debugging DLLs but you may need other DLLs like OpenSSL.

@OSPanel
Copy link

OSPanel commented Apr 27, 2023

There are no more artifacts, everything is fine! thank you!!! (I updated the answer because I tested the build incorrectly first.)

@OSPanel
Copy link

OSPanel commented Apr 27, 2023

The bug is defeated :-) I hope we'll see the fix in the release soon, thanks!

- Suppress an incomplete UTF-8 sequence at the end of the buffer.

- Attempt to reconstruct incomplete UTF-8 sequence from prior call(s)
  in current call.

Prior to this change, in Windows console, UTF-8 sequences split between
two or more calls to the write callback would cause invalid "replacement
characters" U+FFFD to be printed instead of the actual Unicode
character. This is because in Windows only UTF-16 encoded characters are
printed to the console, therefore we convert the UTF-8 contents to
UTF-16, which cannot be done with partial UTF-8 sequences.

Reported-by: Maksim Arhipov

Fixes curl#9841
Closes #xxxx
@jay jay marked this pull request as ready for review June 6, 2023 07:36
@OSPanel
Copy link

OSPanel commented Jul 24, 2023

Why this fix was not added in the latest update ? Can no one cope with this?

@jay
Copy link
Member Author

jay commented Jul 25, 2023

This was held up because I don't have a test for it. I will give it another look to see if we can find some way to test for it. Our test apparatus doesn't cover situations like this. I think this is a worthwhile addition and that I may have to unfortunately add it without a test.

@jay jay closed this in af3f4e4 Aug 1, 2023
@jay jay deleted the utf8_win_console_fix branch August 1, 2023 07:36
ptitSeb pushed a commit to wasix-org/curl that referenced this pull request Sep 25, 2023
- Suppress an incomplete UTF-8 sequence at the end of the buffer.

- Attempt to reconstruct incomplete UTF-8 sequence from prior call(s)
  in current call.

Prior to this change, in Windows console UTF-8 sequences split between
two or more calls to the write callback would cause invalid "replacement
characters" U+FFFD to be printed instead of the actual Unicode
character. This is because in Windows only UTF-16 encoded characters are
printed to the console, therefore we convert the UTF-8 contents to
UTF-16, which cannot be done with partial UTF-8 sequences.

Reported-by: Maksim Arhipov

Fixes curl#9841
Closes curl#10890
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmdline tool unicode Unicode, code page, character encoding Windows Windows-specific
Development

Successfully merging this pull request may close these issues.

broken UTF-8 encoded content terminal output on Windows
3 participants