-
-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broken UTF-8 encoded content terminal output on Windows #9841
Comments
Does this issue happen only with curl-for-win builds, or with any 7.86.0 Windows build? |
All these versions have this bug 7.86, 7.80, 7.79 with Windows 10 and Windows 7 (other builds not tested, other Windows version not tested): x64 Other tests: Curl version 7.85 for Linux (Arch Linux) does not have this error (Kitty/Putty terminal). |
Thanks. So this is a long time issue. Can you try with an MSYS2 / Git for Windows / Windows built-in [ This may be related to the Unicode build option. curl-for-win had it enabled in 7.71.0–7.77.0. ] |
MSYS2 (Bug detected):
Git for Windows (Bug detected):
Windows 10 built-in (Bug NOT detected):
|
Thanks @OSPanel. I'm transferring this to the curl project, as the Issue affects all (recent) Windows builds. If you are prepared to make your own curl build, you may try enabling the Unicode feature to see if it resolves this issue. |
I can't compile the binary, but I can check the test binary if needed. |
@OSPanel: You can find fresh Unicode builds here: https://ci.appveyor.com/project/curlorg/curl-for-win/builds/45248995/artifacts |
Thanks. I downloaded and tested these x86+x64 builds (dated 10/26/2022), and this bug was found there :-( |
I freshly made these builds today, right before posting here. They have Unicode enabled (check [ For reproducibility the date is always the same for a given curl version number. ] |
I used this link: https://ci.appveyor.com/project/curlorg/curl-for-win/builds/45248995/artifacts
|
@OSPanel: That's the correct one indeed, and thanks for testing it! Based on your tests, this issue is present in both Unicode and non-Unicode builds. |
I don't see an explanation exactly what the symptoms of this error is? Is it perhaps code from this commit that causes it? 5bfaa86 |
I'm making an attempt to revert this patch in the test build here: https://ci.appveyor.com/project/curlorg/curl-for-win/builds/45254038 Even though the notion of the patch seems correct to me and changing terminal codepage on the fly was fragile/broken at least before Windows 10. Binaries here: https://ci.appveyor.com/project/curlorg/curl-for-win/builds/45254038/artifacts |
Thanks. This test build works well, the bug we are discussing has not been detected.
|
It means that 5bfaa86 is causing the difference. According to past experience fixing this exact issue in a different project, and this earlier Python Issue, the |
Yes, I think the OP called it correctly. We are passing an incomplete UTF-8 encoding to MultiByteToWideChar more than one time (up to 4 times since there are up to 4 bytes in UTF-8 encoded characters). That results in an invalid character � (U+FFFD) output for each incomplete UTF-8 encoded character. I cannot reproduce with the OP's URL but I have no doubt this issue is valid. Let's take the first mangled output for example:
A caret is pointing to the I think that can be fixed by delaying the conversion and output for incomplete Unicode UTF-8 encoded characters until they are complete. I had at one point written a UTF-8 correctness function for libcurl for other reasons but I don't think we added it. I will see if maybe some of that logic applies here. Also, the comments in those python threads about WriteConsole long output failing concern me. The MS documentation currently says "If the total size of the specified number of characters exceeds the available heap, the function fails with ERROR_NOT_ENOUGH_MEMORY" but according to the python threads it used to reference a specific shared 64k heap that may have an arbitrary size available. That is something we'd need to actually take into consideration though I wonder why they changed the documentation. I will look into this as well. |
Any progress on this @jay? It feels like a rather complicated dilemma. |
I think it's possible. I was focused on other bugs but I will take a look at it tomorrow. |
5bfaa86 - I assume that this is a bad fix. You cannot output UTF 8 to the console if the current encoding is different from 65001. Therefore, the old code worked well, it cached the current encoding in the console and returned it after the output was completed. This is a normal approach. And if someone has low-quality fonts installed that do not support UTF 8, and because of this their console looks bad, so it's not a CURL problem, it's a problem of the user and his chosen font. Just return the old code by canceling this commit. |
I've been busy and haven't had a chance to work on it. curl outputs UTF-16 to the console, it is the conversion UTF-8 => UTF-16 which is failing in this case. The user's font may not be able to display some valid Unicode characters that are being output but IMO that is not a curl issue. |
Yes, before the commit 5bfaa86 was applied, curl's work in the console depended only on the user's font, which is why I suggest returning the old code. Now, after that commit was accepted, the output to the console is completely broken. Maybe you will return the old code until you get around to changing something? |
The old code was broken in a different way, e.g. it could leave the console in an unstable state (leading to a console crash). |
Any news on fixes yet? |
That last problem doesn't seem related to me. Also, how do you know it's a curl problem and not a problem with the server, network, network stack or terminal? |
The problem is related to encoding, so it is not always reproduced, it depends on the output data. When using the city of London, the problem is no longer visible, but from the 10th time I was able to pick up a location where the reproducibility of the breach is 100% right now (the weather will change during your viewing and you may have to pick up another location):
|
This is very strange. I can reproduce it in a Linux terminal in wine! I'm using
the official curl 32-bit build:
curl 7.87.0 (i686-w64-mingw32) libcurl/7.87.0 OpenSSL/3.0.8 (Schannel) zlib/1.0.
Release-Date: 2022-12-21
Protocols: dict file ftp ftps gopher gophers http https imap imaps ldap ldaps sq
Features: alt-svc AsynchDNS brotli gsasl HSTS HTTP2 HTTP3 HTTPS-proxy IDN IPv6dK
But what's strangest is that when I redirect the output to a file and cat it
outside wine, it's always fine. But when I run curl in the Linux terminal with
wine, that line is corrupted probably 80% of the time ATM. Could it be an issue
with output when curl detects a terminal and not when it's directed to a file?
|
So it is, only the output to the console is broken. Read the posts above here. |
I have another clue. By looking at --trace data at the location of the
corruption, you can see this:
```
1690: 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 ................
16a0: 80 e2 94 80 e2 94 80 e2 94 b4 e2 94 80 e2 94 80 ................
16b0: e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 ................
16c0: 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 ................
16d0: 80 e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 80 ................
16e0: e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 ................
16f0: 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 ................
1700: 80 e2 94 80 e2 94 b4 e2 94 80 e2 94 80 e2 94 80 ................
1710: e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 ................
1720: 94 .
== Info: [CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Supplemental data (23):
<= Recv SSL data, 5 bytes (0x5)
0000: 17 03 03 0b c3 .....
<= Recv SSL data, 1 bytes (0x1)
0000: 17 .
<= Recv data, 2994 bytes (0xbb2)
0000: 80 e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 80 ................
0010: e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 ................
0020: 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 ................
0030: 80 e2 94 80 e2 94 80 e2 94 80 e2 94 80 e2 94 80 ................
0040: e2 94 98 0a 20 20 20 20 20 20 20 20 20 20 20 20 ....
0050: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
```
The UTF-8 byte sequence for a horizontal box drawing character is e2 94 80
which you can see a lot of here. But what's most relevant here is that the e2
94 comes in at the end of a frame, which is interrupted by some TLS bytes, then
it continues in another data from with the final 80 byte. Clearly, this
interruption is causing problems somewhere. But where? Is it Windows that
insists that all bytes in a UTF-8 triplet arrive in the same write call and
wine is faithfully emulating the corruption that occurs when it doesn't happen?
Since curl is writing the correct bytes to a file, it's probably not a curl problem per se.
|
Jay's Nov. 3 comment seems to confirm that this is basically what's happening and it's the same issue. |
Maybe curl is better to just stop outputting data to the console until the entire response is received, because this leads to distortion when gluing bytes... |
The output could be infinitely long, so it's not possible to wait for it all in
the general case. Think about a longer or looping version of this, for
example: curl -s http://artscene.textfiles.com/vt100/movglobe.vt
|
Can anyone help with fixing this error, please. Jay hasn't had time for this in 5 months. |
I have some changes for this but I'm still working on it. |
Please tell me, if you have information, how soon will #10890 be accepted into the main branch? |
- Suppress an incomplete UTF-8 sequence at the end of the buffer. - Attempt to reconstruct incomplete UTF-8 sequence from prior call(s) in current call. Prior to this change, in Windows console, UTF-8 sequences split between two or more calls to the write callback would cause invalid "replacement characters" U+FFFD to be printed instead of the actual Unicode character. This is because in Windows only UTF-16 encoded characters are printed to the console, therefore we convert the UTF-8 contents to UTF-16, which cannot be done with partial UTF-8 sequences. Reported-by: Maksim Arhipov Fixes curl#9841 Closes #xxxx
- Suppress an incomplete UTF-8 sequence at the end of the buffer. - Attempt to reconstruct incomplete UTF-8 sequence from prior call(s) in current call. Prior to this change, in Windows console UTF-8 sequences split between two or more calls to the write callback would cause invalid "replacement characters" U+FFFD to be printed instead of the actual Unicode character. This is because in Windows only UTF-16 encoded characters are printed to the console, therefore we convert the UTF-8 contents to UTF-16, which cannot be done with partial UTF-8 sequences. Reported-by: Maksim Arhipov Fixes curl#9841 Closes curl#10890
Latest version (x86 & x64 tested) with cmd.exe (Windows 10).
When requesting data using curl.exe output to the console occurs with errors if the response is in UTF 8 encoding.
The output failure occurs in about 1 attempt out of 5 and usually in the same arbitrary place regardless of the text. Most likely, it's all about the incorrect gluing of chunks.
Wget has no such problem (tested).
TEST (do a lot of attempts, the output will sometimes be broken):
curl.exe https://raw.githubusercontent.com/OSPanel/OpenServerPanel/main/system/lang/Russian.txt
Incorrect results: https://raw.githubusercontent.com/OSPanel/OpenServerPanel/main/resources/error.png
The text was updated successfully, but these errors were encountered: