-
-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue 17229 - File.byChunk (ubyte) w/ stdout.lockingTextWriter corrupts utf-8 data (and is very slow) #5229
Conversation
|
@schveiguy If you mention the issue in the commit message, the bot is able to pick it up. This is important because the changelog is built from (1) combining the manual text entries and (2) filtering the git commit messages for Bugzilla references. Learn more: https://github.com/dlang-bots/dlang-bot#automated-references |
std/stdio.d
Outdated
| @@ -2772,6 +2772,14 @@ See $(LREF byChunk) for an example. | |||
| return LockingTextWriter(this); | |||
| } | |||
|
|
|||
| @system unittest | |||
| { | |||
| // ensure that lockingTextWriter cannot accidentally accept ubyte arrays | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about including a link to the issue for additional infos?
(might be helpful to someone in the future)
|
I would have thought the preferred change would be to accept ubyte, but not decode it. Instead, write the bytes as given. Rationale: Traditionally, the difference between binary and text mode IO is newline translation on Windows. It is also legitimate to work with utf-8 code units rather than full unicode characters, and still want newline translation. In these cases, ubyte may be the best choice. One case is when working with fixed size buffers, as in the File.byChunk case. With fixed sized buffers, it will often be the case the buffers do not contain valid utf-8 sequences, as the buffers will start mid-sequence. This would not work well with the 'char' data type, as many Phobos routines will try to interpret them as valid utf-8 sequences, which will fail. However, newline translation will work correctly despite not having full utf-8 sequences in all cases, as newline has a unique byte code. This suggests using ubyte in these cases. Considering the File.byChunk example further, it would seem that the correct mode for output is the mode that was used for input. On windows, reading in text mode and writing in binary mode could modify the newline style and visa versa, corrupting the file. This suggests a legitimate use case for a text mode writer that takes untranslated utf-8 code points. Similar scenarios will arise when generated data is buffered in fixed size blocks. Forcing it to be written in binary mode forces the creator to insert the correct newline sequence at the point of generation. However, if final destination is unknown at the point of generation, or there are multiple output targets, it's problematic to pick the correct newline sequence. This is again a case where there is an advantage to leaving the translation to the last step, regardless of whether buffer boundaries are matched with valid utf-8 sequences. |
|
@jondegenhardt I understand what you are saying. I didn't realize that Note that using "binary mode" is really what you should be doing for high-performance copying anyway. But I can see there is a use case specifically for ubyte ranges. I will update the code to allow ubyte ranges (not ushort or others, as the current code does), and use fwrite when it's an array. |
I sort of remember that, but I thought it also looked at the PR title. Will modify. |
|
updated. |
|
Ping @wilzbach shouldn't bot recognize it now? |
It seems like the RegExp doesn't like commas: https://regex101.com/r/aI0Rp6/3 (this is the "official" RegExp that is also used by the Github-Bugzilla integration (and Martin since quite a while), so at least on this one I am innocent) |
|
|
Just amended the message to swap comma for dash. And the bot showed up in milliseconds 😄 |
Yup literally: Yeah Vibe.d is pretty awesome :) |
|
LGTM |
|
@schveiguy what was your motivation for removing the unittest? |
|
@wilzbach the unit test was just a static test to make sure put would not compile with ubyte or ubyte[]. Obviously that test is invalid now. I can augment the test at line 3148, should be sufficient to test for regression. |
|
@wilzbach added unit test. Confirmed that it fails on current master. |
|
Perhaps the byChunk / lockingTextWriter example should be removed until LDC has caught up with the update. Anecdotal, but I feel like I've seen this example used in other written contexts, and the corruption would be quite unexpected. |
That's quite a bit of busy-work, and it doesn't undo those other references. Besides, it works just fine (albeit quite slowly) as long as the file is straight ASCII. I think LDC is pretty up-to-date recently when it comes to tracking DMD's releases. |
Okay, understood. And, it's been this way for quite a while. |
|
@wilzbach Not sure if this is on-track for 2.074, but it really should go in that release. The corruption of multi-byte utf-8 characters is a serious issue. That it's an example in the documentation increases the priority. Not sure the process, but if an approved review is the main thing is there a way to generate appropriate attention? |
|
Branching for 2.074.0 is supposed to be tomorrow IIRC |
|
I think this PR has introduced a regression. I can't verify currently as I can't seem to get digger to work https://issues.dlang.org/show_bug.cgi?id=17358 This was the only PR to touch locking text writer though, so I think it's a safe bet. |
The fix is basically to disallow compiling the putting of
ubytedata into the output range. It will now only accept arrays of a character type, or ranges of a character type. Any other type is not really a character, and should be cast if that's what you really meant.Note that the original code did really bad things. For instance, an array of
ubytewas put into the output stream by integer promoting eachubyteinto adchar, and then encoding whatever the result of that was into utf-8 code units.I'm not sure if we need a deprecation here, as code that is passing
ubyte[]tolockingTextWriteris fundamentally broken, but maybe I'm wrong and there are some valid uses for this that need to be migrated tochar[]casting?Also updated the example to reflect the correct usage for "Efficient file copy", which was both broken and not efficient.