Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 17229 - File.byChunk (ubyte) w/ stdout.lockingTextWriter corrupts utf-8 data (and is very slow) #5229

Merged
merged 2 commits into from
Mar 17, 2017

Conversation

schveiguy
Copy link
Member

The fix is basically to disallow compiling the putting of ubyte data into the output range. It will now only accept arrays of a character type, or ranges of a character type. Any other type is not really a character, and should be cast if that's what you really meant.

Note that the original code did really bad things. For instance, an array of ubyte was put into the output stream by integer promoting each ubyte into a dchar, and then encoding whatever the result of that was into utf-8 code units.

I'm not sure if we need a deprecation here, as code that is passing ubyte[] to lockingTextWriter is fundamentally broken, but maybe I'm wrong and there are some valid uses for this that need to be migrated to char[] casting?

Also updated the example to reflect the correct usage for "Efficient file copy", which was both broken and not efficient.

@wilzbach
Copy link
Member

wilzbach commented Mar 2, 2017

@schveiguy If you mention the issue in the commit message, the bot is able to pick it up.

This is important because the changelog is built from (1) combining the manual text entries and (2) filtering the git commit messages for Bugzilla references.

Learn more: https://github.com/dlang-bots/dlang-bot#automated-references

std/stdio.d Outdated
@@ -2772,6 +2772,14 @@ See $(LREF byChunk) for an example.
return LockingTextWriter(this);
}

@system unittest
{
// ensure that lockingTextWriter cannot accidentally accept ubyte arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about including a link to the issue for additional infos?
(might be helpful to someone in the future)

@jondegenhardt
Copy link
Contributor

jondegenhardt commented Mar 2, 2017

I would have thought the preferred change would be to accept ubyte, but not decode it. Instead, write the bytes as given. Rationale: Traditionally, the difference between binary and text mode IO is newline translation on Windows. It is also legitimate to work with utf-8 code units rather than full unicode characters, and still want newline translation. In these cases, ubyte may be the best choice.

One case is when working with fixed size buffers, as in the File.byChunk case. With fixed sized buffers, it will often be the case the buffers do not contain valid utf-8 sequences, as the buffers will start mid-sequence. This would not work well with the 'char' data type, as many Phobos routines will try to interpret them as valid utf-8 sequences, which will fail. However, newline translation will work correctly despite not having full utf-8 sequences in all cases, as newline has a unique byte code. This suggests using ubyte in these cases.

Considering the File.byChunk example further, it would seem that the correct mode for output is the mode that was used for input. On windows, reading in text mode and writing in binary mode could modify the newline style and visa versa, corrupting the file. This suggests a legitimate use case for a text mode writer that takes untranslated utf-8 code points.

Similar scenarios will arise when generated data is buffered in fixed size blocks. Forcing it to be written in binary mode forces the creator to insert the correct newline sequence at the point of generation. However, if final destination is unknown at the point of generation, or there are multiple output targets, it's problematic to pick the correct newline sequence. This is again a case where there is an advantage to leaving the translation to the last step, regardless of whether buffer boundaries are matched with valid utf-8 sequences.

@schveiguy
Copy link
Member Author

@jondegenhardt I understand what you are saying. I didn't realize that lockingBinaryWriter actually temporarily sets the mode to binary first, which would make it not work properly with the example also.

Note that using "binary mode" is really what you should be doing for high-performance copying anyway.

But I can see there is a use case specifically for ubyte ranges. I will update the code to allow ubyte ranges (not ushort or others, as the current code does), and use fwrite when it's an array.

@schveiguy
Copy link
Member Author

If you mention the issue in the commit message, the bot is able to pick it up

I sort of remember that, but I thought it also looked at the PR title. Will modify.

@schveiguy
Copy link
Member Author

updated.

@schveiguy
Copy link
Member Author

Ping @wilzbach shouldn't bot recognize it now?

@wilzbach
Copy link
Member

wilzbach commented Mar 2, 2017

Ping @wilzbach shouldn't bot recognize it now?

It seems like the RegExp doesn't like commas: https://regex101.com/r/aI0Rp6/3

(this is the "official" RegExp that is also used by the Github-Bugzilla integration (and Martin since quite a while), so at least on this one I am innocent)

@dlang-bot
Copy link
Contributor

Fix Bugzilla Description
17229 File.byChunk (ubyte) w/ stdout.lockingTextWriter corrupts utf-8 data (and is very slow)

@schveiguy
Copy link
Member Author

Just amended the message to swap comma for dash. And the bot showed up in milliseconds 😄

@wilzbach
Copy link
Member

wilzbach commented Mar 2, 2017

Just amended the message to swap comma for dash. And the bot showed up in milliseconds 😄

Yup literally:

» 2 Mar 2017 17:47:43.689 290 <158>1 2017-03-02T16:47:43.331643+00:00 heroku router - - at=info method=POST path="/github_hook" host=dlang-bot.herokuapp.com request_id=83dfdf92-7481-4193-93db-98ca1cf8a1b5 fwd="192.30.252.41" dyno=web.1 connect=1ms service=3ms status=200 bytes=146

Yeah Vibe.d is pretty awesome :)

@jondegenhardt
Copy link
Contributor

LGTM

@wilzbach
Copy link
Member

wilzbach commented Mar 4, 2017

@schveiguy what was your motivation for removing the unittest?
Would be nice to have to prevent regressions in the future.

@schveiguy
Copy link
Member Author

schveiguy commented Mar 5, 2017

@wilzbach the unit test was just a static test to make sure put would not compile with ubyte or ubyte[]. Obviously that test is invalid now.

I can augment the test at line 3148, should be sufficient to test for regression.

@schveiguy
Copy link
Member Author

schveiguy commented Mar 6, 2017

@wilzbach added unit test. Confirmed that it fails on current master.

@jondegenhardt
Copy link
Contributor

Perhaps the byChunk / lockingTextWriter example should be removed until LDC has caught up with the update. Anecdotal, but I feel like I've seen this example used in other written contexts, and the corruption would be quite unexpected.

@schveiguy
Copy link
Member Author

should be removed until LDC has caught up with the update

That's quite a bit of busy-work, and it doesn't undo those other references. Besides, it works just fine (albeit quite slowly) as long as the file is straight ASCII. I think LDC is pretty up-to-date recently when it comes to tracking DMD's releases.

@jondegenhardt
Copy link
Contributor

That's quite a bit of busy-work, and it doesn't undo those other references.

Okay, understood. And, it's been this way for quite a while.

@jondegenhardt
Copy link
Contributor

@wilzbach Not sure if this is on-track for 2.074, but it really should go in that release. The corruption of multi-byte utf-8 characters is a serious issue. That it's an example in the documentation increases the priority. Not sure the process, but if an approved review is the main thing is there a way to generate appropriate attention?

@JackStouffer
Copy link
Member

JackStouffer commented Mar 17, 2017

Branching for 2.074.0 is supposed to be tomorrow IIRC

@dlang-bot dlang-bot merged commit 3568cb3 into dlang:master Mar 17, 2017
@schveiguy schveiguy deleted the fixtextwriter branch April 11, 2017 00:55
@JackStouffer
Copy link
Member

JackStouffer commented Apr 28, 2017

I think this PR has introduced a regression. I can't verify currently as I can't seem to get digger to work https://issues.dlang.org/show_bug.cgi?id=17358

This was the only PR to touch locking text writer though, so I think it's a safe bet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants