fix conversion from wchar to char in LockingTextWriter.put #6474

aG0aep6G · 2018-04-23T11:51:03Z

Spin-off from #6469. I'm hoping that this will be relatively easy to get to work on the different platforms.

Instead of throwing a UTFException on unpaired high surrogates, a replacement character could be written. But throwing is what happens currently for unpaired low surrogates, so I went with that.

Fixes the easier part of issue 18789.

dlang-bot · 2018-04-23T11:51:06Z

Thanks for your pull request and interest in making D better, @aG0aep6G! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

My PR is fully covered with tests (you can see the annotated coverage diff directly on GitHub with CodeCov's browser extension
My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
I have provided a detailed rationale explaining my changes
New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.

If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Auto-close	Bugzilla	Severity	Description
✗	18789	normal	std.stdio messes up UTF conversions on output

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub fetch digger
dub run digger -- build "master + phobos#6474"

aG0aep6G · 2018-04-23T13:15:52Z

I'm hoping that this will be relatively easy to get to work on the different platforms.

Yup. All green.

schveiguy

Aside from the fact that you are adding nice checks for partially written wchar surrogate pairs, when you aren't doing the same for char encodings, this looks pretty good. All of my suggestions are optional, though I would like to see the performance improvements.

schveiguy · 2018-04-23T13:23:08Z

std/stdio.d

+            assertThrown!UTFException(writer.put(dchar('y')));
+            assertThrown!UTFException(writer.put(surr));
+        } ());
+        f.close(); // No idea why this is needed.


What happens if you don't include this?

The readText below sees an empty file.

And now I see why: LockingTextWriter holds a reference counted File. When the destructor throws, the File won't be destroyed and its reference count won't be decremented. So the file only gets closed at the end of the program.

Fixed by destroying file_ explicitly before throwing.

Ah, glad you figured that out!

schveiguy · 2018-04-23T13:29:07Z

std/stdio.d

+        immutable wchar surr = "\U0001F608"w[0];
+        auto f = File(deleteme, "w");
+        assertThrown!UTFException(() {
+            auto writer = f.lockingTextWriter();


Took me a few minutes to understand that the first assertThrown above is testing the throw from the lockingTextWriter dtor. Can you put in a comment to that effect?

Added a comment at the end of the block.

schveiguy · 2018-04-23T13:36:09Z

std/stdio.d

+                        char[4] wbuf;
+                        immutable size = encode(wbuf, d);
+                        foreach (i; 0 .. size)
+                            trustedFPUTC(wbuf[i], handle_);


This seems like a lot of acrobatics for this purpose. A few comments here:

You only need a buffer of 1 wchar.

You don't need a "size" for checking if there are 0 or 1 wchars in there, just store a wchar(0) when you aren't using it.

From my experience with iopipe, you get better performance when you avoid using members to do the decoding. This would be better off being duplicated in the function which takes whole wstrings, and you can then use a local to do all the decoding/encoding.

Addressed points 1 and 2. Hopefully, the code's clearer now.

I think I'll leave point 3 for another PR.

Good progress on the simplification. One more thing, I thought it might look better actually if you handle the surrogate pair explicitly instead of the ?: operator

else { dchar d = c; // default to just translate directly if(highSurrogate) { immutable wchar[2] buf = [highSurrogate, c]; d = buf[].front; // doesn't this just work? Do we need to use decodeFront? highSurrogate = 0; } // .. all the rest of the stuff you have }

It would be cool if std.utf provided a way to decode N code units directly. For example, if decode(highSurrogate, c) just worked.

Your code misses the case where c is an unpaired low surrogate. But encode throws on that, so that works. Done.

aG0aep6G · 2018-04-23T21:23:23Z

you are adding nice checks for partially written wchar surrogate pairs, when you aren't doing the same for char encodings

If I find the spiritual strength to continue with #6469, I should add checks for UTF-8, too, yeah.

schveiguy · 2018-04-23T23:18:30Z

I suppose the difference between wchar and char is that LockingTextWriter is actually writing char, so there is a one-to-one mapping. When you are writing wchar, you can't just stop in the middle of a surrogate pair because nothing has been written. When you are writing char, you can write everything given. In any case, there is no denying that unicode is messy when it comes to streaming.

aG0aep6G · 2018-04-24T05:51:39Z

I suppose the difference between wchar and char is that LockingTextWriter is actually writing char, so there is a one-to-one mapping. When you are writing wchar, you can't just stop in the middle of a surrogate pair because nothing has been written. When you are writing char, you can write everything given.

Yeah, I wouldn't validate in the char -> char case. Better just pass it through.

But there's also the char -> wchar_t case. There, the situation is the same as here, and I should check for unused chars in the buffer before starting a new code point.

schveiguy · 2018-04-24T13:50:59Z

OK, I think this looks good. I'll give it a bit of time, and then merge it. @wilzbach does that label do it automatically?

JackStouffer · 2018-04-24T13:55:56Z

No, it's manual.

schveiguy · 2018-04-29T17:55:42Z

Not sure why the bot isn't merging, but I'll do it via auto-tester. Ping @wilzbach

schveiguy · 2018-04-29T17:55:54Z

Auto-merge toggled on

wilzbach · 2018-04-29T18:03:20Z

Because the bot only merges if all CI pass and Jenkins failed here.

schveiguy · 2018-04-29T18:27:41Z

Ah, ok. Jenkins wasn't marked as required, so I assumed it would merge.

wilzbach · 2018-04-29T19:12:42Z

Yeah the default behavior has been changed a few month ago as it's better if a human judges spurious CI failures and also it's better if we don't have them at all.
In this case, the failure was in DCD with this PR hopefully fixing this
dlang-community/DCD#470
(and I also temporarily removed DCD from the tester until the PR is part of the next release - dlang-community/DCD#455)

fix conversion from wchar to char in LockingTextWriter.put

d5cc4c5

Fixes the easier part of issue 18789.

aG0aep6G requested review from CyberShadow and schveiguy as code owners April 23, 2018 11:51

aG0aep6G mentioned this pull request Apr 23, 2018

fix issue 18789 - std.stdio messes up UTF conversions on output #6469

Merged

schveiguy approved these changes Apr 23, 2018

View reviewed changes

aG0aep6G added 3 commits April 23, 2018 23:02

destroy file_ before possibly throwing

8611e45

add comment about the final throw

2ab81bc

only store the high surrogate, and use \0 to indicate its empty

bf1b4db

merge imports

5fa8ba1

only decode when there's a highSurrogate

b771e17

schveiguy added the 72h no objection -> merge The PR will be merged if there are no objections raised. label Apr 24, 2018

schveiguy added auto-merge and removed 72h no objection -> merge The PR will be merged if there are no objections raised. labels Apr 29, 2018

schveiguy merged commit 85844c7 into dlang:master Apr 29, 2018

aG0aep6G deleted the LockingTextWriter.put-wchar-to-char branch April 29, 2018 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix conversion from wchar to char in LockingTextWriter.put #6474

fix conversion from wchar to char in LockingTextWriter.put #6474

aG0aep6G commented Apr 23, 2018

dlang-bot commented Apr 23, 2018

aG0aep6G commented Apr 23, 2018

schveiguy left a comment

schveiguy Apr 23, 2018

aG0aep6G Apr 23, 2018

schveiguy Apr 23, 2018

schveiguy Apr 23, 2018

aG0aep6G Apr 23, 2018

schveiguy Apr 23, 2018

aG0aep6G Apr 23, 2018

schveiguy Apr 23, 2018 •

edited

schveiguy Apr 23, 2018

aG0aep6G Apr 24, 2018

aG0aep6G commented Apr 23, 2018

schveiguy commented Apr 23, 2018 •

edited

aG0aep6G commented Apr 24, 2018

schveiguy commented Apr 24, 2018

JackStouffer commented Apr 24, 2018

schveiguy commented Apr 29, 2018

schveiguy commented Apr 29, 2018

wilzbach commented Apr 29, 2018

schveiguy commented Apr 29, 2018

wilzbach commented Apr 29, 2018

fix conversion from wchar to char in LockingTextWriter.put #6474

fix conversion from wchar to char in LockingTextWriter.put #6474

Conversation

aG0aep6G commented Apr 23, 2018

dlang-bot commented Apr 23, 2018

Bugzilla references

Testing this PR locally

aG0aep6G commented Apr 23, 2018

schveiguy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schveiguy Apr 23, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aG0aep6G commented Apr 23, 2018

schveiguy commented Apr 23, 2018 • edited

aG0aep6G commented Apr 24, 2018

schveiguy commented Apr 24, 2018

JackStouffer commented Apr 24, 2018

schveiguy commented Apr 29, 2018

schveiguy commented Apr 29, 2018

wilzbach commented Apr 29, 2018

schveiguy commented Apr 29, 2018

wilzbach commented Apr 29, 2018

schveiguy Apr 23, 2018 •

edited

schveiguy commented Apr 23, 2018 •

edited