Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix conversion from wchar to char in LockingTextWriter.put #6474

Merged
merged 6 commits into from Apr 29, 2018

Conversation

aG0aep6G
Copy link
Contributor

Spin-off from #6469. I'm hoping that this will be relatively easy to get to work on the different platforms.

Instead of throwing a UTFException on unpaired high surrogates, a replacement character could be written. But throwing is what happens currently for unpaired low surrogates, so I went with that.

Fixes the easier part of issue 18789.
@dlang-bot
Copy link
Contributor

Thanks for your pull request and interest in making D better, @aG0aep6G! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the annotated coverage diff directly on GitHub with CodeCov's browser extension
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Auto-close Bugzilla Severity Description
18789 normal std.stdio messes up UTF conversions on output

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub fetch digger
dub run digger -- build "master + phobos#6474"

@aG0aep6G
Copy link
Contributor Author

I'm hoping that this will be relatively easy to get to work on the different platforms.

Yup. All green.

Copy link
Member

@schveiguy schveiguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the fact that you are adding nice checks for partially written wchar surrogate pairs, when you aren't doing the same for char encodings, this looks pretty good. All of my suggestions are optional, though I would like to see the performance improvements.

std/stdio.d Outdated
assertThrown!UTFException(writer.put(dchar('y')));
assertThrown!UTFException(writer.put(surr));
} ());
f.close(); // No idea why this is needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you don't include this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readText below sees an empty file.

And now I see why: LockingTextWriter holds a reference counted File. When the destructor throws, the File won't be destroyed and its reference count won't be decremented. So the file only gets closed at the end of the program.

Fixed by destroying file_ explicitly before throwing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, glad you figured that out!

immutable wchar surr = "\U0001F608"w[0];
auto f = File(deleteme, "w");
assertThrown!UTFException(() {
auto writer = f.lockingTextWriter();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a few minutes to understand that the first assertThrown above is testing the throw from the lockingTextWriter dtor. Can you put in a comment to that effect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment at the end of the block.

std/stdio.d Outdated
char[4] wbuf;
immutable size = encode(wbuf, d);
foreach (i; 0 .. size)
trustedFPUTC(wbuf[i], handle_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lot of acrobatics for this purpose. A few comments here:

  1. You only need a buffer of 1 wchar.
  2. You don't need a "size" for checking if there are 0 or 1 wchars in there, just store a wchar(0) when you aren't using it.
  3. From my experience with iopipe, you get better performance when you avoid using members to do the decoding. This would be better off being duplicated in the function which takes whole wstrings, and you can then use a local to do all the decoding/encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed points 1 and 2. Hopefully, the code's clearer now.

I think I'll leave point 3 for another PR.

Copy link
Member

@schveiguy schveiguy Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress on the simplification. One more thing, I thought it might look better actually if you handle the surrogate pair explicitly instead of the ?: operator

else
{
   dchar d = c; // default to just translate directly
   if(highSurrogate)
   {
      immutable wchar[2] buf = [highSurrogate, c];
      d = buf[].front; // doesn't this just work? Do we need to use decodeFront?
      highSurrogate = 0;
   }
   // .. all the rest of the stuff you have
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool if std.utf provided a way to decode N code units directly. For example, if decode(highSurrogate, c) just worked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code misses the case where c is an unpaired low surrogate. But encode throws on that, so that works. Done.

@aG0aep6G
Copy link
Contributor Author

you are adding nice checks for partially written wchar surrogate pairs, when you aren't doing the same for char encodings

If I find the spiritual strength to continue with #6469, I should add checks for UTF-8, too, yeah.

@schveiguy
Copy link
Member

schveiguy commented Apr 23, 2018

I suppose the difference between wchar and char is that LockingTextWriter is actually writing char, so there is a one-to-one mapping. When you are writing wchar, you can't just stop in the middle of a surrogate pair because nothing has been written. When you are writing char, you can write everything given. In any case, there is no denying that unicode is messy when it comes to streaming.

@aG0aep6G
Copy link
Contributor Author

I suppose the difference between wchar and char is that LockingTextWriter is actually writing char, so there is a one-to-one mapping. When you are writing wchar, you can't just stop in the middle of a surrogate pair because nothing has been written. When you are writing char, you can write everything given.

Yeah, I wouldn't validate in the char -> char case. Better just pass it through.

But there's also the char -> wchar_t case. There, the situation is the same as here, and I should check for unused chars in the buffer before starting a new code point.

@schveiguy schveiguy added the 72h no objection -> merge The PR will be merged if there are no objections raised. label Apr 24, 2018
@schveiguy
Copy link
Member

OK, I think this looks good. I'll give it a bit of time, and then merge it. @wilzbach does that label do it automatically?

@JackStouffer
Copy link
Member

No, it's manual.

@schveiguy schveiguy added auto-merge and removed 72h no objection -> merge The PR will be merged if there are no objections raised. labels Apr 29, 2018
@schveiguy
Copy link
Member

Not sure why the bot isn't merging, but I'll do it via auto-tester. Ping @wilzbach

@schveiguy
Copy link
Member

Auto-merge toggled on

@schveiguy schveiguy merged commit 85844c7 into dlang:master Apr 29, 2018
@wilzbach
Copy link
Member

Because the bot only merges if all CI pass and Jenkins failed here.

@schveiguy
Copy link
Member

Ah, ok. Jenkins wasn't marked as required, so I assumed it would merge.

@wilzbach
Copy link
Member

Yeah the default behavior has been changed a few month ago as it's better if a human judges spurious CI failures and also it's better if we don't have them at all.
In this case, the failure was in DCD with this PR hopefully fixing this
dlang-community/DCD#470
(and I also temporarily removed DCD from the tester until the PR is part of the next release - dlang-community/DCD#455)

@aG0aep6G aG0aep6G deleted the LockingTextWriter.put-wchar-to-char branch April 29, 2018 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants