Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of byUTF for ubyte[] argument #7249

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

vporton
Copy link
Contributor

@vporton vporton commented Oct 23, 2019

For the reasons outlined in the discussion of that pull request, we concluded that we need to be able to call byUTF on the argument of type ubyte[]. This PR implements exactly that.

I remind that this is necessary:

  1. to support converting a chunk extracted from a file;
  2. to eliminate the need to validate a string of chars two times: when it is created and when converted by byUTF (this simplifies programming and improves efficiency).

@vporton vporton requested a review from jmdavis as a code owner October 23, 2019 20:01
@dlang-bot
Copy link
Contributor

Thanks for your pull request and interest in making D better, @vporton! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub fetch digger
dub run digger -- build "master + phobos#7249"

@lesderid
Copy link
Contributor

What happens for a range of ubytes that aren't valid UTF-8? Is this covered by a test?

@@ -4354,6 +4354,9 @@ if (isSomeChar!C)
// hellö as a range of `ubyte`s, which are UTF-8
assert((cast(ubyte[]) [0x68, 0x65, 0x6c, 0x6c, 0xC3, 0xB6]).byUTF!char().equal(['h', 'e', 'l', 'l', 0xC3, 0xB6]));

assertThrown((cast(ubyte[]) [0xC3, 0x28]).byUTF!char());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs an import std.exception : assertThrown;.

std/utf.d Outdated
@@ -60,7 +60,7 @@ $(TR $(TD Miscellaneous) $(TD
+/
module std.utf;

import std.exception; // basicExceptionCtors
import std.exception; // basicExceptionCtors, assertThrown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work because tests are extracted. Please import it inside the unittest block.

@atilaneves
Copy link
Contributor

I read the discussion on the other PR and AIUI it called for something that took ubyte[] and lazily produced validaded char[]. This doesn't seem to be it. Could you please explain what you're trying to accomplish? Thanks.

assert("𐐷".byUTF!dchar().equal([0x00010437]));

// hellö as a range of `ubyte`s, which are UTF-8
assert((cast(ubyte[]) [0x68, 0x65, 0x6c, 0x6c, 0xC3, 0xB6]).byUTF!char().equal(['h', 'e', 'l', 'l', 0xC3, 0xB6]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, but I think you should test (directly) converting an ubyte range to utf-16 and/or utf-32

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also dealing with invalid UTF-8 in the ubyte, both with exceptions and replacement characters.

@dukc
Copy link
Contributor

dukc commented Apr 27, 2020

I read the discussion on the other PR and AIUI it called for something that took ubyte[] and lazily produced validaded char[]. This doesn't seem to be it. Could you please explain what you're trying to accomplish? Thanks.

The added unittest explains it:
assert((cast(ubyte[]) [0x68, 0x65, 0x6c, 0x6c, 0xC3, 0xB6]).byUTF!char().equal(['h', 'e', 'l', 'l', 0xC3, 0xB6]));

You pass in a range of ubyte and get a range of char. Handy, as no separate step to deal with autodecoding is needed.

@dukc
Copy link
Contributor

dukc commented Apr 27, 2020

Perhaps this should also accept ushort (assumed to be UTF-16) and uint (UTF-32)?

Of course, this can be implemented later on just as well.

@RazvanN7
Copy link
Collaborator

ping @vporton

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants