Support of byUTF for ubyte[] argument #7249

vporton · 2019-10-23T20:01:38Z

For the reasons outlined in the discussion of that pull request, we concluded that we need to be able to call byUTF on the argument of type ubyte[]. This PR implements exactly that.

I remind that this is necessary:

to support converting a chunk extracted from a file;
to eliminate the need to validate a string of chars two times: when it is created and when converted by byUTF (this simplifies programming and improves efficiency).

dlang-bot · 2019-10-23T20:01:41Z

Thanks for your pull request and interest in making D better, @vporton! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
I have provided a detailed rationale explaining my changes
New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.

If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub fetch digger
dub run digger -- build "master + phobos#7249"

lesderid · 2019-10-23T20:47:01Z

What happens for a range of ubytes that aren't valid UTF-8? Is this covered by a test?

lesderid · 2019-10-23T21:04:10Z

std/utf.d

@@ -4354,6 +4354,9 @@ if (isSomeChar!C)
    // hellö as a range of `ubyte`s, which are UTF-8
    assert((cast(ubyte[]) [0x68, 0x65, 0x6c, 0x6c, 0xC3, 0xB6]).byUTF!char().equal(['h', 'e', 'l', 'l', 0xC3, 0xB6]));

+    assertThrown((cast(ubyte[]) [0xC3, 0x28]).byUTF!char());


Needs an import std.exception : assertThrown;.

lesderid · 2019-10-23T21:13:46Z

std/utf.d

@@ -60,7 +60,7 @@ $(TR $(TD Miscellaneous) $(TD
   +/
 module std.utf;

-import std.exception;  // basicExceptionCtors
+import std.exception;  // basicExceptionCtors, assertThrown


This won't work because tests are extracted. Please import it inside the unittest block.

atilaneves · 2019-10-30T10:20:41Z

I read the discussion on the other PR and AIUI it called for something that took ubyte[] and lazily produced validaded char[]. This doesn't seem to be it. Could you please explain what you're trying to accomplish? Thanks.

dukc · 2020-04-27T18:47:32Z

std/utf.d

+    assert("𐐷".byUTF!dchar().equal([0x00010437]));
+
+    // hellö as a range of `ubyte`s, which are UTF-8
+    assert((cast(ubyte[]) [0x68, 0x65, 0x6c, 0x6c, 0xC3, 0xB6]).byUTF!char().equal(['h', 'e', 'l', 'l', 0xC3, 0xB6]));


Good, but I think you should test (directly) converting an ubyte range to utf-16 and/or utf-32

And also dealing with invalid UTF-8 in the ubyte, both with exceptions and replacement characters.

dukc · 2020-04-27T18:57:16Z

I read the discussion on the other PR and AIUI it called for something that took ubyte[] and lazily produced validaded char[]. This doesn't seem to be it. Could you please explain what you're trying to accomplish? Thanks.

The added unittest explains it:
assert((cast(ubyte[]) [0x68, 0x65, 0x6c, 0x6c, 0xC3, 0xB6]).byUTF!char().equal(['h', 'e', 'l', 'l', 0xC3, 0xB6]));

You pass in a range of ubyte and get a range of char. Handy, as no separate step to deal with autodecoding is needed.

dukc · 2020-04-27T19:02:00Z

Perhaps this should also accept ushort (assumed to be UTF-16) and uint (UTF-32)?

Of course, this can be implemented later on just as well.

RazvanN7 · 2021-04-21T09:25:25Z

ping @vporton

vporton added 2 commits October 23, 2019 22:33

Added forgotten assert keyword

2a45742

Conversion from UTF encoded as ubyte[]

a7fb15f

vporton requested a review from jmdavis as a code owner October 23, 2019 20:01

Added assertThrown for an invalid UTF sequence, but this unittest failed

61abaec

lesderid reviewed Oct 23, 2019

View reviewed changes

Import comment

10b49c5

lesderid reviewed Oct 23, 2019

View reviewed changes

bug fix

4f6f636

dukc reviewed Apr 27, 2020

View reviewed changes

RazvanN7 added Needs Rebase Needs Work labels Apr 21, 2021

dlang-bot added the stalled label Jul 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support of byUTF for ubyte[] argument #7249

Support of byUTF for ubyte[] argument #7249

vporton commented Oct 23, 2019

dlang-bot commented Oct 23, 2019

lesderid commented Oct 23, 2019

lesderid Oct 23, 2019

lesderid Oct 23, 2019

atilaneves commented Oct 30, 2019

dukc Apr 27, 2020

dukc Apr 27, 2020

dukc commented Apr 27, 2020

dukc commented Apr 27, 2020

RazvanN7 commented Apr 21, 2021

Support of byUTF for ubyte[] argument #7249

Are you sure you want to change the base?

Support of byUTF for ubyte[] argument #7249

Conversation

vporton commented Oct 23, 2019

dlang-bot commented Oct 23, 2019

Bugzilla references

Testing this PR locally

lesderid commented Oct 23, 2019

lesderid Oct 23, 2019

Choose a reason for hiding this comment

lesderid Oct 23, 2019

Choose a reason for hiding this comment

atilaneves commented Oct 30, 2019

dukc Apr 27, 2020

Choose a reason for hiding this comment

dukc Apr 27, 2020

Choose a reason for hiding this comment

dukc commented Apr 27, 2020

dukc commented Apr 27, 2020

RazvanN7 commented Apr 21, 2021