Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17301: [C++] Implement compute function "binary_slice" #14550

Merged
merged 21 commits into from Nov 15, 2022

Conversation

kshitij12345
Copy link
Contributor

@kshitij12345 kshitij12345 commented Oct 29, 2022

Implements binary_slice_bytes similar to utf8_slice_codeunits.

Mostly based on utf8_slice_codeunits.

TODO:

  • C++ Tests
  • Python Tests

@kshitij12345 kshitij12345 marked this pull request as draft October 29, 2022 16:31
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@kshitij12345 kshitij12345 marked this pull request as ready for review October 30, 2022 13:50
@kshitij12345
Copy link
Contributor Author

cc: @AlenkaF @jorisvandenbossche

@@ -2119,6 +2119,96 @@ TYPED_TEST(TestStringKernels, SliceCodeunitsNegPos) {

#endif // ARROW_WITH_UTF8PROC

TYPED_TEST(TestBaseBinaryKernels, SliceBytesBasic) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inspired from utf8_slice_codeunits tests

TYPED_TEST(TestStringKernels, SliceCodeunitsBasic) {

@@ -536,6 +537,24 @@ def test_slice_compatibility():
start, stop, step) == result


def test_binary_slice_compatibility():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inspired from utf8_slice_codeunits test

def test_slice_compatibility():
arr = pa.array(["", "𝑓", "𝑓ö", "𝑓öõ", "𝑓öõḍ", "𝑓öõḍš"])
for start in range(-6, 6):
for stop in range(-6, 6):
for step in [-3, -2, -1, 1, 2, 3]:
expected = pa.array([k.as_py()[start:stop:step]
for k in arr])
result = pc.utf8_slice_codeunits(
arr, start=start, stop=stop, step=step)
assert expected.equals(result)
# Positional options
assert pc.utf8_slice_codeunits(arr,
start, stop, step) == result

@kshitij12345
Copy link
Contributor Author

Gentle ping @AlenkaF @jorisvandenbossche :)

@AlenkaF
Copy link
Member

AlenkaF commented Nov 7, 2022

Thank you for the PR @kshitij12345!
@pitrou @rok would you mind having a look at this PR?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @kshitij12345 . The implementation looks fine, here are some comments and suggestions.

{"strings"}, "SliceOptions", /*options_required=*/true);

void AddAsciiStringSlice(FunctionRegistry* registry) {
auto func = std::make_shared<ScalarFunction>("binary_slice_bytes", Arity::Unary(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think binary_slice is explanatory enough.

} // namespace

const FunctionDoc binary_slice_bytes_doc(
"Slice string",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Slice string",
"Slice binary string",


const FunctionDoc binary_slice_bytes_doc(
"Slice string",
("For each string in `strings`, emit the substring defined by\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
("For each string in `strings`, emit the substring defined by\n"
("For each binary string in `strings`, emit the substring defined by\n"

// continue counting from the left, we cannot start from begin_sliced because we
// don't know how many bytes are between begin and begin_sliced
end_sliced = std::min(begin + opt.stop, end);
// and therefore we also needs this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// and therefore we also needs this
// and therefore we also need this


TYPED_TEST(TestBaseBinaryKernels, SliceBytesPosNeg) {
SliceOptions options{2, -1};
this->CheckUnary("binary_slice_bytes", R"(["", "f", "fo", "foo", "food", "foods"])",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests would be better without any letter repetition in the source values. Otherwise the results might end up correct even with an incorrect implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be nice to put some non-ASCII bytes in there as well, to check that slicing is really byte-wise.

Comment on lines 541 to 542
arr = pa.array((el.encode('ascii')
for el in ["", "a", "ab", "abc", "abcd", "abcde"]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can write this in more idiomatic way. Also, it's nicer with some non-ASCII data:

Suggested change
arr = pa.array((el.encode('ascii')
for el in ["", "a", "ab", "abc", "abcd", "abcde"]))
arr = pa.array([b"", b"a", b"a\xff", b"a\xffc", b"a\xffcd", b"a\xffcde"])

@@ -449,7 +449,7 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(
return std::make_shared<Options>(cpp11::as_cpp<std::string>(options["characters"]));
}

if (func_name == "utf8_slice_codeunits") {
if (func_name == "utf8_slice_codeunits" || func_name == "binary_slice_bytes") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a test be created on the R side?
@thisisnic @paleolimbot Perhaps one of you can help.

@kshitij12345
Copy link
Contributor Author

@pitrou Thanks for the review. Will address them over the weekend.

void AddAsciiStringSlice(FunctionRegistry* registry) {
auto func = std::make_shared<ScalarFunction>("binary_slice_bytes", Arity::Unary(),
binary_slice_bytes_doc);
for (const auto& ty : BaseBinaryTypes()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Binary Slice should support UTF-8 strings as slicing them incorrectly will return invalid UTF-8 string

Eg. \"\xc2\xa2\" slicing this [0:1] will return ``"\xc2"` which is invalid UTF.

I think this should just support Binary types. Wdyt @pitrou ?

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kshitij12345 I agree with that.

For the record, currently it's a bit of a mixed bag: binary_reverse doesn't support string input, but binary_replace_slice can... and can produce invalid output, for example:

>>> pc.binary_replace_slice(["hé"], 1, 2, "x")
<pyarrow.lib.StringArray object at 0x7fdbc09937c0>
[
  "hx�"
]
>>> pc.binary_replace_slice(["hé"], 1, 2, "x").validate(full=True)
Traceback (most recent call last):
  ...
ArrowInvalid: Invalid UTF8 sequence at string index 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The updated code only works with binary types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think binary_replace_slice should only work with binary types (as there is utf8_replace_slice for string types). And if user wants to actually play with byte data then they must manually cast it to that.

Probably worth an issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, please ping me when this is ready for review again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you can open an issue about binary_replace_slice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is ready for review. Have updated the tests to include non-ascii characters and updated the function names.

Except for R related testing comment, PR is ready.

(Also will open an issue on JIRA soon)

Thanks!

@@ -449,7 +449,7 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(
return std::make_shared<Options>(cpp11::as_cpp<std::string>(options["characters"]));
}

if (func_name == "utf8_slice_codeunits") {
if (func_name == "utf8_slice_codeunits" || func_name == "binary_slice") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rok @AlenkaF Would one of you have to time to help @kshitij12345 write some R test here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this ping on Thursday...I'm happy to do this but it's slightly easier to do on a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok for me if done on a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made ARROW-18321 and self-assigned 🙂

@kshitij12345
Copy link
Contributor Author

@pitrou PTAL :)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for contributing @kshitij12345 !

@ursabot
Copy link

ursabot commented Nov 15, 2022

Benchmark runs are scheduled for baseline = 4daf945 and contender = 058d4f6. 058d4f6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.41% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.27% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.55% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 058d4f69 ec2-t3-xlarge-us-east-2
[Finished] 058d4f69 test-mac-arm
[Finished] 058d4f69 ursa-i9-9960x
[Finished] 058d4f69 ursa-thinkcentre-m75q
[Finished] 4daf9458 ec2-t3-xlarge-us-east-2
[Finished] 4daf9458 test-mac-arm
[Finished] 4daf9458 ursa-i9-9960x
[Finished] 4daf9458 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

paleolimbot added a commit that referenced this pull request Nov 16, 2022
This PR adds a test for the kernel implemented in ARROW-17301 (#14550).

Are there any R functions that map to this? (i.e., is there any `register_binding()` we should do for an R function?)

Authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants