New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17301: [C++] Implement compute function "binary_slice" #14550
Conversation
|
@@ -2119,6 +2119,96 @@ TYPED_TEST(TestStringKernels, SliceCodeunitsNegPos) { | |||
|
|||
#endif // ARROW_WITH_UTF8PROC | |||
|
|||
TYPED_TEST(TestBaseBinaryKernels, SliceBytesBasic) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inspired from utf8_slice_codeunits
tests
TYPED_TEST(TestStringKernels, SliceCodeunitsBasic) { |
@@ -536,6 +537,24 @@ def test_slice_compatibility(): | |||
start, stop, step) == result | |||
|
|||
|
|||
def test_binary_slice_compatibility(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inspired from utf8_slice_codeunits
test
arrow/python/pyarrow/tests/test_compute.py
Lines 525 to 537 in 884e81b
def test_slice_compatibility(): | |
arr = pa.array(["", "𝑓", "𝑓ö", "𝑓öõ", "𝑓öõḍ", "𝑓öõḍš"]) | |
for start in range(-6, 6): | |
for stop in range(-6, 6): | |
for step in [-3, -2, -1, 1, 2, 3]: | |
expected = pa.array([k.as_py()[start:stop:step] | |
for k in arr]) | |
result = pc.utf8_slice_codeunits( | |
arr, start=start, stop=stop, step=step) | |
assert expected.equals(result) | |
# Positional options | |
assert pc.utf8_slice_codeunits(arr, | |
start, stop, step) == result |
Gentle ping @AlenkaF @jorisvandenbossche :) |
Thank you for the PR @kshitij12345! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @kshitij12345 . The implementation looks fine, here are some comments and suggestions.
{"strings"}, "SliceOptions", /*options_required=*/true); | ||
|
||
void AddAsciiStringSlice(FunctionRegistry* registry) { | ||
auto func = std::make_shared<ScalarFunction>("binary_slice_bytes", Arity::Unary(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think binary_slice
is explanatory enough.
} // namespace | ||
|
||
const FunctionDoc binary_slice_bytes_doc( | ||
"Slice string", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Slice string", | |
"Slice binary string", |
|
||
const FunctionDoc binary_slice_bytes_doc( | ||
"Slice string", | ||
("For each string in `strings`, emit the substring defined by\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
("For each string in `strings`, emit the substring defined by\n" | |
("For each binary string in `strings`, emit the substring defined by\n" |
// continue counting from the left, we cannot start from begin_sliced because we | ||
// don't know how many bytes are between begin and begin_sliced | ||
end_sliced = std::min(begin + opt.stop, end); | ||
// and therefore we also needs this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// and therefore we also needs this | |
// and therefore we also need this |
|
||
TYPED_TEST(TestBaseBinaryKernels, SliceBytesPosNeg) { | ||
SliceOptions options{2, -1}; | ||
this->CheckUnary("binary_slice_bytes", R"(["", "f", "fo", "foo", "food", "foods"])", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests would be better without any letter repetition in the source values. Otherwise the results might end up correct even with an incorrect implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it would be nice to put some non-ASCII bytes in there as well, to check that slicing is really byte-wise.
python/pyarrow/tests/test_compute.py
Outdated
arr = pa.array((el.encode('ascii') | ||
for el in ["", "a", "ab", "abc", "abcd", "abcde"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can write this in more idiomatic way. Also, it's nicer with some non-ASCII data:
arr = pa.array((el.encode('ascii') | |
for el in ["", "a", "ab", "abc", "abcd", "abcde"])) | |
arr = pa.array([b"", b"a", b"a\xff", b"a\xffc", b"a\xffcd", b"a\xffcde"]) |
r/src/compute.cpp
Outdated
@@ -449,7 +449,7 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options( | |||
return std::make_shared<Options>(cpp11::as_cpp<std::string>(options["characters"])); | |||
} | |||
|
|||
if (func_name == "utf8_slice_codeunits") { | |||
if (func_name == "utf8_slice_codeunits" || func_name == "binary_slice_bytes") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should a test be created on the R side?
@thisisnic @paleolimbot Perhaps one of you can help.
@pitrou Thanks for the review. Will address them over the weekend. |
void AddAsciiStringSlice(FunctionRegistry* registry) { | ||
auto func = std::make_shared<ScalarFunction>("binary_slice_bytes", Arity::Unary(), | ||
binary_slice_bytes_doc); | ||
for (const auto& ty : BaseBinaryTypes()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Binary Slice should support UTF-8 strings as slicing them incorrectly will return invalid UTF-8 string
Eg. \"\xc2\xa2\"
slicing this [0:1] will return ``"\xc2"` which is invalid UTF.
I think this should just support Binary types. Wdyt @pitrou ?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kshitij12345 I agree with that.
For the record, currently it's a bit of a mixed bag: binary_reverse
doesn't support string input, but binary_replace_slice
can... and can produce invalid output, for example:
>>> pc.binary_replace_slice(["hé"], 1, 2, "x")
<pyarrow.lib.StringArray object at 0x7fdbc09937c0>
[
"hx�"
]
>>> pc.binary_replace_slice(["hé"], 1, 2, "x").validate(full=True)
Traceback (most recent call last):
...
ArrowInvalid: Invalid UTF8 sequence at string index 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The updated code only works with binary
types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think binary_replace_slice
should only work with binary
types (as there is utf8_replace_slice
for string types). And if user wants to actually play with byte data then they must manually cast it to that.
Probably worth an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, please ping me when this is ready for review again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you can open an issue about binary_replace_slice
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR is ready for review. Have updated the tests to include non-ascii characters and updated the function names.
Except for R
related testing comment, PR is ready.
(Also will open an issue on JIRA soon)
Thanks!
@@ -449,7 +449,7 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options( | |||
return std::make_shared<Options>(cpp11::as_cpp<std::string>(options["characters"])); | |||
} | |||
|
|||
if (func_name == "utf8_slice_codeunits") { | |||
if (func_name == "utf8_slice_codeunits" || func_name == "binary_slice") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rok @AlenkaF Would one of you have to time to help @kshitij12345 write some R test here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this ping on Thursday...I'm happy to do this but it's slightly easier to do on a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok for me if done on a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made ARROW-18321 and self-assigned 🙂
@pitrou PTAL :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks for contributing @kshitij12345 !
Benchmark runs are scheduled for baseline = 4daf945 and contender = 058d4f6. 058d4f6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR adds a test for the kernel implemented in ARROW-17301 (#14550). Are there any R functions that map to this? (i.e., is there any `register_binding()` we should do for an R function?) Authored-by: Dewey Dunnington <dewey@voltrondata.com> Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
Implements
binary_slice_bytes
similar toutf8_slice_codeunits
.Mostly based on
utf8_slice_codeunits
.TODO: