-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12948: [C++][Python] Add slice_replace kernel #10494
Conversation
Needs rebasing onto #10496. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this looks good on the principle.
void AddReplaceSlice(FunctionRegistry* registry) { | ||
{ | ||
auto func = std::make_shared<ScalarFunction>("ascii_replace_slice", Arity::Unary(), | ||
&ascii_replace_slice_doc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably be called binary_replace_slice
since it works on non-Ascii input as well (it just slices in byte units, not codeunits).
using Utf8ReplaceSlice = StringTransformExecWithState<Type, Utf8ReplaceSliceTransform>; | ||
|
||
const FunctionDoc ascii_replace_slice_doc( | ||
"Replace a slice of a string with `replacement`", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"binary string"?
cpp/src/arrow/compute/api_scalar.h
Outdated
/// Index to start slicing at | ||
int64_t start = 0; | ||
/// Index to stop slicing at | ||
int64_t stop = std::numeric_limits<int64_t>::max(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I'm not sure the default values will be picked up. Is it just for documentation?
if (opts.stop >= 0) { | ||
// Count from left | ||
after_slice = | ||
std::min<int64_t>(input_string_ncodeunits, std::max(opts.start, opts.stop)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opts.start
can be negative, so you should perhaps use before_slice
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I added a test case for this too.
output = std::copy(input, input + before_slice, output); | ||
output = std::copy(opts.replacement.begin(), opts.replacement.end(), output); | ||
output = std::copy(input + after_slice, input + input_string_ncodeunits, output); | ||
return std::distance(output_start, output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit pedantic. Just output - output_start
?
python/pyarrow/tests/test_compute.py
Outdated
res = pc.utf8_replace_slice(arr, start=-2, stop=3, replacement='χχ') | ||
assert res.tolist() == [None, 'χχ', 'χχ', 'χχ', 'πχχ', 'πbχχd'] | ||
res = pc.utf8_replace_slice(arr, start=-3, stop=-2, replacement='χχ') | ||
assert res.tolist() == [None, 'χχ', 'χχπ', 'χχπb', 'χχbθ', 'πχχθd'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really useful to re-write the same tests in Python. What you could do is generate slices as in test_slice_compatibility
to check Pandas compatibility.
docs/source/cpp/compute.rst
Outdated
+--------------------------+------------+-------------------------+------------------------+---------+---------------------------------------+ | ||
| utf8_lower | Unary | String-like | String-like | \(8) | | | ||
+--------------------------+------------+-------------------------+------------------------+---------+---------------------------------------+ | ||
| utf8_replace_slice | Unary | String-like | String-like | \(2) | :struct:`ReplaceSliceOptions` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the note number should be (4)?
This adds a slice_replace kernel mimicking Pandas's str.slice_replace. There are both ascii and UTF8 variants, indexing respectively with bytes and codepoints. The ascii variant also works on binary arrays. Closes apache#10494 from lidavidm/arrow-12948 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
This adds a slice_replace kernel mimicking Pandas's str.slice_replace. There are both ascii and UTF8 variants, indexing respectively with bytes and codepoints. The ascii variant also works on binary arrays.