Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12948: [C++][Python] Add slice_replace kernel #10494

Closed
wants to merge 3 commits into from

Conversation

lidavidm
Copy link
Member

@lidavidm lidavidm commented Jun 9, 2021

This adds a slice_replace kernel mimicking Pandas's str.slice_replace. There are both ascii and UTF8 variants, indexing respectively with bytes and codepoints. The ascii variant also works on binary arrays.

@github-actions
Copy link

github-actions bot commented Jun 9, 2021

@lidavidm
Copy link
Member Author

lidavidm commented Jun 9, 2021

Needs rebasing onto #10496.

@lidavidm lidavidm marked this pull request as ready for review June 9, 2021 20:03
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this looks good on the principle.

void AddReplaceSlice(FunctionRegistry* registry) {
{
auto func = std::make_shared<ScalarFunction>("ascii_replace_slice", Arity::Unary(),
&ascii_replace_slice_doc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be called binary_replace_slice since it works on non-Ascii input as well (it just slices in byte units, not codeunits).

using Utf8ReplaceSlice = StringTransformExecWithState<Type, Utf8ReplaceSliceTransform>;

const FunctionDoc ascii_replace_slice_doc(
"Replace a slice of a string with `replacement`",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"binary string"?

/// Index to start slicing at
int64_t start = 0;
/// Index to stop slicing at
int64_t stop = std::numeric_limits<int64_t>::max();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I'm not sure the default values will be picked up. Is it just for documentation?

if (opts.stop >= 0) {
// Count from left
after_slice =
std::min<int64_t>(input_string_ncodeunits, std::max(opts.start, opts.stop));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opts.start can be negative, so you should perhaps use before_slice instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I added a test case for this too.

output = std::copy(input, input + before_slice, output);
output = std::copy(opts.replacement.begin(), opts.replacement.end(), output);
output = std::copy(input + after_slice, input + input_string_ncodeunits, output);
return std::distance(output_start, output);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit pedantic. Just output - output_start?

res = pc.utf8_replace_slice(arr, start=-2, stop=3, replacement='χχ')
assert res.tolist() == [None, 'χχ', 'χχ', 'χχ', 'πχχ', 'πbχχd']
res = pc.utf8_replace_slice(arr, start=-3, stop=-2, replacement='χχ')
assert res.tolist() == [None, 'χχ', 'χχπ', 'χχπb', 'χχbθ', 'πχχθd']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really useful to re-write the same tests in Python. What you could do is generate slices as in test_slice_compatibility to check Pandas compatibility.

+--------------------------+------------+-------------------------+------------------------+---------+---------------------------------------+
| utf8_lower | Unary | String-like | String-like | \(8) | |
+--------------------------+------------+-------------------------+------------------------+---------+---------------------------------------+
| utf8_replace_slice | Unary | String-like | String-like | \(2) | :struct:`ReplaceSliceOptions` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the note number should be (4)?

@pitrou pitrou closed this in 49f4b18 Jun 10, 2021
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This adds a slice_replace kernel mimicking Pandas's str.slice_replace. There are both ascii and UTF8 variants, indexing respectively with bytes and codepoints. The ascii variant also works on binary arrays.

Closes apache#10494 from lidavidm/arrow-12948

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants