New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim #8621
Conversation
I don't think the |
Feel free to open a JIRA about that, by the way :-) |
The Benchmarks:
|
I've opened an issue at https://issues.apache.org/jira/browse/ARROW-10556 I guess we still need to manually add content to compute.rst? |
Yes, you do :-) |
d200a04
to
a5343ac
Compare
@pitrou this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much. Looks good mostly, just a couple comments below.
std::vector<bool> codepoints; | ||
|
||
explicit UTF8TrimBase(TrimOptions options) : options(options) { | ||
// TODO: check return / can we raise an exception here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could set the status on the KernelContext in the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored this a bit, the kernel state now include the codepoint vector, and there's where I have access to the KernelContext*.
@pitrou this is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update! I don't have much to add, except two questions.
return false; | ||
} | ||
if (predicate(codepoint)) { | ||
*position = current + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit weird. It returns the position to the next codepoint? The docstring should be a bit clearer about that (the current spelling is cryptic to me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some more text, hope that clarifies enough. See also
https://stackoverflow.com/questions/14760134/why-does-removing-the-first-element-of-a-list-invalidate-rend/14760316#14760316
#ifdef ARROW_WITH_UTF8PROC | ||
|
||
template <typename Type, typename Derived> | ||
struct UTF8Transform : StringTransform<Type, Derived> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't exactly understand this refactor. There's a UTF8Transform
with a Transform
method for utf8 kernels but no corresponding class with a Transform
method for ascii kernels, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was a bad choice of name, StringTransformCodepoint is more descriptive, it's a per codepoint transformation.
It'd be nice to include in the release, so ping @maartenbreddels |
I agree, I'll do my best to be responsive to get this in soon! |
// Note that rounding down the 3/2 is ok, since only codepoints encoded by | ||
// two code units (even) can grow to 3 code units. | ||
|
||
return static_cast<int64_t>(input_ncodeunits) * 3 / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I read this again, it strikes me that this function is estimating a number of codepoints, but we're using it to allocate a number of bytes (i.e. utf-8 codeunits). Is that ok?
(and/or the function naming and comment is inconsistent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, that was completely wrong, also slightly modified the text, this is all about units/bytes.
Ok, the PR seems ok on its own, I just added a question about some existing code. |
…m, and rtrim There is one obvious loose end in this PR, which is where to generate the `std::set` based on the `TrimOptions` (now in the ctor of UTF8TrimBase). I'm not sure what the lifetime guarantees are for this object (TrimOptions), where it makes sense to initialize this set, and when an (utf8 decoding) error occurs, how/where to report this. Although this is not a costly operation, assuming people don't pass in a billion characters to trim, I do wonder what the best approach here is in general. It does not make much sense to create the `std::set` at each `Exec` call, but that is what happens now. (This also seems to happen in `TransformMatchSubstring` for creating the `prefix_table` btw.) Maybe a good place to put per-kernel pre-compute results are the `*Options` objects, but I'm not sure if that makes sense in the current architecture. Another idea is to explore alternatives to the `std::set`. It seem that (based on the TrimManyAscii benchmark), `std::unordered_set` seemed a bit slower, and simply using a linear search: `std::find(options.characters.begin(), options.characters.end(), c) != options.characters.end()` in the predicate instead of the set doesn't seem to affect performance that much. In CPython, a bloom filter is used, I could explore to see if that makes sense, but the implementation in Arrow lives under the parquet namespace. Closes #8621 from maartenbreddels/ARROW-9128 Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…m, and rtrim There is one obvious loose end in this PR, which is where to generate the `std::set` based on the `TrimOptions` (now in the ctor of UTF8TrimBase). I'm not sure what the lifetime guarantees are for this object (TrimOptions), where it makes sense to initialize this set, and when an (utf8 decoding) error occurs, how/where to report this. Although this is not a costly operation, assuming people don't pass in a billion characters to trim, I do wonder what the best approach here is in general. It does not make much sense to create the `std::set` at each `Exec` call, but that is what happens now. (This also seems to happen in `TransformMatchSubstring` for creating the `prefix_table` btw.) Maybe a good place to put per-kernel pre-compute results are the `*Options` objects, but I'm not sure if that makes sense in the current architecture. Another idea is to explore alternatives to the `std::set`. It seem that (based on the TrimManyAscii benchmark), `std::unordered_set` seemed a bit slower, and simply using a linear search: `std::find(options.characters.begin(), options.characters.end(), c) != options.characters.end()` in the predicate instead of the set doesn't seem to affect performance that much. In CPython, a bloom filter is used, I could explore to see if that makes sense, but the implementation in Arrow lives under the parquet namespace. Closes apache#8621 from maartenbreddels/ARROW-9128 Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Needs a rebase after #8621 is merged I totally agree with https://github.com/python/cpython/blob/c9bc290dd6e3994a4ead2a224178bcba86f0c0e4/Objects/sliceobject.c#L252 This was tricky to get right, the main difficulty is in manually dealing with reverse iterators. Therefore I put on extra guardrails by having the Python unittests cover a lot of cases. All edge cases detected by this are translated to the C++ unittest suite, so we could reduce them to reduce pytest execution cost (I added 1 second). Slicing is based on Python, `[start, stop)` inclusive/exclusive semantics, where an index refers to a codeunit (like Python apparently, badly documented), and negative indices start counting from the right. `step != 0` is supported, like Python. The only thing we cannot support easily, are things like reversing a string, since in Python one can do `s[::-1]` or `s[-1::-1]`, but we don't support empty values with the Option machinery (we model this as an c-`int64`). To mimic this, we can do `pc.utf8_slice_codeunits(ar, start=-1, end=-sys.maxsize, step=-1)` (i.e. a very large negative value). For instance, libraries such as Pandas and Vaex can do sth like that, confirmed to be working by modifying the unittest like this: ```python import sys @pytest.mark.parametrize('start', list(range(-6, 6)) + [None]) @pytest.mark.parametrize('stop', list(range(-6, 6)) + [None]) @pytest.mark.parametrize('step', [-3, -2, -1, 1, 2, 3]) def test_slice_compatibility(start,stop, step): input = pa.array(["", "𝑓", "𝑓ö", "𝑓öõ", "𝑓öõḍ", "𝑓öõḍš"]) expected = pa.array([k.as_py()[start:stop:step] for k in input]) if start is None: start = -sys.maxsize if step > 0 else sys.maxsize if stop is None: stop = sys.maxsize if step > 0 else -sys.maxsize result = pc.utf8_slice_codeunits(input, start=start, stop=stop, step=step) assert expected.equals(result) ``` So libraries using this can implement the full Python behavior with this workaround. Closes #9000 from maartenbreddels/ARROW-10557 Lead-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…m, and rtrim There is one obvious loose end in this PR, which is where to generate the `std::set` based on the `TrimOptions` (now in the ctor of UTF8TrimBase). I'm not sure what the lifetime guarantees are for this object (TrimOptions), where it makes sense to initialize this set, and when an (utf8 decoding) error occurs, how/where to report this. Although this is not a costly operation, assuming people don't pass in a billion characters to trim, I do wonder what the best approach here is in general. It does not make much sense to create the `std::set` at each `Exec` call, but that is what happens now. (This also seems to happen in `TransformMatchSubstring` for creating the `prefix_table` btw.) Maybe a good place to put per-kernel pre-compute results are the `*Options` objects, but I'm not sure if that makes sense in the current architecture. Another idea is to explore alternatives to the `std::set`. It seem that (based on the TrimManyAscii benchmark), `std::unordered_set` seemed a bit slower, and simply using a linear search: `std::find(options.characters.begin(), options.characters.end(), c) != options.characters.end()` in the predicate instead of the set doesn't seem to affect performance that much. In CPython, a bloom filter is used, I could explore to see if that makes sense, but the implementation in Arrow lives under the parquet namespace. Closes apache#8621 from maartenbreddels/ARROW-9128 Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Needs a rebase after apache#8621 is merged I totally agree with https://github.com/python/cpython/blob/c9bc290dd6e3994a4ead2a224178bcba86f0c0e4/Objects/sliceobject.c#L252 This was tricky to get right, the main difficulty is in manually dealing with reverse iterators. Therefore I put on extra guardrails by having the Python unittests cover a lot of cases. All edge cases detected by this are translated to the C++ unittest suite, so we could reduce them to reduce pytest execution cost (I added 1 second). Slicing is based on Python, `[start, stop)` inclusive/exclusive semantics, where an index refers to a codeunit (like Python apparently, badly documented), and negative indices start counting from the right. `step != 0` is supported, like Python. The only thing we cannot support easily, are things like reversing a string, since in Python one can do `s[::-1]` or `s[-1::-1]`, but we don't support empty values with the Option machinery (we model this as an c-`int64`). To mimic this, we can do `pc.utf8_slice_codeunits(ar, start=-1, end=-sys.maxsize, step=-1)` (i.e. a very large negative value). For instance, libraries such as Pandas and Vaex can do sth like that, confirmed to be working by modifying the unittest like this: ```python import sys @pytest.mark.parametrize('start', list(range(-6, 6)) + [None]) @pytest.mark.parametrize('stop', list(range(-6, 6)) + [None]) @pytest.mark.parametrize('step', [-3, -2, -1, 1, 2, 3]) def test_slice_compatibility(start,stop, step): input = pa.array(["", "𝑓", "𝑓ö", "𝑓öõ", "𝑓öõḍ", "𝑓öõḍš"]) expected = pa.array([k.as_py()[start:stop:step] for k in input]) if start is None: start = -sys.maxsize if step > 0 else sys.maxsize if stop is None: stop = sys.maxsize if step > 0 else -sys.maxsize result = pc.utf8_slice_codeunits(input, start=start, stop=stop, step=step) assert expected.equals(result) ``` So libraries using this can implement the full Python behavior with this workaround. Closes apache#9000 from maartenbreddels/ARROW-10557 Lead-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
There is one obvious loose end in this PR, which is where to generate the
std::set
based on theTrimOptions
(now in the ctor of UTF8TrimBase). I'm not sure what the lifetime guarantees are for this object (TrimOptions), where it makes sense to initialize this set, and when an (utf8 decoding) error occurs, how/where to report this.Although this is not a costly operation, assuming people don't pass in a billion characters to trim, I do wonder what the best approach here is in general. It does not make much sense to create the
std::set
at eachExec
call, but that is what happens now. (This also seems to happen inTransformMatchSubstring
for creating theprefix_table
btw.)Maybe a good place to put per-kernel pre-compute results are the
*Options
objects, but I'm not sure if that makes sense in the current architecture.Another idea is to explore alternatives to the
std::set
. It seem that (based on the TrimManyAscii benchmark),std::unordered_set
seemed a bit slower, and simply using a linear search:std::find(options.characters.begin(), options.characters.end(), c) != options.characters.end()
in the predicate instead of the set doesn't seem to affect performance that much.In CPython, a bloom filter is used, I could explore to see if that makes sense, but the implementation in Arrow lives under the parquet namespace.