ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim #8621

maartenbreddels · 2020-11-10T13:14:27Z

There is one obvious loose end in this PR, which is where to generate the std::set based on the TrimOptions (now in the ctor of UTF8TrimBase). I'm not sure what the lifetime guarantees are for this object (TrimOptions), where it makes sense to initialize this set, and when an (utf8 decoding) error occurs, how/where to report this.

Although this is not a costly operation, assuming people don't pass in a billion characters to trim, I do wonder what the best approach here is in general. It does not make much sense to create the std::set at each Exec call, but that is what happens now. (This also seems to happen in TransformMatchSubstring for creating the prefix_table btw.)

Maybe a good place to put per-kernel pre-compute results are the *Options objects, but I'm not sure if that makes sense in the current architecture.

Another idea is to explore alternatives to the std::set. It seem that (based on the TrimManyAscii benchmark), std::unordered_set seemed a bit slower, and simply using a linear search: std::find(options.characters.begin(), options.characters.end(), c) != options.characters.end() in the predicate instead of the set doesn't seem to affect performance that much.

In CPython, a bloom filter is used, I could explore to see if that makes sense, but the implementation in Arrow lives under the parquet namespace.

github-actions · 2020-11-10T13:16:54Z

https://issues.apache.org/jira/browse/ARROW-9128

cpp/src/arrow/compute/kernels/scalar_string.cc

pitrou · 2020-11-10T16:00:07Z

Maybe a good place to put per-kernel pre-compute results are the *Options objects, but I'm not sure if that makes sense in the current architecture.

I don't think the Options objects are the right place. Ideally the kernel state would be options-specific, otherwise we can devise a generic caching facility.

pitrou · 2020-11-10T16:01:54Z

Feel free to open a JIRA about that, by the way :-)

maartenbreddels · 2020-11-11T10:37:13Z

The std::vector<bool> was a good idea, and indeed because of it's bit usage, the memory usage for Unicode isn't that heavy (most extreme: 0x10FFFF bits = 140kb in case of a contiguous array implementation).

Benchmarks:

set:
TrimManyAscii_median   28346892 ns   28345125 ns         25   558.956MB/s   35.2794M items/s
TrimManyUtf8_median    28302644 ns   28294883 ns         25   559.949MB/s   35.3421M items/s

unordered_set:
TrimManyAscii_median   32017530 ns   32014024 ns         22   494.898MB/s   31.2363M items/s
TrimManyUtf8_median (not run)

vector<bool>
TrimManyAscii_median   14911543 ns   14910620 ns         47   1062.58MB/s   67.0663M items/s
TrimManyUtf8_median    16148001 ns   16146053 ns         44   981.273MB/s   61.9346M items/s

bitset<256>
TrimManyAscii_median   14304925 ns   14304010 ns         49   1107.64MB/s   69.9105M items/s

vector<bool> is good enough I think, the bitset is consistently faster (5%), but I'd rather have similar code for both solutions.

cpp/src/arrow/compute/kernels/scalar_string.cc

maartenbreddels · 2020-11-11T15:08:57Z

I've opened an issue at https://issues.apache.org/jira/browse/ARROW-10556

I guess we still need to manually add content to compute.rst?

pitrou · 2020-11-11T15:38:16Z

I guess we still need to manually add content to compute.rst?

Yes, you do :-)

…m, and rtrim

maartenbreddels · 2020-11-25T16:42:25Z

@pitrou this is ready for review.

pitrou

Thank you very much. Looks good mostly, just a couple comments below.

cpp/src/arrow/compute/kernels/scalar_string.cc

pitrou · 2020-11-26T12:37:15Z

cpp/src/arrow/compute/kernels/scalar_string.cc

+  std::vector<bool> codepoints;
+
+  explicit UTF8TrimBase(TrimOptions options) : options(options) {
+    // TODO: check return / can we raise an exception here?


You could set the status on the KernelContext in the caller.

I've refactored this a bit, the kernel state now include the codepoint vector, and there's where I have access to the KernelContext*.

cpp/src/arrow/compute/kernels/scalar_string_test.cc

cpp/src/arrow/util/utf8.h

cpp/src/arrow/util/utf8_util_test.cc

maartenbreddels · 2021-01-04T12:31:25Z

@pitrou this is ready for review

pitrou

Thank you for the update! I don't have much to add, except two questions.

pitrou · 2021-01-04T16:50:12Z

cpp/src/arrow/util/utf8.h

+      return false;
+    }
+    if (predicate(codepoint)) {
+      *position = current + 1;


This is a bit weird. It returns the position to the next codepoint? The docstring should be a bit clearer about that (the current spelling is cryptic to me).

I added some more text, hope that clarifies enough. See also
https://stackoverflow.com/questions/14760134/why-does-removing-the-first-element-of-a-list-invalidate-rend/14760316#14760316

pitrou · 2021-01-04T17:35:38Z

cpp/src/arrow/compute/kernels/scalar_string.cc

+#ifdef ARROW_WITH_UTF8PROC
+
+template <typename Type, typename Derived>
+struct UTF8Transform : StringTransform<Type, Derived> {


I don't exactly understand this refactor. There's a UTF8Transform with a Transform method for utf8 kernels but no corresponding class with a Transform method for ascii kernels, is that right?

Yeah, that was a bad choice of name, StringTransformCodepoint is more descriptive, it's a per codepoint transformation.

kszucs · 2021-01-12T14:52:25Z

It'd be nice to include in the release, so ping @maartenbreddels

maartenbreddels · 2021-01-12T18:18:34Z

I agree, I'll do my best to be responsive to get this in soon!

maartenbreddels · 2021-01-13T08:02:09Z

ready for review @kszucs or @pitrou
failure is fsspec (already reported on jira by joris)

pitrou · 2021-01-19T14:44:02Z

cpp/src/arrow/compute/kernels/scalar_string.cc

+    // Note that rounding down the 3/2 is ok, since only codepoints encoded by
+    // two code units (even) can grow to 3 code units.
+
+    return static_cast<int64_t>(input_ncodeunits) * 3 / 2;


Now that I read this again, it strikes me that this function is estimating a number of codepoints, but we're using it to allocate a number of bytes (i.e. utf-8 codeunits). Is that ok?

(and/or the function naming and comment is inconsistent)

Good catch, that was completely wrong, also slightly modified the text, this is all about units/bytes.

pitrou · 2021-01-19T14:44:41Z

Ok, the PR seems ok on its own, I just added a question about some existing code.

…m, and rtrim There is one obvious loose end in this PR, which is where to generate the `std::set` based on the `TrimOptions` (now in the ctor of UTF8TrimBase). I'm not sure what the lifetime guarantees are for this object (TrimOptions), where it makes sense to initialize this set, and when an (utf8 decoding) error occurs, how/where to report this. Although this is not a costly operation, assuming people don't pass in a billion characters to trim, I do wonder what the best approach here is in general. It does not make much sense to create the `std::set` at each `Exec` call, but that is what happens now. (This also seems to happen in `TransformMatchSubstring` for creating the `prefix_table` btw.) Maybe a good place to put per-kernel pre-compute results are the `*Options` objects, but I'm not sure if that makes sense in the current architecture. Another idea is to explore alternatives to the `std::set`. It seem that (based on the TrimManyAscii benchmark), `std::unordered_set` seemed a bit slower, and simply using a linear search: `std::find(options.characters.begin(), options.characters.end(), c) != options.characters.end()` in the predicate instead of the set doesn't seem to affect performance that much. In CPython, a bloom filter is used, I could explore to see if that makes sense, but the implementation in Arrow lives under the parquet namespace. Closes #8621 from maartenbreddels/ARROW-9128 Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

…m, and rtrim There is one obvious loose end in this PR, which is where to generate the `std::set` based on the `TrimOptions` (now in the ctor of UTF8TrimBase). I'm not sure what the lifetime guarantees are for this object (TrimOptions), where it makes sense to initialize this set, and when an (utf8 decoding) error occurs, how/where to report this. Although this is not a costly operation, assuming people don't pass in a billion characters to trim, I do wonder what the best approach here is in general. It does not make much sense to create the `std::set` at each `Exec` call, but that is what happens now. (This also seems to happen in `TransformMatchSubstring` for creating the `prefix_table` btw.) Maybe a good place to put per-kernel pre-compute results are the `*Options` objects, but I'm not sure if that makes sense in the current architecture. Another idea is to explore alternatives to the `std::set`. It seem that (based on the TrimManyAscii benchmark), `std::unordered_set` seemed a bit slower, and simply using a linear search: `std::find(options.characters.begin(), options.characters.end(), c) != options.characters.end()` in the predicate instead of the set doesn't seem to affect performance that much. In CPython, a bloom filter is used, I could explore to see if that makes sense, but the implementation in Arrow lives under the parquet namespace. Closes apache#8621 from maartenbreddels/ARROW-9128 Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

Needs a rebase after #8621 is merged I totally agree with https://github.com/python/cpython/blob/c9bc290dd6e3994a4ead2a224178bcba86f0c0e4/Objects/sliceobject.c#L252 This was tricky to get right, the main difficulty is in manually dealing with reverse iterators. Therefore I put on extra guardrails by having the Python unittests cover a lot of cases. All edge cases detected by this are translated to the C++ unittest suite, so we could reduce them to reduce pytest execution cost (I added 1 second). Slicing is based on Python, `[start, stop)` inclusive/exclusive semantics, where an index refers to a codeunit (like Python apparently, badly documented), and negative indices start counting from the right. `step != 0` is supported, like Python. The only thing we cannot support easily, are things like reversing a string, since in Python one can do `s[::-1]` or `s[-1::-1]`, but we don't support empty values with the Option machinery (we model this as an c-`int64`). To mimic this, we can do `pc.utf8_slice_codeunits(ar, start=-1, end=-sys.maxsize, step=-1)` (i.e. a very large negative value). For instance, libraries such as Pandas and Vaex can do sth like that, confirmed to be working by modifying the unittest like this: ```python import sys @pytest.mark.parametrize('start', list(range(-6, 6)) + [None]) @pytest.mark.parametrize('stop', list(range(-6, 6)) + [None]) @pytest.mark.parametrize('step', [-3, -2, -1, 1, 2, 3]) def test_slice_compatibility(start,stop, step): input = pa.array(["", "𝑓", "𝑓ö", "𝑓öõ", "𝑓öõḍ", "𝑓öõḍš"]) expected = pa.array([k.as_py()[start:stop:step] for k in input]) if start is None: start = -sys.maxsize if step > 0 else sys.maxsize if stop is None: stop = sys.maxsize if step > 0 else -sys.maxsize result = pc.utf8_slice_codeunits(input, start=start, stop=stop, step=step) assert expected.equals(result) ``` So libraries using this can implement the full Python behavior with this workaround. Closes #9000 from maartenbreddels/ARROW-10557 Lead-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

…m, and rtrim There is one obvious loose end in this PR, which is where to generate the `std::set` based on the `TrimOptions` (now in the ctor of UTF8TrimBase). I'm not sure what the lifetime guarantees are for this object (TrimOptions), where it makes sense to initialize this set, and when an (utf8 decoding) error occurs, how/where to report this. Although this is not a costly operation, assuming people don't pass in a billion characters to trim, I do wonder what the best approach here is in general. It does not make much sense to create the `std::set` at each `Exec` call, but that is what happens now. (This also seems to happen in `TransformMatchSubstring` for creating the `prefix_table` btw.) Maybe a good place to put per-kernel pre-compute results are the `*Options` objects, but I'm not sure if that makes sense in the current architecture. Another idea is to explore alternatives to the `std::set`. It seem that (based on the TrimManyAscii benchmark), `std::unordered_set` seemed a bit slower, and simply using a linear search: `std::find(options.characters.begin(), options.characters.end(), c) != options.characters.end()` in the predicate instead of the set doesn't seem to affect performance that much. In CPython, a bloom filter is used, I could explore to see if that makes sense, but the implementation in Arrow lives under the parquet namespace. Closes apache#8621 from maartenbreddels/ARROW-9128 Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

Needs a rebase after apache#8621 is merged I totally agree with https://github.com/python/cpython/blob/c9bc290dd6e3994a4ead2a224178bcba86f0c0e4/Objects/sliceobject.c#L252 This was tricky to get right, the main difficulty is in manually dealing with reverse iterators. Therefore I put on extra guardrails by having the Python unittests cover a lot of cases. All edge cases detected by this are translated to the C++ unittest suite, so we could reduce them to reduce pytest execution cost (I added 1 second). Slicing is based on Python, `[start, stop)` inclusive/exclusive semantics, where an index refers to a codeunit (like Python apparently, badly documented), and negative indices start counting from the right. `step != 0` is supported, like Python. The only thing we cannot support easily, are things like reversing a string, since in Python one can do `s[::-1]` or `s[-1::-1]`, but we don't support empty values with the Option machinery (we model this as an c-`int64`). To mimic this, we can do `pc.utf8_slice_codeunits(ar, start=-1, end=-sys.maxsize, step=-1)` (i.e. a very large negative value). For instance, libraries such as Pandas and Vaex can do sth like that, confirmed to be working by modifying the unittest like this: ```python import sys @pytest.mark.parametrize('start', list(range(-6, 6)) + [None]) @pytest.mark.parametrize('stop', list(range(-6, 6)) + [None]) @pytest.mark.parametrize('step', [-3, -2, -1, 1, 2, 3]) def test_slice_compatibility(start,stop, step): input = pa.array(["", "𝑓", "𝑓ö", "𝑓öõ", "𝑓öõḍ", "𝑓öõḍš"]) expected = pa.array([k.as_py()[start:stop:step] for k in input]) if start is None: start = -sys.maxsize if step > 0 else sys.maxsize if stop is None: stop = sys.maxsize if step > 0 else -sys.maxsize result = pc.utf8_slice_codeunits(input, start=start, stop=stop, step=step) assert expected.equals(result) ``` So libraries using this can implement the full Python behavior with this workaround. Closes apache#9000 from maartenbreddels/ARROW-10557 Lead-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

pitrou reviewed Nov 10, 2020

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Nov 11, 2020

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved

maartenbreddels added 6 commits November 25, 2020 13:09

ARROW-9128: [C++] Implement string space trimming kernels: trim, ltri…

3276cfe

…m, and rtrim

use vector<bool>

974e5a1

use resize, not reserve

46d6c21

grow vector only, and +1 missing

962e042

newlines in docstrings

b24ac37

update compute.rst

a5343ac

maartenbreddels force-pushed the ARROW-9128 branch from d200a04 to a5343ac Compare November 25, 2020 12:09

pitrou requested changes Nov 26, 2020

View reviewed changes

jorisvandenbossche added the Component: C++ label Dec 8, 2020

maartenbreddels added 6 commits December 18, 2020 11:37

add underscores to instance vars

23dbb6b

use std::move for ctor args

b10f8ea

also test tab whitespace

9c3ad89

pass std::string by const ref

118fed3

test unicode input with UTF8FindIf(Reverse)

492479d

seperate kernel state init from kernel, and test for invalid utf8 string

a76d226

maartenbreddels mentioned this pull request Dec 23, 2020

ARROW-10557: [C++] Add scalar string slicing/substring extract kernel #9000

Closed

rvalue issue?

1b9b749

pitrou reviewed Jan 4, 2021

View reviewed changes

clarify reverse iterator code

506835d

better class name for UTF8Transform -> StringTransformCodepoint

b87be8c

pitrou reviewed Jan 19, 2021

View reviewed changes

codepoints -> codeunits

645f083

pitrou approved these changes Jan 19, 2021

View reviewed changes

pitrou closed this in 0e5d646 Jan 19, 2021

maartenbreddels mentioned this pull request Jan 25, 2021

String features for use with Apache Arrow v3.0 vaexio/vaex#1186

Merged

5 tasks

This was referenced Apr 29, 2021

[C++] Implement string space trimming kernels: trim, ltrim, and rtrim #25240

Closed

[C++] Caching pre computed data based on FunctionOptions in the kernel state #26521

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim #8621

ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim #8621

maartenbreddels commented Nov 10, 2020

github-actions bot commented Nov 10, 2020

pitrou commented Nov 10, 2020

pitrou commented Nov 10, 2020

maartenbreddels commented Nov 11, 2020

maartenbreddels commented Nov 11, 2020

pitrou commented Nov 11, 2020

maartenbreddels commented Nov 25, 2020

pitrou left a comment

pitrou Nov 26, 2020

maartenbreddels Dec 18, 2020

maartenbreddels commented Jan 4, 2021

pitrou left a comment

pitrou Jan 4, 2021

maartenbreddels Jan 12, 2021

pitrou Jan 4, 2021

maartenbreddels Jan 12, 2021

kszucs commented Jan 12, 2021

maartenbreddels commented Jan 12, 2021

maartenbreddels commented Jan 13, 2021

pitrou Jan 19, 2021

maartenbreddels Jan 19, 2021

pitrou commented Jan 19, 2021

ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim #8621

ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim #8621

Conversation

maartenbreddels commented Nov 10, 2020

github-actions bot commented Nov 10, 2020

pitrou commented Nov 10, 2020

pitrou commented Nov 10, 2020

maartenbreddels commented Nov 11, 2020

maartenbreddels commented Nov 11, 2020

pitrou commented Nov 11, 2020

maartenbreddels commented Nov 25, 2020

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maartenbreddels commented Jan 4, 2021

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kszucs commented Jan 12, 2021

maartenbreddels commented Jan 12, 2021

maartenbreddels commented Jan 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Jan 19, 2021