Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9128: [C++] Implement string space trimming kernels: trim, ltrim, and rtrim #8621

Closed
wants to merge 16 commits into from

Conversation

maartenbreddels
Copy link
Contributor

There is one obvious loose end in this PR, which is where to generate the std::set based on the TrimOptions (now in the ctor of UTF8TrimBase). I'm not sure what the lifetime guarantees are for this object (TrimOptions), where it makes sense to initialize this set, and when an (utf8 decoding) error occurs, how/where to report this.

Although this is not a costly operation, assuming people don't pass in a billion characters to trim, I do wonder what the best approach here is in general. It does not make much sense to create the std::set at each Exec call, but that is what happens now. (This also seems to happen in TransformMatchSubstring for creating the prefix_table btw.)

Maybe a good place to put per-kernel pre-compute results are the *Options objects, but I'm not sure if that makes sense in the current architecture.

Another idea is to explore alternatives to the std::set. It seem that (based on the TrimManyAscii benchmark), std::unordered_set seemed a bit slower, and simply using a linear search: std::find(options.characters.begin(), options.characters.end(), c) != options.characters.end() in the predicate instead of the set doesn't seem to affect performance that much.

In CPython, a bloom filter is used, I could explore to see if that makes sense, but the implementation in Arrow lives under the parquet namespace.

@github-actions
Copy link

@pitrou
Copy link
Member

pitrou commented Nov 10, 2020

Maybe a good place to put per-kernel pre-compute results are the *Options objects, but I'm not sure if that makes sense in the current architecture.

I don't think the Options objects are the right place. Ideally the kernel state would be options-specific, otherwise we can devise a generic caching facility.

@pitrou
Copy link
Member

pitrou commented Nov 10, 2020

Feel free to open a JIRA about that, by the way :-)

@maartenbreddels
Copy link
Contributor Author

The std::vector<bool> was a good idea, and indeed because of it's bit usage, the memory usage for Unicode isn't that heavy (most extreme: 0x10FFFF bits = 140kb in case of a contiguous array implementation).

Benchmarks:

set:
TrimManyAscii_median   28346892 ns   28345125 ns         25   558.956MB/s   35.2794M items/s
TrimManyUtf8_median    28302644 ns   28294883 ns         25   559.949MB/s   35.3421M items/s

unordered_set:
TrimManyAscii_median   32017530 ns   32014024 ns         22   494.898MB/s   31.2363M items/s
TrimManyUtf8_median (not run)

vector<bool>
TrimManyAscii_median   14911543 ns   14910620 ns         47   1062.58MB/s   67.0663M items/s
TrimManyUtf8_median    16148001 ns   16146053 ns         44   981.273MB/s   61.9346M items/s

bitset<256>
TrimManyAscii_median   14304925 ns   14304010 ns         49   1107.64MB/s   69.9105M items/s

vector<bool> is good enough I think, the bitset is consistently faster (5%), but I'd rather have similar code for both solutions.

@maartenbreddels
Copy link
Contributor Author

I've opened an issue at https://issues.apache.org/jira/browse/ARROW-10556

I guess we still need to manually add content to compute.rst?

@pitrou
Copy link
Member

pitrou commented Nov 11, 2020

I guess we still need to manually add content to compute.rst?

Yes, you do :-)

@maartenbreddels
Copy link
Contributor Author

@pitrou this is ready for review.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much. Looks good mostly, just a couple comments below.

cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
std::vector<bool> codepoints;

explicit UTF8TrimBase(TrimOptions options) : options(options) {
// TODO: check return / can we raise an exception here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could set the status on the KernelContext in the caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored this a bit, the kernel state now include the codepoint vector, and there's where I have access to the KernelContext*.

cpp/src/arrow/compute/kernels/scalar_string_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/util/utf8.h Outdated Show resolved Hide resolved
cpp/src/arrow/util/utf8_util_test.cc Show resolved Hide resolved
@maartenbreddels
Copy link
Contributor Author

@pitrou this is ready for review

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update! I don't have much to add, except two questions.

return false;
}
if (predicate(codepoint)) {
*position = current + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit weird. It returns the position to the next codepoint? The docstring should be a bit clearer about that (the current spelling is cryptic to me).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#ifdef ARROW_WITH_UTF8PROC

template <typename Type, typename Derived>
struct UTF8Transform : StringTransform<Type, Derived> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't exactly understand this refactor. There's a UTF8Transform with a Transform method for utf8 kernels but no corresponding class with a Transform method for ascii kernels, is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was a bad choice of name, StringTransformCodepoint is more descriptive, it's a per codepoint transformation.

@kszucs
Copy link
Member

kszucs commented Jan 12, 2021

It'd be nice to include in the release, so ping @maartenbreddels

@maartenbreddels
Copy link
Contributor Author

I agree, I'll do my best to be responsive to get this in soon!

@maartenbreddels
Copy link
Contributor Author

ready for review @kszucs or @pitrou
failure is fsspec (already reported on jira by joris)

// Note that rounding down the 3/2 is ok, since only codepoints encoded by
// two code units (even) can grow to 3 code units.

return static_cast<int64_t>(input_ncodeunits) * 3 / 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I read this again, it strikes me that this function is estimating a number of codepoints, but we're using it to allocate a number of bytes (i.e. utf-8 codeunits). Is that ok?

(and/or the function naming and comment is inconsistent)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that was completely wrong, also slightly modified the text, this is all about units/bytes.

@pitrou
Copy link
Member

pitrou commented Jan 19, 2021

Ok, the PR seems ok on its own, I just added a question about some existing code.

@pitrou pitrou closed this in 0e5d646 Jan 19, 2021
kszucs pushed a commit that referenced this pull request Jan 25, 2021
…m, and rtrim

There is one obvious loose end in this PR, which is where to generate the `std::set` based on the `TrimOptions` (now in the ctor of UTF8TrimBase). I'm not sure what the lifetime guarantees are for this object (TrimOptions), where it makes sense to initialize this set, and when an (utf8 decoding) error occurs, how/where to report this.

Although this is not a costly operation, assuming people don't pass in a billion characters to trim, I do wonder what the best approach here is in general. It does not make much sense to create the `std::set` at each `Exec` call, but that is what happens now. (This also seems to happen in `TransformMatchSubstring` for creating the `prefix_table` btw.)

Maybe a good place to put per-kernel pre-compute results are the `*Options` objects, but I'm not sure if that makes sense in the current architecture.

Another idea is to explore alternatives to the `std::set`. It seem that (based on the TrimManyAscii benchmark), `std::unordered_set` seemed a bit slower, and simply using a linear search: `std::find(options.characters.begin(), options.characters.end(), c) != options.characters.end()` in the predicate instead of the set doesn't seem to affect performance that much.

In CPython, a bloom filter is used, I could explore to see if that makes sense, but the implementation in Arrow lives under the parquet namespace.

Closes #8621 from maartenbreddels/ARROW-9128

Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…m, and rtrim

There is one obvious loose end in this PR, which is where to generate the `std::set` based on the `TrimOptions` (now in the ctor of UTF8TrimBase). I'm not sure what the lifetime guarantees are for this object (TrimOptions), where it makes sense to initialize this set, and when an (utf8 decoding) error occurs, how/where to report this.

Although this is not a costly operation, assuming people don't pass in a billion characters to trim, I do wonder what the best approach here is in general. It does not make much sense to create the `std::set` at each `Exec` call, but that is what happens now. (This also seems to happen in `TransformMatchSubstring` for creating the `prefix_table` btw.)

Maybe a good place to put per-kernel pre-compute results are the `*Options` objects, but I'm not sure if that makes sense in the current architecture.

Another idea is to explore alternatives to the `std::set`. It seem that (based on the TrimManyAscii benchmark), `std::unordered_set` seemed a bit slower, and simply using a linear search: `std::find(options.characters.begin(), options.characters.end(), c) != options.characters.end()` in the predicate instead of the set doesn't seem to affect performance that much.

In CPython, a bloom filter is used, I could explore to see if that makes sense, but the implementation in Arrow lives under the parquet namespace.

Closes apache#8621 from maartenbreddels/ARROW-9128

Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
pitrou added a commit that referenced this pull request Jun 7, 2021
Needs a rebase after #8621 is merged

I totally agree with https://github.com/python/cpython/blob/c9bc290dd6e3994a4ead2a224178bcba86f0c0e4/Objects/sliceobject.c#L252

This was tricky to get right, the main difficulty is in manually dealing with reverse iterators. Therefore I put on extra guardrails by having the Python unittests cover a lot of cases. All edge cases detected by this are translated to the C++ unittest suite, so we could reduce them to reduce pytest execution cost (I added 1 second).

Slicing is based on Python, `[start, stop)` inclusive/exclusive semantics, where an index refers to a codeunit (like Python apparently, badly documented), and negative indices start counting from the right. `step != 0` is supported, like Python.

The only thing we cannot support easily, are things like reversing a string, since in Python one can do `s[::-1]` or `s[-1::-1]`, but we don't support empty values with the Option machinery (we model this as an c-`int64`). To mimic this, we can do `pc.utf8_slice_codeunits(ar, start=-1, end=-sys.maxsize, step=-1)` (i.e. a very large negative value).

For instance, libraries such as Pandas and Vaex can do sth like that, confirmed to be working by modifying the unittest like this:
```python
import sys
@pytest.mark.parametrize('start', list(range(-6, 6)) + [None])
@pytest.mark.parametrize('stop', list(range(-6, 6)) + [None])
@pytest.mark.parametrize('step', [-3, -2, -1, 1, 2, 3])
def test_slice_compatibility(start,stop, step):
    input = pa.array(["", "𝑓", "𝑓ö", "𝑓öõ", "𝑓öõḍ", "𝑓öõḍš"])
    expected = pa.array([k.as_py()[start:stop:step] for k in input])
    if start is None:
        start = -sys.maxsize if step > 0 else sys.maxsize
    if stop is None:
        stop = sys.maxsize if step > 0 else -sys.maxsize
    result = pc.utf8_slice_codeunits(input, start=start, stop=stop, step=step)
    assert expected.equals(result)
```

So libraries using this can implement the full Python behavior with this workaround.

Closes #9000 from maartenbreddels/ARROW-10557

Lead-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…m, and rtrim

There is one obvious loose end in this PR, which is where to generate the `std::set` based on the `TrimOptions` (now in the ctor of UTF8TrimBase). I'm not sure what the lifetime guarantees are for this object (TrimOptions), where it makes sense to initialize this set, and when an (utf8 decoding) error occurs, how/where to report this.

Although this is not a costly operation, assuming people don't pass in a billion characters to trim, I do wonder what the best approach here is in general. It does not make much sense to create the `std::set` at each `Exec` call, but that is what happens now. (This also seems to happen in `TransformMatchSubstring` for creating the `prefix_table` btw.)

Maybe a good place to put per-kernel pre-compute results are the `*Options` objects, but I'm not sure if that makes sense in the current architecture.

Another idea is to explore alternatives to the `std::set`. It seem that (based on the TrimManyAscii benchmark), `std::unordered_set` seemed a bit slower, and simply using a linear search: `std::find(options.characters.begin(), options.characters.end(), c) != options.characters.end()` in the predicate instead of the set doesn't seem to affect performance that much.

In CPython, a bloom filter is used, I could explore to see if that makes sense, but the implementation in Arrow lives under the parquet namespace.

Closes apache#8621 from maartenbreddels/ARROW-9128

Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
Needs a rebase after apache#8621 is merged

I totally agree with https://github.com/python/cpython/blob/c9bc290dd6e3994a4ead2a224178bcba86f0c0e4/Objects/sliceobject.c#L252

This was tricky to get right, the main difficulty is in manually dealing with reverse iterators. Therefore I put on extra guardrails by having the Python unittests cover a lot of cases. All edge cases detected by this are translated to the C++ unittest suite, so we could reduce them to reduce pytest execution cost (I added 1 second).

Slicing is based on Python, `[start, stop)` inclusive/exclusive semantics, where an index refers to a codeunit (like Python apparently, badly documented), and negative indices start counting from the right. `step != 0` is supported, like Python.

The only thing we cannot support easily, are things like reversing a string, since in Python one can do `s[::-1]` or `s[-1::-1]`, but we don't support empty values with the Option machinery (we model this as an c-`int64`). To mimic this, we can do `pc.utf8_slice_codeunits(ar, start=-1, end=-sys.maxsize, step=-1)` (i.e. a very large negative value).

For instance, libraries such as Pandas and Vaex can do sth like that, confirmed to be working by modifying the unittest like this:
```python
import sys
@pytest.mark.parametrize('start', list(range(-6, 6)) + [None])
@pytest.mark.parametrize('stop', list(range(-6, 6)) + [None])
@pytest.mark.parametrize('step', [-3, -2, -1, 1, 2, 3])
def test_slice_compatibility(start,stop, step):
    input = pa.array(["", "𝑓", "𝑓ö", "𝑓öõ", "𝑓öõḍ", "𝑓öõḍš"])
    expected = pa.array([k.as_py()[start:stop:step] for k in input])
    if start is None:
        start = -sys.maxsize if step > 0 else sys.maxsize
    if stop is None:
        stop = sys.maxsize if step > 0 else -sys.maxsize
    result = pc.utf8_slice_codeunits(input, start=start, stop=stop, step=step)
    assert expected.equals(result)
```

So libraries using this can implement the full Python behavior with this workaround.

Closes apache#9000 from maartenbreddels/ARROW-10557

Lead-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants