Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9133: [C++] Add utf8_upper and utf8_lower #7449

Closed
wants to merge 32 commits into from

Conversation

maartenbreddels
Copy link
Contributor

This is the initial working version, which is very slow though (>100x slower than the ascii versions). It also required libutf8proc to be present.

Please let me know if the general code style etc is ok. I'm using CRTP here, judging from the metaprogramming seen in the rest of the code base, I guess that's fine.

@wesm
Copy link
Member

wesm commented Jun 16, 2020

Yes, CRTP is certainly fine. We'll need to make utf8proc a proper toolchain library, @pitrou should be able to help you with that.

@xhochy
Copy link
Member

xhochy commented Jun 16, 2020

We'll need to make utf8proc a proper toolchain library, @pitrou should be able to help you with that.

I can take care of that!

@github-actions
Copy link

@maartenbreddels
Copy link
Contributor Author

It's not that slow, it was 40% of Vaex' performance (single threaded), so I think there is a bit more to be gained still. But I have added an optimization that tries ASCII conversion first. This gives it a 7x (compared to Vaex) to 10x speedup (in the benchmarks).

Before:

Utf8Lower   193873803 ns    193823124 ns            3 bytes_per_second=102.387M/s items_per_second=5.40996M/s
Utf8Upper   197154929 ns    197093083 ns            4 bytes_per_second=100.688M/s items_per_second=5.32021M/s

After:

Utf8Lower    19508443 ns     19493652 ns           36 bytes_per_second=1018.02M/s items_per_second=53.7906M/s
Utf8Upper    19846885 ns     19832066 ns           35 bytes_per_second=1000.65M/s items_per_second=52.8728M/s

There is one loose end, the growth of the string can cause a utf8 array to be promoted to a large_utf8.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks pretty streamlined to me. We might want to run perf to see what fraction of time is spend in utf8proc_tolower. If utf8proc ends up being a lot slower than unilib you might petition the unilib author to change its license.

// codepoint. This guaranteed by non-overlap design of the unicode standard. (see
// section 2.5 of Unicode Standard Core Specification v13.0)

uint8_t ascii_tolower(uint8_t utf8_code_unit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you will want this to be static inline (not sure all compilers will inline it otherwise)

@@ -159,11 +282,23 @@ void MakeUnaryStringBatchKernel(std::string name, ArrayKernelExec exec,
DCHECK_OK(registry->AddFunction(std::move(func)));
}

template <template <typename> typename Transformer>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some compilers demand that "class" be used for the last "typename"

@wesm
Copy link
Member

wesm commented Jun 16, 2020

I went ahead and asked ufal/unilib#2

@xhochy
Copy link
Member

xhochy commented Jun 17, 2020

The major difference between unilib and utf8proc in uppercasing a character seems to be that unilib looks up the uppercase value directly wheras utf8proc first gets a struct with all properties from which it extracts the uppercase value. Pre-computing the uppercase dictionary first could bring utf8proc en par with the performance.

@maartenbreddels
Copy link
Contributor Author

I used valgrind/callgrind to see how where time spent:
image

I wanted to compare that to unilib, but all calls get inlined directly (making that not visible).

Using unilib, it's almost 3x faster now compared to utf8proc (disabling the fast ascii path, so it should be compared to the items_per_second =5M/s above):

Utf8Lower    74023038 ns     74000707 ns            9 bytes_per_second=268.173M/s items_per_second=14.1698M/s
Utf8Upper    76741459 ns     76715981 ns            9 bytes_per_second=258.681M/s items_per_second=13.6683M/s

This is about 2x faster compared to Vaex (again, ignoring the fast ascii path).

The fact that utf8proc is not inline-able (4 calls per codepoint) will explain part of the overhead already. As an experiment, I make sure the calls to unicode's encode/append are not inlined, and that brings back the performance to:

Utf8Lower   131853749 ns    131822537 ns            5 bytes_per_second=150.543M/s items_per_second=7.95445M/s
Utf8Upper   134526167 ns    134487477 ns            5 bytes_per_second=147.56M/s items_per_second=7.79683M/s

Confirming call overhead plays a role.

Also, utf8proc contains information we don't care about (such as which direction text goes), explaining probably why utf8proc is bigger (300kb vs 120kb compiled).

@pitrou
Copy link
Member

pitrou commented Jun 17, 2020

I don't know how important it is to get good performance on non-ASCII data. Note that the ASCII fast path could well be applied to subsets of the array.

@xhochy
Copy link
Member

xhochy commented Jun 17, 2020

Also crossreferenced this in JuliaStrings/utf8proc#12 to make the utf8proc maintainers aware of what we're doing in case they are interested.

@wesm
Copy link
Member

wesm commented Jun 17, 2020

Since the Unilib developer isn't interested in changing the license I think our effort would be better invested in optimizing utf8proc (if this can be Demonstrated to be worthwhile in realistic workloads, not just benchmarks)

@maartenbreddels
Copy link
Contributor Author

Would a lookup table in the order of 256kb (generated at runtime, not in the binary) per case mapping be acceptable for Arrow?

@pitrou
Copy link
Member

pitrou commented Jun 17, 2020

Let's step back a bit: why do we care about micro-optimizing this?

@maartenbreddels
Copy link
Contributor Author

I want to move Vaex from using its own string functions to using Arrow. If I make all functions at least faster, I'm more than happy to scrap my own code. I don't like to see a regression by moving to using Arrow kernels.

I don't call a factor of 3x micro optimizing though :)

@pitrou
Copy link
Member

pitrou commented Jun 17, 2020

I think it would be more acceptable to inline the relevant utf8proc functions.

@xhochy
Copy link
Member

xhochy commented Jun 17, 2020

Would a lookup table in the order of 256kb (generated at runtime, not in the binary) per case mapping be acceptable for Arrow?

I would find that acceptable if the mapping is only generated if needed (thus you will have a one-off payment when using a UTF8-kernel). I would though prefer if utf8proc could implement it just like this on their side. Can you open an issue there?

@wesm
Copy link
Member

wesm commented Jun 17, 2020

I also agree with inlining the utf8proc functions until utf8proc can be patched to have better performance. I doubt that these optimizations will meaningfully impact the macroperformance of applications

@kszucs
Copy link
Member

kszucs commented Jun 17, 2020

@ursabot build

@kszucs
Copy link
Member

kszucs commented Jun 17, 2020

Added libutf8proc dependency to the ursabot builders, same could be done for the docker-compose images. The tests are failing though.

@maartenbreddels
Copy link
Contributor Author

I've added my own utf encode/decode for now. With lookup tables I now get:

Utf8Lower_median    18414820 ns     18408392 ns            3 bytes_per_second=1078.04M/s items_per_second=56.9618M/s
Utf8Upper_median    17004210 ns     17003407 ns            3 bytes_per_second=1.13976G/s items_per_second=61.6686M/s

which is faster than the 'ascii' version implemented previously (that got items_per_second=53 M/s).

Benchmark results vary a lot between items_per_second=55-66M/s .

Using utf8proc's encode/decode (inlined), this goes down to 18M/s. I have to look a bit into why that is the case since they do a bit more sanity checking. Ideally, some of this goes upstream.

@wesm
Copy link
Member

wesm commented Jun 18, 2020

I just merged my changes for the ASCII kernels making those work on sliced arrays

@maartenbreddels
Copy link
Contributor Author

Note that the unitests should fail, the PR isn't done, but the tests seem to run 👍

@wesm
Copy link
Member

wesm commented Jun 22, 2020

There is one loose end, the growth of the string can cause a utf8 array to be promoted to a large_utf8.

I'd like to treat in-kernel type promotions as an anti-pattern in general. If there is the possibility of overflowing the capacity of a StringArray, then it would be better to do the type promotion (if that is really what is desired) prior to choosing and invoking a kernel (so you would promote to LARGE_STRING and then use the large_utf8 kernel variant).

A better and more efficient strategy would be to break the array into pieces with Slice (based on some size heuristic, e.g. 1MB-8MB of data per slice at most) and process the smaller chunks separately. This also means that you can execute the kernel in parallel. This is the decision that will be made by the expression execution layer once that is developed (I plan to work on it after the 1.0.0 release) because it permits both parallel execution and operator pipelining.

@maartenbreddels
Copy link
Contributor Author

I'd like to treat in-kernel type promotions as an anti-pattern in general.

There are upsides and downsides to it. The downside is that users of the Arrow library are exposed to the implementation details of how each kernel can grow the resulting array. I see this being quite a burden in vaex, to keep track of which kernel does what, and when to promote.

Vaex does something similar, slicing the array's in smaller chunks, but still, it would need to check the sizes, no matter how small the slices are.

Maybe it's best to keep this PR simple first (so give an error?), and discuss the behavior of string growth on the mailing list?

@wesm
Copy link
Member

wesm commented Jun 22, 2020

The downside is that users of the Arrow library are exposed to the implementation details of how each kernel can grow the resulting array.

I'm not saying that. I'm proposing instead a layered implementation approach. You will still write "utf8_lower(x)" in Python but the execution layer will decide when it's appropriate to split inputs or do type promotion. So Vaex shouldn't have to deal with these details.

@pitrou
Copy link
Member

pitrou commented Jun 29, 2020

Main point remaining is whether we raise an error on invalid UTF8 input. I see no reason not to (an Arrow string array has to be valid UTF8 as per the spec, just like a Python unicode string cannot contain characters outside of the unicode code range).

KERNEL_RETURN_IF_ERROR(
ctx,
output->buffers[2]->CopySlice(0, output_ncodeunits).Value(&output->buffers[2]));
ctx, values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice way to make code more readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-)

@maartenbreddels
Copy link
Contributor Author

@pitrou your size commit made the benchmark go from 52->60 M/s 👍

Yes, too. The main point of this state-machine-based decoder is that it's branchless, and so it will perform roughly as well on non-Ascii data with unpredictable branching. On pure Ascii data, a branch-based decoder may be faster since the branches will always be predicted right.

Yes, it would be interesting to see how the two methods deals with a 25/25/25/25% mix of 1-2-3 or 4 byte encoded codepoints, vs say a few % non-ascii.

@pitrou
Copy link
Member

pitrou commented Jun 29, 2020

I pushed a commit that raises an error on invalid UTF8. It does not seem to make the benchmarks slower.

@maartenbreddels
Copy link
Contributor Author

I just concluded the same :)

@pitrou
Copy link
Member

pitrou commented Jun 29, 2020

@xhochy Could you help on the utf8proc issue on RTools 3.5?
See here: https://github.com/apache/arrow/pull/7449/checks?check_run_id=819772618#step:10:169

It seems that UTF8PROC_STATIC would need to be defined when building Arrow. But it's not set by Findutf8proc.cmake.
Also, libutf8proc.pc.in added in r-windows/rtools-packages#124 doesn't set it either.

@nealrichardson
Copy link
Contributor

@xhochy Could you help on the utf8proc issue on RTools 3.5?
See here: https://github.com/apache/arrow/pull/7449/checks?check_run_id=819772618#step:10:169

This means there also needs to be a PKGBUILD submitted to r-windows/rtools-backports for the old toolchain.

It seems that UTF8PROC_STATIC would need to be defined when building Arrow. But it's not set by Findutf8proc.cmake.
Also, libutf8proc.pc.in added in r-windows/rtools-packages#124 doesn't set it either.

Just a reminder that nothing in the R bindings touches these new functions, so turning off utf8proc in the C++ build is also an option for now.

@pitrou
Copy link
Member

pitrou commented Jun 29, 2020

This means there also needs to be a PKGBUILD

Why? libutf8proc is installed.

@nealrichardson
Copy link
Contributor

This means there also needs to be a PKGBUILD

Why? libutf8proc is installed.

The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9. Most of our deps have to be compiled for both, and this is apparently one of those. That's what https://github.com/r-windows/rtools-backports is for.

@pitrou
Copy link
Member

pitrou commented Jun 29, 2020

The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9

What difference does it make? This is plain C.

@wesm
Copy link
Member

wesm commented Jun 29, 2020

Indeed, toolchain incompatibilities only affect C++ code

@nealrichardson
Copy link
Contributor

The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9

What difference does it make? This is plain C.

🤷 then I'll leave it to you to sort out as this is beyond my knowledge. In the past, undefined symbols error + only compiled for rtools-packages (gcc8) = you need to get it built with rtools-backports too. Maybe something's off with the lib that was built, IDK if anyone has verified that it works.

@pitrou
Copy link
Member

pitrou commented Jun 30, 2020

Phew. It worked. RTools 4.0 is still broken, but there doesn't seem to be anything we can do, except perhaps disable that job. I'm gonna merge and leave the R cleanup to someone else.

@pitrou pitrou closed this in 9b162ee Jun 30, 2020
@wesm
Copy link
Member

wesm commented Jun 30, 2020

thanks @maartenbreddels!

@maartenbreddels
Copy link
Contributor Author

You're welcome. Thanks all for your help. Impressed by the project, setup (CI/CMake), and people, and happy with the results:

image

@maartenbreddels maartenbreddels deleted the ARROW-9133 branch June 30, 2020 13:42
sgnkc pushed a commit to sgnkc/arrow that referenced this pull request Jul 2, 2020
This is the initial working version, which is *very* slow though (>100x slower than the ascii versions). It also required libutf8proc to be present.

Please let me know if the general code style etc is ok. I'm using CRTP here, judging from the metaprogramming seen in the rest of the code base, I guess that's fine.

Closes apache#7449 from maartenbreddels/ARROW-9133

Lead-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Uwe L. Korn <uwe.korn@quantco.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants