ARROW-9133: [C++] Add utf8_upper and utf8_lower #7449

maartenbreddels · 2020-06-16T13:58:41Z

This is the initial working version, which is very slow though (>100x slower than the ascii versions). It also required libutf8proc to be present.

Please let me know if the general code style etc is ok. I'm using CRTP here, judging from the metaprogramming seen in the rest of the code base, I guess that's fine.

wesm · 2020-06-16T14:03:48Z

Yes, CRTP is certainly fine. We'll need to make utf8proc a proper toolchain library, @pitrou should be able to help you with that.

xhochy · 2020-06-16T14:17:44Z

We'll need to make utf8proc a proper toolchain library, @pitrou should be able to help you with that.

I can take care of that!

github-actions · 2020-06-16T14:34:03Z

https://issues.apache.org/jira/browse/ARROW-9133

maartenbreddels · 2020-06-16T18:53:00Z

It's not that slow, it was 40% of Vaex' performance (single threaded), so I think there is a bit more to be gained still. But I have added an optimization that tries ASCII conversion first. This gives it a 7x (compared to Vaex) to 10x speedup (in the benchmarks).

Before:

Utf8Lower   193873803 ns    193823124 ns            3 bytes_per_second=102.387M/s items_per_second=5.40996M/s
Utf8Upper   197154929 ns    197093083 ns            4 bytes_per_second=100.688M/s items_per_second=5.32021M/s

After:

Utf8Lower    19508443 ns     19493652 ns           36 bytes_per_second=1018.02M/s items_per_second=53.7906M/s
Utf8Upper    19846885 ns     19832066 ns           35 bytes_per_second=1000.65M/s items_per_second=52.8728M/s

There is one loose end, the growth of the string can cause a utf8 array to be promoted to a large_utf8.

wesm

The implementation looks pretty streamlined to me. We might want to run perf to see what fraction of time is spend in utf8proc_tolower. If utf8proc ends up being a lot slower than unilib you might petition the unilib author to change its license.

wesm · 2020-06-16T21:16:13Z

cpp/src/arrow/compute/kernels/scalar_string.cc

+// codepoint. This guaranteed by non-overlap design of the unicode standard. (see
+// section 2.5 of Unicode Standard Core Specification v13.0)
+
+uint8_t ascii_tolower(uint8_t utf8_code_unit) {


I think you will want this to be static inline (not sure all compilers will inline it otherwise)

wesm · 2020-06-16T21:19:45Z

cpp/src/arrow/compute/kernels/scalar_string.cc

@@ -159,11 +282,23 @@ void MakeUnaryStringBatchKernel(std::string name, ArrayKernelExec exec,
  DCHECK_OK(registry->AddFunction(std::move(func)));
 }

+template <template <typename> typename Transformer>


I think some compilers demand that "class" be used for the last "typename"

wesm · 2020-06-16T21:34:28Z

I went ahead and asked ufal/unilib#2

xhochy · 2020-06-17T09:07:46Z

The major difference between unilib and utf8proc in uppercasing a character seems to be that unilib looks up the uppercase value directly wheras utf8proc first gets a struct with all properties from which it extracts the uppercase value. Pre-computing the uppercase dictionary first could bring utf8proc en par with the performance.

maartenbreddels · 2020-06-17T09:55:45Z

I used valgrind/callgrind to see how where time spent:

I wanted to compare that to unilib, but all calls get inlined directly (making that not visible).

Using unilib, it's almost 3x faster now compared to utf8proc (disabling the fast ascii path, so it should be compared to the items_per_second =5M/s above):

Utf8Lower    74023038 ns     74000707 ns            9 bytes_per_second=268.173M/s items_per_second=14.1698M/s
Utf8Upper    76741459 ns     76715981 ns            9 bytes_per_second=258.681M/s items_per_second=13.6683M/s

This is about 2x faster compared to Vaex (again, ignoring the fast ascii path).

The fact that utf8proc is not inline-able (4 calls per codepoint) will explain part of the overhead already. As an experiment, I make sure the calls to unicode's encode/append are not inlined, and that brings back the performance to:

Utf8Lower   131853749 ns    131822537 ns            5 bytes_per_second=150.543M/s items_per_second=7.95445M/s
Utf8Upper   134526167 ns    134487477 ns            5 bytes_per_second=147.56M/s items_per_second=7.79683M/s

Confirming call overhead plays a role.

Also, utf8proc contains information we don't care about (such as which direction text goes), explaining probably why utf8proc is bigger (300kb vs 120kb compiled).

pitrou · 2020-06-17T10:00:37Z

I don't know how important it is to get good performance on non-ASCII data. Note that the ASCII fast path could well be applied to subsets of the array.

xhochy · 2020-06-17T10:01:56Z

Also crossreferenced this in JuliaStrings/utf8proc#12 to make the utf8proc maintainers aware of what we're doing in case they are interested.

wesm · 2020-06-17T12:23:12Z

Since the Unilib developer isn't interested in changing the license I think our effort would be better invested in optimizing utf8proc (if this can be Demonstrated to be worthwhile in realistic workloads, not just benchmarks)

maartenbreddels · 2020-06-17T13:42:22Z

Would a lookup table in the order of 256kb (generated at runtime, not in the binary) per case mapping be acceptable for Arrow?

pitrou · 2020-06-17T13:43:42Z

Let's step back a bit: why do we care about micro-optimizing this?

maartenbreddels · 2020-06-17T13:55:18Z

I want to move Vaex from using its own string functions to using Arrow. If I make all functions at least faster, I'm more than happy to scrap my own code. I don't like to see a regression by moving to using Arrow kernels.

I don't call a factor of 3x micro optimizing though :)

pitrou · 2020-06-17T13:56:41Z

I think it would be more acceptable to inline the relevant utf8proc functions.

xhochy · 2020-06-17T14:17:00Z

Would a lookup table in the order of 256kb (generated at runtime, not in the binary) per case mapping be acceptable for Arrow?

I would find that acceptable if the mapping is only generated if needed (thus you will have a one-off payment when using a UTF8-kernel). I would though prefer if utf8proc could implement it just like this on their side. Can you open an issue there?

wesm · 2020-06-17T14:35:43Z

I also agree with inlining the utf8proc functions until utf8proc can be patched to have better performance. I doubt that these optimizations will meaningfully impact the macroperformance of applications

kszucs · 2020-06-17T18:15:59Z

@ursabot build

kszucs · 2020-06-17T18:17:28Z

Added libutf8proc dependency to the ursabot builders, same could be done for the docker-compose images. The tests are failing though.

maartenbreddels · 2020-06-17T18:32:04Z

I've added my own utf encode/decode for now. With lookup tables I now get:

Utf8Lower_median    18414820 ns     18408392 ns            3 bytes_per_second=1078.04M/s items_per_second=56.9618M/s
Utf8Upper_median    17004210 ns     17003407 ns            3 bytes_per_second=1.13976G/s items_per_second=61.6686M/s

which is faster than the 'ascii' version implemented previously (that got items_per_second=53 M/s).

Benchmark results vary a lot between items_per_second=55-66M/s .

Using utf8proc's encode/decode (inlined), this goes down to 18M/s. I have to look a bit into why that is the case since they do a bit more sanity checking. Ideally, some of this goes upstream.

wesm · 2020-06-18T00:33:11Z

I just merged my changes for the ASCII kernels making those work on sliced arrays

maartenbreddels · 2020-06-19T12:19:53Z

Note that the unitests should fail, the PR isn't done, but the tests seem to run 👍

wesm · 2020-06-22T02:06:58Z

There is one loose end, the growth of the string can cause a utf8 array to be promoted to a large_utf8.

I'd like to treat in-kernel type promotions as an anti-pattern in general. If there is the possibility of overflowing the capacity of a StringArray, then it would be better to do the type promotion (if that is really what is desired) prior to choosing and invoking a kernel (so you would promote to LARGE_STRING and then use the large_utf8 kernel variant).

A better and more efficient strategy would be to break the array into pieces with Slice (based on some size heuristic, e.g. 1MB-8MB of data per slice at most) and process the smaller chunks separately. This also means that you can execute the kernel in parallel. This is the decision that will be made by the expression execution layer once that is developed (I plan to work on it after the 1.0.0 release) because it permits both parallel execution and operator pipelining.

maartenbreddels · 2020-06-22T10:26:41Z

I'd like to treat in-kernel type promotions as an anti-pattern in general.

There are upsides and downsides to it. The downside is that users of the Arrow library are exposed to the implementation details of how each kernel can grow the resulting array. I see this being quite a burden in vaex, to keep track of which kernel does what, and when to promote.

Vaex does something similar, slicing the array's in smaller chunks, but still, it would need to check the sizes, no matter how small the slices are.

Maybe it's best to keep this PR simple first (so give an error?), and discuss the behavior of string growth on the mailing list?

wesm · 2020-06-22T13:27:48Z

The downside is that users of the Arrow library are exposed to the implementation details of how each kernel can grow the resulting array.

I'm not saying that. I'm proposing instead a layered implementation approach. You will still write "utf8_lower(x)" in Python but the execution layer will decide when it's appropriate to split inputs or do type promotion. So Vaex shouldn't have to deal with these details.

pitrou · 2020-06-29T18:21:26Z

Main point remaining is whether we raise an error on invalid UTF8 input. I see no reason not to (an Arrow string array has to be valid UTF8 as per the spec, just like a Python unicode string cannot contain characters outside of the unicode code range).

maartenbreddels · 2020-06-29T18:30:43Z

cpp/src/arrow/compute/kernels/scalar_string.cc

      KERNEL_RETURN_IF_ERROR(
-          ctx,
-          output->buffers[2]->CopySlice(0, output_ncodeunits).Value(&output->buffers[2]));
+          ctx, values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true));


Nice way to make code more readable.

maartenbreddels · 2020-06-29T18:35:45Z

@pitrou your size commit made the benchmark go from 52->60 M/s 👍

Yes, too. The main point of this state-machine-based decoder is that it's branchless, and so it will perform roughly as well on non-Ascii data with unpredictable branching. On pure Ascii data, a branch-based decoder may be faster since the branches will always be predicted right.

Yes, it would be interesting to see how the two methods deals with a 25/25/25/25% mix of 1-2-3 or 4 byte encoded codepoints, vs say a few % non-ascii.

pitrou · 2020-06-29T19:33:43Z

I pushed a commit that raises an error on invalid UTF8. It does not seem to make the benchmarks slower.

maartenbreddels · 2020-06-29T19:45:58Z

I just concluded the same :)

pitrou · 2020-06-29T20:19:10Z

@xhochy Could you help on the utf8proc issue on RTools 3.5?
See here: https://github.com/apache/arrow/pull/7449/checks?check_run_id=819772618#step:10:169

It seems that UTF8PROC_STATIC would need to be defined when building Arrow. But it's not set by Findutf8proc.cmake.
Also, libutf8proc.pc.in added in r-windows/rtools-packages#124 doesn't set it either.

nealrichardson · 2020-06-29T20:26:18Z

@xhochy Could you help on the utf8proc issue on RTools 3.5?
See here: https://github.com/apache/arrow/pull/7449/checks?check_run_id=819772618#step:10:169

This means there also needs to be a PKGBUILD submitted to r-windows/rtools-backports for the old toolchain.

It seems that UTF8PROC_STATIC would need to be defined when building Arrow. But it's not set by Findutf8proc.cmake.
Also, libutf8proc.pc.in added in r-windows/rtools-packages#124 doesn't set it either.

Just a reminder that nothing in the R bindings touches these new functions, so turning off utf8proc in the C++ build is also an option for now.

pitrou · 2020-06-29T20:44:14Z

This means there also needs to be a PKGBUILD

Why? libutf8proc is installed.

nealrichardson · 2020-06-29T20:48:05Z

This means there also needs to be a PKGBUILD

Why? libutf8proc is installed.

The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9. Most of our deps have to be compiled for both, and this is apparently one of those. That's what https://github.com/r-windows/rtools-backports is for.

pitrou · 2020-06-29T21:07:53Z

The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9

What difference does it make? This is plain C.

wesm · 2020-06-29T21:10:00Z

Indeed, toolchain incompatibilities only affect C++ code

nealrichardson · 2020-06-29T21:12:34Z

The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9

What difference does it make? This is plain C.

🤷 then I'll leave it to you to sort out as this is beyond my knowledge. In the past, undefined symbols error + only compiled for rtools-packages (gcc8) = you need to get it built with rtools-backports too. Maybe something's off with the lib that was built, IDK if anyone has verified that it works.

…into ARROW-9133

pitrou · 2020-06-30T10:49:58Z

Phew. It worked. RTools 4.0 is still broken, but there doesn't seem to be anything we can do, except perhaps disable that job. I'm gonna merge and leave the R cleanup to someone else.

wesm · 2020-06-30T13:32:42Z

thanks @maartenbreddels!

maartenbreddels · 2020-06-30T13:38:53Z

You're welcome. Thanks all for your help. Impressed by the project, setup (CI/CMake), and people, and happy with the results:

This is the initial working version, which is *very* slow though (>100x slower than the ascii versions). It also required libutf8proc to be present. Please let me know if the general code style etc is ok. I'm using CRTP here, judging from the metaprogramming seen in the rest of the code base, I guess that's fine. Closes apache#7449 from maartenbreddels/ARROW-9133 Lead-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Uwe L. Korn <uwe.korn@quantco.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

maartenbreddels mentioned this pull request Jun 16, 2020

ARROW-8961: [C++] Add utf8proc library to toolchain #7452

Closed

wesm reviewed Jun 16, 2020

View reviewed changes

xhochy mentioned this pull request Jun 17, 2020

benchmarking/profiling JuliaStrings/utf8proc#12

Open

maartenbreddels force-pushed the ARROW-9133 branch from 884dcc0 to 157114d Compare June 17, 2020 18:25

maartenbreddels force-pushed the ARROW-9133 branch from b5b40ea to 17a506f Compare June 19, 2020 12:12

maartenbreddels force-pushed the ARROW-9133 branch from 17a506f to e074d14 Compare June 22, 2020 14:03

Try to fix R on Windows builds

e6c800c

maartenbreddels commented Jun 29, 2020

View reviewed changes

pitrou added 3 commits June 29, 2020 20:58

Add -lutf8proc for R Windows linking

6b2f07b

Error out on invalid UTF8 instead of succeeding

d47ea4a

Also tweak r_windows_build.sh ...

9a9079d

maartenbreddels and others added 5 commits June 30, 2020 09:35

check capacity for scalar

11c342a

Merge branch 'master' into ARROW-9133

85a169e

Try to fix RTools build again...

592eb10

Merge branch 'ARROW-9133' of https://github.com/maartenbreddels/arrow …

05c2f38

…into ARROW-9133

Last one

ff3df58

pitrou closed this in 9b162ee Jun 30, 2020

maartenbreddels deleted the ARROW-9133 branch June 30, 2020 13:42

xhochy mentioned this pull request Jun 30, 2020

🚀 String Super-Issue xhochy/fletcher#121

Closed

TomAugspurger mentioned this pull request Jul 7, 2020

Plan for a native string dtype pandas-dev/pandas#35169

Closed

asfimport mentioned this pull request Jul 15, 2020

[C++] Add utf8_upper and utf8_lower #25244

Closed

ARROW-9133: [C++] Add utf8_upper and utf8_lower #7449

ARROW-9133: [C++] Add utf8_upper and utf8_lower #7449

Conversation

maartenbreddels commented Jun 16, 2020

wesm commented Jun 16, 2020

xhochy commented Jun 16, 2020

github-actions bot commented Jun 16, 2020

maartenbreddels commented Jun 16, 2020

wesm left a comment

Choose a reason for hiding this comment

wesm Jun 16, 2020

Choose a reason for hiding this comment

wesm Jun 16, 2020

Choose a reason for hiding this comment

wesm commented Jun 16, 2020

xhochy commented Jun 17, 2020

maartenbreddels commented Jun 17, 2020

pitrou commented Jun 17, 2020

xhochy commented Jun 17, 2020

wesm commented Jun 17, 2020 • edited

maartenbreddels commented Jun 17, 2020

pitrou commented Jun 17, 2020

maartenbreddels commented Jun 17, 2020

pitrou commented Jun 17, 2020

xhochy commented Jun 17, 2020

wesm commented Jun 17, 2020

kszucs commented Jun 17, 2020

kszucs commented Jun 17, 2020

maartenbreddels commented Jun 17, 2020

wesm commented Jun 18, 2020

maartenbreddels commented Jun 19, 2020

wesm commented Jun 22, 2020

maartenbreddels commented Jun 22, 2020

wesm commented Jun 22, 2020

pitrou commented Jun 29, 2020 • edited

maartenbreddels Jun 29, 2020

Choose a reason for hiding this comment

pitrou Jun 29, 2020

Choose a reason for hiding this comment

maartenbreddels commented Jun 29, 2020

pitrou commented Jun 29, 2020

maartenbreddels commented Jun 29, 2020

pitrou commented Jun 29, 2020

nealrichardson commented Jun 29, 2020

pitrou commented Jun 29, 2020

nealrichardson commented Jun 29, 2020

pitrou commented Jun 29, 2020

wesm commented Jun 29, 2020

nealrichardson commented Jun 29, 2020

pitrou commented Jun 30, 2020

wesm commented Jun 30, 2020

maartenbreddels commented Jun 30, 2020

wesm commented Jun 17, 2020 •

edited

pitrou commented Jun 29, 2020 •

edited