[SPARK-47693][SQL] Add optimization for lowercase comparison of UTF8String used in UTF8_BINARY_LCASE collation #45816

nikolamand-db · 2024-04-02T10:31:59Z

What changes were proposed in this pull request?

Current collation benchmarks indicate that UTF8_BINARY_LCASE collation comparisons are order of magnitude slower (~7-10x) than plain binary comparisons. Improve the performance by optimizing lowercase comparison function for UTF8String instances instead of performing full lowercase conversion before binary comparison.

Optimization is based on similar method used in toLowerCase where we check character by character if conversion is valid under ASCII and fallback to slow comparison of native strings. In latter case, we only take into consideration suffixes that are left to compare.

Benchmarks from CollationBenchmark ran locally show substantial performance increase:

[info] collation unit benchmarks - equalsFunction:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] --------------------------------------------------------------------------------------------------------------------------
[info] UTF8_BINARY_LCASE                                    7199           7209          14          0.0       71988.8       1.0X
[info] UNICODE                                              3925           3929           5          0.0       39250.4       1.8X
[info] UTF8_BINARY                                          3935           3950          21          0.0       39351.2       1.8X
[info] UNICODE_CI                                          45248          51404        8706          0.0      452484.7       0.2X

Why are the changes needed?

To improve performance of comparisons of strings under UTF8_BINARY_LCASE collation.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests to UTF8StringSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

…BINARY_LCASE collation

HyukjinKwon · 2024-04-03T00:33:31Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+
+  private int compareLowercaseSuffixSlow(UTF8String other, int pref) {
+    UTF8String suffixLeft = UTF8String.fromAddress(base, offset + pref,
+            numBytes - pref);


Can we use 2-spaced intentation? See "Code style guide" at https://spark.apache.org/contributing.html

common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java

dbatomic · 2024-04-03T15:16:49Z

sql/core/benchmarks/CollationBenchmark-results.txt

-UNICODE                                              4520           4522           2          0.0       45201.8       7.5X
-UTF8_BINARY                                          4524           4526           2          0.0       45243.0       7.5X
-UNICODE_CI                                          52706          52711           7          0.0      527056.1       0.6X
+UTF8_BINARY_LCASE                                    8006           8022          24          0.0       80056.6       1.0X


Can you add a benchmark for UTF8_BINARY_LCASE with non ASCII chars?
This should be a separate group in benchmark list.

Added new benchmarks for ASCII and non-ASCII data, please check.

dbatomic · 2024-04-03T15:19:19Z

sql/core/benchmarks/CollationBenchmark-results.txt

-UNICODE                                          177636         177709         103          0.0     1776363.9       0.1X
-UTF8_BINARY                                       11954          11956           3          0.0      119536.7       1.8X
-UNICODE_CI                                       158014         158038          35          0.0     1580135.7       0.1X
+UTF8_BINARY_LCASE                                 24485          24506          30          0.0      244846.2       1.0X


We can do the same trick for hash as well? e.g. iterate, take single byte code point, convert to lcase and pass it to hasher?

Maybe it's better to skip hash optimizations for now as hashing of data blocks requires internal mixing functions

spark/common/unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java

Lines 74 to 75 in 383bb4a

int k1 = mixK1(halfWord);

h1 = mixH1(h1, k1);

but we must supply data generated on fly as stream because we want to do char-by-char lowercase and this is still not supported in internal hash implementation.

dbatomic · 2024-04-03T15:23:43Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+      }
+      int lowerLeft = Character.toLowerCase(left);
+      int lowerRight = Character.toLowerCase(right);
+      if (lowerLeft > 127 || lowerRight > 127) {


I see that you are not introducing anything new here and that numBytes != 1 && codePoint < 127 is already used in toUpperCase. But I don't really understand this logic.
Why can't we take multibyte codepoints? I see that Character.ToLowerCase accepts an integer specifying code point, that we can decode from UTF8 binary. What is the reason why we can't use this for any code point?

The idea is that if we remain in ASCII space, we don't need to worry about locales and length of character representation in UTF8 encoding; if we step outside this set of characters, we may encounter issues with both.

Ignoring locales (which Character.ToLowerCase does) means that we will break compliance with UTF8String.toLowerCase which is locale-dependent. Issue with varying length of character encoding in bytes introduces major performance penalty as we then need to deal with byte buffers to store incoming converted data. Although this method should have same asymptotic complexity, it will be several times slower than simply comparing char-by-char. Since current naive implementation is not that much bad (~7x slower than UTF8_BINARY), we woudn't gain much in terms of performance.

bart-samwel · 2024-04-04T09:23:29Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+    for (curr = 0; curr < numBytes && curr < other.numBytes; curr++) {
+      byte left = getByte(curr);
+      byte right = other.getByte(curr);
+      if (numBytesForFirstByte(left) != 1 || numBytesForFirstByte(right) != 1) {


This is expensive -- you don't want to know the number of bytes, you just want to know if it's more than 1. If you look at the UTF-8 spec, you see that the multibyte characters all have the high bit set, and the single-byte characters all have the high bit unset. So you could just test for the high bit. Assuming that toLowerCase will not go from ASCII to non-ASCII, this also gets rid of the next check in line 463, which is in essence also a test for the high bit.

you can just check if the byte is positive as java does when creating compact strings

Good point @bart-samwel and @stefankandic, adding further optimizations to comparison and case conversions as well since they also contain similar constructs.

…eCase

dbatomic · 2024-04-05T14:34:08Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

-
-    byte[] bytes = new byte[numBytes];
-    bytes[0] = (byte) Character.toTitleCase(getByte(0));
+    // skip allocation if we need to fallback


Better than saying "skip allocation" explain why we need fallback here.

dbatomic · 2024-04-10T09:25:00Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

@@ -447,28 +442,50 @@ private UTF8String toUpperCaseSlow() {
    return fromString(toString().toUpperCase());
  }

+  /**
+   * Optimized lowercase comparison for UTF8_BINARY_LCASE collation


Can you describe in comment what lowercase comparison means?
Maybe better name is compareCaseInsensitive?

That's not exactly equivalent to case-insensitive comparison. For example, "ﬀ".compareLowerCase("ff") == false, but case-insensitive should return true because uppercase converts both strings to "FF". Added clarification to the comment.

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

cloud-fan · 2024-04-10T15:07:05Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

-      int upper = Character.toUpperCase(b);
-      if (upper > 127) {
-        // fallback
+      if (getByte(i) < 0) {


is it the the same as numBytesForFirstByte(b) != 1?

It's equivalent but this way is faster. Please check previous discussion.

cloud-fan · 2024-04-10T15:22:55Z

thanks, merging to master!

Add optimization for lowercase comparison of UTF8String used in UTF8_…

c88bd78

…BINARY_LCASE collation

github-actions bot added the SQL label Apr 2, 2024

Fix

942e8ac

HyukjinKwon reviewed Apr 3, 2024

View reviewed changes

common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java Show resolved Hide resolved

Add benchmark results

3b82502

nikolamand-db requested a review from HyukjinKwon April 3, 2024 10:08

dbatomic reviewed Apr 3, 2024

View reviewed changes

bart-samwel reviewed Apr 4, 2024

View reviewed changes

nikolamand-db added 3 commits April 4, 2024 18:00

ASCII optimization compareLowercase, toLowerCase, toUpperCase, toTitl…

e5739f3

…eCase

Add CollationNonASCIIBenchmark

7aeb909

Add new benchmarks

fe2374a

dbatomic reviewed Apr 5, 2024

View reviewed changes

nikolamand-db added 2 commits April 8, 2024 09:52

Clarify fallback, update benchmarks

520c946

New benchmarks

f1bf106

dbatomic reviewed Apr 10, 2024

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java Outdated Show resolved Hide resolved

dbatomic approved these changes Apr 10, 2024

View reviewed changes

Clarify

08b058f

cloud-fan reviewed Apr 10, 2024

View reviewed changes

cloud-fan closed this in 627f608 Apr 10, 2024

cloud-fan mentioned this pull request Apr 11, 2024

[SPARK-47410][SQL] Refactor UTF8String and CollationFactory #45978

Closed

dongjoon-hyun mentioned this pull request Apr 29, 2024

[SPARK-47693][TESTS][FOLLOWUP] Reduce CollationBenchmarks time #46283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47693][SQL] Add optimization for lowercase comparison of UTF8String used in UTF8_BINARY_LCASE collation #45816

[SPARK-47693][SQL] Add optimization for lowercase comparison of UTF8String used in UTF8_BINARY_LCASE collation #45816

nikolamand-db commented Apr 2, 2024 •

edited

HyukjinKwon Apr 3, 2024

dbatomic Apr 3, 2024

nikolamand-db Apr 5, 2024

dbatomic Apr 3, 2024

nikolamand-db Apr 5, 2024 •

edited

dbatomic Apr 3, 2024

nikolamand-db Apr 3, 2024

bart-samwel Apr 4, 2024

stefankandic Apr 4, 2024

nikolamand-db Apr 4, 2024

dbatomic Apr 5, 2024

dbatomic Apr 10, 2024

nikolamand-db Apr 10, 2024

cloud-fan Apr 10, 2024

nikolamand-db Apr 10, 2024

cloud-fan commented Apr 10, 2024

[SPARK-47693][SQL] Add optimization for lowercase comparison of UTF8String used in UTF8_BINARY_LCASE collation #45816

[SPARK-47693][SQL] Add optimization for lowercase comparison of UTF8String used in UTF8_BINARY_LCASE collation #45816

Conversation

nikolamand-db commented Apr 2, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikolamand-db Apr 5, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 10, 2024

nikolamand-db commented Apr 2, 2024 •

edited

nikolamand-db Apr 5, 2024 •

edited