Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47693][SQL] Add optimization for lowercase comparison of UTF8String used in UTF8_BINARY_LCASE collation #45816

Closed
wants to merge 9 commits into from

Conversation

nikolamand-db
Copy link
Contributor

@nikolamand-db nikolamand-db commented Apr 2, 2024

What changes were proposed in this pull request?

Current collation benchmarks indicate that UTF8_BINARY_LCASE collation comparisons are order of magnitude slower (~7-10x) than plain binary comparisons. Improve the performance by optimizing lowercase comparison function for UTF8String instances instead of performing full lowercase conversion before binary comparison.

Optimization is based on similar method used in toLowerCase where we check character by character if conversion is valid under ASCII and fallback to slow comparison of native strings. In latter case, we only take into consideration suffixes that are left to compare.

Benchmarks from CollationBenchmark ran locally show substantial performance increase:

[info] collation unit benchmarks - equalsFunction:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] --------------------------------------------------------------------------------------------------------------------------
[info] UTF8_BINARY_LCASE                                    7199           7209          14          0.0       71988.8       1.0X
[info] UNICODE                                              3925           3929           5          0.0       39250.4       1.8X
[info] UTF8_BINARY                                          3935           3950          21          0.0       39351.2       1.8X
[info] UNICODE_CI                                          45248          51404        8706          0.0      452484.7       0.2X

Why are the changes needed?

To improve performance of comparisons of strings under UTF8_BINARY_LCASE collation.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests to UTF8StringSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Apr 2, 2024

private int compareLowercaseSuffixSlow(UTF8String other, int pref) {
UTF8String suffixLeft = UTF8String.fromAddress(base, offset + pref,
numBytes - pref);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use 2-spaced intentation? See "Code style guide" at https://spark.apache.org/contributing.html

UNICODE 4520 4522 2 0.0 45201.8 7.5X
UTF8_BINARY 4524 4526 2 0.0 45243.0 7.5X
UNICODE_CI 52706 52711 7 0.0 527056.1 0.6X
UTF8_BINARY_LCASE 8006 8022 24 0.0 80056.6 1.0X
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a benchmark for UTF8_BINARY_LCASE with non ASCII chars?
This should be a separate group in benchmark list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new benchmarks for ASCII and non-ASCII data, please check.

UNICODE 177636 177709 103 0.0 1776363.9 0.1X
UTF8_BINARY 11954 11956 3 0.0 119536.7 1.8X
UNICODE_CI 158014 158038 35 0.0 1580135.7 0.1X
UTF8_BINARY_LCASE 24485 24506 30 0.0 244846.2 1.0X
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do the same trick for hash as well? e.g. iterate, take single byte code point, convert to lcase and pass it to hasher?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to skip hash optimizations for now as hashing of data blocks requires internal mixing functions

int k1 = mixK1(halfWord);
h1 = mixH1(h1, k1);
but we must supply data generated on fly as stream because we want to do char-by-char lowercase and this is still not supported in internal hash implementation.

}
int lowerLeft = Character.toLowerCase(left);
int lowerRight = Character.toLowerCase(right);
if (lowerLeft > 127 || lowerRight > 127) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you are not introducing anything new here and that numBytes != 1 && codePoint < 127 is already used in toUpperCase. But I don't really understand this logic.
Why can't we take multibyte codepoints? I see that Character.ToLowerCase accepts an integer specifying code point, that we can decode from UTF8 binary. What is the reason why we can't use this for any code point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that if we remain in ASCII space, we don't need to worry about locales and length of character representation in UTF8 encoding; if we step outside this set of characters, we may encounter issues with both.

Ignoring locales (which Character.ToLowerCase does) means that we will break compliance with UTF8String.toLowerCase which is locale-dependent. Issue with varying length of character encoding in bytes introduces major performance penalty as we then need to deal with byte buffers to store incoming converted data. Although this method should have same asymptotic complexity, it will be several times slower than simply comparing char-by-char. Since current naive implementation is not that much bad (~7x slower than UTF8_BINARY), we woudn't gain much in terms of performance.

for (curr = 0; curr < numBytes && curr < other.numBytes; curr++) {
byte left = getByte(curr);
byte right = other.getByte(curr);
if (numBytesForFirstByte(left) != 1 || numBytesForFirstByte(right) != 1) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expensive -- you don't want to know the number of bytes, you just want to know if it's more than 1. If you look at the UTF-8 spec, you see that the multibyte characters all have the high bit set, and the single-byte characters all have the high bit unset. So you could just test for the high bit. Assuming that toLowerCase will not go from ASCII to non-ASCII, this also gets rid of the next check in line 463, which is in essence also a test for the high bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

you can just check if the byte is positive as java does when creating compact strings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @bart-samwel and @stefankandic, adding further optimizations to comparison and case conversions as well since they also contain similar constructs.


byte[] bytes = new byte[numBytes];
bytes[0] = (byte) Character.toTitleCase(getByte(0));
// skip allocation if we need to fallback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better than saying "skip allocation" explain why we need fallback here.

@@ -447,28 +442,50 @@ private UTF8String toUpperCaseSlow() {
return fromString(toString().toUpperCase());
}

/**
* Optimized lowercase comparison for UTF8_BINARY_LCASE collation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe in comment what lowercase comparison means?
Maybe better name is compareCaseInsensitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not exactly equivalent to case-insensitive comparison. For example, "ff".compareLowerCase("ff") == false, but case-insensitive should return true because uppercase converts both strings to "FF". Added clarification to the comment.

int upper = Character.toUpperCase(b);
if (upper > 127) {
// fallback
if (getByte(i) < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the the same as numBytesForFirstByte(b) != 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's equivalent but this way is faster. Please check previous discussion.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants