-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47693][SQL] Add optimization for lowercase comparison of UTF8String used in UTF8_BINARY_LCASE collation #45816
Conversation
…BINARY_LCASE collation
|
||
private int compareLowercaseSuffixSlow(UTF8String other, int pref) { | ||
UTF8String suffixLeft = UTF8String.fromAddress(base, offset + pref, | ||
numBytes - pref); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use 2-spaced intentation? See "Code style guide" at https://spark.apache.org/contributing.html
common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
Show resolved
Hide resolved
UNICODE 4520 4522 2 0.0 45201.8 7.5X | ||
UTF8_BINARY 4524 4526 2 0.0 45243.0 7.5X | ||
UNICODE_CI 52706 52711 7 0.0 527056.1 0.6X | ||
UTF8_BINARY_LCASE 8006 8022 24 0.0 80056.6 1.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a benchmark for UTF8_BINARY_LCASE with non ASCII chars?
This should be a separate group in benchmark list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added new benchmarks for ASCII and non-ASCII data, please check.
UNICODE 177636 177709 103 0.0 1776363.9 0.1X | ||
UTF8_BINARY 11954 11956 3 0.0 119536.7 1.8X | ||
UNICODE_CI 158014 158038 35 0.0 1580135.7 0.1X | ||
UTF8_BINARY_LCASE 24485 24506 30 0.0 244846.2 1.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do the same trick for hash as well? e.g. iterate, take single byte code point, convert to lcase and pass it to hasher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to skip hash optimizations for now as hashing of data blocks requires internal mixing functions
spark/common/unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java
Lines 74 to 75 in 383bb4a
int k1 = mixK1(halfWord); | |
h1 = mixH1(h1, k1); |
} | ||
int lowerLeft = Character.toLowerCase(left); | ||
int lowerRight = Character.toLowerCase(right); | ||
if (lowerLeft > 127 || lowerRight > 127) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you are not introducing anything new here and that numBytes != 1 && codePoint < 127
is already used in toUpperCase
. But I don't really understand this logic.
Why can't we take multibyte codepoints? I see that Character.ToLowerCase
accepts an integer specifying code point, that we can decode from UTF8 binary. What is the reason why we can't use this for any code point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that if we remain in ASCII space, we don't need to worry about locales and length of character representation in UTF8 encoding; if we step outside this set of characters, we may encounter issues with both.
Ignoring locales (which Character.ToLowerCase
does) means that we will break compliance with UTF8String.toLowerCase
which is locale-dependent. Issue with varying length of character encoding in bytes introduces major performance penalty as we then need to deal with byte buffers to store incoming converted data. Although this method should have same asymptotic complexity, it will be several times slower than simply comparing char-by-char. Since current naive implementation is not that much bad (~7x slower than UTF8_BINARY), we woudn't gain much in terms of performance.
for (curr = 0; curr < numBytes && curr < other.numBytes; curr++) { | ||
byte left = getByte(curr); | ||
byte right = other.getByte(curr); | ||
if (numBytesForFirstByte(left) != 1 || numBytesForFirstByte(right) != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is expensive -- you don't want to know the number of bytes, you just want to know if it's more than 1. If you look at the UTF-8 spec, you see that the multibyte characters all have the high bit set, and the single-byte characters all have the high bit unset. So you could just test for the high bit. Assuming that toLowerCase
will not go from ASCII to non-ASCII, this also gets rid of the next check in line 463, which is in essence also a test for the high bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @bart-samwel and @stefankandic, adding further optimizations to comparison and case conversions as well since they also contain similar constructs.
|
||
byte[] bytes = new byte[numBytes]; | ||
bytes[0] = (byte) Character.toTitleCase(getByte(0)); | ||
// skip allocation if we need to fallback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better than saying "skip allocation" explain why we need fallback here.
@@ -447,28 +442,50 @@ private UTF8String toUpperCaseSlow() { | |||
return fromString(toString().toUpperCase()); | |||
} | |||
|
|||
/** | |||
* Optimized lowercase comparison for UTF8_BINARY_LCASE collation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe in comment what lowercase comparison means?
Maybe better name is compareCaseInsensitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not exactly equivalent to case-insensitive comparison. For example, "ff".compareLowerCase("ff") == false
, but case-insensitive should return true because uppercase converts both strings to "FF". Added clarification to the comment.
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
Outdated
Show resolved
Hide resolved
int upper = Character.toUpperCase(b); | ||
if (upper > 127) { | ||
// fallback | ||
if (getByte(i) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it the the same as numBytesForFirstByte(b) != 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's equivalent but this way is faster. Please check previous discussion.
thanks, merging to master! |
What changes were proposed in this pull request?
Current collation benchmarks indicate that
UTF8_BINARY_LCASE
collation comparisons are order of magnitude slower (~7-10x) than plain binary comparisons. Improve the performance by optimizing lowercase comparison function forUTF8String
instances instead of performing full lowercase conversion before binary comparison.Optimization is based on similar method used in
toLowerCase
where we check character by character if conversion is valid under ASCII and fallback to slow comparison of native strings. In latter case, we only take into consideration suffixes that are left to compare.Benchmarks from
CollationBenchmark
ran locally show substantial performance increase:Why are the changes needed?
To improve performance of comparisons of strings under UTF8_BINARY_LCASE collation.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added unit tests to
UTF8StringSuite
.Was this patch authored or co-authored using generative AI tooling?
No.