[SPARK-47418][SQL] Add hand-crafted implementations for lowercase unicode-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE #46181

vladimirg-db · 2024-04-23T10:00:56Z

What changes were proposed in this pull request?

Added hand-crafted implementations of unicode-aware lower-case contains, startsWith, endsWith to optimize UTF8_BINARY_LCASE for ASCII-only strings.

Why are the changes needed?

UTF8String.toLowerCase(), which is used for the aforementioned collation-aware functions, has an optimization for full-ascii strings, but still always allocates a new object. In this PR I introduced loop-based implementations, which fall-back to toLowerCase() in case they meet a non-asci character.

Does this PR introduce any user-facing change?

No, these functions should behave exactly as:

lhs.containsInLowerCase(rhs) == lhs.toLowerCase().contains(rhs.toLowerCase())
lhs.startsWithInLowerCase(rhs) == lhs.toLowerCase().startsWith(rhs.toLowerCase())
lhs.endsWithInLowerCase(rhs) == lhs.toLowerCase().endsWith(rhs.toLowerCase())

How was this patch tested?

Added new test cases to org.apache.spark.unsafe.types.CollationSupportSuite and org.apache.spark.unsafe.types.UTF8StringSuite, including several unicode lowercase specific. Also I've run CollationBenchmark on GHA for JDK 17 and JDK 21 and have updated the data.

Was this patch authored or co-authored using generative AI tooling?

No

…code-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE

vladimirg-db · 2024-04-23T10:10:57Z

Re-running the benchmarks on GHA for the new implementation proposed by @cloud-fan here and after that I will merge the changes.

cloud-fan · 2024-04-24T07:59:03Z

thanks, merging to master!

…code-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE ### What changes were proposed in this pull request? Added hand-crafted implementations of unicode-aware lower-case `contains`, `startsWith`, `endsWith` to optimize UTF8_BINARY_LCASE for ASCII-only strings. ### Why are the changes needed? `UTF8String.toLowerCase()`, which is used for the aforementioned collation-aware functions, has an optimization for full-ascii strings, but still always allocates a new object. In this PR I introduced loop-based implementations, which fall-back to `toLowerCase()` in case they meet a non-asci character. ### Does this PR introduce _any_ user-facing change? No, these functions should behave exactly as: - `lhs.containsInLowerCase(rhs)` == `lhs.toLowerCase().contains(rhs.toLowerCase())` - `lhs.startsWithInLowerCase(rhs)` == `lhs.toLowerCase().startsWith(rhs.toLowerCase())` - `lhs.endsWithInLowerCase(rhs)` == `lhs.toLowerCase().endsWith(rhs.toLowerCase())` ### How was this patch tested? Added new test cases to `org.apache.spark.unsafe.types.CollationSupportSuite` and `org.apache.spark.unsafe.types.UTF8StringSuite`, including several unicode lowercase specific. Also I've run `CollationBenchmark` on GHA for JDK 17 and JDK 21 and have updated the data. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46181 from vladimirg-db/vladimirg-db/add-hand-crafted-string-function-implementations-for-utf8-binary-lcase-collations. Authored-by: Vladimir Golubev <vladimir.golubev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-47418][SQL] Add hand-crafted implementations for lowercase uni…

d1a98e6

…code-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE

github-actions bot added the SQL label Apr 23, 2024

vladimirg-db marked this pull request as ready for review April 23, 2024 10:01

vladimirg-db mentioned this pull request Apr 23, 2024

[SPARK-47418][SQL] Add hand-crafted implementations for lowercase unicode-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE #46082

Closed

cloud-fan approved these changes Apr 23, 2024

View reviewed changes

uros-db approved these changes Apr 23, 2024

View reviewed changes

Update CollationBenchmark results

698cf99

cloud-fan closed this in 890f78d Apr 24, 2024

vladimirg-db deleted the vladimirg-db/add-hand-crafted-string-function-implementations-for-utf8-binary-lcase-collations branch April 24, 2024 08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47418][SQL] Add hand-crafted implementations for lowercase unicode-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE #46181

[SPARK-47418][SQL] Add hand-crafted implementations for lowercase unicode-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE #46181

vladimirg-db commented Apr 23, 2024 •

edited by cloud-fan

vladimirg-db commented Apr 23, 2024

cloud-fan commented Apr 24, 2024

[SPARK-47418][SQL] Add hand-crafted implementations for lowercase unicode-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE #46181

[SPARK-47418][SQL] Add hand-crafted implementations for lowercase unicode-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE #46181

Conversation

vladimirg-db commented Apr 23, 2024 • edited by cloud-fan

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

vladimirg-db commented Apr 23, 2024

cloud-fan commented Apr 24, 2024

vladimirg-db commented Apr 23, 2024 •

edited by cloud-fan