Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47418][SQL] Add hand-crafted implementations for lowercase unicode-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE #46181

Conversation

vladimirg-db
Copy link
Contributor

@vladimirg-db vladimirg-db commented Apr 23, 2024

What changes were proposed in this pull request?

Added hand-crafted implementations of unicode-aware lower-case contains, startsWith, endsWith to optimize UTF8_BINARY_LCASE for ASCII-only strings.

Why are the changes needed?

UTF8String.toLowerCase(), which is used for the aforementioned collation-aware functions, has an optimization for full-ascii strings, but still always allocates a new object. In this PR I introduced loop-based implementations, which fall-back to toLowerCase() in case they meet a non-asci character.

Does this PR introduce any user-facing change?

No, these functions should behave exactly as:

  • lhs.containsInLowerCase(rhs) == lhs.toLowerCase().contains(rhs.toLowerCase())
  • lhs.startsWithInLowerCase(rhs) == lhs.toLowerCase().startsWith(rhs.toLowerCase())
  • lhs.endsWithInLowerCase(rhs) == lhs.toLowerCase().endsWith(rhs.toLowerCase())

How was this patch tested?

Added new test cases to org.apache.spark.unsafe.types.CollationSupportSuite and org.apache.spark.unsafe.types.UTF8StringSuite, including several unicode lowercase specific. Also I've run CollationBenchmark on GHA for JDK 17 and JDK 21 and have updated the data.

Was this patch authored or co-authored using generative AI tooling?

No

…code-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE
@vladimirg-db
Copy link
Contributor Author

Re-running the benchmarks on GHA for the new implementation proposed by @cloud-fan here and after that I will merge the changes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 890f78d Apr 24, 2024
@vladimirg-db vladimirg-db deleted the vladimirg-db/add-hand-crafted-string-function-implementations-for-utf8-binary-lcase-collations branch April 24, 2024 08:00
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…code-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE

### What changes were proposed in this pull request?
Added hand-crafted implementations of unicode-aware lower-case `contains`, `startsWith`, `endsWith` to optimize UTF8_BINARY_LCASE for ASCII-only strings.

### Why are the changes needed?
`UTF8String.toLowerCase()`, which is used for the aforementioned collation-aware functions, has an optimization for full-ascii strings, but still always allocates a new object. In this PR I introduced loop-based implementations, which fall-back to `toLowerCase()` in case they meet a non-asci character.

### Does this PR introduce _any_ user-facing change?
No, these functions should behave exactly as:
- `lhs.containsInLowerCase(rhs)` == `lhs.toLowerCase().contains(rhs.toLowerCase())`
- `lhs.startsWithInLowerCase(rhs)` == `lhs.toLowerCase().startsWith(rhs.toLowerCase())`
- `lhs.endsWithInLowerCase(rhs)` == `lhs.toLowerCase().endsWith(rhs.toLowerCase())`

### How was this patch tested?
Added new test cases to `org.apache.spark.unsafe.types.CollationSupportSuite` and `org.apache.spark.unsafe.types.UTF8StringSuite`, including several unicode lowercase specific. Also I've run `CollationBenchmark` on GHA for JDK 17 and JDK 21 and have updated the data.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#46181 from vladimirg-db/vladimirg-db/add-hand-crafted-string-function-implementations-for-utf8-binary-lcase-collations.

Authored-by: Vladimir Golubev <vladimir.golubev@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants