[SPARK-48440][SQL] Fix StringTranslate behaviour for non-UTF8_BINARY collations #46761

uros-db · 2024-05-27T20:33:33Z

What changes were proposed in this pull request?

String searching in UTF8_LCASE now works on character-level, rather than on byte-level. For example: translate("İ", "i") now returns "İ", because there exists no single character in "İ" such that lowercased version of that character equals to "i". Note, however, that there is a byte subsequence of "İ" such that lowercased version of that UTF-8 byte sequence equals to "i" (so the new behaviour is different than the old behaviour).

Also, translation for ICU collations works by repeatedly translating the longest possible substring that matches a key in the dictionary (under the specified collation), starting from the left side of the input string, until the entire string is translated.

Why are the changes needed?

Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_BINARY_LCASE (see example above).

Does this PR introduce any user-facing change?

Yes, behaviour of translate expression is changed for edge cases with one-to-many case mapping.

How was this patch tested?

New unit tests in CollationStringExpressionsSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

### What changes were proposed in this pull request? Renaming `UTF8_BINARY_LCASE` collation to `UTF8_LCASE`. ### Why are the changes needed? As part of the collation effort in Spark, we've moved away from byte-by-byte logic towards character-by-character logic, so what we used to call `UTF8_BINARY_LCASE` is now more precisely `UTF8_LCASE`. For example, string searching in UTF8_LCASE now works on character-level (rather than on byte-level), which is reflected in this PRs: #46511, #46589, #46682, #46761, #46762. In addition, string comparison also works on character-level now, as per the changes introduced in this PR: #46700. ### Does this PR introduce _any_ user-facing change? Yes, what was previously named `UTF8_BINARY_LCASE` collation, will from now on be named `UTF8_LCASE`. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46924 from uros-db/rename-lcase. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

mkaravel

One minor nit.
PR looks good to me. Thank you for working on this!

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java

cloud-fan · 2024-07-12T14:29:51Z

The CI failure is unrelated, thanks, merging to master!

…collations ### What changes were proposed in this pull request? String searching in UTF8_LCASE now works on character-level, rather than on byte-level. For example: `translate("İ", "i")` now returns `"İ"`, because there exists no **single character** in `"İ"` such that lowercased version of that character equals to `"i"`. Note, however, that there _is_ a byte subsequence of `"İ"` such that lowercased version of that UTF-8 byte sequence equals to `"i"` (so the new behaviour is different than the old behaviour). Also, translation for ICU collations works by repeatedly translating the longest possible substring that matches a key in the dictionary (under the specified collation), starting from the left side of the input string, until the entire string is translated. ### Why are the changes needed? Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_BINARY_LCASE (see example above). ### Does this PR introduce _any_ user-facing change? Yes, behaviour of `translate` expression is changed for edge cases with one-to-many case mapping. ### How was this patch tested? New unit tests in `CollationStringExpressionsSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46761 from uros-db/alter-translate. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Initial commit

204e338

github-actions bot added the SQL label May 27, 2024

uros-db added 2 commits May 28, 2024 15:41

Fix StringTranslate

b366630

Fix Java lint

cee88c5

uros-db changed the title ~~[WIP][SQL] Fix StringTranslate behaviour for non-UTF8_BINARY collations~~ [SPARK-48440][SQL] Fix StringTranslate behaviour for non-UTF8_BINARY collations May 28, 2024

uros-db changed the title ~~[SPARK-48440][SQL] Fix StringTranslate behaviour for non-UTF8_BINARY collations~~ [WIP][SPARK-48440][SQL] Fix StringTranslate behaviour for non-UTF8_BINARY collations May 28, 2024

uros-db added 4 commits May 28, 2024 20:23

Merge branch 'apache:master' into alter-translate

3543c91

Merge branch 'master' into alter-translate

18efb2a

Fix LCASE implementation

f436ade

Fix lint

659b1eb

uros-db changed the title ~~[WIP][SPARK-48440][SQL] Fix StringTranslate behaviour for non-UTF8_BINARY collations~~ [SPARK-48440][SQL] Fix StringTranslate behaviour for non-UTF8_BINARY collations Jun 10, 2024

uros-db mentioned this pull request Jun 10, 2024

[SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE #46924

Closed

uros-db added 2 commits June 10, 2024 16:21

Add tests

a70b651

Merge branch 'master' into alter-translate

19ed217

uros-db commented Jun 20, 2024

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

mkaravel reviewed Jun 20, 2024

View reviewed changes

uros-db added 3 commits July 4, 2024 15:22

Merge branch 'master' into alter-translate

3076e8c

Update CollationStringExpressionsSuite.scala

542fee3

Refactor translate

be50416

uros-db commented Jul 8, 2024

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

uros-db commented Jul 8, 2024

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java Outdated Show resolved Hide resolved

uros-db commented Jul 8, 2024

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java Outdated Show resolved Hide resolved

uros-db requested a review from mkaravel July 8, 2024 13:04

uros-db added 2 commits July 8, 2024 17:02

Fixes

aec0077

Update tests

4202538

mkaravel reviewed Jul 8, 2024

View reviewed changes

Fixes

b509941

uros-db requested a review from mkaravel July 9, 2024 18:24

Merge branch 'apache:master' into alter-translate

5fc0282

Update comment

2cdda66

mkaravel approved these changes Jul 12, 2024

View reviewed changes

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

Fix comments

ae718aa

cloud-fan approved these changes Jul 12, 2024

View reviewed changes

cloud-fan closed this in 8e4bbdf Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48440][SQL] Fix StringTranslate behaviour for non-UTF8_BINARY collations #46761

[SPARK-48440][SQL] Fix StringTranslate behaviour for non-UTF8_BINARY collations #46761

uros-db commented May 27, 2024 •

edited

Loading

mkaravel left a comment

cloud-fan commented Jul 12, 2024

[SPARK-48440][SQL] Fix StringTranslate behaviour for non-UTF8_BINARY collations #46761

[SPARK-48440][SQL] Fix StringTranslate behaviour for non-UTF8_BINARY collations #46761

Conversation

uros-db commented May 27, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

mkaravel left a comment

Choose a reason for hiding this comment

cloud-fan commented Jul 12, 2024

uros-db commented May 27, 2024 •

edited

Loading