[SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet) #46682

uros-db · 2024-05-21T09:41:00Z

What changes were proposed in this pull request?

String searching in UTF8_BINARY_LCASE now works on character-level, rather than on byte-level. For example: replace("İ", "i"); now returns "İ", because there exists no start, len such that lowercase(substring("İ", start, len)) == "i".

Why are the changes needed?

Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_BINARY_LCASE (see example above).

Does this PR introduce any user-facing change?

Yes, behaviour of replace expression is changed for edge cases with one-to-many case mapping.

How was this patch tested?

New unit tests for StringReplace and FindInSet in CollationSupportSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

mkaravel · 2024-05-24T00:36:15Z

Why are the changes needed?

Fix functions that give unusable results due to one-to-many conditional case mapping when performing string search under UTF8_BINARY_LCASE (see example above).

Please remove the word "conditional" in the above section.

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java

### What changes were proposed in this pull request? Renaming `UTF8_BINARY_LCASE` collation to `UTF8_LCASE`. ### Why are the changes needed? As part of the collation effort in Spark, we've moved away from byte-by-byte logic towards character-by-character logic, so what we used to call `UTF8_BINARY_LCASE` is now more precisely `UTF8_LCASE`. For example, string searching in UTF8_LCASE now works on character-level (rather than on byte-level), which is reflected in this PRs: #46511, #46589, #46682, #46761, #46762. In addition, string comparison also works on character-level now, as per the changes introduced in this PR: #46700. ### Does this PR introduce _any_ user-facing change? Yes, what was previously named `UTF8_BINARY_LCASE` collation, will from now on be named `UTF8_LCASE`. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46924 from uros-db/rename-lcase. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

uros-db added 2 commits May 21, 2024 11:35

Initial commit

b0225b1

Tests

665742c

github-actions bot added the SQL label May 21, 2024

checkstyle

8d05451

uros-db changed the title ~~[WIP][SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet)~~ [SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet) May 23, 2024

mkaravel reviewed May 24, 2024

View reviewed changes

uros-db added 5 commits May 24, 2024 12:03

Add tests

d5a811a

Update new method access

f7d22da

Fix FindInSet

7ff48e4

Fix scalastyle

d6bc73a

Merge branch 'master' into alter-lcase-vol3

9c58f57

mkaravel reviewed May 28, 2024

View reviewed changes

uros-db added 3 commits May 29, 2024 07:41

More tests

9444cfb

Merge branch 'apache:master' into alter-lcase-vol3

d18de9e

Merge branch 'apache:master' into alter-lcase-vol3

8a79827

uros-db requested a review from mkaravel May 31, 2024 12:20

uros-db mentioned this pull request Jun 10, 2024

[SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE #46924

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet) #46682

[SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet) #46682

uros-db commented May 21, 2024 •

edited

mkaravel commented May 24, 2024

Why are the changes needed?

[SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet) #46682

Are you sure you want to change the base?

[SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet) #46682

Conversation

uros-db commented May 21, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

mkaravel commented May 24, 2024

Why are the changes needed?

uros-db commented May 21, 2024 •

edited