[SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex) #46589

uros-db · 2024-05-15T06:47:19Z

What changes were proposed in this pull request?

String searching in UTF8_BINARY_LCASE now works on character-level, rather than on byte-level. For example: instr("İ", "i"); now returns 0, because there exists no start, len such that lowercase(substring("İ", start, len)) == "i".

Why are the changes needed?

Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_BINARY_LCASE (see example above).

Does this PR introduce any user-facing change?

Yes, behaviour of instr and substring_index expressions is changed for edge cases with one-to-many case mapping.

How was this patch tested?

New unit tests in CollationSupportSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java

mkaravel

Thank you for addressing my comments.
LGTM!

mkaravel · 2024-05-20T23:26:32Z

Please update the PR description.

uros-db

updated PR description & title
@cloud-fan ready for review

mkaravel · 2024-05-22T04:03:46Z

@uros-db Please see my comment about the PR description in #46511
The same comment applies to the PR description for this PR.

cloud-fan · 2024-05-28T17:05:48Z

please fix merge conflicts.

cloud-fan · 2024-05-29T18:14:32Z

thanks, merging to master!

…llation (StringInStr, SubstringIndex) ### What changes were proposed in this pull request? String searching in UTF8_BINARY_LCASE now works on character-level, rather than on byte-level. For example: `instr("İ", "i")`; now returns 0, because there exists no `start, len` such that `lowercase(substring("İ", start, len)) == "i"`. ### Why are the changes needed? Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_BINARY_LCASE (see example above). ### Does this PR introduce _any_ user-facing change? Yes, behaviour of `instr` and `substring_index` expressions is changed for edge cases with one-to-many case mapping. ### How was this patch tested? New unit tests in `CollationSupportSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46589 from uros-db/alter-lcase-vol2. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Renaming `UTF8_BINARY_LCASE` collation to `UTF8_LCASE`. ### Why are the changes needed? As part of the collation effort in Spark, we've moved away from byte-by-byte logic towards character-by-character logic, so what we used to call `UTF8_BINARY_LCASE` is now more precisely `UTF8_LCASE`. For example, string searching in UTF8_LCASE now works on character-level (rather than on byte-level), which is reflected in this PRs: #46511, #46589, #46682, #46761, #46762. In addition, string comparison also works on character-level now, as per the changes introduced in this PR: #46700. ### Does this PR introduce _any_ user-facing change? Yes, what was previously named `UTF8_BINARY_LCASE` collation, will from now on be named `UTF8_LCASE`. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46924 from uros-db/rename-lcase. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Fix and tests

f226c65

github-actions bot added the SQL label May 15, 2024

Fix tests

a2e1307

uros-db changed the title ~~[WIP][SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubStringIndex)~~ [WIP][SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex) May 15, 2024

Merge branch 'apache:master' into alter-lcase-vol2

4671b85

mkaravel reviewed May 16, 2024

View reviewed changes

Small fixes

168dde3

uros-db requested a review from mkaravel May 17, 2024 08:25

Centralize indexOf behaviour for empty substring

a26124c

mkaravel approved these changes May 20, 2024

View reviewed changes

uros-db changed the title ~~[WIP][SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex)~~ [SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex) May 21, 2024

uros-db commented May 21, 2024

View reviewed changes

Update new method access

b571084

dbatomic approved these changes May 28, 2024

View reviewed changes

Merge branch 'master' into alter-lcase-vol2

05d2ddb

cloud-fan closed this in 0461745 May 29, 2024

uros-db mentioned this pull request Jun 10, 2024

[SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE #46924

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex) #46589

[SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex) #46589

uros-db commented May 15, 2024 •

edited

Loading

mkaravel left a comment

mkaravel commented May 20, 2024

uros-db left a comment

mkaravel commented May 22, 2024

cloud-fan commented May 28, 2024

cloud-fan commented May 29, 2024

[SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex) #46589

[SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex) #46589

Conversation

uros-db commented May 15, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

mkaravel left a comment

Choose a reason for hiding this comment

mkaravel commented May 20, 2024

uros-db left a comment

Choose a reason for hiding this comment

mkaravel commented May 22, 2024

cloud-fan commented May 28, 2024

cloud-fan commented May 29, 2024

uros-db commented May 15, 2024 •

edited

Loading