[SPARK-47566][SQL] Support SubstringIndex function to work with collated strings #45725

miland-db · 2024-03-26T15:09:46Z

What changes were proposed in this pull request?

Extend built-in string functions to support non-binary, non-lowercase collation for: substring_index.

Why are the changes needed?

Update collation support for built-in string functions in Spark.

Does this PR introduce any user-facing change?

Yes, users should now be able to use COLLATE within arguments for built-in string function SUBSTRING_INDEX in Spark SQL queries, using non-binary collations such as UNICODE_CI.

How was this patch tested?

Unit tests for queries using SubstringIndex (CollationStringExpressionsSuite.scala).

Was this patch authored or co-authored using generative AI tooling?

No

To consider:

There is no check for collation match between string and delimiter, it will be introduced with Implicit Casting.

We can remove the original public UTF8String subStringIndex(UTF8String delim, int count) method, and get the existing behavior using subStringIndex(delim, count, 0).

dongjoon-hyun

Thank you for making a PR.

Just one preliminary question, is there any chance of performance regression after this PR, @miland-db ?

miland-db · 2024-03-26T16:11:40Z

So far the computational complexity of this function was O(n*m) where n = string.length and m = delimiter.length (please correct me if I'm wrong)

Using this function without explicit collations should have the same performance as before.
For UTF8_BINARY_LCASE collation, performance should have the same asymptotic time complexity O(n*m) but it will have a greater constant factor due to conversion of strings to lowercase and some additional work.
Performance for other non-binary collations depends on StringSearch implementation, but it is widely used to do string search on collated strings. Performance of that algorithm is explained here: String Search | ICU Documentation

I hope this helps @dongjoon-hyun

miland-db · 2024-03-26T16:57:15Z

@uros-db @mihailom-db @MaxGekk please take a look at this changes

MaxGekk

There is the test suite UTF8StringWithCollationSuite. Could you add/move tests there for the changes in UTF8String + collation.

miland-db · 2024-03-27T09:38:11Z

I am testing functions from stringExpressions. Existing tests for these functions on non-collated strings are written in StringExpressionsSuite. Following that logic, tests on collated strings using functions from stringExpressions should be in CollationStringExpressionsSuite. I can add unit tests for changes introduced in UTF8String to UTF8StringWithCollationSuite.java if that's what we want to test that part more thoroughly.

In this PR: #45615 I have added tests to CollationStringExpressionsSuite

miland-db · 2024-03-29T16:57:20Z

@stefankandic can you also review this change please?

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

…se checks

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala

# Conflicts: # common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala

uros-db

just flagging this PR will likely need a fix for the ICU implementation

# Conflicts: # common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java # common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala # sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala

# Conflicts: # common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala

# Conflicts: # common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java # common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala

uros-db

lgtm

cloud-fan · 2024-04-30T09:18:12Z

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java

+    }
+    public static UTF8String execLowercase(final UTF8String string, final UTF8String delimiter,
+        final int count) {
+      return CollationAwareUTF8String.lowercaseSubStringIndex(string, delimiter, count);


The class CollationAwareUTF8String is getting bigger. Shall we move it to an individual file?

Maybe in the next PR. We will consider this option

agreed, we should do this in #45820

cloud-fan · 2024-04-30T09:18:24Z

thanks, merging to master!

…ted strings ### What changes were proposed in this pull request? Extend built-in string functions to support non-binary, non-lowercase collation for: substring_index. ### Why are the changes needed? Update collation support for built-in string functions in Spark. ### Does this PR introduce _any_ user-facing change? Yes, users should now be able to use COLLATE within arguments for built-in string function SUBSTRING_INDEX in Spark SQL queries, using non-binary collations such as UNICODE_CI. ### How was this patch tested? Unit tests for queries using SubstringIndex (`CollationStringExpressionsSuite.scala`). ### Was this patch authored or co-authored using generative AI tooling? No ### To consider: There is no check for collation match between string and delimiter, it will be introduced with Implicit Casting. We can remove the original `public UTF8String subStringIndex(UTF8String delim, int count)` method, and get the existing behavior using `subStringIndex(delim, count, 0)`. Closes apache#45725 from miland-db/miland-db/substringIndex-stringLocate. Authored-by: Milan Dankovic <milan.dankovic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

miland-db added 4 commits March 25, 2024 18:09

Add find method with collation supported

d2e75fe

Add SubstringIndex support for collated strings

2097a66

Improve unit tests and fix bugs

b3bd34a

Fix bug with the rfind on collated strings

15c5491

github-actions bot added the SQL label Mar 26, 2024

miland-db added 2 commits March 26, 2024 16:19

Merge branch 'master' into miland-db/substringIndex-stringLocate

5925763

Resolve merge problems with master

34ee8af

dongjoon-hyun reviewed Mar 26, 2024

View reviewed changes

improve scala style

2f8f13d

MaxGekk requested changes Mar 27, 2024

View reviewed changes

Add tests to UTF8StringWithCollationSuite

5538b07

miland-db requested a review from MaxGekk March 29, 2024 16:55