Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47566][SQL] Support SubstringIndex function to work with collated strings #45725

Closed

Conversation

miland-db
Copy link
Contributor

@miland-db miland-db commented Mar 26, 2024

What changes were proposed in this pull request?

Extend built-in string functions to support non-binary, non-lowercase collation for: substring_index.

Why are the changes needed?

Update collation support for built-in string functions in Spark.

Does this PR introduce any user-facing change?

Yes, users should now be able to use COLLATE within arguments for built-in string function SUBSTRING_INDEX in Spark SQL queries, using non-binary collations such as UNICODE_CI.

How was this patch tested?

Unit tests for queries using SubstringIndex (CollationStringExpressionsSuite.scala).

Was this patch authored or co-authored using generative AI tooling?

No

To consider:

There is no check for collation match between string and delimiter, it will be introduced with Implicit Casting.

We can remove the original public UTF8String subStringIndex(UTF8String delim, int count) method, and get the existing behavior using subStringIndex(delim, count, 0).

@github-actions github-actions bot added the SQL label Mar 26, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR.

Just one preliminary question, is there any chance of performance regression after this PR, @miland-db ?

@miland-db
Copy link
Contributor Author

miland-db commented Mar 26, 2024

So far the computational complexity of this function was O(n*m) where n = string.length and m = delimiter.length (please correct me if I'm wrong)

  • Using this function without explicit collations should have the same performance as before.
  • For UTF8_BINARY_LCASE collation, performance should have the same asymptotic time complexity O(n*m) but it will have a greater constant factor due to conversion of strings to lowercase and some additional work.
  • Performance for other non-binary collations depends on StringSearch implementation, but it is widely used to do string search on collated strings. Performance of that algorithm is explained here: String Search | ICU Documentation

I hope this helps @dongjoon-hyun

@miland-db
Copy link
Contributor Author

@uros-db @mihailom-db @MaxGekk please take a look at this changes

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the test suite UTF8StringWithCollationSuite. Could you add/move tests there for the changes in UTF8String + collation.

@miland-db
Copy link
Contributor Author

miland-db commented Mar 27, 2024

I am testing functions from stringExpressions. Existing tests for these functions on non-collated strings are written in StringExpressionsSuite. Following that logic, tests on collated strings using functions from stringExpressions should be in CollationStringExpressionsSuite. I can add unit tests for changes introduced in UTF8String to UTF8StringWithCollationSuite.java if that's what we want to test that part more thoroughly.

In this PR: #45615 I have added tests to CollationStringExpressionsSuite

@miland-db miland-db requested a review from MaxGekk March 29, 2024 16:55
@miland-db
Copy link
Contributor Author

@stefankandic can you also review this change please?

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
# Conflicts:
#	common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just flagging this PR will likely need a fix for the ICU implementation

# Conflicts:
#	common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
#	common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
#	sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
@miland-db miland-db changed the title [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [SPARK-47566][SQL] Support SubstringIndex function to work with collated strings Apr 23, 2024
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
# Conflicts:
#	common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
@miland-db miland-db requested a review from uros-db April 26, 2024 13:53
# Conflicts:
#	common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
#	common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

}
public static UTF8String execLowercase(final UTF8String string, final UTF8String delimiter,
final int count) {
return CollationAwareUTF8String.lowercaseSubStringIndex(string, delimiter, count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class CollationAwareUTF8String is getting bigger. Shall we move it to an individual file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the next PR. We will consider this option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, we should do this in #45820

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 12a5074 Apr 30, 2024
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…ted strings

### What changes were proposed in this pull request?
Extend built-in string functions to support non-binary, non-lowercase collation for: substring_index.

### Why are the changes needed?
Update collation support for built-in string functions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes, users should now be able to use COLLATE within arguments for built-in string function SUBSTRING_INDEX in Spark SQL queries, using non-binary collations such as UNICODE_CI.

### How was this patch tested?
Unit tests for queries using SubstringIndex (`CollationStringExpressionsSuite.scala`).

### Was this patch authored or co-authored using generative AI tooling?
No

### To consider:
There is no check for collation match between string and delimiter, it will be introduced with Implicit Casting.

We can remove the original `public UTF8String subStringIndex(UTF8String delim, int count)` method, and get the existing behavior using `subStringIndex(delim, count, 0)`.

Closes apache#45725 from miland-db/miland-db/substringIndex-stringLocate.

Authored-by: Milan Dankovic <milan.dankovic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants