Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47413][SQL] - add support to substr/left/right for collations #46040

Closed

Conversation

GideonPotok
Copy link
Contributor

https://issues.apache.org/jira/browse/SPARK-46830

What changes were proposed in this pull request?

Add collation support to types of return values for calls to substr, left, right, when passed in arguments of an explicit, implicit, or session-specified collations. Add tests to validate behavior.

Why are the changes needed?

We are incrementally adding collation support to built-in string functions in Spark. These functions are intended to be supported for collated types.

Does this PR introduce any user-facing change?

these sql functions will now not throw errors when passed in collated types. Instead, they will return the right value, of the passed in type. Or of the default collation.

How was this patch tested?

Unit testing + ad-hoc spark shell and pyspark shell interactions.

Was this patch authored or co-authored using generative AI tooling?

No.

@GideonPotok
Copy link
Contributor Author

@uros-db please review

@HyukjinKwon
Copy link
Member

What's idff w/ #46039?

@GideonPotok
Copy link
Contributor Author

@uros-db I made all suggested changes. Please re-review. Thanks!

Comment on lines 191 to 220
SubstringTestCase("select left('abc' collate " + c + ", 1)", c, Row("a")),
SubstringTestCase("select right('def' collate " + c + ", 1)", c, Row("f")),
SubstringTestCase("select substr('abc' collate " + c + ", 2)", c, Row("bc")),
SubstringTestCase("select substr('example' collate " + c + ", 0, 2)", c, Row("ex")),
SubstringTestCase("select substr('example' collate " + c + ", 1, 2)", c, Row("ex")),
SubstringTestCase("select substr('example' collate " + c + ", 0, 7)", c, Row("example")),
SubstringTestCase("select substr('example' collate " + c + ", 1, 7)", c, Row("example")),
SubstringTestCase("select substr('example' collate " + c + ", 0, 100)", c, Row("example")),
SubstringTestCase("select substr('example' collate " + c + ", 1, 100)", c, Row("example")),
SubstringTestCase("select substr('example' collate " + c + ", 2, 2)", c, Row("xa")),
SubstringTestCase("select substr('example' collate " + c + ", 1, 6)", c, Row("exampl")),
SubstringTestCase("select substr('example' collate " + c + ", 2, 100)", c, Row("xample")),
SubstringTestCase("select substr('example' collate " + c + ", 0, 0)", c, Row("")),
SubstringTestCase("select substr('example' collate " + c + ", 100, 4)", c, Row("")),
SubstringTestCase("select substr('example' collate " + c + ", 0, 100)", c, Row("example")),
SubstringTestCase("select substr('example' collate " + c + ", 1, 100)", c, Row("example")),
SubstringTestCase("select substr('example' collate " + c + ", 2, 100)", c, Row("xample")),
SubstringTestCase("select substr('example' collate " + c + ", -3, 2)", c, Row("pl")),
SubstringTestCase("select substr('example' collate " + c + ", -100, 4)", c, Row("")),
SubstringTestCase("select substr('example' collate " + c + ", -2147483648, 6)", c, Row("")),
SubstringTestCase("select substr(' a世a ' collate " + c + ", 2, 3)", c, Row("a世a")), // scalastyle:ignore
SubstringTestCase("select left(' a世a ' collate " + c + ", 3)", c, Row(" a世")), // scalastyle:ignore
SubstringTestCase("select right(' a世a ' collate " + c + ", 3)", c, Row("世a ")), // scalastyle:ignore
SubstringTestCase("select substr('AaAaAaAa000000' collate " + c + ", 2, 3)", c, Row("aAa")),
SubstringTestCase("select left('AaAaAaAa000000' collate " + c + ", 3)", c, Row("AaA")),
SubstringTestCase("select right('AaAaAaAa000000' collate " + c + ", 3)", c, Row("000")),
SubstringTestCase("select substr('' collate " + c + ", 1, 1)", c, Row("")),
SubstringTestCase("select left('' collate " + c + ", 1)", c, Row("")),
SubstringTestCase("select right('' collate " + c + ", 1)", c, Row("")),
SubstringTestCase("select left('ghi' collate " + c + ", 1)", c, Row("g"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this many test cases here, if you didn't modify the way Substring/Left/Right expressions behave when given collated strings (i.e. you didn't introduce any collation awareness to nullSafeEval/doCodeGen), then there should be no need to go this deep - a couple of test cases should do the trick just fine

also, I think these tests can be combined with the one above to make:
test("Support Left/Right/Substr with collation") {

so that we could have something like:

checks.foreach { check =>
// Result & data type (explicit collation)
...
// Result & data type (implicit collation)
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do (But without implicit collation at all, as per your other review comment.)

@GideonPotok
Copy link
Contributor Author

What's idff w/ #46039?

@HyukjinKwon Sorry for confusion. I closed out the other one to clear up which one to review.

(To answer your question, that other one did not have the unit test over a struct field, because in the past I have dealt with GHA Test flakiness over using withTable expressions. )

@GideonPotok GideonPotok force-pushed the spark_collation_47413_5 branch 2 times, most recently from 62e22c6 to f6efb32 Compare April 16, 2024 06:35
@GideonPotok
Copy link
Contributor Author

@uros-db please re-review!

@GideonPotok
Copy link
Contributor Author

@uros-db please re-review this one too.

Comment on lines 219 to 247
SubstringTestCase("select substr('example' collate " + c + ", 1, 100)", c, Row("example")),
SubstringTestCase("select substr('example' collate " + c + ", 2, 2)", c, Row("xa")),
SubstringTestCase("select substr('example' collate " + c + ", 0, 0)", c, Row("")),
SubstringTestCase("select substr('example' collate " + c + ", -3, 2)", c, Row("pl")),
SubstringTestCase("select substr(' a世a ' collate " + c + ", 2, 3)", c, Row("a世a")), // scalastyle:ignore
SubstringTestCase("select left(' a世a ' collate " + c + ", 3)", c, Row(" a世")), // scalastyle:ignore
SubstringTestCase("select right(' a世a ' collate " + c + ", 3)", c, Row("世a ")), // scalastyle:ignore
SubstringTestCase("select left('AaAaAaAa000000' collate " + c + ", 3)", c, Row("AaA")),
SubstringTestCase("select right('AaAaAaAa000000' collate " + c + ", 3)", c, Row("000")),
SubstringTestCase("select substr('' collate " + c + ", 1, 1)", c, Row("")),
SubstringTestCase("select left('' collate " + c + ", 1)", c, Row("")),
SubstringTestCase("select right('' collate " + c + ", 1)", c, Row("")),
// improper values
SubstringTestCase("select left(null collate " + c + ", 1)", c, Row(null)),
SubstringTestCase("select right(null collate " + c + ", 1)", c, Row(null)),
SubstringTestCase("select substr(null collate " + c + ", 1)", c, Row(null)),
SubstringTestCase("select substr(null collate " + c + ", 1, 1)", c, Row(null)),
SubstringTestCase("select left(null collate " + c + ", null)", c, Row(null)),
SubstringTestCase("select right(null collate " + c + ", null)", c, Row(null)),
SubstringTestCase("select substr(null collate " + c + ", null)", c, Row(null)),
SubstringTestCase("select substr(null collate " + c + ", null, null)", c, Row(null)),
SubstringTestCase("select left('AaAaAaAa000000' collate " + c + ", null)", c, Row(null)),
SubstringTestCase("select right('AaAaAaAa000000' collate " + c + ", null)", c, Row(null)),
SubstringTestCase("select substr('AaAaAaAa000000' collate " + c + ", null)", c, Row(null)),
SubstringTestCase("select substr('AaAaAaAa0' collate " + c + ", null, null)", c, Row(null)),
SubstringTestCase("select right('' collate " + c + ", null)", c, Row(null)),
SubstringTestCase("select substr('' collate " + c + ", null)", c, Row(null)),
SubstringTestCase("select substr('' collate " + c + ", null, null)", c, Row(null)),
SubstringTestCase("select left('' collate " + c + ", null)", c, Row(null))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

28 cases * 4 collations = 112 tests

I'd say we don't need that many SQL tests, there's no need to do Seq("utf8_binary_lcase", "utf8_binary", "unicode", "unicode_ci").flatMap, only 4 tests (with valid values) per function (substring/left/right) should be enough

a couple of additional tests for improper values are fine as well, but we don't need to test every possible pair of collation & function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't have unit tests for these functions, let's make sure to use a smaller number of tests here in order to test more things - for example, I don't see any case/accent variation here, as well as a wider variety of variable len characters, etc.

Copy link
Contributor Author

@GideonPotok GideonPotok Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uros-db done. Please re-review

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix tests

Comment on lines 218 to 531
"select substr('example' collate " + "utf8_binary_lcase" + ", 1, 100)",
"utf8_binary_lcase",
Row("example")),
SubstringTestCase(
"select substr('example' collate " + "utf8_binary" + ", 2, 2)",
"utf8_binary",
Row("xa")),
SubstringTestCase(
"select right('' collate " + "utf8_binary_lcase" + ", 1)",
"utf8_binary_lcase",
Row("")),
SubstringTestCase(
"select substr('example' collate " + "unicode" + ", 0, 0)",
"unicode",
Row("")),
SubstringTestCase(
"select substr('example' collate " + "unicode_ci" + ", -3, 2)",
"unicode_ci",
Row("pl")),
SubstringTestCase(
"select substr(' a世a ' collate " + "utf8_binary_lcase" + ", 2, 3)", // scalastyle:ignore
"utf8_binary_lcase",
Row("a世a")), // scalastyle:ignore
SubstringTestCase(
"select left(' a世a ' collate " + "utf8_binary" + ", 3)", // scalastyle:ignore
"utf8_binary",
Row(" a世")), // scalastyle:ignore
SubstringTestCase(
"select right(' a世a ' collate " + "unicode" + ", 3)", // scalastyle:ignore
"unicode",
Row("世a ")), // scalastyle:ignore
SubstringTestCase(
"select left('ÀÃÂĀĂȦÄäåäáâãȻȻȻȻȻǢǼÆ' collate " + "unicode_ci" + ", 3)", // scalastyle:ignore
"unicode_ci",
Row("ÀÃÂ")), // scalastyle:ignore
SubstringTestCase(
"select right('ÀÃÂĀĂȦÄäåäáâãȻȻȻȻȻǢǼÆ' collate " + "utf8_binary_lcase" + ", 3)", // scalastyle:ignore
"utf8_binary_lcase",
Row("ǢǼÆ")), // scalastyle:ignore
SubstringTestCase(
"select substr('' collate " + "utf8_binary_lcase" + ", 1, 1)",
"utf8_binary_lcase",
Row("")),
SubstringTestCase(
"select substr('' collate " + "unicode" + ", 1, 1)",
"unicode",
Row("")),
SubstringTestCase(
"select left('' collate " + "utf8_binary" + ", 1)",
"utf8_binary",
Row("")),
// improper values
SubstringTestCase(
"select left(null collate " + "utf8_binary_lcase" + ", 1)",
"utf8_binary_lcase",
Row(null)),
SubstringTestCase(
"select right(null collate " + "unicode" + ", 1)",
"unicode",
Row(null)),
SubstringTestCase(
"select substr(null collate " + "utf8_binary" + ", 1)",
"utf8_binary",
Row(null)),
SubstringTestCase(
"select substr(null collate " + "unicode_ci" + ", 1, 1)",
"unicode_ci",
Row(null)),
SubstringTestCase(
"select left(null collate " + "utf8_binary_lcase" + ", null)",
"utf8_binary_lcase",
Row(null)),
SubstringTestCase(
"select right(null collate " + "unicode" + ", null)",
"unicode",
Row(null)),
SubstringTestCase(
"select substr(null collate " + "utf8_binary" + ", null)",
"utf8_binary",
Row(null)),
SubstringTestCase(
"select substr(null collate " + "unicode_ci" + ", null, null)",
"unicode_ci",
Row(null)),
SubstringTestCase(
"select left('ÀÃÂĀĂȦÄäåäáâãȻȻȻȻȻǢǼÆ' collate " + "utf8_binary_lcase" + ", null)", // scalastyle:ignore
"utf8_binary_lcase",
Row(null)),
SubstringTestCase(
"select right('ÀÃÂĀĂȦÄäåäáâãȻȻȻȻȻǢǼÆ' collate " + "unicode" + ", null)", // scalastyle:ignore
"unicode",
Row(null)),
SubstringTestCase(
"select substr('ÀÃÂĀĂȦÄäåäáâãȻȻȻȻȻǢǼÆ' collate " + "utf8_binary" + ", null)", // scalastyle:ignore
"utf8_binary",
Row(null)),
SubstringTestCase(
"select substr('' collate " + "unicode_ci" + ", null, null)",
"unicode_ci",
Row(null))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please follow the format of tests in this suite, queries should be constructed from these parameterized cases, otherwise they don't have a point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uros-db I will change it accordingly. Please advise - Do you want three case classes, or one case class but with a parameter for function name? If the latter (one case class), how do you want me to handle the third parameter (len), which left and right do not have, and which is optional for substr? Maybe with an Option[String]?

Is the quantity of tests satisfactory? I got it down from 112 tests to 25 tests. Thus 13 for valid values and 12 for invalid values. I can get it down to 12 valid test cases and, say, six invalid values if you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine if there's a parameter and it's not used in some cases, I don't think that causes any error

otherwise you can always introduce LeftRightTestCase, whatever works

quantity is fine here, what's important is proper coverage without needless repetition, these e2e sql tests are pretty slow, so having hundreds of them for trivial expressions is less than ideal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uros-db Done.

@GideonPotok GideonPotok changed the title [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [SPARK-47413][SQL] - add support to substr/left/right for collations Apr 18, 2024
Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to avoid scalastyle guides

@GideonPotok
Copy link
Contributor Author

@uros-db I have made the suggested changes. please re-review.

format

format

format

format

fewer test cases now

fewer test cases now

fewer test cases now

fewer test cases now

fewer test cases now

Fewer test cases and fewer tests for left/right/substr

rename t1234 to t1234<index>

rename QTestCase to SubstringTestCase

remove redundant unit test

 unify test naming

with struct test

tests pass locally

format

test impl to make more in line with refactor. next add struct test, maybe.

left impl

right impl
@GideonPotok
Copy link
Contributor Author

@uros-db

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in e1432ef Apr 22, 2024
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
https://issues.apache.org/jira/browse/SPARK-46830

### What changes were proposed in this pull request?

Add collation support to types of return values for calls to substr, left, right, when passed in arguments of an explicit, implicit, or session-specified collations. Add tests to validate behavior.

### Why are the changes needed?

We are incrementally adding collation support to built-in string functions in Spark. These functions are intended to be supported for collated types.

### Does this PR introduce _any_ user-facing change?

these sql functions will now not throw errors when passed in collated types. Instead, they will return the right value, of the passed in type. Or of the default collation.

### How was this patch tested?

Unit testing + ad-hoc spark shell and pyspark shell interactions.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46040 from GideonPotok/spark_collation_47413_5.

Authored-by: GideonPotok <g.potok4@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants