-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47413][SQL] - add support to substr/left/right for collations #46040
Conversation
@uros-db please review |
What's idff w/ #46039? |
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
Outdated
Show resolved
Hide resolved
@uros-db I made all suggested changes. Please re-review. Thanks! |
SubstringTestCase("select left('abc' collate " + c + ", 1)", c, Row("a")), | ||
SubstringTestCase("select right('def' collate " + c + ", 1)", c, Row("f")), | ||
SubstringTestCase("select substr('abc' collate " + c + ", 2)", c, Row("bc")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 0, 2)", c, Row("ex")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 1, 2)", c, Row("ex")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 0, 7)", c, Row("example")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 1, 7)", c, Row("example")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 0, 100)", c, Row("example")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 1, 100)", c, Row("example")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 2, 2)", c, Row("xa")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 1, 6)", c, Row("exampl")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 2, 100)", c, Row("xample")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 0, 0)", c, Row("")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 100, 4)", c, Row("")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 0, 100)", c, Row("example")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 1, 100)", c, Row("example")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 2, 100)", c, Row("xample")), | ||
SubstringTestCase("select substr('example' collate " + c + ", -3, 2)", c, Row("pl")), | ||
SubstringTestCase("select substr('example' collate " + c + ", -100, 4)", c, Row("")), | ||
SubstringTestCase("select substr('example' collate " + c + ", -2147483648, 6)", c, Row("")), | ||
SubstringTestCase("select substr(' a世a ' collate " + c + ", 2, 3)", c, Row("a世a")), // scalastyle:ignore | ||
SubstringTestCase("select left(' a世a ' collate " + c + ", 3)", c, Row(" a世")), // scalastyle:ignore | ||
SubstringTestCase("select right(' a世a ' collate " + c + ", 3)", c, Row("世a ")), // scalastyle:ignore | ||
SubstringTestCase("select substr('AaAaAaAa000000' collate " + c + ", 2, 3)", c, Row("aAa")), | ||
SubstringTestCase("select left('AaAaAaAa000000' collate " + c + ", 3)", c, Row("AaA")), | ||
SubstringTestCase("select right('AaAaAaAa000000' collate " + c + ", 3)", c, Row("000")), | ||
SubstringTestCase("select substr('' collate " + c + ", 1, 1)", c, Row("")), | ||
SubstringTestCase("select left('' collate " + c + ", 1)", c, Row("")), | ||
SubstringTestCase("select right('' collate " + c + ", 1)", c, Row("")), | ||
SubstringTestCase("select left('ghi' collate " + c + ", 1)", c, Row("g")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this many test cases here, if you didn't modify the way Substring/Left/Right expressions behave when given collated strings (i.e. you didn't introduce any collation awareness to nullSafeEval/doCodeGen), then there should be no need to go this deep - a couple of test cases should do the trick just fine
also, I think these tests can be combined with the one above to make:
test("Support Left/Right/Substr with collation") {
so that we could have something like:
checks.foreach { check =>
// Result & data type (explicit collation)
...
// Result & data type (implicit collation)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do (But without implicit collation at all, as per your other review comment.)
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
Outdated
Show resolved
Hide resolved
@HyukjinKwon Sorry for confusion. I closed out the other one to clear up which one to review. (To answer your question, that other one did not have the unit test over a struct field, because in the past I have dealt with GHA Test flakiness over using withTable expressions. ) |
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
Outdated
Show resolved
Hide resolved
62e22c6
to
f6efb32
Compare
@uros-db please re-review! |
f6efb32
to
1d91e95
Compare
@uros-db please re-review this one too. |
SubstringTestCase("select substr('example' collate " + c + ", 1, 100)", c, Row("example")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 2, 2)", c, Row("xa")), | ||
SubstringTestCase("select substr('example' collate " + c + ", 0, 0)", c, Row("")), | ||
SubstringTestCase("select substr('example' collate " + c + ", -3, 2)", c, Row("pl")), | ||
SubstringTestCase("select substr(' a世a ' collate " + c + ", 2, 3)", c, Row("a世a")), // scalastyle:ignore | ||
SubstringTestCase("select left(' a世a ' collate " + c + ", 3)", c, Row(" a世")), // scalastyle:ignore | ||
SubstringTestCase("select right(' a世a ' collate " + c + ", 3)", c, Row("世a ")), // scalastyle:ignore | ||
SubstringTestCase("select left('AaAaAaAa000000' collate " + c + ", 3)", c, Row("AaA")), | ||
SubstringTestCase("select right('AaAaAaAa000000' collate " + c + ", 3)", c, Row("000")), | ||
SubstringTestCase("select substr('' collate " + c + ", 1, 1)", c, Row("")), | ||
SubstringTestCase("select left('' collate " + c + ", 1)", c, Row("")), | ||
SubstringTestCase("select right('' collate " + c + ", 1)", c, Row("")), | ||
// improper values | ||
SubstringTestCase("select left(null collate " + c + ", 1)", c, Row(null)), | ||
SubstringTestCase("select right(null collate " + c + ", 1)", c, Row(null)), | ||
SubstringTestCase("select substr(null collate " + c + ", 1)", c, Row(null)), | ||
SubstringTestCase("select substr(null collate " + c + ", 1, 1)", c, Row(null)), | ||
SubstringTestCase("select left(null collate " + c + ", null)", c, Row(null)), | ||
SubstringTestCase("select right(null collate " + c + ", null)", c, Row(null)), | ||
SubstringTestCase("select substr(null collate " + c + ", null)", c, Row(null)), | ||
SubstringTestCase("select substr(null collate " + c + ", null, null)", c, Row(null)), | ||
SubstringTestCase("select left('AaAaAaAa000000' collate " + c + ", null)", c, Row(null)), | ||
SubstringTestCase("select right('AaAaAaAa000000' collate " + c + ", null)", c, Row(null)), | ||
SubstringTestCase("select substr('AaAaAaAa000000' collate " + c + ", null)", c, Row(null)), | ||
SubstringTestCase("select substr('AaAaAaAa0' collate " + c + ", null, null)", c, Row(null)), | ||
SubstringTestCase("select right('' collate " + c + ", null)", c, Row(null)), | ||
SubstringTestCase("select substr('' collate " + c + ", null)", c, Row(null)), | ||
SubstringTestCase("select substr('' collate " + c + ", null, null)", c, Row(null)), | ||
SubstringTestCase("select left('' collate " + c + ", null)", c, Row(null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
28 cases * 4 collations = 112 tests
I'd say we don't need that many SQL tests, there's no need to do Seq("utf8_binary_lcase", "utf8_binary", "unicode", "unicode_ci").flatMap
, only 4 tests (with valid values) per function (substring/left/right) should be enough
a couple of additional tests for improper values are fine as well, but we don't need to test every possible pair of collation & function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we don't have unit tests for these functions, let's make sure to use a smaller number of tests here in order to test more things - for example, I don't see any case/accent variation here, as well as a wider variety of variable len characters, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@uros-db done. Please re-review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix tests
"select substr('example' collate " + "utf8_binary_lcase" + ", 1, 100)", | ||
"utf8_binary_lcase", | ||
Row("example")), | ||
SubstringTestCase( | ||
"select substr('example' collate " + "utf8_binary" + ", 2, 2)", | ||
"utf8_binary", | ||
Row("xa")), | ||
SubstringTestCase( | ||
"select right('' collate " + "utf8_binary_lcase" + ", 1)", | ||
"utf8_binary_lcase", | ||
Row("")), | ||
SubstringTestCase( | ||
"select substr('example' collate " + "unicode" + ", 0, 0)", | ||
"unicode", | ||
Row("")), | ||
SubstringTestCase( | ||
"select substr('example' collate " + "unicode_ci" + ", -3, 2)", | ||
"unicode_ci", | ||
Row("pl")), | ||
SubstringTestCase( | ||
"select substr(' a世a ' collate " + "utf8_binary_lcase" + ", 2, 3)", // scalastyle:ignore | ||
"utf8_binary_lcase", | ||
Row("a世a")), // scalastyle:ignore | ||
SubstringTestCase( | ||
"select left(' a世a ' collate " + "utf8_binary" + ", 3)", // scalastyle:ignore | ||
"utf8_binary", | ||
Row(" a世")), // scalastyle:ignore | ||
SubstringTestCase( | ||
"select right(' a世a ' collate " + "unicode" + ", 3)", // scalastyle:ignore | ||
"unicode", | ||
Row("世a ")), // scalastyle:ignore | ||
SubstringTestCase( | ||
"select left('ÀÃÂĀĂȦÄäåäáâãȻȻȻȻȻǢǼÆ' collate " + "unicode_ci" + ", 3)", // scalastyle:ignore | ||
"unicode_ci", | ||
Row("ÀÃÂ")), // scalastyle:ignore | ||
SubstringTestCase( | ||
"select right('ÀÃÂĀĂȦÄäåäáâãȻȻȻȻȻǢǼÆ' collate " + "utf8_binary_lcase" + ", 3)", // scalastyle:ignore | ||
"utf8_binary_lcase", | ||
Row("ǢǼÆ")), // scalastyle:ignore | ||
SubstringTestCase( | ||
"select substr('' collate " + "utf8_binary_lcase" + ", 1, 1)", | ||
"utf8_binary_lcase", | ||
Row("")), | ||
SubstringTestCase( | ||
"select substr('' collate " + "unicode" + ", 1, 1)", | ||
"unicode", | ||
Row("")), | ||
SubstringTestCase( | ||
"select left('' collate " + "utf8_binary" + ", 1)", | ||
"utf8_binary", | ||
Row("")), | ||
// improper values | ||
SubstringTestCase( | ||
"select left(null collate " + "utf8_binary_lcase" + ", 1)", | ||
"utf8_binary_lcase", | ||
Row(null)), | ||
SubstringTestCase( | ||
"select right(null collate " + "unicode" + ", 1)", | ||
"unicode", | ||
Row(null)), | ||
SubstringTestCase( | ||
"select substr(null collate " + "utf8_binary" + ", 1)", | ||
"utf8_binary", | ||
Row(null)), | ||
SubstringTestCase( | ||
"select substr(null collate " + "unicode_ci" + ", 1, 1)", | ||
"unicode_ci", | ||
Row(null)), | ||
SubstringTestCase( | ||
"select left(null collate " + "utf8_binary_lcase" + ", null)", | ||
"utf8_binary_lcase", | ||
Row(null)), | ||
SubstringTestCase( | ||
"select right(null collate " + "unicode" + ", null)", | ||
"unicode", | ||
Row(null)), | ||
SubstringTestCase( | ||
"select substr(null collate " + "utf8_binary" + ", null)", | ||
"utf8_binary", | ||
Row(null)), | ||
SubstringTestCase( | ||
"select substr(null collate " + "unicode_ci" + ", null, null)", | ||
"unicode_ci", | ||
Row(null)), | ||
SubstringTestCase( | ||
"select left('ÀÃÂĀĂȦÄäåäáâãȻȻȻȻȻǢǼÆ' collate " + "utf8_binary_lcase" + ", null)", // scalastyle:ignore | ||
"utf8_binary_lcase", | ||
Row(null)), | ||
SubstringTestCase( | ||
"select right('ÀÃÂĀĂȦÄäåäáâãȻȻȻȻȻǢǼÆ' collate " + "unicode" + ", null)", // scalastyle:ignore | ||
"unicode", | ||
Row(null)), | ||
SubstringTestCase( | ||
"select substr('ÀÃÂĀĂȦÄäåäáâãȻȻȻȻȻǢǼÆ' collate " + "utf8_binary" + ", null)", // scalastyle:ignore | ||
"utf8_binary", | ||
Row(null)), | ||
SubstringTestCase( | ||
"select substr('' collate " + "unicode_ci" + ", null, null)", | ||
"unicode_ci", | ||
Row(null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please follow the format of tests in this suite, queries should be constructed from these parameterized cases, otherwise they don't have a point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@uros-db I will change it accordingly. Please advise - Do you want three case classes, or one case class but with a parameter for function name? If the latter (one case class), how do you want me to handle the third parameter (len
), which left
and right
do not have, and which is optional for substr
? Maybe with an Option[String]
?
Is the quantity of tests satisfactory? I got it down from 112 tests to 25 tests. Thus 13 for valid values and 12 for invalid values. I can get it down to 12 valid test cases and, say, six invalid values if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine if there's a parameter and it's not used in some cases, I don't think that causes any error
otherwise you can always introduce LeftRightTestCase
, whatever works
quantity is fine here, what's important is proper coverage without needless repetition, these e2e sql tests are pretty slow, so having hundreds of them for trivial expressions is less than ideal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@uros-db Done.
a4d3c1e
to
95be248
Compare
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to avoid scalastyle guides
@uros-db I have made the suggested changes. please re-review. |
format format format format fewer test cases now fewer test cases now fewer test cases now fewer test cases now fewer test cases now Fewer test cases and fewer tests for left/right/substr rename t1234 to t1234<index> rename QTestCase to SubstringTestCase remove redundant unit test unify test naming with struct test tests pass locally format test impl to make more in line with refactor. next add struct test, maybe. left impl right impl
5050cf6
to
b0357b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
thanks, merging to master! |
https://issues.apache.org/jira/browse/SPARK-46830 ### What changes were proposed in this pull request? Add collation support to types of return values for calls to substr, left, right, when passed in arguments of an explicit, implicit, or session-specified collations. Add tests to validate behavior. ### Why are the changes needed? We are incrementally adding collation support to built-in string functions in Spark. These functions are intended to be supported for collated types. ### Does this PR introduce _any_ user-facing change? these sql functions will now not throw errors when passed in collated types. Instead, they will return the right value, of the passed in type. Or of the default collation. ### How was this patch tested? Unit testing + ad-hoc spark shell and pyspark shell interactions. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46040 from GideonPotok/spark_collation_47413_5. Authored-by: GideonPotok <g.potok4@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
https://issues.apache.org/jira/browse/SPARK-46830
What changes were proposed in this pull request?
Add collation support to types of return values for calls to substr, left, right, when passed in arguments of an explicit, implicit, or session-specified collations. Add tests to validate behavior.
Why are the changes needed?
We are incrementally adding collation support to built-in string functions in Spark. These functions are intended to be supported for collated types.
Does this PR introduce any user-facing change?
these sql functions will now not throw errors when passed in collated types. Instead, they will return the right value, of the passed in type. Or of the default collation.
How was this patch tested?
Unit testing + ad-hoc spark shell and pyspark shell interactions.
Was this patch authored or co-authored using generative AI tooling?
No.