-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28052][SQL] Make ArrayExists
follow the three-valued boolean logic.
#24873
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, @ueshin !
I like how this is making the ArrayExists
expression more consistent with the rest of three-valued boolean logic expressions, especially with the new some()
/any()
aggregate functions.
But the current implementation still seems to be slightly different from the semantics of some()
/any()
:
scala> spark.sql("select explode(array(null, 1)) as x").selectExpr("any(x = 2)").show
+------------+
|any((x = 2))|
+------------+
| false|
+------------+
With the current PR, it looks like select exists(array(null, 1), x -> x = 2)
will return null
instead of false
.
P.S. this is definitely a behavior change, and although we're only doing this in a major release, should we still create a conf flag for it?
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
Show resolved
Hide resolved
@rednaxelafx Thanks for taking a look at this! Actually I checked the behavior with Postgres' equivalent function,
As for the
I think aggregate functions have a different semantics, so the current behavior is reasonable. P.S. sure, I'll add a config. |
Test build #106509 has finished for PR 24873 at commit
|
@ueshin , thanks for the explanation! Matching PostgreSQL's
It might be worth mentioning in the PR description that this behavior change is made to match PG's behavior? |
@rednaxelafx I updated the description. |
Test build #106514 has finished for PR 24873 at commit
|
Test build #106516 has finished for PR 24873 at commit
|
Test build #106517 has finished for PR 24873 at commit
|
Thanks @ueshin, LGTM! |
docs/sql-migration-guide-upgrade.md
Outdated
@@ -139,6 +139,8 @@ license: | | |||
|
|||
- Since Spark 3.0, we use a new protocol for fetching shuffle blocks, for external shuffle service users, we need to upgrade the server correspondingly. Otherwise, we'll get the error message `UnsupportedOperationException: Unexpected message: FetchShuffleBlocks`. If it is hard to upgrade the shuffle service right now, you can still use the old protocol by setting `spark.shuffle.useOldFetchProtocol` to `true`. | |||
|
|||
- Since Spark 3.0, a higher-order function `exists` follows the three-valued boolean logic. The previous behaviour can be restored by setting `spark.sql.legacy.arrayExistsFollowsThreeValuedLogic` to `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may we add an example in order to make more clear to users what to expect? three-valued boolean logic
may be a bit obscure for users, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I added some more note and an example. Could you check it again?
if (ret == null) { | ||
foundNull = true | ||
} else if (ret.asInstanceOf[Boolean]) { | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid using return
here and keep the previous way we handle exists
? Using return is a pretty bad practice and I think here having an extra flag we can easily avoid it..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 we don't have to use early return here. The old code works fine and conveys the loop condition well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I updated to use exists
back.
LGTM, pending Jenkins |
Test build #106537 has finished for PR 24873 at commit
|
LGTM as well, thanks @ueshin ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, all!
Merged to master.
Thanks all for the review! |
What changes were proposed in this pull request?
Currently
ArrayExists
always returns boolean values (if the arguments are notnull
), but it should follow the three-valued boolean logic:true
if the predicate holds at least onetrue
null
if the predicate holdsnull
false
This behavior change is made to match Postgres' equivalent function
ANY/SOME (array)
's behavior: https://www.postgresql.org/docs/9.6/functions-comparisons.html#AEN21174How was this patch tested?
Modified tests and existing tests.