Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28052][SQL] Make ArrayExists follow the three-valued boolean logic. #24873

Closed
wants to merge 6 commits into from

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Jun 14, 2019

What changes were proposed in this pull request?

Currently ArrayExists always returns boolean values (if the arguments are not null), but it should follow the three-valued boolean logic:

  • true if the predicate holds at least one true
  • otherwise, null if the predicate holds null
  • otherwise, false

This behavior change is made to match Postgres' equivalent function ANY/SOME (array)'s behavior: https://www.postgresql.org/docs/9.6/functions-comparisons.html#AEN21174

How was this patch tested?

Modified tests and existing tests.

@ueshin
Copy link
Member Author

ueshin commented Jun 14, 2019

Copy link
Contributor

@rednaxelafx rednaxelafx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, @ueshin !
I like how this is making the ArrayExists expression more consistent with the rest of three-valued boolean logic expressions, especially with the new some()/any() aggregate functions.

But the current implementation still seems to be slightly different from the semantics of some()/any():

scala> spark.sql("select explode(array(null, 1)) as x").selectExpr("any(x = 2)").show
+------------+
|any((x = 2))|
+------------+
|       false|
+------------+

With the current PR, it looks like select exists(array(null, 1), x -> x = 2) will return null instead of false.

P.S. this is definitely a behavior change, and although we're only doing this in a major release, should we still create a conf flag for it?

@ueshin
Copy link
Member Author

ueshin commented Jun 14, 2019

@rednaxelafx Thanks for taking a look at this!

Actually I checked the behavior with Postgres' equivalent function, any:

postgres=# select 2 = any(array[null, 1]);
 ?column?
----------
 (null)
(1 row)

As for the some()/any() aggregate functions, the equivalent functions would be bool_or(), then Postgres says:

postgres=# select bool_or(c = 2) from (values (null), (1)) as t(c);
 bool_or
---------
 f
(1 row)

I think aggregate functions have a different semantics, so the current behavior is reasonable.

P.S. sure, I'll add a config.

@SparkQA
Copy link

SparkQA commented Jun 14, 2019

Test build #106509 has finished for PR 24873 at commit 0d30c66.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rednaxelafx
Copy link
Contributor

@ueshin , thanks for the explanation!

Matching PostgreSQL's any() makes sense. Quoting their doc: https://www.postgresql.org/docs/9.6/functions-comparisons.html#AEN21174

The right-hand side is a parenthesized expression, which must yield an array value. The left-hand expression is evaluated and compared to each element of the array using the given operator, which must yield a Boolean result. The result of ANY is "true" if any true result is obtained. The result is "false" if no true result is found (including the case where the array has zero elements).

If the array expression yields a null array, the result of ANY will be null. If the left-hand expression yields null, the result of ANY is ordinarily null (though a non-strict comparison operator could possibly yield a different result). Also, if the right-hand array contains any null elements and no true comparison result is obtained, the result of ANY will be null, not false (again, assuming a strict comparison operator). This is in accordance with SQL's normal rules for Boolean combinations of null values.

It might be worth mentioning in the PR description that this behavior change is made to match PG's behavior?

@ueshin
Copy link
Member Author

ueshin commented Jun 14, 2019

@rednaxelafx I updated the description.

@SparkQA
Copy link

SparkQA commented Jun 14, 2019

Test build #106514 has finished for PR 24873 at commit ef7c90a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 14, 2019

Test build #106516 has finished for PR 24873 at commit 0bf37f1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 14, 2019

Test build #106517 has finished for PR 24873 at commit 5a3c300.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rednaxelafx
Copy link
Contributor

Thanks @ueshin, LGTM!

@@ -139,6 +139,8 @@ license: |

- Since Spark 3.0, we use a new protocol for fetching shuffle blocks, for external shuffle service users, we need to upgrade the server correspondingly. Otherwise, we'll get the error message `UnsupportedOperationException: Unexpected message: FetchShuffleBlocks`. If it is hard to upgrade the shuffle service right now, you can still use the old protocol by setting `spark.shuffle.useOldFetchProtocol` to `true`.

- Since Spark 3.0, a higher-order function `exists` follows the three-valued boolean logic. The previous behaviour can be restored by setting `spark.sql.legacy.arrayExistsFollowsThreeValuedLogic` to `false`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may we add an example in order to make more clear to users what to expect? three-valued boolean logic may be a bit obscure for users, IMHO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I added some more note and an example. Could you check it again?

if (ret == null) {
foundNull = true
} else if (ret.asInstanceOf[Boolean]) {
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid using return here and keep the previous way we handle exists? Using return is a pretty bad practice and I think here having an extra flag we can easily avoid it..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 we don't have to use early return here. The old code works fine and conveys the loop condition well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I updated to use exists back.

@kiszk
Copy link
Member

kiszk commented Jun 15, 2019

LGTM, pending Jenkins

@SparkQA
Copy link

SparkQA commented Jun 15, 2019

Test build #106537 has finished for PR 24873 at commit d626559.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor

LGTM as well, thanks @ueshin !

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, all!
Merged to master.

@ueshin
Copy link
Member Author

ueshin commented Jun 15, 2019

Thanks all for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants