[SPARK-28052][SQL] Make `ArrayExists` follow the three-valued boolean logic. #24873

ueshin · 2019-06-14T07:11:09Z

What changes were proposed in this pull request?

Currently ArrayExists always returns boolean values (if the arguments are not null), but it should follow the three-valued boolean logic:

true if the predicate holds at least one true
otherwise, null if the predicate holds null
otherwise, false

This behavior change is made to match Postgres' equivalent function ANY/SOME (array)'s behavior: https://www.postgresql.org/docs/9.6/functions-comparisons.html#AEN21174

How was this patch tested?

Modified tests and existing tests.

ueshin · 2019-06-14T07:13:56Z

cc @hvanhovell @gatorsmile @rednaxelafx @nvander1

rednaxelafx

Thanks for working on this, @ueshin !
I like how this is making the ArrayExists expression more consistent with the rest of three-valued boolean logic expressions, especially with the new some()/any() aggregate functions.

But the current implementation still seems to be slightly different from the semantics of some()/any():

scala> spark.sql("select explode(array(null, 1)) as x").selectExpr("any(x = 2)").show
+------------+
|any((x = 2))|
+------------+
|       false|
+------------+

With the current PR, it looks like select exists(array(null, 1), x -> x = 2) will return null instead of false.

P.S. this is definitely a behavior change, and although we're only doing this in a major release, should we still create a conf flag for it?

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

ueshin · 2019-06-14T08:06:07Z

@rednaxelafx Thanks for taking a look at this!

Actually I checked the behavior with Postgres' equivalent function, any:

postgres=# select 2 = any(array[null, 1]);
 ?column?
----------
 (null)
(1 row)

As for the some()/any() aggregate functions, the equivalent functions would be bool_or(), then Postgres says:

postgres=# select bool_or(c = 2) from (values (null), (1)) as t(c);
 bool_or
---------
 f
(1 row)

I think aggregate functions have a different semantics, so the current behavior is reasonable.

P.S. sure, I'll add a config.

SparkQA · 2019-06-14T08:44:23Z

Test build #106509 has finished for PR 24873 at commit 0d30c66.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rednaxelafx · 2019-06-14T09:30:09Z

@ueshin , thanks for the explanation!

Matching PostgreSQL's any() makes sense. Quoting their doc: https://www.postgresql.org/docs/9.6/functions-comparisons.html#AEN21174

The right-hand side is a parenthesized expression, which must yield an array value. The left-hand expression is evaluated and compared to each element of the array using the given operator, which must yield a Boolean result. The result of ANY is "true" if any true result is obtained. The result is "false" if no true result is found (including the case where the array has zero elements).

If the array expression yields a null array, the result of ANY will be null. If the left-hand expression yields null, the result of ANY is ordinarily null (though a non-strict comparison operator could possibly yield a different result). Also, if the right-hand array contains any null elements and no true comparison result is obtained, the result of ANY will be null, not false (again, assuming a strict comparison operator). This is in accordance with SQL's normal rules for Boolean combinations of null values.

It might be worth mentioning in the PR description that this behavior change is made to match PG's behavior?

ueshin · 2019-06-14T09:40:46Z

@rednaxelafx I updated the description.

SparkQA · 2019-06-14T12:03:21Z

Test build #106514 has finished for PR 24873 at commit ef7c90a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-14T12:45:56Z

Test build #106516 has finished for PR 24873 at commit 0bf37f1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-14T17:00:58Z

Test build #106517 has finished for PR 24873 at commit 5a3c300.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rednaxelafx · 2019-06-14T20:01:55Z

Thanks @ueshin, LGTM!

mgaido91 · 2019-06-14T20:38:49Z

docs/sql-migration-guide-upgrade.md

@@ -139,6 +139,8 @@ license: |

  - Since Spark 3.0, we use a new protocol for fetching shuffle blocks, for external shuffle service users, we need to upgrade the server correspondingly. Otherwise, we'll get the error message `UnsupportedOperationException: Unexpected message: FetchShuffleBlocks`. If it is hard to upgrade the shuffle service right now, you can still use the old protocol by setting `spark.shuffle.useOldFetchProtocol` to `true`. 

+  - Since Spark 3.0, a higher-order function `exists` follows the three-valued boolean logic. The previous behaviour can be restored by setting `spark.sql.legacy.arrayExistsFollowsThreeValuedLogic` to `false`.


may we add an example in order to make more clear to users what to expect? three-valued boolean logic may be a bit obscure for users, IMHO.

Sure, I added some more note and an example. Could you check it again?

mgaido91 · 2019-06-14T20:43:10Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+      if (ret == null) {
+        foundNull = true
+      } else if (ret.asInstanceOf[Boolean]) {
+        return true


can we avoid using return here and keep the previous way we handle exists? Using return is a pretty bad practice and I think here having an extra flag we can easily avoid it..

+1 we don't have to use early return here. The old code works fine and conveys the loop condition well.

Sure, I updated to use exists back.

kiszk · 2019-06-15T03:45:26Z

LGTM, pending Jenkins

SparkQA · 2019-06-15T04:44:47Z

Test build #106537 has finished for PR 24873 at commit d626559.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2019-06-15T07:44:40Z

LGTM as well, thanks @ueshin !

dongjoon-hyun

+1, LGTM. Thank you, all!
Merged to master.

ueshin · 2019-06-15T23:01:45Z

Thanks all for the review!

Make ArrayExists follow the three-valued boolean logic.

0d30c66

ueshin mentioned this pull request Jun 14, 2019

[SPARK-27905] [SQL] Add higher order function 'forall' #24761

Closed

rednaxelafx reviewed Jun 14, 2019

View reviewed changes

dongjoon-hyun reviewed Jun 14, 2019

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala Show resolved Hide resolved

Add a conf.

276c094

Add examples.

ef7c90a

Add a migration guide.

0bf37f1

Fix.

5a3c300

dongjoon-hyun added the SQL label Jun 14, 2019

mgaido91 reviewed Jun 14, 2019

View reviewed changes

Address comments.

d626559

dongjoon-hyun approved these changes Jun 15, 2019

View reviewed changes

dongjoon-hyun closed this in 5ae1a6b Jun 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28052][SQL] Make `ArrayExists` follow the three-valued boolean logic. #24873

[SPARK-28052][SQL] Make `ArrayExists` follow the three-valued boolean logic. #24873

ueshin commented Jun 14, 2019 •

edited

ueshin commented Jun 14, 2019

rednaxelafx left a comment

ueshin commented Jun 14, 2019

SparkQA commented Jun 14, 2019

rednaxelafx commented Jun 14, 2019

ueshin commented Jun 14, 2019

SparkQA commented Jun 14, 2019

SparkQA commented Jun 14, 2019

SparkQA commented Jun 14, 2019

rednaxelafx commented Jun 14, 2019

mgaido91 Jun 14, 2019

ueshin Jun 15, 2019

mgaido91 Jun 14, 2019

rednaxelafx Jun 14, 2019

ueshin Jun 15, 2019

kiszk commented Jun 15, 2019

SparkQA commented Jun 15, 2019

mgaido91 commented Jun 15, 2019

dongjoon-hyun left a comment

ueshin commented Jun 15, 2019

		@@ -139,6 +139,8 @@ license: \|

		- Since Spark 3.0, we use a new protocol for fetching shuffle blocks, for external shuffle service users, we need to upgrade the server correspondingly. Otherwise, we'll get the error message `UnsupportedOperationException: Unexpected message: FetchShuffleBlocks`. If it is hard to upgrade the shuffle service right now, you can still use the old protocol by setting `spark.shuffle.useOldFetchProtocol` to `true`.

		- Since Spark 3.0, a higher-order function `exists` follows the three-valued boolean logic. The previous behaviour can be restored by setting `spark.sql.legacy.arrayExistsFollowsThreeValuedLogic` to `false`.

[SPARK-28052][SQL] Make ArrayExists follow the three-valued boolean logic. #24873

[SPARK-28052][SQL] Make ArrayExists follow the three-valued boolean logic. #24873

Conversation

ueshin commented Jun 14, 2019 • edited

What changes were proposed in this pull request?

How was this patch tested?

ueshin commented Jun 14, 2019

rednaxelafx left a comment

Choose a reason for hiding this comment

ueshin commented Jun 14, 2019

SparkQA commented Jun 14, 2019

rednaxelafx commented Jun 14, 2019

ueshin commented Jun 14, 2019

SparkQA commented Jun 14, 2019

SparkQA commented Jun 14, 2019

SparkQA commented Jun 14, 2019

rednaxelafx commented Jun 14, 2019

mgaido91 Jun 14, 2019

Choose a reason for hiding this comment

ueshin Jun 15, 2019

Choose a reason for hiding this comment

mgaido91 Jun 14, 2019

Choose a reason for hiding this comment

rednaxelafx Jun 14, 2019

Choose a reason for hiding this comment

ueshin Jun 15, 2019

Choose a reason for hiding this comment

kiszk commented Jun 15, 2019

SparkQA commented Jun 15, 2019

mgaido91 commented Jun 15, 2019

dongjoon-hyun left a comment

Choose a reason for hiding this comment

ueshin commented Jun 15, 2019

[SPARK-28052][SQL] Make `ArrayExists` follow the three-valued boolean logic. #24873

[SPARK-28052][SQL] Make `ArrayExists` follow the three-valued boolean logic. #24873

ueshin commented Jun 14, 2019 •

edited