Skip to content

[SPARK-49794][SQL] Fix the expression ArrayContains bug where value is null#48261

Closed
panbingkun wants to merge 2 commits intoapache:masterfrom
panbingkun:SPARK-49794
Closed

[SPARK-49794][SQL] Fix the expression ArrayContains bug where value is null#48261
panbingkun wants to merge 2 commits intoapache:masterfrom
panbingkun:SPARK-49794

Conversation

@panbingkun
Copy link
Contributor

@panbingkun panbingkun commented Sep 26, 2024

What changes were proposed in this pull request?

The pr aims to fix the expression ArrayContains bug where value is null.

Why are the changes needed?

  • Give an example with the following code:
sql("CREATE TABLE t(data array<int>, value int) USING PARQUET")
sql("INSERT INTO TABLE t VALUES (array(1, 2, 3, null), null)")
sql("SELECT * FROM t").show(false)
sql("SELECT array_contains(data, value) FROM T").show(false)
  • Before
+---------------+-----+
|data           |value|
+---------------+-----+
|[1, 2, 3, NULL]|NULL |
+---------------+-----+

+---------------------------+
|array_contains(data, value)|
+---------------------------+
|NULL                       |
+---------------------------+

Note: Obviously, the result of array_contains(data, value) does not meet common sense expectations.

  • After
+---------------+-----+
|data           |value|
+---------------+-----+
|[1, 2, 3, NULL]|NULL |
+---------------+-----+

+---------------------------+
|array_contains(data, value)|
+---------------------------+
|true                       |
+---------------------------+

Does this PR introduce any user-facing change?

Yes, corrected incorrect results about ArrayContains.

How was this patch tested?

  • Pass GA.
  • Add new UT & update UT.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Sep 26, 2024
StructField("b", IntegerType)))
val data = Seq(Row(Seq[Integer](1, 2, 3, null), null))
val df1 = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
checkAnswer(df1.select(array_contains(col("a"), col("b"))), Seq(Row(true)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this belongs to break changes, but I think it should be a correct correction.

@panbingkun panbingkun marked this pull request as ready for review September 26, 2024 10:44
@panbingkun
Copy link
Contributor Author

cc @MaxGekk @cloud-fan

@cloud-fan
Copy link
Contributor

Yea I think this is common sense. Do other systems have array_contains function and how do they handle null?

@panbingkun
Copy link
Contributor Author

panbingkun commented Sep 27, 2024

array_contains

image image
  • mysql
mysql> SELECT FIND_IN_SET('b','a,b,c,d');
+----------------------------+
| FIND_IN_SET('b','a,b,c,d') |
+----------------------------+
|                          2 |
+----------------------------+
1 row in set (0.01 sec)

mysql> SELECT FIND_IN_SET(null,'a,null,c,d');
+--------------------------------+
| FIND_IN_SET(null,'a,null,c,d') |
+--------------------------------+
|                           NULL |
+--------------------------------+
1 row in set (0.01 sec)

mysql> select null in (1, 2, null);
+----------------------+
| null in (1, 2, null) |
+----------------------+
|                 NULL |
+----------------------+
1 row in set (0.00 sec)

mysql> select null=null;
+-----------+
| null=null |
+-----------+
|      NULL |
+-----------+
1 row in set (0.01 sec)

mysql> select 1=1;
+-----+
| 1=1 |
+-----+
|   1 |
+-----+
1 row in set (0.01 sec)

@agubichev
Copy link
Contributor

agubichev commented Sep 27, 2024

Yea I think this is common sense. Do other systems have array_contains function and how do they handle null?

Current behavior seems to match how we handle IN lists (and those are in the SQL standard):
select null in (1, 2, null);
-- returns NULL, not true.

Fundamentally, if array_contains() uses equality to establish whether an element belongs to an array, then NULL = NULL returns NULL, not true.

See also https://docs.databricks.com/en/sql/language-manual/functions/array_contains.html#examples

@beliefer
Copy link
Contributor

beliefer commented Sep 27, 2024

I have also noticed this phenomenon before, which is quite common in Spark's expression functions. I guess the NullIntolerant lead to the result.
NullIntolerant is a specification that most expression followed. I don't know if we need to break this constraint.

@panbingkun
Copy link
Contributor Author

Yea I think this is common sense. Do other systems have array_contains function and how do they handle null?

Current behavior seems to match how we handle IN lists (and those are in the SQL standard): select null in (1, 2, null); -- returns NULL, not true.

Fundamentally, if array_contains() uses equality to establish whether an element belongs to an array, then NULL = NULL returns NULL, not true.

See also https://docs.databricks.com/en/sql/language-manual/functions/array_contains.html#examples

This is really a rule that goes against common sense, and I believe most data processors will be troubled by this rule.

@panbingkun
Copy link
Contributor Author

I have also noticed this phenomenon before, which is quite common in Spark's expression functions. I guess the NullIntolerant lead to the result. NullIntolerant is a specification that most expression followed. I don't know if we need to break this constraint.

Yea, that's it (NullIntolerant)

@panbingkun
Copy link
Contributor Author

panbingkun commented Sep 27, 2024

  • Let me give an example of a production environment:
CREATE TABLE t(a string, b long, data array<int>, value int) USING PARQUET;
INSERT INTO TABLE t VALUES ('a', 1, array(1, 2, 3, null), null);
INSERT INTO TABLE t VALUES ('b', 2, array(1, 2, 3, 4), 4);
INSERT INTO TABLE t VALUES ('c', 3, array(1, 2, 3, 5), 4);
INSERT INTO TABLE t VALUES ('c', 3, array(1, 2, 3, 4), null);
SELECT * FROM t;
  • A data processor that writes the following SQL statement:
SELECT * FROM t WHERE array_contains(data, value)
  • It is expected to obtain the following data:
+---+---+------------------+-----+
|a  |b  |data              |value|
+---+---+------------------+-----+
|a  |1  |[1, 2, 3,  null]  |null |
|b  |2  |[1, 2, 3, 4]      |4    |
+---+---+------------------+-----+
  • But got:
+---+---+------------+-----+
|a  |b  |data        |value|
+---+---+------------+-----+
|b  |2  |[1, 2, 3, 4]|4    |
+---+---+------------+-----+

@panbingkun
Copy link
Contributor Author

panbingkun commented Sep 27, 2024

Give another counter example:

spark-sql (default)> select array_distinct(array(1, 2, 3, null, 3,null));
[1,2,3,null]
Time taken: 0.055 seconds, Fetched 1 row(s)

spark-sql (default)> select array_union(array(1, 2, 3, null), array(1, 3, 5, null));
[1,2,3,null,5]
Time taken: 0.067 seconds, Fetched 1 row(s)

@github-actions
Copy link

github-actions bot commented Jan 6, 2025

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 6, 2025
@github-actions github-actions bot closed this Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants