[SPARK-49794][SQL] Fix the expression ArrayContains bug where value is null#48261
[SPARK-49794][SQL] Fix the expression ArrayContains bug where value is null#48261panbingkun wants to merge 2 commits intoapache:masterfrom
ArrayContains bug where value is null#48261Conversation
| StructField("b", IntegerType))) | ||
| val data = Seq(Row(Seq[Integer](1, 2, 3, null), null)) | ||
| val df1 = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) | ||
| checkAnswer(df1.select(array_contains(col("a"), col("b"))), Seq(Row(true))) |
There was a problem hiding this comment.
I'm not sure if this belongs to break changes, but I think it should be a correct correction.
|
Yea I think this is common sense. Do other systems have |
Current behavior seems to match how we handle IN lists (and those are in the SQL standard): Fundamentally, if array_contains() uses equality to establish whether an element belongs to an array, then NULL = NULL returns NULL, not true. See also https://docs.databricks.com/en/sql/language-manual/functions/array_contains.html#examples |
|
I have also noticed this phenomenon before, which is quite common in Spark's expression functions. I guess the |
This is really |
Yea, that's it ( |
CREATE TABLE t(a string, b long, data array<int>, value int) USING PARQUET;
INSERT INTO TABLE t VALUES ('a', 1, array(1, 2, 3, null), null);
INSERT INTO TABLE t VALUES ('b', 2, array(1, 2, 3, 4), 4);
INSERT INTO TABLE t VALUES ('c', 3, array(1, 2, 3, 5), 4);
INSERT INTO TABLE t VALUES ('c', 3, array(1, 2, 3, 4), null);
SELECT * FROM t;
SELECT * FROM t WHERE array_contains(data, value)
|
|
Give another counter example: spark-sql (default)> select array_distinct(array(1, 2, 3, null, 3,null));
[1,2,3,null]
Time taken: 0.055 seconds, Fetched 1 row(s)
spark-sql (default)> select array_union(array(1, 2, 3, null), array(1, 3, 5, null));
[1,2,3,null,5]
Time taken: 0.067 seconds, Fetched 1 row(s) |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |


What changes were proposed in this pull request?
The pr aims to fix the expression
ArrayContainsbug wherevalueisnull.Why are the changes needed?
Note: Obviously, the result of
array_contains(data, value)does not meet common sense expectations.Does this PR introduce any user-facing change?
Yes, corrected incorrect results about
ArrayContains.How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No.