[Incompatibility] Document arrays_overlap null handling differences#3364
Draft
csbiy wants to merge 1 commit intoapache:mainfrom
Draft
[Incompatibility] Document arrays_overlap null handling differences#3364csbiy wants to merge 1 commit intoapache:mainfrom
csbiy wants to merge 1 commit intoapache:mainfrom
Conversation
This commit addresses issue apache#3175 by documenting the specific null handling incompatibility in the arrays_overlap function between Spark and Comet. Changes: 1. Updated CometArraysOverlap.getSupportLevel() in arrays.scala to provide a detailed explanation of the incompatibility with a concrete example. 2. Added comprehensive test cases in CometArrayExpressionSuite.scala to verify and document the null handling behavior, including: - Common element exists (returns true) - No common elements, no nulls (returns false) - No common elements with null present (Spark: null, Comet: false) - Common element with null present (returns true) - Both arrays with null but no common elements - Empty array cases 3. Updated expressions.md documentation to inform users of the specific three-valued logic difference with a clear example. Root Cause: Comet uses DataFusion's array_has_any function, which returns false when no common elements are found, regardless of null presence. Spark follows SQL's three-valued logic and returns null when the result is indeterminate. Example: - arrays_overlap(array(1, null, 3), array(4, 5)) * Spark: null (indeterminate due to null) * Comet: false Related to apache#3175
Contributor
|
Thank you very much for documenting these @csbiy . |
Contributor
|
Please take a look at the comet contributor guide which should help you with PR : I generally run the following commands to clean up the code before raising the PR: |
coderfender
reviewed
Feb 3, 2026
| "Null handling differs from Spark: DataFusion's array_has_any returns false when no " + | ||
| "common elements are found, even if null elements exist. Spark returns null in such " + | ||
| "cases following three-valued logic (SQL standard). Example: " + | ||
| "arrays_overlap(array(1, null, 3), array(4, 5)) returns null in Spark but false in Comet.")) |
Contributor
There was a problem hiding this comment.
Thank you for writing this. I would probably keep the message short and route to the documentation to help declutter the logs :)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses issue #3175 by documenting the specific null handling incompatibility between Spark and Comet for the
arrays_overlapfunction.Changes Made
1. Code Documentation (
arrays.scala)Updated
CometArraysOverlap.getSupportLevel()to returnIncompatible(Some(...))with detailed explanation and concrete example.Before:
After:
2. Test Coverage (
CometArrayExpressionSuite.scala)Added comprehensive test
arrays_overlap - null handling behavior verificationwith 6 test cases:truefalsenull, Comet:false(documented incompatibility)true3. User Documentation (
expressions.md)Updated the Array Expressions table with specific explanation and example showing the three-valued logic difference.
Root Cause Analysis
Spark Behavior (Three-Valued Logic)
Spark follows SQL's three-valued logic (true, false, null):
trueif common elements foundfalseif no common elements AND no nullsnullif no common elements BUT nulls exist (indeterminate)Comet Behavior
Comet uses DataFusion's
array_has_anyfunction:trueif common elements foundfalsein all other cases (no three-valued logic support)Example Demonstrating Incompatibility
nullfalseWhy This Matters
Users who enable
arrays_overlapwithspark.comet.expression.ArraysOverlap.allowIncompatible=trueneed to understand:Testing Notes
Local test execution encountered environment Java version compatibility issues (unrelated to code changes). Test code is syntactically correct and follows existing patterns. CI environment should run tests successfully with proper Java configuration.
Files Modified
Closes
Closes #3175
Checklist