fix: array_position nested lists handling & check index for nulls#21791
fix: array_position nested lists handling & check index for nulls#21791Jefffrey wants to merge 4 commits intoapache:mainfrom
Conversation
| } | ||
| Some(ColumnarValue::Scalar(s)) if s.is_null() => { | ||
| exec_err!("array_position index cannot contain nulls") | ||
| } |
There was a problem hiding this comment.
Technically this was checked by the below arm, but I added this as a more specific error message
| /// `needle_element_index`, return a `BooleanArray` based on whether the elements | ||
| /// in `haystack` match the `needle` value using `IS NOT DISTINCT FROM` semantics. | ||
| /// - Allows NULL = NULL to be considered true | ||
| pub(crate) fn compare_element_to_list_fixed<const IS_LIST: bool>( |
There was a problem hiding this comment.
I added a new version as I didn't want to affect array_remove/array_replace yet; I'll tackle them in followups
(this is why I also omitted the eq parameter, as for now this is only used by position anyway)
| needle: &dyn Array, | ||
| needle_element_index: usize, | ||
| ) -> Result<BooleanArray> { | ||
| if IS_LIST { |
There was a problem hiding this comment.
I figured it would be a better idea to not do this datatype check inside, as this function is in the hotloop (called for every row), so pulled it into a const generic
| let res = (0..haystack.len()) | ||
| .map(|i| cmp(i, needle_element_index).is_eq()) | ||
| .collect::<BooleanArray>(); | ||
| Ok(res) |
There was a problem hiding this comment.
Main fix here, using comparator now which handles nulls on both sides properly (null = null is true)
| [1] [] | ||
|
|
||
|
|
||
| query II? rowsort |
There was a problem hiding this comment.
Some of these might overlap, but I wanted to get a comprehensive overview of the behaviours in a single test case; also this ensures we run the array path instead of just the scalar path
| #NULL | ||
| # expected to return null | ||
| query error DataFusion error: Execution error: array_positions does not support type 'Null' | ||
| select array_positions(null, 1); |
There was a problem hiding this comment.
Unrelated to this fix, but enabling it now so when we resolve #7142 in the future we'll know to also fix this test as well
Which issue does this PR close?
array_positiondoesn't check nulls in array index & fails to handle nulls properly #21792Rationale for this change
Index nulls
When
indexis a scalar, we ensure it can't be null:datafusion/datafusion/functions-nested/src/position.rs
Lines 209 to 214 in 766dff1
However we don't guard nulls when it is an array:
datafusion/datafusion/functions-nested/src/position.rs
Lines 215 to 217 in 766dff1
datafusion/datafusion/functions-nested/src/position.rs
Lines 308 to 314 in 766dff1
values()directly, ignoring parent null bufferNull handling
array_positionrelies oncompare_element_to_list:datafusion/datafusion/functions-nested/src/utils.rs
Lines 150 to 221 in 766dff1
When the type to be checked is not a list, we pass through to arrow
not_distinctkernel which handles null accordingly. However, we roll our own verification for list types, which critically does not account for null in the needle:datafusion/datafusion/functions-nested/src/utils.rs
Lines 164 to 187 in 766dff1
element_array_row_innercan be null, but we never check this and assume it is validWhat changes are included in this PR?
Fix handling of index when it is an array to error fast if we detect any nulls.
Revamp
compare_element_to_listinto a new version which handles nulls properly by usingmake_comparatorfrom arrow.Are these changes tested?
Yes.
Are there any user-facing changes?
Corrected behaviour for null handling of array_position[s]