Skip to content

fix: array_position nested lists handling & check index for nulls#21791

Open
Jefffrey wants to merge 4 commits intoapache:mainfrom
Jefffrey:array-position-fix
Open

fix: array_position nested lists handling & check index for nulls#21791
Jefffrey wants to merge 4 commits intoapache:mainfrom
Jefffrey:array-position-fix

Conversation

@Jefffrey
Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey commented Apr 22, 2026

Which issue does this PR close?

Rationale for this change

Index nulls

When index is a scalar, we ensure it can't be null:

Some(ColumnarValue::Scalar(ScalarValue::Int64(Some(v)))) => {
Ok(vec![v - 1; num_rows])
}
Some(ColumnarValue::Scalar(s)) => {
exec_err!("array_position expected Int64 for start_from, got {s}")
}

  • This matches with postgres which also errors if index is null
postgres=# select array_position(ARRAY['sun', 'mon', 'tue', 'wed', 'thu', 'fri', 'sat'], 'mon', null);
ERROR:  initial position must not be null

However we don't guard nulls when it is an array:

Some(ColumnarValue::Array(a)) => {
Ok(as_int64_array(a)?.values().iter().map(|&x| x - 1).collect())
}

let arr_from = if args.len() == 3 {
as_int64_array(&args[2])?
.values()
.iter()
.map(|&x| x - 1)
.collect::<Vec<_>>()
} else {

  • Accessing values() directly, ignoring parent null buffer

Null handling

array_position relies on compare_element_to_list:

pub(crate) fn compare_element_to_list(
list_array_row: &dyn Array,
element_array: &dyn Array,
row_index: usize,
eq: bool,
) -> Result<BooleanArray> {
if list_array_row.data_type() != element_array.data_type() {
return exec_err!(
"compare_element_to_list received incompatible types: '{:?}' and '{:?}'.",
list_array_row.data_type(),
element_array.data_type()
);
}
let element_array_row = element_array.slice(row_index, 1);
// Compute all positions in list_row_array (that is itself an
// array) that are equal to `from_array_row`
let res = match element_array_row.data_type() {
// arrow_ord::cmp::eq does not support ListArray, so we need to compare it by loop
DataType::List(_) => {
// compare each element of the from array
let element_array_row_inner = as_list_array(&element_array_row)?.value(0);
let list_array_row_inner = as_list_array(list_array_row)?;
list_array_row_inner
.iter()
// compare element by element the current row of list_array
.map(|row| {
row.map(|row| {
if eq {
row.eq(&element_array_row_inner)
} else {
row.ne(&element_array_row_inner)
}
})
})
.collect::<BooleanArray>()
}
DataType::LargeList(_) => {
// compare each element of the from array
let element_array_row_inner =
as_large_list_array(&element_array_row)?.value(0);
let list_array_row_inner = as_large_list_array(list_array_row)?;
list_array_row_inner
.iter()
// compare element by element the current row of list_array
.map(|row| {
row.map(|row| {
if eq {
row.eq(&element_array_row_inner)
} else {
row.ne(&element_array_row_inner)
}
})
})
.collect::<BooleanArray>()
}
_ => {
let element_arr = Scalar::new(element_array_row);
// use not_distinct so we can compare NULL
if eq {
arrow_ord::cmp::not_distinct(&list_array_row, &element_arr)?
} else {
arrow_ord::cmp::distinct(&list_array_row, &element_arr)?
}
}
};
Ok(res)
}

When the type to be checked is not a list, we pass through to arrow not_distinct kernel which handles null accordingly. However, we roll our own verification for list types, which critically does not account for null in the needle:

let element_array_row = element_array.slice(row_index, 1);
// Compute all positions in list_row_array (that is itself an
// array) that are equal to `from_array_row`
let res = match element_array_row.data_type() {
// arrow_ord::cmp::eq does not support ListArray, so we need to compare it by loop
DataType::List(_) => {
// compare each element of the from array
let element_array_row_inner = as_list_array(&element_array_row)?.value(0);
let list_array_row_inner = as_list_array(list_array_row)?;
list_array_row_inner
.iter()
// compare element by element the current row of list_array
.map(|row| {
row.map(|row| {
if eq {
row.eq(&element_array_row_inner)
} else {
row.ne(&element_array_row_inner)
}
})
})
.collect::<BooleanArray>()

  • element_array_row_inner can be null, but we never check this and assume it is valid

What changes are included in this PR?

Fix handling of index when it is an array to error fast if we detect any nulls.

Revamp compare_element_to_list into a new version which handles nulls properly by using make_comparator from arrow.

Are these changes tested?

Yes.

Are there any user-facing changes?

Corrected behaviour for null handling of array_position[s]

@github-actions github-actions Bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Apr 22, 2026
}
Some(ColumnarValue::Scalar(s)) if s.is_null() => {
exec_err!("array_position index cannot contain nulls")
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this was checked by the below arm, but I added this as a more specific error message

/// `needle_element_index`, return a `BooleanArray` based on whether the elements
/// in `haystack` match the `needle` value using `IS NOT DISTINCT FROM` semantics.
/// - Allows NULL = NULL to be considered true
pub(crate) fn compare_element_to_list_fixed<const IS_LIST: bool>(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new version as I didn't want to affect array_remove/array_replace yet; I'll tackle them in followups

(this is why I also omitted the eq parameter, as for now this is only used by position anyway)

needle: &dyn Array,
needle_element_index: usize,
) -> Result<BooleanArray> {
if IS_LIST {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured it would be a better idea to not do this datatype check inside, as this function is in the hotloop (called for every row), so pulled it into a const generic

let res = (0..haystack.len())
.map(|i| cmp(i, needle_element_index).is_eq())
.collect::<BooleanArray>();
Ok(res)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main fix here, using comparator now which handles nulls on both sides properly (null = null is true)

[1] []


query II? rowsort
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these might overlap, but I wanted to get a comprehensive overview of the behaviours in a single test case; also this ensures we run the array path instead of just the scalar path

@Jefffrey Jefffrey marked this pull request as ready for review April 22, 2026 18:14
#NULL
# expected to return null
query error DataFusion error: Execution error: array_positions does not support type 'Null'
select array_positions(null, 1);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this fix, but enabling it now so when we resolve #7142 in the future we'll know to also fix this test as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

array_position doesn't check nulls in array index & fails to handle nulls properly

1 participant