feat: add cleanup_non_empty_nulls kernel#9970
Conversation
This is useful in: 1. lambda function on lists - since we need to remove the values that are not null 2. `explode` sql function 3. have kernels that use the list values without need to check if the value should be processed or not - for example implementing `array_distinct` which is keeping in each list the unique items
| assert_eq!(cleaned.nulls(), Some(&input_nulls)); | ||
| } | ||
|
|
||
| // ===== Underlying child array is sliced ===== |
There was a problem hiding this comment.
Should the children arrays below be sliced? The first test below looks very similiar as list_cleanup_nulls_with_null_pointing_to_non_empty_list_and_have_empty_list in line 464
| unsafe { Buffer::from_trusted_len_iter(iter) } | ||
| }; | ||
|
|
||
| let cleanup_array = crate::take::take( |
There was a problem hiding this comment.
non-blocking: for lists/maps with big chunks of valid or empty sublists, is possible that using MutableArrayData directly be faster, since we can copy the given chunk in a single memcpy for some data types, and perform less dynamic dispatches compared to take_list?
Now for string/binary I'm not sure since is static
| return Ok(Arc::new(self.clone())); | ||
| }; | ||
|
|
||
| // Find an empty value so we can use the `take` kernel |
comphead
left a comment
There was a problem hiding this comment.
Thanks @rluvaton @gstvg before expanding arrow-rs lets try to scope the concern.
Ideally to include in the PR description the explanation why this problem is happening, what is the reason of ArrayRef require cleaning, because under usual circumstances it should not. When we discussed the change in apache/datafusion#22158 (comment) I feel the scope was to explore if BooleanBuffer can calculate has_false without adding another BooleanBuilder wrapper on top of it
Which issue does this PR close?
N/A
Rationale for this change
when working with lists or variable size arrays you cant operate on the underlying values/bytes of variable length array as is as nulls might point to non empty values
Cases when this is useful:
are not null
explodesql function - list values behind nulls cannot be keptvalue should be processed or not - for example implementing
array_distinctwhich is keeping in each list the unique itemsWhat changes are included in this PR?
added to
arrow-selectcleanup_non_empty_nullsmodule which include 2 functionscleanup_non_empty_nullswhich is the logic for removing non empty nulls valueshas_non_empty_nullswhich can be called before calling thecleanup_non_empty_nullsfunction to check if the expensive work is even neededOriginally I wanted to add the function on
ListArrayandStringArrayand so on, but because the use of take and interleave we cannot do thatAre these changes tested?
Yes
Are there any user-facing changes?
yes, new kernel