Skip to content

feat: add cleanup_non_empty_nulls kernel#9970

Open
rluvaton wants to merge 7 commits into
apache:mainfrom
rluvaton:add-cleanup-non-empty-nulls-kernel
Open

feat: add cleanup_non_empty_nulls kernel#9970
rluvaton wants to merge 7 commits into
apache:mainfrom
rluvaton:add-cleanup-non-empty-nulls-kernel

Conversation

@rluvaton
Copy link
Copy Markdown
Member

Which issue does this PR close?

N/A

Rationale for this change

when working with lists or variable size arrays you cant operate on the underlying values/bytes of variable length array as is as nulls might point to non empty values

Cases when this is useful:

  1. lambda function on lists - since we need to remove the values that
    are not null
  2. explode sql function - list values behind nulls cannot be kept
  3. have kernels that use the list values without need to check if the
    value should be processed or not - for example implementing
    array_distinct which is keeping in each list the unique items

What changes are included in this PR?

added to arrow-select cleanup_non_empty_nulls module which include 2 functions

  1. cleanup_non_empty_nulls which is the logic for removing non empty nulls values
  2. has_non_empty_nulls which can be called before calling the cleanup_non_empty_nulls function to check if the expensive work is even needed
  3. Added benchmarks for cleanup

Originally I wanted to add the function on ListArray and StringArray and so on, but because the use of take and interleave we cannot do that

Are these changes tested?

Yes

Are there any user-facing changes?

yes, new kernel

rluvaton added 5 commits May 13, 2026 21:55
This is useful in:
1. lambda function on lists - since we need to remove the values that
are not null
2. `explode` sql function
   3. have kernels that use the list values without need to check if the
   value should be processed or not - for example implementing
   `array_distinct` which is keeping in each list the unique items
@github-actions github-actions Bot added the arrow Changes to the arrow crate label May 13, 2026
@rluvaton rluvaton marked this pull request as ready for review May 13, 2026 19:49
@rluvaton
Copy link
Copy Markdown
Member Author

Cc @gstvg @comphead (for lambda support)

assert_eq!(cleaned.nulls(), Some(&input_nulls));
}

// ===== Underlying child array is sliced =====
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the children arrays below be sliced? The first test below looks very similiar as list_cleanup_nulls_with_null_pointing_to_non_empty_list_and_have_empty_list in line 464

unsafe { Buffer::from_trusted_len_iter(iter) }
};

let cleanup_array = crate::take::take(
Copy link
Copy Markdown
Contributor

@gstvg gstvg May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: for lists/maps with big chunks of valid or empty sublists, is possible that using MutableArrayData directly be faster, since we can copy the given chunk in a single memcpy for some data types, and perform less dynamic dispatches compared to take_list?

Now for string/binary I'm not sure since is static

return Ok(Arc::new(self.clone()));
};

// Find an empty value so we can use the `take` kernel
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we can take that as granted, but take should already clean up nulls for list/maps and bytes

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rluvaton @gstvg before expanding arrow-rs lets try to scope the concern.

Ideally to include in the PR description the explanation why this problem is happening, what is the reason of ArrayRef require cleaning, because under usual circumstances it should not. When we discussed the change in apache/datafusion#22158 (comment) I feel the scope was to explore if BooleanBuffer can calculate has_false without adding another BooleanBuilder wrapper on top of it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants