Skip to content

Conversation

@brancz
Copy link
Contributor

@brancz brancz commented Nov 28, 2025

Which issue does this PR close?

Closes #16011

What changes are included in this PR?

Support for hashing of REE arrays

Are these changes tested?

Unit tests are added.

Are there any user-facing changes?

No, strictly additive behavior/new feature.

@alamb @vegarsti

@github-actions github-actions bot added the common Related to common crate label Nov 28, 2025
@brancz brancz mentioned this pull request Nov 28, 2025
4 tasks
@brancz brancz changed the title Add hashing of REE arrays and add REE aggregation example common: Add hashing support for REE arrays Nov 28, 2025
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're aiming to close #16011 with this PR, perhaps we should add an end-to-end test showcasing REE arrays can be aggregated on?

let mut values_hashes = vec![0u64; values_len];
create_hashes(&[Arc::clone(values)], random_state, &mut values_hashes)?;

// Get run ends buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove these verbose, LLM-like comments throughout the PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do the same for some of the test comments as well?

@brancz
Copy link
Contributor Author

brancz commented Nov 28, 2025

If we're aiming to close #16011 with this PR, perhaps we should add an end-to-end test showcasing REE arrays can be aggregated on?

Can do! Where do you think is the right place for a test like that?

@brancz
Copy link
Contributor Author

brancz commented Nov 28, 2025

I have an example, but maybe in the interest of keeping PRs small and easily reviewable, I'll contribute the end-to-end example in a follow up?

@brancz
Copy link
Contributor Author

brancz commented Nov 28, 2025

Let me know how you feel about the clippy lints, I personally find the suggestions from clippy in this case far harder to read since we're dealing with indexes, so I'd be inclined to add ignores, but let me know what you think!

@martin-g
Copy link
Member

Let me know how you feel about the clippy lints,

I prefer the clippy suggested way. Using indices is more C-style.
But this is just my personal preference !

@brancz
Copy link
Contributor Author

brancz commented Nov 28, 2025

Adjusted to using mutable iterators.

}

// Downcast ArrayRef to RunArray
pub fn as_run_array<R: RunEndIndexType>(array: &dyn Array) -> Result<&RunArray<R>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function still needed ?
Its usage has been replaced with downcast_run_array!() and now it is not used anywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to have anyway, following suit with similar functions for other types above

let mut prev_run_end = 0;

for (i, value_hash) in values_hashes.iter().enumerate().take(values_len) {
let run_end = run_ends.values()[i].as_usize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may fail for a sliced RunArray.
Sliced RunArray has the same values but sliced run_ends, so their indices do not match 1:1.
https://docs.rs/arrow-array/latest/src/arrow_array/array/run_array.rs.html#247-253

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could zip the run_ends.values().iter() as part of the for clause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very good point, let me see what I can do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for (i, value_hash) in values_hashes.iter().enumerate().take(values_len) {
let run_end = run_ends.values()[i].as_usize();

if rehash {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other functions check whether the current value is valid/non-null before hashing it when rehash == true. Is this needed here too ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you're right, we need to. Combining hashes modifies the hash even on 0 hashes (which makes sense). So we do need to care.

*hash = combine_hashes(*value_hash, *hash);
}
} else {
for hash in hashes_buffer.iter_mut().take(run_end).skip(prev_run_end) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be optimized to hashes_buffer[prev_run_end..run_end].fill(value_hash) (SIMD friendly!).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@Jefffrey
Copy link
Contributor

I have an example, but maybe in the interest of keeping PRs small and easily reviewable, I'll contribute the end-to-end example in a follow up?

I think it's fine to add the tests in this PR. Perhaps we can add the test as an SLT, assuming arrow_cast supports REE arrays?

let run_ends = array.run_ends();
let mut prev_run_end = 0;

for (i, value_hash) in values_hashes.iter().enumerate().take(values_len) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (i, value_hash) in values_hashes.iter().enumerate().take(values_len) {
for (i, value_hash) in values_hashes.iter().enumerate() {

@brancz
Copy link
Contributor Author

brancz commented Nov 29, 2025

In addition to the comments, I also realized that we were also always hashing all values, which is unnecessary, so I changed the implementation to first figure out via the runs which values are relevant and only hash those and then process them accordingly.

@brancz
Copy link
Contributor Author

brancz commented Nov 29, 2025

Ok I went down a bit of a rabbithole, and I have a working SLT, but I think it's better as a follow up, as we need arrow 57.1.0 as we need apache/arrow-rs#8765 and another patch in datafusion for the arrow cast to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Aggregating by RunArrays

4 participants