Skip to content

Conversation

@tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

The approach of storing hashes in GroupOrdering was causing merge conflicts for #7016 and is not actually necessary

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jul 19, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me

FYI @mustafasrepo and @ozankabak -- this effectively should improve the speed of streamed / bounded group by

for (idx, &hash) in hashes.iter().enumerate() {
self.map.insert(hash, (hash, idx), |(hash, _)| *hash);
self.group_ordering.remove_groups(n);
// SAFETY: self.map outlives iterator and is not modified concurrently
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 628 to 634
unsafe {
for bucket in self.map.iter() {
match bucket.as_ref().1.checked_sub(n) {
None => self.map.erase(bucket),
Some(sub) => bucket.as_mut().1 = sub,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is both wonderfully elegant as well as cryptic. How about some comments (this is so I don't have to refigure this out the next time I see this code):

Suggested change
unsafe {
for bucket in self.map.iter() {
match bucket.as_ref().1.checked_sub(n) {
None => self.map.erase(bucket),
Some(sub) => bucket.as_mut().1 = sub,
}
}
unsafe {
for bucket in self.map.iter() {
// decrement group index by n
match bucket.as_ref().1.checked_sub(n) {
// group index was < n, so remove from table
None => self.map.erase(bucket),
// group index was >= n, shift value down
Some(sub) => bucket.as_mut().1 = sub,
}
}

I double checked https://docs.rs/hashbrown/latest/hashbrown/raw/struct.RawIter.html

You must not free the hash table while iterating (including via growing/shrinking).
It is fine to erase a bucket that has been yielded by the iterator.
Erasing a bucket that has not yet been yielded by the iterator may still result in the iterator yielding that bucket (unless reflect_remove is called).
It is unspecified whether an element inserted after the iterator was created will be yielded by that iterator (unless reflect_insert is called).
The order in which the iterator yields bucket is unspecified and may change in the future.

Which seems to be followed 👍

@tustvold tustvold merged commit a3db191 into apache:main Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants