-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Emit aggregation groups in chunks to avoid blocking async runtime #18906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cd69062 to
3dc9515
Compare
3dc9515 to
cc01921
Compare
| self.emission_offset = end; | ||
| if self.emission_offset == group_values.num_rows() { | ||
| group_values.clear(); | ||
| self.emission_offset = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I fully understand this. Looking at the previous code, it does look like it was already emitting groups incrementally.
The difference I see is that in this new code we track the offset with self.emission_offset but the other side of the if statements mutates self.group_values in place to trim the results that have already been emitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The emission_offset optimization is just to avoid the expensive "copy remaining rows" operation that the old EmitTo::First path did during input processing. The real fix is replacing emit(EmitTo::All) (which blocks for seconds on large group counts) with incremental drain https://github.com/apache/datafusion/pull/18906/files#diff-69c8ecaca5e2c7005f2ed1facaa41f80b45bfd006f2357e53ff3072f535c287dR1196
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't that mean then that the else part (when drain_mode == false) is now dead code? FWIW I can put a panic!() here:
output
} else {
+ panic!("foo");
let groups_rows = group_values.iter().take(n);
let output = self.row_converter.convert_rows(groups_rows)?;And all the tests in the crate pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's needed for emit_early_if_necessary() for Partial mode with memory pressure and group_ordering.emit_to() when input is sorted or partially sorted. Those 2 cases can trigger incremental / early emission.
I added a test case for sorted input c4903b6 that triggers that path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll reframe. If during the benchmarks you see that the total execution time is the same with the old code vs these new one that executes with drain_mode == true, how about just leaving group_values/row.rs as it was?
If there's actually no noticeable performance improvement, we might same some lines of code and complexity by just keeping the old path here in group_values/row.rs, as the actual improvement is happening in row_hash.rs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can give that a try but it will be less efficient for sure. There is no value in retaining the state of the hash map when draining. And you can see that the drain_mode == false path is doing much heavier operations compared to drain_mode == true.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ahmed-mez -- do you have any measurements / benchmarks that illustrate how this change improves performance?
Your initial PR description in #18907 (comment) sounds like maybe a separate thread pool may be more appropriate?
|
I augmented the reproducer test case with some stats to clarify the benefit:
Note: The first poll in chunked (2.21s) includes input processing (building the hash table). This is unavoidable and the same in both approaches. |
| let batch = self.emit(EmitTo::All, false)?; | ||
| batch.map_or(ExecutionState::Done, ExecutionState::ProducingOutput) | ||
| ExecutionState::DrainingGroups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I understand better what's happening now, correct me if I'm wrong:
- The previous implementation, upon finishing accumulating all groups, it bundled everything into a big
RecordBatchand then proceeded to yield slices of it respecting the configuredbatch_size - The current implementation, upon finishing accumulating all groups, it bundles nothing, and instead each
RecordBatchof sizebatch_sizeis bundled on-demand as the stream gets polled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, I imagine there's no reason to keep both modes of emitting outputs right? would there still be any situation where we want the previous behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I understand better what's happening now, correct me if I'm wrong:
- The previous implementation, upon finishing accumulating all groups, it bundled everything into a big
RecordBatchand then proceeded to yield slices of it respecting the configuredbatch_size- The current implementation, upon finishing accumulating all groups, it bundles nothing, and instead each
RecordBatchof sizebatch_sizeis bundled on-demand as the stream gets polled
Yes, that's pretty much it.
If that's the case, I imagine there's no reason to keep both modes of emitting outputs right? would there still be any situation where we want the previous behavior?
Are you referring to EmitTo::All vs EmitTo::First(n) ? It can make sense, however, I also see EmitTo::All being used in many other paths (e.g when spilling) so I'm not confident enough to make the call. Happy to update the code accordingly if it's recommended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more referring to ExecutionState::ProducingOutput and ExecutionState::DrainingOutput. If we are capable of draining the output generating small chunks on-demand, I imagine there's no value in also keeping the previous full batch production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I see it ExecutionState::ProducingOutput is not doing anything wrong per se, it's still useful for producing results while still reading input (e.g. in sorted input scenario) which requires a different (more involved) handling compared to ExecutionState::DrainingOutput.
The fundamental problem was in the way set_input_done_and_produce_output emitted results, not in the ExecutionState::ProducingOutput implementation.
I do acknowledge though that the the current PR added some complexity to the code flow and I'm happy to explore ways to simplify it.
Why is this unavoidable? Can't we insert yield points in the hash table build as well? |
One problem at a time, I'd like to keep the scope limited and easy to review 😅 Also tbf, I did see this particular overhead in the test but not in practice. In practice / prod-like envs the majority of the stalls happen bc of the emission. |
Which issue does this PR close?
Rationale for this change
When a
GroupedHashAggregateStreamfinishes processing its input, it emits the accumulated groups. The current implementation materializes all groups into a singleRecordBatch(viaemit(EmitTo::All)) before slicing it into smaller batches.For queries with high-cardinality grouping keys (e.g., >500k groups) or complex types (Strings, Lists), this single emission step becomes a blocking operation that can stall the async runtime for seconds. This "long poll" prevents other tasks from running, leading to latency spikes and reduced concurrency.
This PR changes the emission strategy to respect the configured
batch_sizeduring the drain phase, emitting groups incrementally instead of all at once.What changes are included in this PR?
Incremental Emission Logic:
ExecutionState::DrainingGroupsinGroupedHashAggregateStream.poll_nextto handle this state by emittingbatch_sizegroups at a time.input_done()to theGroupValuestrait to signal the transition to drain mode.GroupValues Updates:
input_done()inGroupValuesimplementations (e.g.,GroupValuesRows).GroupValuesRows,input_done()clears the hash map (to free memory immediately) and sets adrain_modeflag to optimize sequential emission.Test Case:
test_chunked_group_emissionto verify that the chunked emission behavior works correctly:batch_sizelimit.test_long_poll_reproducertoaggregates/mod.rswhich demonstrates the latency improvement (from ~2.8s to ~2.1s in local tests) and verifies that results are emitted in multiple small batches rather than one huge batch. This test case doesn't have to be committed, its purpose is mainly to demo the issue and the fix.Are these changes tested?
Yes, two new test cases have been added.
Are there any user-facing changes?