-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fix: eliminate unnecessary repartitioning for small datasets #18776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix: eliminate unnecessary repartitioning for small datasets #18776
Conversation
| ) -> Result<bool> { | ||
| let stats = input.partition_statistics(None)?; | ||
|
|
||
| if let Precision::Exact(num_rows) = stats.num_rows { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
| if let Precision::Exact(num_rows) = stats.num_rows { | |
| if let Some(num_rows) = stats.num_rows.get_value() { |
This will cover both Exact and Inexact.
|
|
||
| if let Precision::Exact(num_rows) = stats.num_rows { | ||
| let batch_size = session_state.config().batch_size(); | ||
| return Ok(num_rows > batch_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return Ok(num_rows > batch_size); | |
| return Ok(num_rows >= batch_size); |
| input: &Arc<dyn ExecutionPlan>, | ||
| session_state: &SessionState, | ||
| ) -> Result<bool> { | ||
| let stats = input.partition_statistics(None)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it have to fail the aggregation if the statistics fail for any reason ?
IMO it would be better to return Ok(true) instead.
…ds, remove precision enum usage, update test expectations
dc03871 to
7b58acd
Compare
|
@martin-g, Thanks for the review! I made these changes:
I set the threshold to |
Is this really needed ? |
Why 10 times ? |
6bb86d1 to
8494f01
Compare
Which issue does this PR close?
Rationale for this change
Small datasets were undergoing unnecessary repartitioning, causing overhead without performance benefit. This change ensures that small Parquet datasets use single-partition aggregation.
What changes are included in this PR?
batch_size, the planner selectsAggregateMode::Single.aggregate_repartition.slttest file to expectAggregateExec: mode=Singlefor small Parquet datasets.Are these changes tested?
Yes
Are there any user-facing changes?
Yes