Skip to content

Conversation

@ShashidharM0118
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Small datasets were undergoing unnecessary repartitioning, causing overhead without performance benefit. This change ensures that small Parquet datasets use single-partition aggregation.

What changes are included in this PR?

  • Added a condition in the physical planner to check Parquet row count statistics before repartitioning. If the dataset is smaller than the batch_size, the planner selects AggregateMode::Single.
  • Updated the aggregate_repartition.slt test file to expect AggregateExec: mode=Single for small Parquet datasets.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Nov 17, 2025
@ShashidharM0118 ShashidharM0118 changed the title Fix/eliminate repartition small datasets Fix: eliminate repartition small datasets Nov 17, 2025
@ShashidharM0118 ShashidharM0118 changed the title Fix: eliminate repartition small datasets Fix: eliminate unnecessary repartitioning for small datasets Nov 17, 2025
) -> Result<bool> {
let stats = input.partition_statistics(None)?;

if let Precision::Exact(num_rows) = stats.num_rows {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

Suggested change
if let Precision::Exact(num_rows) = stats.num_rows {
if let Some(num_rows) = stats.num_rows.get_value() {

This will cover both Exact and Inexact.


if let Precision::Exact(num_rows) = stats.num_rows {
let batch_size = session_state.config().batch_size();
return Ok(num_rows > batch_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Ok(num_rows > batch_size);
return Ok(num_rows >= batch_size);

input: &Arc<dyn ExecutionPlan>,
session_state: &SessionState,
) -> Result<bool> {
let stats = input.partition_statistics(None)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to fail the aggregation if the statistics fail for any reason ?
IMO it would be better to return Ok(true) instead.

…ds, remove precision enum usage, update test expectations
@ShashidharM0118 ShashidharM0118 force-pushed the fix/eliminate-repartition-small-datasets branch from dc03871 to 7b58acd Compare November 18, 2025 19:35
@ShashidharM0118
Copy link
Contributor Author

@martin-g, Thanks for the review!

I made these changes:

  • Switched to stats.num_rows.get_value() instead of Precision::Exact(num_rows)
  • Added check for round_robin_repartition() to respect when users want extra parallelism
  • Added logic to get statistics, defaulting to repartitioning when stats aren't available

I set the threshold to 10 * batch_size. IMO, if the dataset size is only "in and around" a single batch size, distributing it creates "micro-batches" and causes unnecessary overhead. I am not entirely sure if this is the best value, so let me know your thoughts.


@martin-g
Copy link
Member

  • Added check for round_robin_repartition() to respect when users want extra parallelism

Is this really needed ?
If there are not enough rows to partition then there is no need of extra parallelism.

@martin-g
Copy link
Member

I set the threshold to 10 * batch_size

Why 10 times ?
When I suggested adding support for Inexact I also thought about adding 10%, not 1000%.

@ShashidharM0118 ShashidharM0118 force-pushed the fix/eliminate-repartition-small-datasets branch from 6bb86d1 to 8494f01 Compare November 19, 2025 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eliminate Repartitioning for Small Datasets

2 participants