Fix: eliminate unnecessary repartitioning for small datasets #18776

ShashidharM0118 · 2025-11-17T16:12:42Z

Which issue does this PR close?

Closes Eliminate Repartitioning for Small Datasets #18595

Rationale for this change

Small datasets were undergoing unnecessary repartitioning, causing overhead without performance benefit. This change ensures that small Parquet datasets use single-partition aggregation.

What changes are included in this PR?

Added a condition in the physical planner to check Parquet row count statistics before repartitioning. If the dataset is smaller than the batch_size, the planner selects AggregateMode::Single.
Updated the aggregate_repartition.slt test file to expect AggregateExec: mode=Single for small Parquet datasets.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

…tioning

martin-g · 2025-11-17T21:28:29Z

datafusion/core/src/physical_planner.rs

+) -> Result<bool> {
+    let stats = input.partition_statistics(None)?;
+
+    if let Precision::Exact(num_rows) = stats.num_rows {


How about:

Suggested change

if let Precision::Exact(num_rows) = stats.num_rows {

if let Some(num_rows) = stats.num_rows.get_value() {

This will cover both Exact and Inexact.

martin-g · 2025-11-17T21:28:53Z

datafusion/core/src/physical_planner.rs

+
+    if let Precision::Exact(num_rows) = stats.num_rows {
+        let batch_size = session_state.config().batch_size();
+        return Ok(num_rows > batch_size);


Suggested change

return Ok(num_rows > batch_size);

return Ok(num_rows >= batch_size);

martin-g · 2025-11-17T21:31:43Z

datafusion/core/src/physical_planner.rs

+    input: &Arc<dyn ExecutionPlan>,
+    session_state: &SessionState,
+) -> Result<bool> {
+    let stats = input.partition_statistics(None)?;


Does it have to fail the aggregation if the statistics fail for any reason ?
IMO it would be better to return Ok(true) instead.

…ds, remove precision enum usage, update test expectations

ShashidharM0118 · 2025-11-18T21:10:59Z

@martin-g, Thanks for the review!

I made these changes:

Switched to stats.num_rows.get_value() instead of Precision::Exact(num_rows)
Added check for round_robin_repartition() to respect when users want extra parallelism
Added logic to get statistics, defaulting to repartitioning when stats aren't available

I set the threshold to 10 * batch_size. IMO, if the dataset size is only "in and around" a single batch size, distributing it creates "micro-batches" and causes unnecessary overhead. I am not entirely sure if this is the best value, so let me know your thoughts.

martin-g · 2025-11-18T21:22:28Z

Added check for round_robin_repartition() to respect when users want extra parallelism

Is this really needed ?
If there are not enough rows to partition then there is no need of extra parallelism.

martin-g · 2025-11-18T21:26:27Z

I set the threshold to 10 * batch_size

Why 10 times ?
When I suggested adding support for Inexact I also thought about adding 10%, not 1000%.

ShashidharM0118 added 3 commits November 17, 2025 14:55

fix: skip repartitioning for small Parquet datasets in aggregates

1e9a9cf

test: expect Single mode for small Parquet aggregates without reparti…

c06b2c0

…tioning

chore: format physical_planner.rs with cargo fmt

693ec23

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Nov 17, 2025

ShashidharM0118 changed the title ~~Fix/eliminate repartition small datasets~~ Fix: eliminate repartition small datasets Nov 17, 2025

ShashidharM0118 changed the title ~~Fix: eliminate repartition small datasets~~ Fix: eliminate unnecessary repartitioning for small datasets Nov 17, 2025

martin-g reviewed Nov 17, 2025

View reviewed changes

refactor: improve aggregate repartitioning with conservative threshol…

7b58acd

…ds, remove precision enum usage, update test expectations

ShashidharM0118 force-pushed the fix/eliminate-repartition-small-datasets branch from dc03871 to 7b58acd Compare November 18, 2025 19:35

Revert changes in aggregate_repartition.slt

ed10465

ShashidharM0118 requested a review from martin-g November 18, 2025 21:11

remove extra parallelism

8494f01

ShashidharM0118 force-pushed the fix/eliminate-repartition-small-datasets branch from 6bb86d1 to 8494f01 Compare November 19, 2025 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: eliminate unnecessary repartitioning for small datasets #18776

Fix: eliminate unnecessary repartitioning for small datasets #18776

ShashidharM0118 commented Nov 17, 2025

Uh oh!

martin-g Nov 17, 2025

Uh oh!

martin-g Nov 17, 2025

Uh oh!

martin-g Nov 17, 2025

Uh oh!

ShashidharM0118 commented Nov 18, 2025

Uh oh!

martin-g commented Nov 18, 2025

Uh oh!

martin-g commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if let Precision::Exact(num_rows) = stats.num_rows {
	if let Some(num_rows) = stats.num_rows.get_value() {

	return Ok(num_rows > batch_size);
	return Ok(num_rows >= batch_size);

Fix: eliminate unnecessary repartitioning for small datasets #18776

Are you sure you want to change the base?

Fix: eliminate unnecessary repartitioning for small datasets #18776

Conversation

ShashidharM0118 commented Nov 17, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

martin-g Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

martin-g Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

martin-g Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

ShashidharM0118 commented Nov 18, 2025

Uh oh!

martin-g commented Nov 18, 2025

Uh oh!

martin-g commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants