Skip to content

Conversation

@Tamar-Posen
Copy link

@Tamar-Posen Tamar-Posen commented Nov 22, 2025

Previously, AggregateExec dropped total_byte_size statistics (Precision::Absent) through aggregation operations, preventing the optimizer from making informed decisions about memory allocation and execution strategies(join side selection -> dynamic filters).

This commit implements proportional byte-size scaling based on row count ratios:

  • Added calculate_scaled_byte_size helper with inline optimization
  • Scales byte size for Final/FinalPartitioned without GROUP BY
  • Scales byte size proportionally for all other aggregation modes
  • Always returns Precision::Inexact for estimates (semantically correct)
  • Returns Precision::Absent when insufficient input statistics

Added test coverage for edge cases (absent statistics, zero rows).

Which issue does this PR close?

#18850

Rationale for this change

Without byte-size statistics, the optimizer cannot estimate memory requirements for join-side selection, dynamic filter generation, and memory allocation decisions. This preserves statistics using proportional scaling (bytes_per_row × output_rows).

What changes are included in this PR?

  1. Modified statistics_inner to calculate proportional byte size instead of returning Precision::Absent
  2. Added calculate_scaled_byte_size helper (inline optimized, guards against division by zero)
  3. Updated test assertions and added edge case coverage

Are these changes tested?

Yes:

  • New test_aggregate_statistics_edge_cases covers edge cases scenarios
  • Existing tests confirm stats propagate correctly through the aggregation pipeline

Are there any user-facing changes?

No breaking changes.
Internal optimization that may improve query planning and provide more accurate memory estimates in EXPLAIN output.

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Nov 22, 2025
@Tamar-Posen Tamar-Posen force-pushed the fix-aggregate-bytes-stats branch from cc9f192 to 5f2fb92 Compare November 22, 2025 23:19
@github-actions github-actions bot added the core Core DataFusion crate label Nov 22, 2025
@Dandandan
Copy link
Contributor

Thanks for opening the PR! I fired the tests

@Tamar-Posen Tamar-Posen force-pushed the fix-aggregate-bytes-stats branch 2 times, most recently from e9bc9fe to 9ef133b Compare November 23, 2025 12:28
@Tamar-Posen
Copy link
Author

Hey @Dandandan, could you retrigger the tests?
Pushed an additional fix
THNX

assert_snapshot!(
pretty_format_batches(&sql_results).unwrap(),
@r"
+---------------+-------------------------------------------------------------------------------------------------------------------------+
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The physical plan changed bc join selection now uses available byte stats.
Which leads to a more optimal join order - ensuring the smaller side is chosen as the build side.

@Dandandan
Copy link
Contributor

Hey @Dandandan, could you retrigger the tests? Pushed an additional fix THNX

Done

@Tamar-Posen Tamar-Posen force-pushed the fix-aggregate-bytes-stats branch from 9ef133b to 14dc862 Compare November 23, 2025 19:57
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 23, 2025
@Tamar-Posen Tamar-Posen force-pushed the fix-aggregate-bytes-stats branch from 14dc862 to a6be210 Compare November 23, 2025 20:07
@Tamar-Posen
Copy link
Author

Same here @Dandandan, @alamb - could one of you please retrigger the CI? (Unless there’s a way for me to do it myself?)
Also, I’d appreciate a review when you have a moment.
Thanks!

Previously, AggregateExec dropped total_byte_size statistics
(Precision::Absent) through aggregation operations, preventing the
optimizer from making informed decisions about memory allocation
and execution strategies(join side selection -> dynamic filters).

This commit implements proportional byte-size scaling based on
row count ratios:
- Added calculate_scaled_byte_size helper with inline optimization
- Scales byte size for Final/FinalPartitioned without GROUP BY
- Scales byte size proportionally for all other aggregation modes
- Always returns Precision::Inexact for estimates (semantically correct)
- Returns Precision::Absent when insufficient input statistics

Added test coverage for edge cases (absent statistics, zero rows).
@Tamar-Posen Tamar-Posen force-pushed the fix-aggregate-bytes-stats branch from a6be210 to 5916ebb Compare November 23, 2025 21:23
@Tamar-Posen
Copy link
Author

@Dandandan Thanks for the review!
Whenever convenient, could someone merge it, pls?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AggregateExec drops byte-size statistics causing incorrect join build-side selection and broken dynamic filtering

2 participants