-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: preserve byte-size statistics in AggregateExec #18885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cc9f192 to
5f2fb92
Compare
|
Thanks for opening the PR! I fired the tests |
e9bc9fe to
9ef133b
Compare
|
Hey @Dandandan, could you retrigger the tests? |
| assert_snapshot!( | ||
| pretty_format_batches(&sql_results).unwrap(), | ||
| @r" | ||
| +---------------+-------------------------------------------------------------------------------------------------------------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The physical plan changed bc join selection now uses available byte stats.
Which leads to a more optimal join order - ensuring the smaller side is chosen as the build side.
Done |
9ef133b to
14dc862
Compare
14dc862 to
a6be210
Compare
|
Same here @Dandandan, @alamb - could one of you please retrigger the CI? (Unless there’s a way for me to do it myself?) |
Previously, AggregateExec dropped total_byte_size statistics (Precision::Absent) through aggregation operations, preventing the optimizer from making informed decisions about memory allocation and execution strategies(join side selection -> dynamic filters). This commit implements proportional byte-size scaling based on row count ratios: - Added calculate_scaled_byte_size helper with inline optimization - Scales byte size for Final/FinalPartitioned without GROUP BY - Scales byte size proportionally for all other aggregation modes - Always returns Precision::Inexact for estimates (semantically correct) - Returns Precision::Absent when insufficient input statistics Added test coverage for edge cases (absent statistics, zero rows).
a6be210 to
5916ebb
Compare
|
@Dandandan Thanks for the review! |
Previously, AggregateExec dropped total_byte_size statistics (Precision::Absent) through aggregation operations, preventing the optimizer from making informed decisions about memory allocation and execution strategies(join side selection -> dynamic filters).
This commit implements proportional byte-size scaling based on row count ratios:
Added test coverage for edge cases (absent statistics, zero rows).
Which issue does this PR close?
#18850
Rationale for this change
Without byte-size statistics, the optimizer cannot estimate memory requirements for join-side selection, dynamic filter generation, and memory allocation decisions. This preserves statistics using proportional scaling (bytes_per_row × output_rows).
What changes are included in this PR?
statistics_innerto calculate proportional byte size instead of returningPrecision::Absentcalculate_scaled_byte_sizehelper (inline optimized, guards against division by zero)Are these changes tested?
Yes:
test_aggregate_statistics_edge_casescovers edge cases scenariosAre there any user-facing changes?
No breaking changes.
Internal optimization that may improve query planning and provide more accurate memory estimates in EXPLAIN output.