-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Describe the bug
In DataFusion 51.1.0, joins involving a GROUP BY may choose the wrong build/probe side.
This happens because AggregateExec discards total_byte_size statistics, leaving only row-count stats.
The join-selection rule then falls back to comparing num_rows, which producing incorrect decisions for tables with uneven byte-size distributions
As a result, dynamic filters may be:
- Built on the large table, or
- Not generated at all.
To Reproduce
This occurs when joining tables with extreme size asymmetry:
- large_bytes: ~50 rows, ~1 GB total
- many_rows: ~1,000,000 rows, ~50 KB total
Run:
SELECT *
FROM large_bytes
JOIN (
SELECT id, join_key
FROM many_rows
GROUP BY ALL
) AS many_rows
ON large_bytes.id = many_rows.join_key;
Inspect the optimized plan (EXPLAIN or df.explain(false, true)).
The optimizer may select large_bytes as the build side because the aggregated many_rows loses its byte-size stats.
Minimal Rust test reproducing the issue: repro_aggregate_bytes.rs
Expected behavior
- AggregateExec should preserve or approximate total_byte_size.
- Join selection should use byte-size when available.
- Dynamic filters should consistently build on the smaller table.
Additional context
Relevant areas:
- physical-plan/aggregates/mod.rs – statistics propagation
- physical-optimizer/join_selection.rs
- physical-optimizer/filter_pushdown.rs
- physical-plan/joins/hash_join/exec.rs
This behavior appears to be a limitation (design choice) rather than an incidental bug. Preserving or estimating byte-size stats after aggregation would improve join selection for skewed row-size distributions
I’m happy to submit a PR after discussing the desired approach for propagating or estimating byte-size stats through AggregateExec.