feat(aqe): dynamic switching from streaming agg to hash aggregation#1722
feat(aqe): dynamic switching from streaming agg to hash aggregation#1722wirybeaver wants to merge 1 commit into
Conversation
Add `DynamicAggregateAlgorithmRule`, an AQE physical optimizer rule that re-derives `InputOrderMode` for each `AggregateExec` after a shuffle stage resolves and rewrites the operator when the derived mode differs from the cached one. DataFusion freezes `InputOrderMode` (Linear/Sorted/PartiallySorted) at plan time. In Ballista's AQE, this means a downstream aggregate stays in Linear (hash table) mode even when an upstream stage completes and grants ordering on the group-by columns — wasting memory that streaming aggregation would avoid. The rule also corrects the reverse: a stale Sorted claim after a subtree rewrite removes the assumed ordering. The rewrite relies on `AggregateExec::with_new_children` which internally calls `try_new_with_schema`, re-running the derivation against the current input `EquivalenceProperties`. No upstream DataFusion changes are required. Changes: - `ballista/core/src/config.rs`: add `ballista.aqe.dynamic_aggregate.enabled` config key (default: false) with getter `aqe_dynamic_aggregate_enabled()` - `optimizer_rule/dynamic_aggregate_algorithm.rs`: new rule with 9 unit tests (gate, no-exchange/unresolved-exchange skip, mode transitions, idempotence, schema preservation) - `optimizer_rule/mod.rs`: expose new module - `planner.rs`: register rule before `DistributedExchangeRule` Closes part of apache#1359
| /// want to know if *any* upstream stage has finished. | ||
| fn subtree_has_resolved_exchange(plan: &Arc<dyn ExecutionPlan>) -> bool { | ||
| if let Some(exchange) = plan.as_any().downcast_ref::<ExchangeExec>() { | ||
| exchange.shuffle_created() && !exchange.inactive_stage |
There was a problem hiding this comment.
The doc above says Recurses through all nodes, including through exchange boundaries but the implementation stops recursion on the first ExchangeExec
There was a problem hiding this comment.
You're right — the doc contradicts the implementation. The code stops at the first ExchangeExec and returns based on its resolved state, which is what we actually want (the nearest exchange gates the agg's input; a resolved exchange further down behind an unresolved one doesn't help). The doc should describe the actual behavior.
That said, please see my reply to @milenkovicm on the main thread — the broader design has problems that may make this PR moot.
| let result_agg = result.as_any().downcast_ref::<AggregateExec>().unwrap(); | ||
|
|
||
| // After the rule the mode must match what re-derivation produces. | ||
| assert_eq!(result_agg.input_order_mode(), &original_mode); |
There was a problem hiding this comment.
assert that the order mode is Sorted
There was a problem hiding this comment.
Right — the assertion is missing, but more importantly the test as written can't exercise a Linear→Sorted transition. aggregate_on_k(exchange) calls AggregateExec::try_new, which derives input_order_mode from the exchange's eq_properties at construction time. Since the exchange wraps a SortExec(k) and propagates its sort ordering, the agg is derived as Sorted immediately. The rule then re-derives the same mode, producing no transition — so the test passes trivially and doesn't validate what its name claims. Full root-cause analysis in my reply to @milenkovicm.
| #[test] | ||
| fn schema_check_is_false() { | ||
| assert!(!DynamicAggregateAlgorithmRule::default().schema_check()); | ||
| } |
There was a problem hiding this comment.
None of the tests validates order mode change from Linear to Sorted.
There was a problem hiding this comment.
Confirmed — and this gap isn't just test coverage. with_new_children re-derivation produces the same mode the constructor would, because ExchangeExec.equivalence_properties() doesn't change between construction and resolution. So the rule can't, by design, produce a Linear↔Sorted transition. Details in my reply to @milenkovicm.
|
Sorry but I do not understand this PR, can you please give a bit more explanation what are you trying to do in this PR? In which case can we assume sorting is going to be preserved after an exchange ? |
In which cases will this happen? When can w guarantee preserving sorting across shuffles ?
Can you please give an example when would this claim be true? |
|
@milenkovicm Thanks for pushing back — your questions exposed that my reasoning was wrong. Let me walk through what I now see. The flawed premise. I assumed let eq_properties = input.properties().eq_properties.clone();The eq_properties are captured at construction time from the pre-shuffle input and never updated when Why the rule is a no-op (or worse).
What the roadmap line actually requires. For AQE to re-derive useful information after a stage resolves, the resolved exchange (or whatever node represents the post-shuffle reader) needs to advertise post-shuffle properties — driven by shuffle mode:
Once those properties reflect reality, no dedicated re-derivation rule is needed — Proposal. Close this PR. I'd like to open a separate one (or RFC discussion) to update Apologies for the misfire — should have validated the premise with a runnable example before opening. |
|
Closing per my analysis above — the design is wrong at the foundation. Will reopen with a different approach (likely updating |
Which issue does this PR close?
Part of #1359 (AQE epic) — roadmap line: "switch from streaming aggregation to hash aggregation (extended rules)"
What does this PR do?
Adds
DynamicAggregateAlgorithmRule, an AQE physical optimizer rule that re-derivesInputOrderModefor eachAggregateExecafter a shuffle stage resolves and rewrites the operator when the derived mode differs from the cached one.Problem
DataFusion freezes
InputOrderMode(Linear/Sorted/PartiallySorted) at plan time. In Ballista's AQE this creates two issues:Linear(hash table, O(distinct-groups) memory) even when an upstream stage completes and grants ordering on the group-by columns — whereSorted(streaming, O(1) memory) would suffice.Sortedclaim when its input is no longer ordered.Approach
AggregateExec::with_new_childrenalready callstry_new_with_schemawhich re-derivesinput_order_modefrom the current inputEquivalenceProperties. The rule:transform_up.AggregateExec, skips if no resolvedExchangeExecexists in the subtree (idempotence guard).with_new_children([same_input])to force re-derivation.Transformed::yes(rebuilt)only when the derived mode differs from the cached one.No upstream DataFusion changes are required.
Configuration
Disabled by default (
false) pending benchmarking.Checklist
default_optimizers()beforeDistributedExchangeRuleballista.aqe.dynamic_aggregate.enabledadded toBallistaConfig