Hi everyone, I'm very new to DataFusion (and databases in general — I come from a healthcare background, learned SQL from Excel, and only started touching Rust recently). I have an idea for a SQL feature, and I've been banging my head against it for a while. I'm hoping someone can tell me where I'm fundamentally wrong, or point me to a simpler path.
The idea: EXBY
I often want to compute aggregates at multiple granularities in one query. For example:
-- I want SUM(c) by (a,b), and also SUM(c) by (b) only
-- Currently I have to self-join, which scans twice
So I thought of this syntax:
SELECT a, b, SUM(c), EXBY(SUM(c), a) FROM t GROUP BY a, b;
EXBY(SUM(c), a) means: ignore column a when grouping, only group by the remaining columns (b).
What I've tried (and where I got stuck)
I've tried several approaches. All of them either don't work, or work but defeat the purpose. The core problem seems to be output alignment — how to combine two aggregation results with different granularities into a single row.
Approach 1: SQL rewrite (the only one that "works")
I rewrote the EXBY query into an equivalent JOIN:
SELECT main.a, main.b, main.sum_c, exby.exby_sum
FROM (SELECT a, b, SUM(c) FROM t GROUP BY a, b) main
JOIN (SELECT a, SUM(c) FROM t GROUP BY a) exby
ON main.a = exby.a
This is functionally correct, and I put it into my CLI tool. But it scans the table twice, which defeats the original purpose (avoiding redundant scans). So this is just a functional hack, not a real solution.
Approach 2: Two accumulators inside one AggregateExec
My thinking: the main aggregation and the EXBY aggregation share the same input stream. Why not just run two accumulators in one AggregateExec?
Input batches arrive
├─ Main accumulator: update_batch() → groups by (a, b)
└─ EXBY accumulator: update_batch() → groups by (b) only
One scan, no JOIN. Seemed obvious.
But I got stuck on the emit side. When the main accumulator emits a batch, the EXBY accumulator also needs to emit matching values. The problem: one EXBY group maps to multiple main groups (e.g., b=1 maps to both (1,1) and (2,1)). The two accumulators don't have a one-to-one correspondence in their output indices.
I dug into row_hash.rs and GroupOrdering, trying to figure out how to align the two outputs, but got completely lost in the state machine. I suspect the current emit path assumes one set of groups, not two sets with different cardinalities that need to be aligned, but I might be wrong.
Approach 3: Synchronized EmitTo
I also thought about using EmitTo::First(n) to synchronize the emissions: when the main group (a=1,b=1) is complete, the EXBY group (b=1) might also be complete (if the input is sorted by (a,b), then (b) is a prefix and should also be complete). So both could emit together, and the results could be zipped.
But I couldn't figure out how to make this work in practice. The EXBY accumulator has fewer groups than the main accumulator, so EmitTo::First(n) on the main side doesn't map cleanly to EmitTo::First(m) on the EXBY side. And when the input is not sorted, this whole idea falls apart.
Again, I got lost in the GroupedHashAggregateStream internals and couldn't make it work.
Approach 4: The GROUPING SETS confusion
What really confuses me is that GROUPING SETS can compute multiple aggregation levels in one scan:
SELECT a, b, SUM(c) FROM t GROUP BY GROUPING SETS ((a,b), (b));
This does almost exactly what I want — it computes both the (a,b) and (b) aggregations in a single pass. So the engine can handle multiple grouping granularities.
But GROUPING SETS outputs them as separate rows, whereas I need them combined into the same row. It feels like the data is there, but I can't figure out how to stitch it back together. Is the "combine into the same row" requirement the fundamental blocker?
My question (or plea for guidance)
I've spent a lot of time on this and have to admit I'm out of my depth. I don't have a clean proposal or PR. I'm just really curious:
- Is the "two accumulators" idea fundamentally flawed, or is it possible but just beyond my current ability?
- Is the output alignment problem (one EXBY group → multiple main rows) something that can be solved within the current
AggregateExec architecture, or does it require a new operator?
- Are there any similar features or past discussions I can learn from? I've looked at
retract_batch and EmitTo but they seem designed for different use cases.
Even a "this won't work because X, go read Y" would be incredibly helpful. I'm here to learn.
Sorry for the long post, and thanks for reading.
Hi everyone, I'm very new to DataFusion (and databases in general — I come from a healthcare background, learned SQL from Excel, and only started touching Rust recently). I have an idea for a SQL feature, and I've been banging my head against it for a while. I'm hoping someone can tell me where I'm fundamentally wrong, or point me to a simpler path.
The idea: EXBY
I often want to compute aggregates at multiple granularities in one query. For example:
So I thought of this syntax:
EXBY(SUM(c), a)means: ignore columnawhen grouping, only group by the remaining columns(b).What I've tried (and where I got stuck)
I've tried several approaches. All of them either don't work, or work but defeat the purpose. The core problem seems to be output alignment — how to combine two aggregation results with different granularities into a single row.
Approach 1: SQL rewrite (the only one that "works")
I rewrote the EXBY query into an equivalent JOIN:
This is functionally correct, and I put it into my CLI tool. But it scans the table twice, which defeats the original purpose (avoiding redundant scans). So this is just a functional hack, not a real solution.
Approach 2: Two accumulators inside one AggregateExec
My thinking: the main aggregation and the EXBY aggregation share the same input stream. Why not just run two accumulators in one
AggregateExec?One scan, no JOIN. Seemed obvious.
But I got stuck on the emit side. When the main accumulator emits a batch, the EXBY accumulator also needs to emit matching values. The problem: one EXBY group maps to multiple main groups (e.g.,
b=1maps to both(1,1)and(2,1)). The two accumulators don't have a one-to-one correspondence in their output indices.I dug into
row_hash.rsandGroupOrdering, trying to figure out how to align the two outputs, but got completely lost in the state machine. I suspect the current emit path assumes one set of groups, not two sets with different cardinalities that need to be aligned, but I might be wrong.Approach 3: Synchronized EmitTo
I also thought about using
EmitTo::First(n)to synchronize the emissions: when the main group(a=1,b=1)is complete, the EXBY group(b=1)might also be complete (if the input is sorted by(a,b), then(b)is a prefix and should also be complete). So both could emit together, and the results could be zipped.But I couldn't figure out how to make this work in practice. The EXBY accumulator has fewer groups than the main accumulator, so
EmitTo::First(n)on the main side doesn't map cleanly toEmitTo::First(m)on the EXBY side. And when the input is not sorted, this whole idea falls apart.Again, I got lost in the
GroupedHashAggregateStreaminternals and couldn't make it work.Approach 4: The GROUPING SETS confusion
What really confuses me is that
GROUPING SETScan compute multiple aggregation levels in one scan:This does almost exactly what I want — it computes both the
(a,b)and(b)aggregations in a single pass. So the engine can handle multiple grouping granularities.But GROUPING SETS outputs them as separate rows, whereas I need them combined into the same row. It feels like the data is there, but I can't figure out how to stitch it back together. Is the "combine into the same row" requirement the fundamental blocker?
My question (or plea for guidance)
I've spent a lot of time on this and have to admit I'm out of my depth. I don't have a clean proposal or PR. I'm just really curious:
AggregateExecarchitecture, or does it require a new operator?retract_batchandEmitTobut they seem designed for different use cases.Even a "this won't work because X, go read Y" would be incredibly helpful. I'm here to learn.
Sorry for the long post, and thanks for reading.