You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current design of aggregates makes the calculation of the average incorrect.
Namely, if there are multiple input partitions, the result is average of the averages. For example if the input it in two batches [1,2], and [3,4,5], datafusion will say "average=3.25" rather than "average=3".
It also makes it impossible to compute the geometric mean, distinct sum, and other operations.
The central issue is that Accumulator returns a ScalarValue during partial aggregations via get_value, but very often a ScalarValue is not sufficient information to perform the full aggregation.
A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are distributed in batches of 2,
{[x1, x2], [x3, x4], [x5]}
. Our current calculation performs partial means,
{(x1+x2)/2, (x3+x4)/2, x5}
, and then reduces them using another average, i.e.
((x1+x2)/2 + (x3+x4)/2 + x5)/3
which is not equal to (x1 + x2 + x3 + x4 + x5)/5.
I believe that our Accumulators need to pass more information from the partial aggregations to the final aggregation.
We could consider taking an API equivalent to spark, i.e. have an update, a merge and an evaluate.
Code with a failing test (src/execution/context.rs)
The current design of aggregates makes the calculation of the average incorrect.
Namely, if there are multiple input partitions, the result is average of the averages. For example if the input it in two batches
[1,2]
, and[3,4,5]
, datafusion will say "average=3.25" rather than "average=3".It also makes it impossible to compute the geometric mean, distinct sum, and other operations.
The central issue is that Accumulator returns a
ScalarValue
during partial aggregations viaget_value
, but very often aScalarValue
is not sufficient information to perform the full aggregation.A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are distributed in batches of 2,
{[x1, x2], [x3, x4], [x5]}
. Our current calculation performs partial means,
{(x1+x2)/2, (x3+x4)/2, x5}
, and then reduces them using another average, i.e.
((x1+x2)/2 + (x3+x4)/2 + x5)/3
which is not equal to
(x1 + x2 + x3 + x4 + x5)/5
.I believe that our Accumulators need to pass more information from the partial aggregations to the final aggregation.
We could consider taking an API equivalent to spark, i.e. have an
update
, amerge
and anevaluate
.Code with a failing test (
src/execution/context.rs
)Reporter: Jorge Leitão / @jorgecarleitao
Assignee: Jorge Leitão / @jorgecarleitao
PRs and other links:
Note: This issue was originally created as ARROW-9937. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: