ARROW-9779: [Rust] [DataFusion] Increase stability of average accumulator by jorgecarleitao · Pull Request #7984 · apache/arrow

jorgecarleitao · 2020-08-18T06:24:06Z

I benchmarked this on my computer and there is no performance change. However, my computer is prob not a good representative (MacBook Pro from 2017).

This method is recommended in Knuth, The Art of Computer Programming Vol 2, section 4.2.2.

FYI @andygrove @alamb @houqp

The formulat sum / count can cause sum to overflow. This commit changes our implementation to use an online algorithm instead, which tracks only avg and count.

github-actions · 2020-08-18T06:31:57Z

https://issues.apache.org/jira/browse/ARROW-9779

alamb

One potential concern with this approach is that errors accumulate (specifically each step introduces some floating point error. So the more numbers you accumulate the larger the accumulated error). I don't have a reference handy now that quantifies the error build up.

A different approach, which is not subject to the same error accumulation, is to use a larger data type to accumulate the sum.

So for example, to calculate the average of a f32, we would store the running sum as f64. To calculate f64 we would use a 128bit floating point number -- such as https://alkis.github.io/decimal/decimal/struct.d128.html. Similarly for i32 and i64.

I don't have any data to prefer one way over the other, I just remember discussions about the error accumulation in past lives being a concern

alamb · 2020-08-18T12:32:37Z

rust/datafusion/src/execution/physical_plan/expressions.rs

+    ($SELF:ident, $ARRAY:expr, $ARRAY_TYPE:ident) => {{
+        // details here: https://stackoverflow.com/a/23493727/931303
+        let m = $ARRAY.len();
+        let sum = compute::sum($ARRAY.as_any().downcast_ref::<$ARRAY_TYPE>().unwrap());


isn't this calculation still subject to overflow?

It is less likely to overflow over a single record batch than over a the set of all record batches in a single partition.

I am taking a text book authoritative argument here, as I do not have sufficient knowledge to take this debate to its details. I can see that spark uses sum,count, so that is another argument (for what is worth).

No specific preference here. I implemented it as an example of an aggregate UDF in one of my other PRs, and thought that it could be useful here.

jorgecarleitao · 2020-08-23T21:54:10Z

Closing as wont fix.

jorgecarleitao added 2 commits August 18, 2020 08:00

Added bench for avg aggregation.

875748a

Improved stability of average.

7c20ddf

The formulat sum / count can cause sum to overflow. This commit changes our implementation to use an online algorithm instead, which tracks only avg and count.

alamb reviewed Aug 18, 2020

View reviewed changes

andygrove added Component: Rust Component: Rust - DataFusion labels Aug 19, 2020

jorgecarleitao closed this Aug 23, 2020

jorgecarleitao deleted the avg branch August 23, 2020 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-9779: [Rust] [DataFusion] Increase stability of average accumulator#7984

ARROW-9779: [Rust] [DataFusion] Increase stability of average accumulator#7984
jorgecarleitao wants to merge 2 commits intoapache:masterfrom
jorgecarleitao:avg

jorgecarleitao commented Aug 18, 2020

Uh oh!

github-actions bot commented Aug 18, 2020

Uh oh!

alamb left a comment

Uh oh!

alamb Aug 18, 2020

Uh oh!

jorgecarleitao Aug 18, 2020 •

edited

Loading

Uh oh!

jorgecarleitao commented Aug 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jorgecarleitao commented Aug 18, 2020

Uh oh!

github-actions bot commented Aug 18, 2020

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Aug 18, 2020

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Aug 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao commented Aug 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jorgecarleitao Aug 18, 2020 •

edited

Loading