Skip to content

ARROW-9779: [Rust] [DataFusion] Increase stability of average accumulator#7984

Closed
jorgecarleitao wants to merge 2 commits intoapache:masterfrom
jorgecarleitao:avg
Closed

ARROW-9779: [Rust] [DataFusion] Increase stability of average accumulator#7984
jorgecarleitao wants to merge 2 commits intoapache:masterfrom
jorgecarleitao:avg

Conversation

@jorgecarleitao
Copy link
Copy Markdown
Member

I benchmarked this on my computer and there is no performance change. However, my computer is prob not a good representative (MacBook Pro from 2017).

This method is recommended in Knuth, The Art of Computer Programming Vol 2, section 4.2.2.

FYI @andygrove @alamb @houqp

The formulat sum / count can cause sum to overflow.

This commit changes our implementation to use an online algorithm
instead, which tracks only avg and count.
@github-actions
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential concern with this approach is that errors accumulate (specifically each step introduces some floating point error. So the more numbers you accumulate the larger the accumulated error). I don't have a reference handy now that quantifies the error build up.

A different approach, which is not subject to the same error accumulation, is to use a larger data type to accumulate the sum.

So for example, to calculate the average of a f32, we would store the running sum as f64. To calculate f64 we would use a 128bit floating point number -- such as https://alkis.github.io/decimal/decimal/struct.d128.html. Similarly for i32 and i64.

I don't have any data to prefer one way over the other, I just remember discussions about the error accumulation in past lives being a concern

($SELF:ident, $ARRAY:expr, $ARRAY_TYPE:ident) => {{
// details here: https://stackoverflow.com/a/23493727/931303
let m = $ARRAY.len();
let sum = compute::sum($ARRAY.as_any().downcast_ref::<$ARRAY_TYPE>().unwrap());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this calculation still subject to overflow?

Copy link
Copy Markdown
Member Author

@jorgecarleitao jorgecarleitao Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is less likely to overflow over a single record batch than over a the set of all record batches in a single partition.

I am taking a text book authoritative argument here, as I do not have sufficient knowledge to take this debate to its details. I can see that spark uses sum,count, so that is another argument (for what is worth).

No specific preference here. I implemented it as an example of an aggregate UDF in one of my other PRs, and thought that it could be useful here.

@jorgecarleitao
Copy link
Copy Markdown
Member Author

Closing as wont fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants