ARROW-9779: [Rust] [DataFusion] Increase stability of average accumulator#7984
ARROW-9779: [Rust] [DataFusion] Increase stability of average accumulator#7984jorgecarleitao wants to merge 2 commits intoapache:masterfrom jorgecarleitao:avg
Conversation
The formulat sum / count can cause sum to overflow. This commit changes our implementation to use an online algorithm instead, which tracks only avg and count.
alamb
left a comment
There was a problem hiding this comment.
One potential concern with this approach is that errors accumulate (specifically each step introduces some floating point error. So the more numbers you accumulate the larger the accumulated error). I don't have a reference handy now that quantifies the error build up.
A different approach, which is not subject to the same error accumulation, is to use a larger data type to accumulate the sum.
So for example, to calculate the average of a f32, we would store the running sum as f64. To calculate f64 we would use a 128bit floating point number -- such as https://alkis.github.io/decimal/decimal/struct.d128.html. Similarly for i32 and i64.
I don't have any data to prefer one way over the other, I just remember discussions about the error accumulation in past lives being a concern
| ($SELF:ident, $ARRAY:expr, $ARRAY_TYPE:ident) => {{ | ||
| // details here: https://stackoverflow.com/a/23493727/931303 | ||
| let m = $ARRAY.len(); | ||
| let sum = compute::sum($ARRAY.as_any().downcast_ref::<$ARRAY_TYPE>().unwrap()); |
There was a problem hiding this comment.
isn't this calculation still subject to overflow?
There was a problem hiding this comment.
It is less likely to overflow over a single record batch than over a the set of all record batches in a single partition.
I am taking a text book authoritative argument here, as I do not have sufficient knowledge to take this debate to its details. I can see that spark uses sum,count, so that is another argument (for what is worth).
No specific preference here. I implemented it as an example of an aggregate UDF in one of my other PRs, and thought that it could be useful here.
|
Closing as wont fix. |
I benchmarked this on my computer and there is no performance change. However, my computer is prob not a good representative (MacBook Pro from 2017).
This method is recommended in Knuth, The Art of Computer Programming Vol 2, section 4.2.2.
FYI @andygrove @alamb @houqp