-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Describe the bug
Hi!
We're trying to support an unnamed BI tool on top of datafusion, which is fond of generating symmetric aggregates that look like this:
SELECT (
COALESCE(
(
SUM(DISTINCT
(CAST(FLOOR(COALESCE( o_totalprice , 0) * 1000000) AS DECIMAL(65,0)))
+ _hex_to_bigint('x' || MD5( l_orderkey::varchar ))::DECIMAL(65,0)
* 18446744073709551616
+ _hex_to_bigint('x' || SUBSTR(MD5( l_orderkey::varchar ) , 17))::DECIMAL(65,0)
)
- SUM(DISTINCT
_hex_to_bigint('x' || MD5( l_orderkey::varchar ))::DECIMAL(65,0)
* 18446744073709551616
+ _hex_to_bigint('x' || SUBSTR(MD5( l_orderkey::varchar ), 17))::DECIMAL(65,0)
)
) / 1000000, 0) -- coalesce
) / COUNT(DISTINCT l_orderkey) as average_total
FROM orders left join lineitem on (l_orderkey = o_orderkey)Here is some documentation from a totally different, 100% unrelated BI tool that explains why the above SQL is so cursed: https://cloud.google.com/looker/docs/best-practices/understanding-symmetric-aggregates
Roughly what it's doing is computing a really big sum twice, and subtracting it from itself. The numbers it's summing are hashes of a primary key. To generate big unique numbers, it's doing two md5 hashes and bitshifting the first to the left.
Unfortunately, the results from DF seem to be nondeterministic, and often negative. The issue seems to be with floating point math, related to the literal 18446744073709551616 (2^64). The BI tool seems to be expecting that the literal will be typed as a DECIMAL(65, 0), since the left hand side of the multiplication is also DECIMAL(65, 0), but DF is picking that up as a float:
<snip> AS Decimal256(65, 0)) * Float64(18446744073709552000) <snip>
This is especially weird because the literal in the plan is a completely different number. 😕
To aid in debugging, I've included a small reproducer here: https://github.com/colinmarc/df-symmetric-aggregate
Note that if you change the literal in the generated SQL to 18446744073709551615 (2^64 - 1), it gets picked up as a UInt64, which is still odd, but works, produces the correct value, and is deterministic.
Thanks in advance!
To Reproduce
git clone https://github.com/colinmarc/df-symmetric-aggregate
cd df-symmetric-aggregate
cargo test