Skip to content

Nondeterministic, incorrect result for symmetric aggregates #17970

@colinmarc

Description

@colinmarc

Describe the bug

Hi!

We're trying to support an unnamed BI tool on top of datafusion, which is fond of generating symmetric aggregates that look like this:

SELECT (
    COALESCE(
        (
                SUM(DISTINCT
                        (CAST(FLOOR(COALESCE( o_totalprice , 0) * 1000000) AS DECIMAL(65,0)))
                    +   _hex_to_bigint('x' || MD5( l_orderkey::varchar ))::DECIMAL(65,0)
                    *   18446744073709551616
                    +   _hex_to_bigint('x' || SUBSTR(MD5( l_orderkey::varchar ) , 17))::DECIMAL(65,0)
                )
            -   SUM(DISTINCT
                        _hex_to_bigint('x' || MD5( l_orderkey::varchar ))::DECIMAL(65,0)
                    *   18446744073709551616
                    +   _hex_to_bigint('x' || SUBSTR(MD5( l_orderkey::varchar ), 17))::DECIMAL(65,0)
                )
        ) / 1000000, 0) -- coalesce
) / COUNT(DISTINCT l_orderkey) as average_total
FROM orders left join lineitem on (l_orderkey = o_orderkey)

Here is some documentation from a totally different, 100% unrelated BI tool that explains why the above SQL is so cursed: https://cloud.google.com/looker/docs/best-practices/understanding-symmetric-aggregates

Roughly what it's doing is computing a really big sum twice, and subtracting it from itself. The numbers it's summing are hashes of a primary key. To generate big unique numbers, it's doing two md5 hashes and bitshifting the first to the left.

Unfortunately, the results from DF seem to be nondeterministic, and often negative. The issue seems to be with floating point math, related to the literal 18446744073709551616 (2^64). The BI tool seems to be expecting that the literal will be typed as a DECIMAL(65, 0), since the left hand side of the multiplication is also DECIMAL(65, 0), but DF is picking that up as a float:

<snip> AS Decimal256(65, 0)) * Float64(18446744073709552000) <snip>

This is especially weird because the literal in the plan is a completely different number. 😕

To aid in debugging, I've included a small reproducer here: https://github.com/colinmarc/df-symmetric-aggregate

Note that if you change the literal in the generated SQL to 18446744073709551615 (2^64 - 1), it gets picked up as a UInt64, which is still odd, but works, produces the correct value, and is deterministic.

Thanks in advance!

To Reproduce

git clone https://github.com/colinmarc/df-symmetric-aggregate
cd df-symmetric-aggregate
cargo test

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions