Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different result on fasthll and distinctcounthll #5153

Closed
xuanzih opened this issue Mar 14, 2020 · 3 comments
Closed

different result on fasthll and distinctcounthll #5153

xuanzih opened this issue Mar 14, 2020 · 3 comments

Comments

@xuanzih
Copy link

xuanzih commented Mar 14, 2020

hi guys, we are trying to switch from fasthll to distinctcounthll.
com.clearspring.analytics.stream.cardinality.HyperLogLog; is used in code and org.apache.pinot.core.startree.hll.HllUtil to serialize the hll to a string.
with the same condition we have 1000x difference.
Example:

SELECT fasthll(my_hll), distinctcounthll(my_hll)
FROM counts_table WHERE timestamp >= 1500768000

I get results:

"aggregationResults": [
    {
        "function": "fastHLL_my_hll",
        "value": "68685244"
    }, {
        "function": "distinctCountHLL_my_hll",
        "value": "50535"
    }]

Could anyone suggest what's the big difference between them?

@xiangfu0
Copy link
Contributor

FastHll will convert one string into a hyperloglog object, which may represent thousand unique values. DistinctCountHLL treats string as a value, not hyperloglog object, so it will return the approximation of how many unique hyperloglog serialized strings, the value should be close to your total number scanned .

@xiangfu0
Copy link
Contributor

DistinctCountHLL could only infer bytes column as a hyperloglog object.

@Jackie-Jiang
Copy link
Contributor

fasthll is deprecated because of the low performance of deserialization. You may generate BYTES type for serialized HyperLogLog using org.apache.pinot.core.common.ObjectSerDeUtils.HYPER_LOG_LOG_SER_DE.serialize(hyperLogLog) and query it with distinctcounthll

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants