Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Hyperloglog cardinality estimation algorithm #2073

Closed
bocharov opened this issue Mar 18, 2018 · 2 comments
Closed

Better Hyperloglog cardinality estimation algorithm #2073

bocharov opened this issue Mar 18, 2018 · 2 comments
Assignees
Labels
st-need-info We need extra data to continue (waiting for response)

Comments

@bocharov
Copy link
Contributor

Otmar Ertl (@oertl) did a great job on HyperLogLog improvements in his article "New cardinality estimation algorithms for HyperLogLog sketches" https://arxiv.org/abs/1702.01284. Later he proposed to change Redis implementation of HyperLogLog (redis/redis#4749) and @antirez found that the new implementation is around 20% faster, simpler and has lower theoretical error.

We could improve ClickHouse uniqHLL12, uniqCombined aggregate functions and maybe introduce new function uniqHLL(q, p)(x) using same algorithm.

@alexey-milovidov
Copy link
Member

We have added parameter to the uniqCombined function to tune the HLL quality/space in #3406.
But we still have to:

  • dig into links you shared;
  • change the hash function and introduce new aggregate function like uniqEnhanced.

azat added a commit to azat/ClickHouse that referenced this issue Oct 7, 2019
…d overflow)

uniqCombined() return type is UInt64, but uniqCombined() uses
CombinedCardinalityEstimator, and CombinedCardinalityEstimator::size()
return type is UInt32, while the underlying HyperLogLog::size() is
UInt64.

So after this patch uniqCombined() can be used for >UINT_MAX values, the
outcome is not ideal (ClickHouse#2073) but at least sane.
@stale stale bot added the stale label Oct 20, 2019
@ClickHouse ClickHouse deleted a comment from stale bot Dec 13, 2019
@alexey-milovidov
Copy link
Member

Is there any evidence that the proposed implementations going to be better than the uniqCombined function?

@alexey-milovidov alexey-milovidov self-assigned this Oct 11, 2022
@alexey-milovidov alexey-milovidov added the st-need-info We need extra data to continue (waiting for response) label Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
st-need-info We need extra data to continue (waiting for response)
Projects
None yet
Development

No branches or pull requests

3 participants