Better Hyperloglog cardinality estimation algorithm #4749

oertl · 2018-03-10T19:47:07Z

The current implementation uses the LogLog-Beta approach to estimate the cardinality from the HyperLogLog registers. Unfortunately, the method relies on magic constants that have been empirically determined. The formula presented in "New cardinality estimation algorithms for HyperLogLog sketches", https://arxiv.org/abs/1702.01284 has a better theoretical foundation and has already been independently verified in https://www.biorxiv.org/content/biorxiv/suppl/2018/02/09/262956.DC1/262956-1.pdf. The new implementation finds the histogram of registers first, which is finally fed into the new cardinality estimation formula.

based on method described in https://arxiv.org/abs/1702.01284 that does not rely on any magic constants

antirez · 2018-03-14T10:43:12Z

Hello @oertl, thank you, this looks great!

I've a few questions: there are any performance differences? About the histogram of the registers first, this is interesting as I was doing tests with deep neural networks as cardinality estimators (which may sound terrible to you if you are complaining about magic constants in LogLog-Beta 😄 ) in order to implement the last stage of HLL and I quickly realized that what matters is indeed just how many registers we have with a given value being all the other info completely meaningless. Totally makes sense to me, and this may even speedup the computation AFAIK, because the grouping stage is just integers and we later have less inputs for the floating point function.

Another thing: using the full 64 bits instead of 63 is only of theoretical interest if I understand correctly, since the probability of a very long run of the same bit value is very unlikely to happen in the practice so this should not change the output.

Finally: does the PFCOUNT implementation shows a smaller average error after this change? Thanks.

andrescorrada · 2018-03-14T13:10:36Z

@oertl , I just came across your work on HyperLogLog. I am very impressed. I have a question about your dense representation. Am I correct on realizing that it leads to an enormous space saving in the HLL data structure? Why is this not more widely known? Or am I misunderstanding the dense representation?

antirez · 2018-03-14T13:12:43Z

P.S. note that we re going to move HLLs in Redis to its own native data type, no longer string-backed in the future, so it is possible to change format to a better one.

antirez · 2018-03-14T16:10:35Z

In the meantime I read the paper, so I think I've the answers about the performance (it should be definitely faster AFAIK). Still not sure if while the curve of the error is now much more smooth without any additional correction, the average error is still in the range we had before. Please note how crucial is in Redis that we have a small error for small cardinalities: sometimes app developers will use this as a way to count user-visible interactions. The fact that the currently used algorithm is so good at almost giving the precise count for small cardinalities, makes it a lot less surprising for users that still remember the exact count of certain events.

michael-grunder · 2018-03-14T18:19:57Z

src/hyperloglog.c

@@ -1009,7 +1009,7 @@ uint64_t hllCount(struct hllhdr *hdr, int *invalid) {
    double m = HLL_REGISTERS;
    double E;
    int j;
-    double alphaInf = 0.5 / log(2.);
+    static double alphaInf = 0.5 / log(2.);


This will compile on GCC but not clang:

hyperloglog.c:1015:34: error: initializer element is not a compile-time constant static double alphaInf = 0.5 / log(2.);

oertl · 2018-03-14T19:08:47Z

@antirez, it can be shown that the histogram of register values is a sufficient statistic (https://en.wikipedia.org/wiki/Sufficient_statistic) for estimating the cardinality. This means the corresponding information loss when mapping the register values to a histogram is irrelevant for the cardinality estimate. Therefore, any good estimator should be able to be expressed as function of the histogram. In fact, the original cardinality estimation approach as well as loglog-beta can also be expressed as function of this histogram, if equal terms in the formulas are collected. Calculating the histogram first and then calculating the estimate in a second step (regardless of the used estimation method) reduces the number of floating-point operations and hence numerical errors.

Probably there will not be any significant difference in the estimation error between loglog-beta and the new approach. loglog-beta assumes a function of certain shape and relies on some fitting parameters that have been empirically determined for each p (the number of registers). This means that you have to trust that the empirical fitting and verification has been conducted thoroughly. The loglog-beta also requires a call to the log function which is probably also more expensive than calling hllSigma. Note, hllTau is usually irrelevant for 64-bit hashes, because it is very unlikely that any register reaches the maximum register value which is equal to (q+1). Therefore, hllTau is almost always left via the first return statement.

It is true that using 64 instead of 63 bits of the hash value is more a cosmetic issue.

@andrescorrada, the new approach does not change the representation of individual register values. The individual registers are necessary for the add and merge operations. The transformation to a histogram is only lossless with regard to cardinality estimation.

antirez · 2018-03-16T16:00:38Z

Hello @oertl, I ran several tests against the old and new implementation. If it was not for the fact that the new implementation is around 20% faster, I could not tell the difference, the output is constantly identical, even if it uses a completely different estimator. This is a very good sign, because it means that the functions are both doing the best they can do based on the frequency histograms (implicit or explicit) they have. However your implementation is faster and in certain ways also simpler, and has more sounding scientific motivations, so I decided to merge the PR. Thank you for your contribution! I hope you'll think at Redis again in case you publish never results.

michael-grunder · 2018-03-16T16:42:33Z

@antirez That's interesting. I was able to see differences but perhaps I was doing something wrong 😄

Where I saw differences the PR estimations were usually more accurate.

My quick and dirty script used a list of about 300,000 English words and did a PFADD on an incrementing number of them. Doing them 1000 words at a time yielded this result:

15d7e617 was better 299 times
d8207d09 was better 64 times
They were identical 8 times

# Literally just sum(error%)/entries
d8207d09 overall error: 0.583769
15d7e617 overall error: 0.573327

antirez · 2018-03-16T19:05:03Z

Michael that's odd! I would say that's worth investigating but since your results show improved precisions... Not sure. BTW I used the scripts that are shipped with Redis. The results looked a bit too similar to be honest, but if we claim that both the approaches are quasi optimal, then the biggest error source is how much far the histogram of the distribution is compared to the expected one, and any good enough function under the stated premises will lead to the same error curve for the same sequence of elements.

oertl added 5 commits March 10, 2018 20:09

replaced tab by spaces

6470b21

improved HyperLogLog cardinality estimation

1e9a774

based on method described in https://arxiv.org/abs/1702.01284 that does not rely on any magic constants

improved definition of HLL_Q

633983d

made constant static

44698f4

use all 64 bits of the hash value instead of 63

97bde9f

michael-grunder reviewed Mar 14, 2018

View reviewed changes

fixed compilation error when using clang as reported by michael-grunder

15d7e61

antirez merged commit 15d7e61 into redis:unstable Mar 16, 2018

antirez added a commit that referenced this pull request Mar 16, 2018

Aesthetic changes to PR #4749.

36b78e8

oertl deleted the hyperloglog-improvement branch March 16, 2018 18:43

bocharov mentioned this pull request Mar 18, 2018

Better Hyperloglog cardinality estimation algorithm ClickHouse/ClickHouse#2073

Closed

JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Apr 10, 2018

Aesthetic changes to PR redis#4749.

05c5f13

pulllock pushed a commit to pulllock/redis that referenced this pull request Jun 28, 2023

Aesthetic changes to PR redis#4749.

ebb8e3f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Hyperloglog cardinality estimation algorithm #4749

Better Hyperloglog cardinality estimation algorithm #4749

oertl commented Mar 10, 2018

antirez commented Mar 14, 2018

andrescorrada commented Mar 14, 2018

antirez commented Mar 14, 2018

antirez commented Mar 14, 2018

michael-grunder Mar 14, 2018

oertl commented Mar 14, 2018

antirez commented Mar 16, 2018

michael-grunder commented Mar 16, 2018 •

edited

antirez commented Mar 16, 2018

Better Hyperloglog cardinality estimation algorithm #4749

Better Hyperloglog cardinality estimation algorithm #4749

Conversation

oertl commented Mar 10, 2018

antirez commented Mar 14, 2018

andrescorrada commented Mar 14, 2018

antirez commented Mar 14, 2018

antirez commented Mar 14, 2018

michael-grunder Mar 14, 2018

Choose a reason for hiding this comment

oertl commented Mar 14, 2018

antirez commented Mar 16, 2018

michael-grunder commented Mar 16, 2018 • edited

antirez commented Mar 16, 2018

michael-grunder commented Mar 16, 2018 •

edited