New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better Hyperloglog cardinality estimation algorithm #4749
Conversation
based on method described in https://arxiv.org/abs/1702.01284 that does not rely on any magic constants
Hello @oertl, thank you, this looks great! I've a few questions: there are any performance differences? About the histogram of the registers first, this is interesting as I was doing tests with deep neural networks as cardinality estimators (which may sound terrible to you if you are complaining about magic constants in LogLog-Beta 😄 ) in order to implement the last stage of HLL and I quickly realized that what matters is indeed just how many registers we have with a given value being all the other info completely meaningless. Totally makes sense to me, and this may even speedup the computation AFAIK, because the grouping stage is just integers and we later have less inputs for the floating point function. Another thing: using the full 64 bits instead of 63 is only of theoretical interest if I understand correctly, since the probability of a very long run of the same bit value is very unlikely to happen in the practice so this should not change the output. Finally: does the PFCOUNT implementation shows a smaller average error after this change? Thanks. |
@oertl , I just came across your work on HyperLogLog. I am very impressed. I have a question about your dense representation. Am I correct on realizing that it leads to an enormous space saving in the HLL data structure? Why is this not more widely known? Or am I misunderstanding the dense representation? |
P.S. note that we re going to move HLLs in Redis to its own native data type, no longer string-backed in the future, so it is possible to change format to a better one. |
In the meantime I read the paper, so I think I've the answers about the performance (it should be definitely faster AFAIK). Still not sure if while the curve of the error is now much more smooth without any additional correction, the average error is still in the range we had before. Please note how crucial is in Redis that we have a small error for small cardinalities: sometimes app developers will use this as a way to count user-visible interactions. The fact that the currently used algorithm is so good at almost giving the precise count for small cardinalities, makes it a lot less surprising for users that still remember the exact count of certain events. |
src/hyperloglog.c
Outdated
@@ -1009,7 +1009,7 @@ uint64_t hllCount(struct hllhdr *hdr, int *invalid) { | |||
double m = HLL_REGISTERS; | |||
double E; | |||
int j; | |||
double alphaInf = 0.5 / log(2.); | |||
static double alphaInf = 0.5 / log(2.); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will compile on GCC but not clang:
hyperloglog.c:1015:34: error: initializer element is not a compile-time constant
static double alphaInf = 0.5 / log(2.);
@antirez, it can be shown that the histogram of register values is a sufficient statistic (https://en.wikipedia.org/wiki/Sufficient_statistic) for estimating the cardinality. This means the corresponding information loss when mapping the register values to a histogram is irrelevant for the cardinality estimate. Therefore, any good estimator should be able to be expressed as function of the histogram. In fact, the original cardinality estimation approach as well as loglog-beta can also be expressed as function of this histogram, if equal terms in the formulas are collected. Calculating the histogram first and then calculating the estimate in a second step (regardless of the used estimation method) reduces the number of floating-point operations and hence numerical errors. Probably there will not be any significant difference in the estimation error between loglog-beta and the new approach. loglog-beta assumes a function of certain shape and relies on some fitting parameters that have been empirically determined for each p (the number of registers). This means that you have to trust that the empirical fitting and verification has been conducted thoroughly. The loglog-beta also requires a call to the log function which is probably also more expensive than calling hllSigma. Note, hllTau is usually irrelevant for 64-bit hashes, because it is very unlikely that any register reaches the maximum register value which is equal to (q+1). Therefore, hllTau is almost always left via the first return statement. It is true that using 64 instead of 63 bits of the hash value is more a cosmetic issue. @andrescorrada, the new approach does not change the representation of individual register values. The individual registers are necessary for the add and merge operations. The transformation to a histogram is only lossless with regard to cardinality estimation. |
Hello @oertl, I ran several tests against the old and new implementation. If it was not for the fact that the new implementation is around 20% faster, I could not tell the difference, the output is constantly identical, even if it uses a completely different estimator. This is a very good sign, because it means that the functions are both doing the best they can do based on the frequency histograms (implicit or explicit) they have. However your implementation is faster and in certain ways also simpler, and has more sounding scientific motivations, so I decided to merge the PR. Thank you for your contribution! I hope you'll think at Redis again in case you publish never results. |
@antirez That's interesting. I was able to see differences but perhaps I was doing something wrong 😄 Where I saw differences the PR estimations were usually more accurate. My quick and dirty script used a list of about 300,000 English words and did a 15d7e617 was better 299 times
d8207d09 was better 64 times
They were identical 8 times
# Literally just sum(error%)/entries
d8207d09 overall error: 0.583769
15d7e617 overall error: 0.573327 |
Michael that's odd! I would say that's worth investigating but since your results show improved precisions... Not sure. BTW I used the scripts that are shipped with Redis. The results looked a bit too similar to be honest, but if we claim that both the approaches are quasi optimal, then the biggest error source is how much far the histogram of the distribution is compared to the expected one, and any good enough function under the stated premises will lead to the same error curve for the same sequence of elements. |
The current implementation uses the LogLog-Beta approach to estimate the cardinality from the HyperLogLog registers. Unfortunately, the method relies on magic constants that have been empirically determined. The formula presented in "New cardinality estimation algorithms for HyperLogLog sketches", https://arxiv.org/abs/1702.01284 has a better theoretical foundation and has already been independently verified in https://www.biorxiv.org/content/biorxiv/suppl/2018/02/09/262956.DC1/262956-1.pdf. The new implementation finds the histogram of registers first, which is finally fed into the new cardinality estimation formula.