Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Hyperloglog cardinality estimation algorithm #4749

Merged
merged 6 commits into from Mar 16, 2018
Merged

Better Hyperloglog cardinality estimation algorithm #4749

merged 6 commits into from Mar 16, 2018

Conversation

oertl
Copy link
Contributor

@oertl oertl commented Mar 10, 2018

The current implementation uses the LogLog-Beta approach to estimate the cardinality from the HyperLogLog registers. Unfortunately, the method relies on magic constants that have been empirically determined. The formula presented in "New cardinality estimation algorithms for HyperLogLog sketches", https://arxiv.org/abs/1702.01284 has a better theoretical foundation and has already been independently verified in https://www.biorxiv.org/content/biorxiv/suppl/2018/02/09/262956.DC1/262956-1.pdf. The new implementation finds the histogram of registers first, which is finally fed into the new cardinality estimation formula.

@antirez
Copy link
Contributor

antirez commented Mar 14, 2018

Hello @oertl, thank you, this looks great!

I've a few questions: there are any performance differences? About the histogram of the registers first, this is interesting as I was doing tests with deep neural networks as cardinality estimators (which may sound terrible to you if you are complaining about magic constants in LogLog-Beta 😄 ) in order to implement the last stage of HLL and I quickly realized that what matters is indeed just how many registers we have with a given value being all the other info completely meaningless. Totally makes sense to me, and this may even speedup the computation AFAIK, because the grouping stage is just integers and we later have less inputs for the floating point function.

Another thing: using the full 64 bits instead of 63 is only of theoretical interest if I understand correctly, since the probability of a very long run of the same bit value is very unlikely to happen in the practice so this should not change the output.

Finally: does the PFCOUNT implementation shows a smaller average error after this change? Thanks.

@andrescorrada
Copy link

@oertl , I just came across your work on HyperLogLog. I am very impressed. I have a question about your dense representation. Am I correct on realizing that it leads to an enormous space saving in the HLL data structure? Why is this not more widely known? Or am I misunderstanding the dense representation?

@antirez
Copy link
Contributor

antirez commented Mar 14, 2018

P.S. note that we re going to move HLLs in Redis to its own native data type, no longer string-backed in the future, so it is possible to change format to a better one.

@antirez
Copy link
Contributor

antirez commented Mar 14, 2018

In the meantime I read the paper, so I think I've the answers about the performance (it should be definitely faster AFAIK). Still not sure if while the curve of the error is now much more smooth without any additional correction, the average error is still in the range we had before. Please note how crucial is in Redis that we have a small error for small cardinalities: sometimes app developers will use this as a way to count user-visible interactions. The fact that the currently used algorithm is so good at almost giving the precise count for small cardinalities, makes it a lot less surprising for users that still remember the exact count of certain events.

@@ -1009,7 +1009,7 @@ uint64_t hllCount(struct hllhdr *hdr, int *invalid) {
double m = HLL_REGISTERS;
double E;
int j;
double alphaInf = 0.5 / log(2.);
static double alphaInf = 0.5 / log(2.);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will compile on GCC but not clang:

hyperloglog.c:1015:34: error: initializer element is not a compile-time constant
    static double alphaInf = 0.5 / log(2.);

@oertl
Copy link
Contributor Author

oertl commented Mar 14, 2018

@antirez, it can be shown that the histogram of register values is a sufficient statistic (https://en.wikipedia.org/wiki/Sufficient_statistic) for estimating the cardinality. This means the corresponding information loss when mapping the register values to a histogram is irrelevant for the cardinality estimate. Therefore, any good estimator should be able to be expressed as function of the histogram. In fact, the original cardinality estimation approach as well as loglog-beta can also be expressed as function of this histogram, if equal terms in the formulas are collected. Calculating the histogram first and then calculating the estimate in a second step (regardless of the used estimation method) reduces the number of floating-point operations and hence numerical errors.

Probably there will not be any significant difference in the estimation error between loglog-beta and the new approach. loglog-beta assumes a function of certain shape and relies on some fitting parameters that have been empirically determined for each p (the number of registers). This means that you have to trust that the empirical fitting and verification has been conducted thoroughly. The loglog-beta also requires a call to the log function which is probably also more expensive than calling hllSigma. Note, hllTau is usually irrelevant for 64-bit hashes, because it is very unlikely that any register reaches the maximum register value which is equal to (q+1). Therefore, hllTau is almost always left via the first return statement.

It is true that using 64 instead of 63 bits of the hash value is more a cosmetic issue.

@andrescorrada, the new approach does not change the representation of individual register values. The individual registers are necessary for the add and merge operations. The transformation to a histogram is only lossless with regard to cardinality estimation.

@antirez antirez merged commit 15d7e61 into redis:unstable Mar 16, 2018
antirez added a commit that referenced this pull request Mar 16, 2018
@antirez
Copy link
Contributor

antirez commented Mar 16, 2018

Hello @oertl, I ran several tests against the old and new implementation. If it was not for the fact that the new implementation is around 20% faster, I could not tell the difference, the output is constantly identical, even if it uses a completely different estimator. This is a very good sign, because it means that the functions are both doing the best they can do based on the frequency histograms (implicit or explicit) they have. However your implementation is faster and in certain ways also simpler, and has more sounding scientific motivations, so I decided to merge the PR. Thank you for your contribution! I hope you'll think at Redis again in case you publish never results.

@michael-grunder
Copy link
Contributor

michael-grunder commented Mar 16, 2018

@antirez That's interesting. I was able to see differences but perhaps I was doing something wrong 😄

Where I saw differences the PR estimations were usually more accurate.

My quick and dirty script used a list of about 300,000 English words and did a PFADD on an incrementing number of them. Doing them 1000 words at a time yielded this result:

15d7e617 was better 299 times
d8207d09 was better 64 times
They were identical 8 times

# Literally just sum(error%)/entries
d8207d09 overall error: 0.583769
15d7e617 overall error: 0.573327

@oertl oertl deleted the hyperloglog-improvement branch March 16, 2018 18:43
@antirez
Copy link
Contributor

antirez commented Mar 16, 2018

Michael that's odd! I would say that's worth investigating but since your results show improved precisions... Not sure. BTW I used the scripts that are shipped with Redis. The results looked a bit too similar to be honest, but if we claim that both the approaches are quasi optimal, then the biggest error source is how much far the histogram of the distribution is compared to the expected one, and any good enough function under the stated premises will lead to the same error curve for the same sequence of elements.

JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Apr 10, 2018
pulllock pushed a commit to pulllock/redis that referenced this pull request Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants