New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large errors hll++ at high cardinality #10
Comments
Nice find, I'll take a look. Thanks for the repro code! |
When debugging this, I found a strangely large number of similar hashes. If you change the hashing line from: istr := fmt.Sprintf("%s", i) to istr := fmt.Sprintf("%s", i*384174) the results are much better (within a couple hundred). I'm wondering if the fnv hash doesn't handle similar values very well for HLL purposes. On another note, the reason it explodes after a certain number is because it switches from the sparse representation to the normal representation. The sparse representation doesn't require the hash values to be as well distributed, so the problem doesn't show up until the normal representation is used. Looks like the crossover point is around 59040. |
I think you're right. It's the hashing function. I switched to fnv64a:
and results look fine:
Apparently that variant has better avalanche properties: http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function#FNV-1a_hash |
I think this can be closed. It may be worth-while to update the examples/benchmarks to use the |
I popped a PR updating benchmark so that nobody else makes the same mistake I did: |
Once I set my cardinality to 100k using hll++, the error rate becomes massive:
My sample code is pretty much cribbed from your test/bench code. Set the
truth
variable to 1000 and you get exact result. Things look good in the 10k range too. But at 100k... boom.Complete repro code:
The text was updated successfully, but these errors were encountered: