Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large errors hll++ at high cardinality #10

Closed
nieksand opened this issue Mar 24, 2015 · 5 comments
Closed

Large errors hll++ at high cardinality #10

nieksand opened this issue Mar 24, 2015 · 5 comments

Comments

@nieksand
Copy link
Contributor

Once I set my cardinality to 100k using hll++, the error rate becomes massive:

approx:  10506
actual:  100000
delta:   89494

My sample code is pretty much cribbed from your test/bench code. Set the truth variable to 1000 and you get exact result. Things look good in the 10k range too. But at 100k... boom.

Complete repro code:

package main

import (
    "github.com/clarkduvall/hyperloglog"
    "fmt"
    "hash/fnv"
    "hash"
)

func hash64(s string) hash.Hash64 {
    h := fnv.New64()
    h.Write([]byte(s))
    return h
}

func main() {
    truth := 100000

    h, err := hyperloglog.NewPlus(16);
    if err != nil {
        fmt.Println(err)
        return
    }

    for i := 0; i < truth; i++ {
            istr := fmt.Sprintf("%s", i)
            h.Add(hash64(istr))
    }
    epp := h.Count()

    fmt.Println("approx: ", epp)
    fmt.Println("actual: ", truth)
    fmt.Println("delta:  ", int64(truth)-int64(h.Count()))
}
@clarkduvall
Copy link
Owner

Nice find, I'll take a look. Thanks for the repro code!

@clarkduvall
Copy link
Owner

When debugging this, I found a strangely large number of similar hashes. If you change the hashing line from:

istr := fmt.Sprintf("%s", i)

to

istr := fmt.Sprintf("%s", i*384174)

the results are much better (within a couple hundred). I'm wondering if the fnv hash doesn't handle similar values very well for HLL purposes.

On another note, the reason it explodes after a certain number is because it switches from the sparse representation to the normal representation. The sparse representation doesn't require the hash values to be as well distributed, so the problem doesn't show up until the normal representation is used. Looks like the crossover point is around 59040.

@nieksand
Copy link
Contributor Author

I think you're right. It's the hashing function.

I switched to fnv64a:

    h := fnv.New64a()

and results look fine:

approx:  103611
actual:  100000
delta:   -3611

Apparently that variant has better avalanche properties: http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function#FNV-1a_hash

@nieksand
Copy link
Contributor Author

I think this can be closed. It may be worth-while to update the examples/benchmarks to use the a variant of fnv instead.

@nieksand
Copy link
Contributor Author

I popped a PR updating benchmark so that nobody else makes the same mistake I did:

#11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants