Cardinality Estimation: Bloom Filter vs. HyperLogLog

Edwin Chen edited this page Dec 17, 2017 · 2 revisions

How do Bloom filters and HyperLogLogs of the same size compare for cardinality estimation?

Run 1

The red line depicts the error (= [estimated cardinality - true cardinality] / true cardinality) of a HyperLogLog with 9 bucketing bits (= size 2^9 bytes = 4096 bits). The blue and green lines depict the error of two Bloom filters with width 4096 (= size 4096 bits = 2^9 bytes), one using 1 hash function and the other using 2 hash functions. (For the cardinality estimation algorithm I'm using, 1 hash function is easily seen to be optimal -- the other one is there just to compare.) I cut off the Bloom filters when their densities reach 100%.

You can see that the Bloom filter initially beats out HyperLogLog, but that HyperLogLog keeps going long after the Bloom filter dies out. All of this is expected from the math.

Here are two more runs on different data sequences:

Run 2

Run 3