Improved intersection error note in README. #2

ghost · 2013-01-09T22:35:54Z

Just wrote a blog post on intersection error in HLLs and then came across your implementation. Thought it might help to have a slightly more elaborate description of the factors contributing to the intersection error.

One really important thing to emphasize is that intersection error grows explosively when intersecting more than three sets. The reason is fairly straightforward: the intersection cardinality is monotonically decreasing and the number of terms (and hence propagated error) in the inclusion-exclusion formula grows at least exponentially in the number of sets.

Cheers!

aaw · 2013-01-10T17:32:27Z

Hi Timon,

Thanks for linking to your blog post, it was an interesting read!

Here's my conjecture about the true error bounds on intersections, stated a little more rigorously than I did in the README:

Let epsilon(b) = 1.04/sqrt(2^b), the relative error that you get with high probability via HyperLogLog counters with parameter b. Given a family of sets S_i with union U and intersection N that are observed via HyperLogLog counters with paramter b, the estimate of |N| obtained by using the inclusion/exclusion formula to combine union estimates has an absolute error of |U|*epsilon(b), with high probability.

So if you have three sets A, B, and C with a union that has 1 billion elements, I'm claiming that the inclusion/exclusion estimate that you get for the size of the intersection by combining HyperLogLog counters for A, B, and C will have absolute error epsilon(b) * 1 billion. Since epsilon(11) is around 2%, the absolute error of the intersection estimate can be around 20 million.

This is a little different than what you've concluded, since it doesn't look like any of your error measurements involve the size of the union. In particular, I don't believe that the error of HyperLogLog intersections computed via inclusion/exclusion is a function of the size of the intersection, I think it's entirely a function of the size of the union. So saying something like "the intersection error grows explosively when intersecting more than three sets" isn't necessarily true - it's true in the case that the size of the intersection is much smaller than the size of the union, but it's a little misleading when compared to the bounds I'm proposing.

Timon Karnezos added 2 commits January 9, 2013 14:22

Updated README with more accurate intersection error note.

a253acd

Added example intersection error calculation.

25609ec

aaw closed this Jun 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved intersection error note in README. #2

Improved intersection error note in README. #2

Uh oh!

ghost commented Jan 9, 2013

Uh oh!

aaw commented Jan 10, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Improved intersection error note in README. #2

Improved intersection error note in README. #2

Uh oh!

Conversation

ghost commented Jan 9, 2013

Uh oh!

aaw commented Jan 10, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant