Skip to content

Conversation

ghost
Copy link

@ghost ghost commented Jan 9, 2013

Just wrote a blog post on intersection error in HLLs and then came across your implementation. Thought it might help to have a slightly more elaborate description of the factors contributing to the intersection error.

One really important thing to emphasize is that intersection error grows explosively when intersecting more than three sets. The reason is fairly straightforward: the intersection cardinality is monotonically decreasing and the number of terms (and hence propagated error) in the inclusion-exclusion formula grows at least exponentially in the number of sets.

Cheers!

@aaw
Copy link
Owner

aaw commented Jan 10, 2013

Hi Timon,

Thanks for linking to your blog post, it was an interesting read!

Here's my conjecture about the true error bounds on intersections, stated a little more rigorously than I did in the README:

Let epsilon(b) = 1.04/sqrt(2^b), the relative error that you get with high probability via HyperLogLog counters with parameter b. Given a family of sets S_i with union U and intersection N that are observed via HyperLogLog counters with paramter b, the estimate of |N| obtained by using the inclusion/exclusion formula to combine union estimates has an absolute error of |U|*epsilon(b), with high probability.

So if you have three sets A, B, and C with a union that has 1 billion elements, I'm claiming that the inclusion/exclusion estimate that you get for the size of the intersection by combining HyperLogLog counters for A, B, and C will have absolute error epsilon(b) * 1 billion. Since epsilon(11) is around 2%, the absolute error of the intersection estimate can be around 20 million.

This is a little different than what you've concluded, since it doesn't look like any of your error measurements involve the size of the union. In particular, I don't believe that the error of HyperLogLog intersections computed via inclusion/exclusion is a function of the size of the intersection, I think it's entirely a function of the size of the union. So saying something like "the intersection error grows explosively when intersecting more than three sets" isn't necessarily true - it's true in the case that the size of the intersection is much smaller than the size of the union, but it's a little misleading when compared to the bounds I'm proposing.

@aaw aaw closed this Jun 7, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant