Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Improved intersection error note in README. #2

Open
wants to merge 2 commits into from

2 participants

Timon Karnezos Aaron Windsor
Timon Karnezos

Just wrote a blog post on intersection error in HLLs and then came across your implementation. Thought it might help to have a slightly more elaborate description of the factors contributing to the intersection error.

One really important thing to emphasize is that intersection error grows explosively when intersecting more than three sets. The reason is fairly straightforward: the intersection cardinality is monotonically decreasing and the number of terms (and hence propagated error) in the inclusion-exclusion formula grows at least exponentially in the number of sets.

Cheers!

Aaron Windsor
Owner
aaw commented January 10, 2013

Hi Timon,

Thanks for linking to your blog post, it was an interesting read!

Here's my conjecture about the true error bounds on intersections, stated a little more rigorously than I did in the README:

Let epsilon(b) = 1.04/sqrt(2^b), the relative error that you get with high probability via HyperLogLog counters with parameter b. Given a family of sets S_i with union U and intersection N that are observed via HyperLogLog counters with paramter b, the estimate of |N| obtained by using the inclusion/exclusion formula to combine union estimates has an absolute error of |U|*epsilon(b), with high probability.

So if you have three sets A, B, and C with a union that has 1 billion elements, I'm claiming that the inclusion/exclusion estimate that you get for the size of the intersection by combining HyperLogLog counters for A, B, and C will have absolute error epsilon(b) * 1 billion. Since epsilon(11) is around 2%, the absolute error of the intersection estimate can be around 20 million.

This is a little different than what you've concluded, since it doesn't look like any of your error measurements involve the size of the union. In particular, I don't believe that the error of HyperLogLog intersections computed via inclusion/exclusion is a function of the size of the intersection, I think it's entirely a function of the size of the union. So saying something like "the intersection error grows explosively when intersecting more than three sets" isn't necessarily true - it's true in the case that the size of the intersection is much smaller than the size of the union, but it's a little misleading when compared to the bounds I'm proposing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.

Showing 1 changed file with 8 additions and 6 deletions. Show diff stats Hide diff stats

  1. 14  README.md
14  README.md
Source Rendered
@@ -116,12 +116,14 @@ the union of counters is lossless in the sense that you end up with the same cou
116 116
 you would have arrived at had you observed the union of all of the individual events.
117 117
 
118 118
 * For an intersection of counters, there's no good theoretical bound on the relative
119  
-error. In practice, and especially for intersections involving a small number of sets,
120  
-the relative error you obtain tends to be in relation to the size of the union of the
121  
-sets involved. For example, if you have two sets, each of cardinality 5000 and observe
122  
-both sets through HyperLogLog counters with parameter b=10 (3% relative error), you can
123  
-expect the intersection estimate to be within 10000 * 0.03 = 300 of the actual intersection
124  
-size.
  119
+error. In practice, the relative error is largely a function of the relative size of
  120
+the sets, the amount they overlap, and the number of sets being intersected. If the
  121
+error of any term in the inclusion-exclusion formula is as large as the intersection
  122
+cardinality, then the estimate will be useless. For the best results, intersect only
  123
+two or three sets of roughly the same size. For instance, given two sets whose
  124
+cardinalities are within one order of magnitude and whose intersection is roughly 10%
  125
+of the smaller set, the error (relative to the true intersection cardinality) would be
  126
+about 10-30%.
125 127
 
126 128
 * For time queries, the relative error applies to the size of the set within the time
127 129
 range you've queried. For example, given a set of cardinality 1,000,000 that has had
Commit_comment_tip

Tip: You can add notes to lines in a file. Hover to the left of a line to make a note

Something went wrong with that request. Please try again.