# publicaaw/hyperloglog-redis

### Subversion checkout URL

You can clone with HTTPS or Subversion.

# Improved intersection error note in README.#2

Open
wants to merge 2 commits into from
 +8 6

### 2 participants

Just wrote a blog post on intersection error in HLLs and then came across your implementation. Thought it might help to have a slightly more elaborate description of the factors contributing to the intersection error.

One really important thing to emphasize is that intersection error grows explosively when intersecting more than three sets. The reason is fairly straightforward: the intersection cardinality is monotonically decreasing and the number of terms (and hence propagated error) in the inclusion-exclusion formula grows at least exponentially in the number of sets.

Cheers!

added some commits January 09, 2013
 timonk `Updated README with more accurate intersection error note.` `a253acd` timonk `Added example intersection error calculation.` `25609ec`
Owner
commented January 10, 2013

Hi Timon,

Here's my conjecture about the true error bounds on intersections, stated a little more rigorously than I did in the README:

Let epsilon(b) = 1.04/sqrt(2^b), the relative error that you get with high probability via HyperLogLog counters with parameter b. Given a family of sets S_i with union U and intersection N that are observed via HyperLogLog counters with paramter b, the estimate of |N| obtained by using the inclusion/exclusion formula to combine union estimates has an absolute error of |U|*epsilon(b), with high probability.

So if you have three sets A, B, and C with a union that has 1 billion elements, I'm claiming that the inclusion/exclusion estimate that you get for the size of the intersection by combining HyperLogLog counters for A, B, and C will have absolute error epsilon(b) * 1 billion. Since epsilon(11) is around 2%, the absolute error of the intersection estimate can be around 20 million.

This is a little different than what you've concluded, since it doesn't look like any of your error measurements involve the size of the union. In particular, I don't believe that the error of HyperLogLog intersections computed via inclusion/exclusion is a function of the size of the intersection, I think it's entirely a function of the size of the union. So saying something like "the intersection error grows explosively when intersecting more than three sets" isn't necessarily true - it's true in the case that the size of the intersection is much smaller than the size of the union, but it's a little misleading when compared to the bounds I'm proposing.

Showing 2 unique commits by 1 author.

Jan 09, 2013
`Updated README with more accurate intersection error note.` `a253acd`
`Added example intersection error calculation.` `25609ec`
 `@@ -116,12 +116,14 @@ the union of counters is lossless in the sense that you end up with the same cou` 116 116 ` you would have arrived at had you observed the union of all of the individual events.` 117 117 ` ` 118 118 ` * For an intersection of counters, there's no good theoretical bound on the relative` 119 `-error. In practice, and especially for intersections involving a small number of sets,` 120 `-the relative error you obtain tends to be in relation to the size of the union of the` 121 `-sets involved. For example, if you have two sets, each of cardinality 5000 and observe` 122 `-both sets through HyperLogLog counters with parameter b=10 (3% relative error), you can` 123 `-expect the intersection estimate to be within 10000 * 0.03 = 300 of the actual intersection` 124 `-size.` 119 `+error. In practice, the relative error is largely a function of the relative size of` 120 `+the sets, the amount they overlap, and the number of sets being intersected. If the` 121 `+error of any term in the inclusion-exclusion formula is as large as the intersection` 122 `+cardinality, then the estimate will be useless. For the best results, intersect only` 123 `+two or three sets of roughly the same size. For instance, given two sets whose` 124 `+cardinalities are within one order of magnitude and whose intersection is roughly 10%` 125 `+of the smaller set, the error (relative to the true intersection cardinality) would be` 126 `+about 10-30%.` 125 127 ` ` 126 128 ` * For time queries, the relative error applies to the size of the set within the time` 127 129 ` range you've queried. For example, given a set of cardinality 1,000,000 that has had`