Skip to content
Ian Chan edited this page May 2, 2015 · 30 revisions

HyperLogSandwich

A probabilistic data structure for frequency/k-occurrence cardinality estimation of multisets. Sample implementation

Group by uuid, Count, Group by Count

"How many colours appear n-times?" In the figure above, we first group nodes by colour, then count each occurrence, then regroup on the frequency of those occurrences, finally counting the number of coloured nodes which appeared more than n times. Our HyperLogSandwich is an estimator on the size of each grouped bucket.

Application

This aims to solve the general counting problem for the question:

Given multiset A, how many items have appeared k times?

Unlike the HyperLogLog[1], which estimates the cardinality of unique items in a set, and unlike a CountMinSketch, which estimates the frequency of a specific item, the HyperLogSandwich estimates the number of unique items that have occurred for any frequency. While these exact counts could easily be computed, we aim to support arbitrary composition of these structures, support for stream compute, and space efficiency at the cost of ideally low error rates.

Real world examples:

  • How many people viewed my Tweets 4+ times this month?
  • How many users watched this video 2 times this week?
  • How many users visited my website 3 times during an arbitrary time range?
  • How many log exceptions happened 10+ times last week?
  • How many people bought 4+ items this week?

Requirements / Desires

  • significantly more space efficient than naive counting
  • support associative binary operation

HyperLogSandwich: HyperLogLog + CountMinSketch

(k+ occurrence approximation)

CountMinSketch index

Using the Count Min Sketch[2] to approximate the frequency of a specific item, we then use those approximations as indices into a table of HyperLogLogs, which approximate the cardinality of unique items that had that frequency in the multiset. We can arbitrarily choose how many levels of depth/frequency we care to store by simply adding more HyperLogLogs.

 // using 1-indexed arrays
 depth = d
 initialize collection of d HyperLogLogs, HLL[1]..HLL[d]
 initialize cms CountMinSketch

 // iterate through each item in set or stream
 for each item in stream:

    // insert item into CMS
    cms.count(item, 1)

    // fetch the current frequency approximation
    frequency = cms.query(item)

    // count all i or more occurrence of this item
    for i in 1 .. max(d, frequency)
       // Each HLL by design will ignore/correct for duplicates inserted
       H[i].insert(item)

Example usages:

Take n unique items - and artificially introduce duplicates with the frequency distribution as described below:

HyperLogSandwich: Count Min Sketch, Cardinality / Error

Relative error vs. True Cardinality. Given this data distribution, we expect to see error increase as we start estimating larger frequencies. The probability an item has occurred with high frequency decreases given our distribution, however the probability for a mis-estimation of item frequency increases, especially as the CountMinSketch becomes saturated.

Error/Depth

Tuning the CMS and choosing a correct ε, such that 99.9% of estimations are within ε * M from the true frequency, where M is the total stream size [3]

Plotting different epsilon values for our CountMinSketch, as the likelihood of a high frequency cardinality decreases (and actual cardinality becomes smaller), our errors become much higher.

Associative Binary Operations

The merge or addition operation can be achieved by merging the underlying individual HyperLogLog. This also means we can have differently sized/configured internal HyperLogLogs if desirable while still supporting binary operations (eg: for space efficiency if we have prior knowledge of the distribution of the dataset).

 Define HLS1, HLS2
 Define HLS1.hll_i to be the ith sub internal HyperLogLog
 Define |HLS1.hll_i| as the cardinality estimation for items appearing i or more times
 For all i, configuration of HLS1.hll_i == HLS2.hll_i

 Then:
 Union
 |(HLS1 ∪ HLS2).hll_k| = |HLS1.hll_k ∪ HLS2.hll_k|

 Intersection
 |(HLS1 ∩ HLS2).hll_k| = |HLS1.hll_k| + |HLS2.hll_k| - |HLS1.hll_k ∪ HLS2.hll_k|

HLL Unions & Intersections (http://research.neustar.biz/2012/12/17/hll-intersections-2/)

Example1 Storing a HLS per week, each storing occurrences of a specific log events. HLS1 ∪ HLS2 ∪ HLS3, combines the approximations across 3 weeks, and would let us query, for example, "how many log events occurred k times during this 3 week window?".

Example2: Storing a HLS per user, each storing phone numbers called ever. HLS1 ∩ HLS2 ∩ HLS3, intersects approximations across 3 users, and would let us query, for example, "how many numbers did all 3 people call 5 times?".

Tuning

We can configure each HLL separately whilst still supporting binary operations (provided the internal configuration of each nth HLL match. This is especially important if there is some known information about the distribution of the data being stored.

Summary

The HyperLogSandwich is an efficient method for approximating k-occurrence cardinalities and also tasty snack. Further research would analyze specific space efficiencies, error propagation and much more.


HyperLogSandwich with Bloomfilter (alternative approaches)

Bloomfilter Chain

 depth = d
 initialize collection of d HyperLogLogs, HLL[0]..HLL[d - 1]
 initialize bf Bloomfilters, BF[0]..BF[d - 1]
 for each item in stream:
    for i = 0..depth-1
        h[i].set(item)
        if(!bf[i].query(item)):
            bf[i].set(item)
            break;