-
Notifications
You must be signed in to change notification settings - Fork 5
Home
A probabilistic data structure for frequency/k-occurrence cardinality estimation of multisets. Sample implementation
"How many colours appear n-times?" In the figure above, we first group nodes by colour, then count each occurrence, then regroup on the frequency of those occurrences, finally counting the number of coloured nodes which appeared more than n times. Our HyperLogSandwich is an estimator on the size of each grouped bucket.
This aims to solve the general counting problem for the question:
Given multiset A, how many items have appeared k times?
Unlike the HyperLogLog[1], which estimates the cardinality of unique items in a set, and unlike a CountMinSketch, which estimates the frequency of a specific item, the HyperLogSandwich estimates the number of unique items that have occurred for any frequency. While these exact counts could easily be computed, we aim to support arbitrary composition of these structures, support for stream compute, and space efficiency at the cost of ideally low error rates.
Real world examples:
- How many people viewed my Tweets 4+ times this month?
- How many users watched this video 2 times this week?
- How many users visited my website 3 times during an arbitrary time range?
- How many log exceptions happened 10+ times last week?
- How many people bought 4+ items this week?
- significantly more space efficient than naive counting
- support associative binary operation
(k+ occurrence approximation)
Using the Count Min Sketch[2] to approximate the frequency of a specific item, we then use those approximations as indices into a table of HyperLogLogs, which approximate the cardinality of unique items that had that frequency in the multiset. We can arbitrarily choose how many levels of depth/frequency we care to store by simply adding more HyperLogLogs.
// using 1-indexed arrays
depth = d
initialize collection of d HyperLogLogs, HLL[1]..HLL[d]
initialize cms CountMinSketch
// iterate through each item in set or stream
for each item in stream:
// insert item into CMS
cms.count(item, 1)
// fetch the current frequency approximation
frequency = cms.query(item)
// count all i or more occurrence of this item
for i in 1 .. max(d, frequency)
// Each HLL by design will ignore/correct for duplicates inserted
H[i].insert(item)
Take n unique items - and artificially introduce duplicates with the frequency distribution as described below:
Relative error vs. True Cardinality. Given this data distribution, we expect to see error increase as we start estimating larger frequencies. The probability an item has occurred with high frequency decreases given our distribution, however the probability for a mis-estimation of item frequency increases, especially as the CountMinSketch becomes saturated.
Tuning the CMS and choosing a correct ε, such that 99.9% of estimations are within ε * M from the true frequency, where M is the total stream size [3]
Plotting different epsilon values for our CountMinSketch, as the likelihood of a high frequency cardinality decreases (and actual cardinality becomes smaller), our errors become much higher.
The merge or addition operation can be achieved by merging the underlying individual HyperLogLog. This also means we can have differently sized/configured internal HyperLogLogs if desirable while still supporting binary operations (eg: for space efficiency if we have prior knowledge of the distribution of the dataset).
Define HLS1, HLS2
Define HLS1.hll_i to be the ith sub internal HyperLogLog
Define |HLS1.hll_i| as the cardinality estimation for items appearing i or more times
For all i, configuration of HLS1.hll_i == HLS2.hll_i
Then:
Union
|(HLS1 ∪ HLS2).hll_k| = |HLS1.hll_k ∪ HLS2.hll_k|
Intersection
|(HLS1 ∩ HLS2).hll_k| = |HLS1.hll_k| + |HLS2.hll_k| - |HLS1.hll_k ∪ HLS2.hll_k|
HLL Unions & Intersections (http://research.neustar.biz/2012/12/17/hll-intersections-2/)
Example1 Storing a HLS per week, each storing occurrences of a specific log events. HLS1 ∪ HLS2 ∪ HLS3, combines the approximations across 3 weeks, and would let us query, for example, "how many log events occurred k times during this 3 week window?".
Example2: Storing a HLS per user, each storing phone numbers called ever. HLS1 ∩ HLS2 ∩ HLS3, intersects approximations across 3 users, and would let us query, for example, "how many numbers did all 3 people call 5 times?".
We can configure each HLL separately whilst still supporting binary operations (provided the internal configuration of each nth HLL match. This is especially important if there is some known information about the distribution of the data being stored.
The HyperLogSandwich is an efficient method for approximating k-occurrence cardinalities and also tasty snack. Further research would analyze specific space efficiencies, error propagation and much more.
depth = d
initialize collection of d HyperLogLogs, HLL[0]..HLL[d - 1]
initialize bf Bloomfilters, BF[0]..BF[d - 1]
for each item in stream:
for i = 0..depth-1
h[i].set(item)
if(!bf[i].query(item)):
bf[i].set(item)
break;