NOTE: This implementation fully implements reading all formats in the v1.0.0 storage specification, but only writes to the
FULL format. Also, the in-RAM storage is always
HyperLogLog (HLL) is a fixed-size, set-like structure used for distinct value counting with tunable precision. For example, in 1280 bytes HLL can estimate the count of tens of billions of distinct values with only a few percent error.
The log-base-2 of the number of registers used in the HyperLogLog algorithm. It must be at least 4 and at most 24 (but it is recommended to be no more than 17). This parameter tunes the accuracy of the HyperLogLog structure. The relative error is given by the expression ±1.04/√(2^log2m). Note that increasing
log2m by 1 doubles the required storage for the HLL.
The number of bits used per register in the HyperLogLog algorithm. It must be at least 1 and at most 5. This parameter, in conjunction with
log2m, tunes the maximum cardinality of the set whose cardinality can be estimated. For clarity, a table of
log2ms, the approximate maximum cardinality and the size of the resulting structure that can be estimated with those parameters is provided below.
|10||7.4e+02 (128B)||3.0e+03 (256B)||4.7e+04 (384B)||1.2e+07 (512B)||7.9e+11 (640B)|
|11||1.5e+03 (256B)||5.9e+03 (512B)||9.5e+04 (768B)||2.4e+07 (1.0KB)||1.6e+12 (1.2KB)|
|12||3.0e+03 (512B)||1.2e+04 (1.0KB)||1.9e+05 (1.5KB)||4.8e+07 (2.0KB)||3.2e+12 (2.5KB)|
|13||5.9e+03 (1.0KB)||2.4e+04 (2.0KB)||3.8e+05 (3KB)||9.7e+07 (4KB)||6.3e+12 (5KB)|
|14||1.2e+04 (2.0KB)||4.7e+04 (4KB)||7.6e+05 (6KB)||1.9e+08 (8KB)||1.3e+13 (10KB)|
|15||2.4e+04 (4KB)||9.5e+04 (8KB)||1.5e+06 (12KB)||3.9e+08 (16KB)||2.5e+13 (20KB)|
|16||4.7e+04 (8KB)||1.9e+05 (16KB)||3.0e+06 (24KB)||7.7e+08 (32KB)||5.1e+13 (40KB)|
|17||9.5e+04 (16KB)||3.8e+05 (32KB)||6.0e+06 (48KB)||1.5e+09 (64KB)||1.0e+14 (80KB)|
The Importance of Hashing
The seed to the hash call must remain constant for all inputs to a given HLL. Similarly, if one plans to compute the union of two HLLs, the input values must have been hashed using the same seed.
For a good overview of the importance of hashing and hash functions when using probabilistic algorithms as well as an analysis of MurmurHash 3, refer to these blog posts:
- K-Minimum Values: Sketching Error, Hash Functions, and You
- Choosing a Good Hash Function, Part 1
- Choosing a Good Hash Function, Part 2
- Choosing a Good Hash Function, Part 3
On Unions and Intersections
HLLs have the useful property that the union of any number of HLLs is equal to the HLL that would have been populated by playing back all inputs to those 'n' HLLs into a single HLL. Colloquially, one can say that HLLs have "lossless" unions because the same cardinality error guarantees that apply to a single HLL apply to a union of HLLs. See the
Using the inclusion-exclusion principle and the
union() function, one can also estimate the intersection of sets represented by HLLs. Note, however, that error is proportional to the union of the two HLLs, while the result can be significantly smaller than the union, leading to disproportionately large error relative to the actual intersection cardinality. For instance, if one HLL has a cardinality of 1 billion, while the other has a cardinality of 10 million, with an overlap of 5 million, the intersection cardinality can easily be dwarfed by even a 1% error estimate in the larger HLLs cardinality.
For more information on HLL intersections, see this blog post.
Refer to the unit tests (
USAGE.markdown for many more usage examples.
Hashing and adding a value to a new HLL:
var seed = 0x123456; var rawKey = new ArrayBuffer(8); var byteView = new Int8Array(rawKey); byteView = 0xDE; byteView = 0xAD; byteView = 0xBE; byteView = 0xEF; byteView = 0xFE; byteView = 0xED; byteView = 0xFA; byteView = 0xCE; var hllSet = new hll.HLL(13/*log2m*/, 5/*registerWidth*/); hllSet.addRaw(murmur3.hash128(rawKey, seed));
Retrieving the cardinality of an HLL:
Unioning two HLLs together (and retrieving the resulting cardinality):
var hllSet1 = new hll.HLL(13/*log2m*/, 5/*registerWidth*/), hllSet2 = new hll.HLL(13/*log2m*/, 5/*registerWidth*/); // ... (add values to both sets) ... hllSet1.union(hllSet2)/*modifies hllSet1 to contain the union*/; console.log(hllSet1.cardinality());
Cloning an HLL:
var hllSet1 = new hll.HLL(13/*log2m*/, 5/*registerWidth*/), hllSet2 = new hll.HLL(13/*log2m*/, 5/*registerWidth*/); // ... (add values to both sets) ... var hllUnion = hllSet1.clone(); hllUnion.union(hllSet2)/*modifies hllUnion to contain the union*/; // both 'hllSet1' and 'hllSet2' are unmodified console.log(hllUnion.cardinality());
var hllSet = hll.fromHexString(hllHexString).hllSet; console.log(hllSet.cardinality());
... var hllHexString = hllSet.toHexString(); ...
- Requires Maven 2.0
mvn clean packagein the base directory
targetdirectory will be created (which is the combination of
js-hll-x.x.x.min.jswill be created in the base directory.
- Build the library
- Point a browser at:
qunit-test.htmlto run the unit tests
stress-test.htmlto run the performance tests (view browser console (ctrl-shift-c) for results)
- For this initial implementation, readability has been chosen over in-RAM size. Specifically, the register values (regardless of the register width) are stored as an array of numbers (each of which are 64bits) rather than bit-packing them to achieve minimum storage. This will be rectified in future versions.