JavaScript library for the HyperLogLog algorithm
JavaScript CSS
Latest commit cf1f6ac Dec 23, 2013 @timonk timonk Added references to new storage specification document and cleaned up…
… documentation. No logic changed.

README.markdown

js-hll

A JavaScript implementation of HyperLogLog whose goal is to be storage-compatible with other similar offerings from Aggregate Knowledge. A demo of this library is available.

NOTE: This implementation fully implements reading all formats in the v1.0.0 storage specification, but only writes to the FULL format. Also, the in-RAM storage is always FULL.

Latest Version

Overview

HyperLogLog (HLL) is a fixed-size, set-like structure used for distinct value counting with tunable precision. For example, in 1280 bytes HLL can estimate the count of tens of billions of distinct values with only a few percent error.

log2m

The log-base-2 of the number of registers used in the HyperLogLog algorithm. It must be at least 4 and at most 24 (but it is recommended to be no more than 17). This parameter tunes the accuracy of the HyperLogLog structure. The relative error is given by the expression ±1.04/√(2^log2m). Note that increasing log2m by 1 doubles the required storage for the HLL.

registerWidth

The number of bits used per register in the HyperLogLog algorithm. It must be at least 1 and at most 5. This parameter, in conjunction with log2m, tunes the maximum cardinality of the set whose cardinality can be estimated. For clarity, a table of registerWidths and log2ms, the approximate maximum cardinality and the size of the resulting structure that can be estimated with those parameters is provided below.

logm2regwidth=1regwidth=2regwidth=3regwidth=4regwidth=5
107.4e+02 (128B)3.0e+03 (256B)4.7e+04 (384B)1.2e+07 (512B)7.9e+11 (640B)
111.5e+03 (256B)5.9e+03 (512B)9.5e+04 (768B)2.4e+07 (1.0KB)1.6e+12 (1.2KB)
123.0e+03 (512B)1.2e+04 (1.0KB)1.9e+05 (1.5KB)4.8e+07 (2.0KB)3.2e+12 (2.5KB)
135.9e+03 (1.0KB)2.4e+04 (2.0KB)3.8e+05 (3KB)9.7e+07 (4KB)6.3e+12 (5KB)
141.2e+04 (2.0KB)4.7e+04 (4KB)7.6e+05 (6KB)1.9e+08 (8KB)1.3e+13 (10KB)
152.4e+04 (4KB)9.5e+04 (8KB)1.5e+06 (12KB)3.9e+08 (16KB)2.5e+13 (20KB)
164.7e+04 (8KB)1.9e+05 (16KB)3.0e+06 (24KB)7.7e+08 (32KB)5.1e+13 (40KB)
179.5e+04 (16KB)3.8e+05 (32KB)6.0e+06 (48KB)1.5e+09 (64KB)1.0e+14 (80KB)

The Importance of Hashing

In brief, it is absolutely crucial to hash inputs to an HLL. A close approximation of uniform randomness in the inputs ensures that the error guarantees laid out in the original paper hold. In fact, a JavaScript implementation of MurmurHash 3 (128bit) was made available to facilitate this input requirement. We've empirically determined that MurmurHash 3 is an excellent and fast hash function to use in conjunction with js-hll module.

The seed to the hash call must remain constant for all inputs to a given HLL. Similarly, if one plans to compute the union of two HLLs, the input values must have been hashed using the same seed.

For a good overview of the importance of hashing and hash functions when using probabilistic algorithms as well as an analysis of MurmurHash 3, refer to these blog posts:

On Unions and Intersections

HLLs have the useful property that the union of any number of HLLs is equal to the HLL that would have been populated by playing back all inputs to those 'n' HLLs into a single HLL. Colloquially, one can say that HLLs have "lossless" unions because the same cardinality error guarantees that apply to a single HLL apply to a union of HLLs. See the union() function.

Using the inclusion-exclusion principle and the union() function, one can also estimate the intersection of sets represented by HLLs. Note, however, that error is proportional to the union of the two HLLs, while the result can be significantly smaller than the union, leading to disproportionately large error relative to the actual intersection cardinality. For instance, if one HLL has a cardinality of 1 billion, while the other has a cardinality of 10 million, with an overlap of 5 million, the intersection cardinality can easily be dwarfed by even a 1% error estimate in the larger HLLs cardinality.

For more information on HLL intersections, see this blog post.

Usage

Refer to the unit tests (hll-test.js) or USAGE.markdown for many more usage examples.

Hashing and adding a value to a new HLL:

var seed = 0x123456;
var rawKey = new ArrayBuffer(8);
var byteView = new Int8Array(rawKey);
    byteView[0] = 0xDE; byteView[1] = 0xAD; byteView[2] = 0xBE; byteView[3] = 0xEF;
    byteView[4] = 0xFE; byteView[5] = 0xED; byteView[6] = 0xFA; byteView[7] = 0xCE;
var hllSet = new hll.HLL(13/*log2m*/, 5/*registerWidth*/);
    hllSet.addRaw(murmur3.hash128(rawKey, seed));

Retrieving the cardinality of an HLL:

console.log(hllSet.cardinality());

Unioning two HLLs together (and retrieving the resulting cardinality):

var hllSet1 = new hll.HLL(13/*log2m*/, 5/*registerWidth*/),
    hllSet2 = new hll.HLL(13/*log2m*/, 5/*registerWidth*/);

// ... (add values to both sets) ...

hllSet1.union(hllSet2)/*modifies hllSet1 to contain the union*/;
console.log(hllSet1.cardinality());

Cloning an HLL:

var hllSet1 = new hll.HLL(13/*log2m*/, 5/*registerWidth*/),
    hllSet2 = new hll.HLL(13/*log2m*/, 5/*registerWidth*/);

// ... (add values to both sets) ...

var hllUnion = hllSet1.clone();
hllUnion.union(hllSet2)/*modifies hllUnion to contain the union*/;
// both 'hllSet1' and 'hllSet2' are unmodified
console.log(hllUnion.cardinality());

Reading an HLL from a hex representation of storage specification, v1.0.0 (for example, retrieved from a PostgreSQL database):

var hllSet = hll.fromHexString(hllHexString).hllSet;
console.log(hllSet.cardinality());

Writing an HLL to its hex representation of storage specification, v1.0.0 (for example, to be inserted into a PostgreSQL database):

...
var hllHexString = hllSet.toHexString();
...

Projects using js-hll


Building

  • Requires Maven 2.0
  • mvn clean package in the base directory

    A target directory will be created (which is the combination of src/main and src/test/).

    Library files js-hll-x.x.x.js and js-hll-x.x.x.min.js will be created in the base directory.

Testing

  • Build the library
  • Point a browser at:
    1. qunit-test.html to run the unit tests
    2. stress-test.html to run the performance tests (view browser console (ctrl-shift-c) for results)

Notes

  • For this initial implementation, readability has been chosen over in-RAM size. Specifically, the register values (regardless of the register width) are stored as an array of numbers (each of which are 64bits) rather than bit-packing them to achieve minimum storage. This will be rectified in future versions.
  • The register width is limited to at most 5 bits to allow for using JavaScript's native bit operations (which support up to 32bits). If you have the need to support 6 bits of precision please let us know (file an issue)!