Storage Specification
Version: 1.0.0
This document is a specification for the storage format of the Aggregate Knowledge HyperLogLog implementations.
Algorithms & Types
A hll is a combination of different set/distinct-value-counting algorithms that can be thought of as a hierarchy, along with rules for moving up that hierarchy. In order to distinguish between said algorithms, we have given them names:
EMPTY
A constant value that denotes the empty set.
EXPLICIT
An explicit, unique, sorted list of integers in the set, which is maintained up to a fixed cardinality.
SPARSE
A 'lazy', map-based implementation of HyperLogLog, a probabilistic set data structure. Only stores the indices and values of non-zero registers in a map, until the number of non-zero registers exceeds a fixed cardinality.
FULL
A fully-materialized, list-based implementation of HyperLogLog. Explicitly stores the value of every register in a list ordered by register index.
For brevity's sake, we'll refer to these different algorithms as 'types'.
Storage
These data structures are stored as zero-indexed array of bytes with a schema version specified in the top nibble of the zero-th byte. The schema version decides both the specific storage layout as well as the rules for the interaction of the different types mentioned above.
The storage specification can trivially be converted into a serialization format by simply encoding the arrays of bytes. The storage specification includes how to represent the various set-like structures as arrays of bytes, as well as how to represent the results of operations on multiple sets (such as an undefined result).
NOTE: all examples and diagrams will use a hex format with the high nibble first (on the left) in the octet, and the zero-th octet on the left.
In a pseudo-regex format, the storage layout of the bytes is as follows: ^VS*$
^symbolizes the beginning of the array of bytes, right before the zero-indexed byteVis the 'version' byte, indicating the schema version in the top nibble. The bottom nibble is schema version-dependent.Sis a byte specified by the particular schema version. See below.$symbolizes the end of the array of bytes, right after the highest-indexed byte
Each schema version will elaborate on the above layout.
Schema Version
Schema Version 1
-
Size/Layout
- Size/Layout
-
Minimum length: 3 bytes
-
Maximum length: 17,179,869,187 bytes ( = (231 * 8) + 3)
-
The schema-specific bytes
S*have the following structure:PCB*Pis the required 'parameter' byte. See 'Type-agnostic' subsection.Cis the required 'cutoff' byte. See 'Type-agnostic' subsection.Bis a 'data' byte. The 'data' bytes layout is specified per type. See 'Type-specific' subsection.
-
Type-agnostic
-
The bottom nibble of the 'version' byte
Vindicates the type, as defined by these ordinals:0- undefined1-EMPTY2-EXPLICIT3-SPARSE4-FULL
-
The 'parameter' byte
Pencodes the bit-width and number of registers used bySPARSEandFULLtypes.- the highest 3 bits are used to encode the integer value
registerWidth - 1, and - the remaining 5 bits encode the integer value
log2(numberOfRegisters).
- the highest 3 bits are used to encode the integer value
registerWidthmay take values from 1 to 8, inclusive, andlog2(numberOfRegisters)may take on 4 to 31, inclusive.For example:
P = xA6 = 1010 0110 = 101 00110thus
registerWidth - 1 = 5 registerWidth = 6and
log2(numberOfRegisters) = 6 numberOfRegisters = 2^6 = 64-
The 'cutoff' byte
Cencodes parameters defining theEXPLICITtoSPARSE, andSPARSEtoFULLpromotions.- 1 bit (the top bit) of padding,
- 1 bit (second highest bit) indicating the boolean value
sparseEnabled, and - 6 bits (lowest six bits) as an unsigned big-endian integer
explicitCutoffthat can take on the values0,63, or1to31inclusive.
-
-
Type-specific
-
undefined/invalid result
Uses no data bytes.
-
EMPTYUses no data bytes.
-
EXPLICITThe layout of the 'data' bytes is:
(B{8}){0,256}, that is 0-256 blocks of 8 bytes. Each block of 8 bytes represent a signed 64-bit integer (sign bit + 63 value bits). These integers are encoded as big-endian (with sign-bit at highest position), and are the "contents" of the set. (The blocks together represent an array of such integers, stored in ascending order, without duplicates.)data bytes = [-5451491901947305642, 1] # as ascending, signed, decimal-encoded 64-bit values in array = [0xCBA79700677CDEAA, 0x0000000000000001] # as hex-encoded 64-bit values in array = 0xCBA79700677CDEAA0000000000000001 # as hex -
SPARSEThe data bytes encode the register indices and values in
(log2(numberOfRegisters) + registerWidth)-bit-wide "short-words". Each short-word is a bit-packed registerindex/registervaluepair.- The
indexis encoded in the highestlog2(numberOfRegisters)bits of the short-word. - The
valueis encoded in the lowestregisterWidthbits of the short-word.
The short-words are packed into bytes from the top of the zero-th data byte to the bottom of the last data byte, with the high bits of a short-word toward the high bits of a byte.
- If
BITS = (registerWidth + log2(numberOfRegisters)) * numberOfRegistersis not divisible by 8, thenBITS % 8padding bits are added to the bottom of the last byte of the array. - The short-words are stored in ascending
indexorder.
For example, if
log2(numberOfRegisters) = 11andregisterWidth = 6, and if the register index/value pairs are(11, 6)and(1099, 19):= [(11, 6), (1099, 19), padding] # as unsigned decimal-encoded pairs = [(0b00000001011, 0b000110), (0b10001001011, 0b010011), 0b000000] # as two binary-encoded pairs and 6 bits of padding = [(0b00000001011000110), (0b10001001011010011), 0b000000] # as binary-encoded 17-bit short words and 6 bits of padding = [0b00000001, 0b01100011, 0b01000100, 0b10110100, 0b11000000] # as binary-encoded octets in array = [0x01, 0x63, 0x44, 0x5B, 0xC0] # as byte array 0x0163445BC0 # as hexThis encoding was chosen so that when reading bytes as octets in the typical first-octet-is-the-high-4-bits fashion, a octet-to-binary conversion as demonstrated above would yield a high-to-low, left-to-right view of the "short words" and their components.
- The
-
FULLThe data bytes encode the register values in
registerWidth-bit-wide, big-endian "short-words". The short words are written from the top of the zero-th byte of the array to the bottom of the last byte of the array, with the high bits of a short-word toward the high bits of a byte.- If
BITS = registerWidth * numberOfRegistersis not divisible by 8, thenBITS % 8padding bits are added to the bottom of the last byte of the array. - The short-words are stored in ascending index order.
For example, if
registerWidth = 5andnumberOfRegisters = 4, and if the register index/value pairs are(0, 0), (1,1), (2,2), (3,3):[0, 1, 2, 3, padding] # as unsigned decimal-encoded register values = [0b00000, 0b00001, 0b00010, 0b00011, 0b0000] # as four 5-bit "short words" + 4 bits padding = [0b00000000, 0b01000100, 0b00110000] # as binary-encoded octets in array = [0x00, 0x44, 0x30] # as hex-encoded byte array = 0x004430 # as hex - If
-
-
Hierarchy
-
The hierarchy is dependent on the 'cutoff' byte
C. When a set is promoted from one 'type'/algorithm to another, the top nibble of the 'version' byteV, the 'parameter' byteP, and the 'cutoff' byteCall remain the same. The bottom nibble ofVand the 'data' bytesBmay change. -
When any value is added to an
EMPTYset,- if
explicitCutoff = 0andsparseEnabled = 0, then it is promoted to aFULLset containing that one value. - if
explicitCutoff = 0andsparseEnabled = 1, then it is promoted to aSPARSEset containing that one value. - if
explicitCutoff > 0but< 63, then it is promoted to anEXPLICITset containing that one value.
- if
-
When inserting an element into an
EXPLICITset,- if
sparseEnabled = 0andexplicitCutoff = 0, then it is promoted to aFULLset. - if
sparseEnabled = 1andexplicitCutoff = 0, then it is promoted to aSPARSEset. - if
sparseEnabled = 0andexplicitCutoff > 0and< 63, and if inserting the element would cause the cardinality to exceed2 ^ (explicitCutoff - 1), then it is promoted to aFULL. - if
sparseEnabled = 1andexplicitCutoff > 0and< 63, and if inserting the element would cause the cardinality to exceed2 ^ (explicitCutoff - 1), then it is promoted to aSPARSE. - if
sparseEnabled = 0andexplicitCutoff = 63, then the criteria for promotion is implementation-dependent, as this value ofexplicitCutoffindicates an 'auto' promotion mode. SincesparseEnabled = 0the set can only be promoted to aFULLset. - if
sparseEnabled = 1andexplicitCutoff = 63, then the promotion is implementation-dependent, as this value ofexplicitCutoffindicates an 'auto' promotion mode. SincesparseEnabled = 1the set can only be promoted to aSPARSEset.
- if
-
When inserting an element into a
SPARSEset, if that element would cause the storage size of theSPARSEset to be greater than that of aFULLset, then it is promoted to aFULLset.
-