Dictionary

Overview

The Dictionary compression scheme is intended for data with support size that's small relative to its domain - but not necessarily concentrated in a contiguous subset of that domain (those cases usually fit better with [FOR](Frame of Refenrece), [DZB/NS](Discard Zero Bytes) or their combination). With this scheme we 'squeeze' the support set Σ together by mapping it to 0..|Σ|-1. The inverse of this bijection - a map from indices in 0..|Σ|-1 into the compressed data's domain - is a dictionary for the compressed data.

Compressing data constitutes the construction of such a bijection, then replacing the large elements - of type τ_D - with their corresponding "dictionary indices" using the bijection; decompression is the simpler action of simply looking up τ_D in the dictionary: decompressed[i] = dictionary[compressed[i]]. One is immediately reminded of the Gather operation, and, indeed - dictionary decompression is nothing but a Gather.

This scheme's shorthand designator is DICT.

Scheme parameters

Parameter	Aliases	Symbol used
Domain	Uncompressed domain	D
Domain element type	Uncompressed type, uncompressed data type	τ_D
Support set	Effective domain,	Σ
Dictionary size	Support set size	σ
Dictionary index type		τ_di
Column length	Single bitmap length, uncompressed data length	n

Variants

Dictionary index size resolution

At what quanta, or what resolution, do we set sizes for dictionary indices?

The simpler option is to make dictionary indices take up power-of-2 bytes: 1, 2, 4 or 8 on typical hardware. This is the easiest to implement and it aligns nicely - but one loses a lot in terms of compression ratio.
The next, slightly more complex option is to allow indices to take up any integral number of bytes. Now one must allow for reading non-GPU-word-aligned index values of sizes 3 and 5 from memory. Still, this doesn't cover the entire range of possibilities. This option is not yet implemented (see issue #15).
The most flexible option is allowing bit-size-resolution dictionary index sizes. This requires the most hassle when reading - one has to make sure the right bytes end up in the right places. This option is not yet implemented (see issue #10).

Dictionary for variable-length data (unsupported)

Other than the usual benefit of using a dictionary, using it also makes the (now-compressed) data have a fixed-width type - which is quite beneficial (e.g. much easier lookup and filtering) - if the data width was not yet fixed. Thus for data which may benefit from some uniformity, we might consider using this scheme even if there's no significant compression benefit.

Dictionary-per-segment (unimplemented)

Compressed data might be amenable to dictionary compression locally, but not locally. i.e. limited-length-but-not-too-short contiguous sequences of uncompressed elements may have a limited support size, but over the entire input - the support size is a very large fraction of the entire domain (and it's not artificially large due to outliers, a case which could be done away by patching). In this case it makes sense to occasionally switch dictionaries, to reflect the effective current support set per segment (and possibly use patching to take care of some inter-dictionary overlap).

Memory layout

The following scalars are used: column length
The dictionary is a contiguous array of σ elements of type τ_D
The compressed data is a contiguous array of n indices into the dictionary array
For per-segment dictionaries the layout would be different - but that's not supported yet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly