Skip to content

Dictionary

Eyal Rozenberg edited this page May 3, 2017 · 1 revision

Overview

The Dictionary compression scheme is intended for data with support size that's small relative to its domain - but not necessarily concentrated in a contiguous subset of that domain (those cases usually fit better with [FOR](Frame of Refenrece), [DZB/NS](Discard Zero Bytes) or their combination). With this scheme we 'squeeze' the support set Σ together by mapping it to 0..|Σ|-1. The inverse of this bijection - a map from indices in 0..|Σ|-1 into the compressed data's domain - is a dictionary for the compressed data.

Compressing data constitutes the construction of such a bijection, then replacing the large elements - of type τ_D - with their corresponding "dictionary indices" using the bijection; decompression is the simpler action of simply looking up τ_D in the dictionary: decompressed[i] = dictionary[compressed[i]]. One is immediately reminded of the Gather operation, and, indeed - dictionary decompression is nothing but a Gather.

This scheme's shorthand designator is DICT.

Scheme parameters

Parameter Aliases Symbol used
Domain Uncompressed domain D
Domain element type Uncompressed type, uncompressed data type τ_D
Support set Effective domain, Σ
Dictionary size Support set size σ
Dictionary index type τ_di
Column length Single bitmap length, uncompressed data length n

Variants

Dictionary index size resolution

At what quanta, or what resolution, do we set sizes for dictionary indices?

  • The simpler option is to make dictionary indices take up power-of-2 bytes: 1, 2, 4 or 8 on typical hardware. This is the easiest to implement and it aligns nicely - but one loses a lot in terms of compression ratio.
  • The next, slightly more complex option is to allow indices to take up any integral number of bytes. Now one must allow for reading non-GPU-word-aligned index values of sizes 3 and 5 from memory. Still, this doesn't cover the entire range of possibilities. This option is not yet implemented (see issue #15).
  • The most flexible option is allowing bit-size-resolution dictionary index sizes. This requires the most hassle when reading - one has to make sure the right bytes end up in the right places. This option is not yet implemented (see issue #10).

Dictionary for variable-length data (unsupported)

Other than the usual benefit of using a dictionary, using it also makes the (now-compressed) data have a fixed-width type - which is quite beneficial (e.g. much easier lookup and filtering) - if the data width was not yet fixed. Thus for data which may benefit from some uniformity, we might consider using this scheme even if there's no significant compression benefit.

Dictionary-per-segment (unimplemented)

Compressed data might be amenable to dictionary compression locally, but not locally. i.e. limited-length-but-not-too-short contiguous sequences of uncompressed elements may have a limited support size, but over the entire input - the support size is a very large fraction of the entire domain (and it's not artificially large due to outliers, a case which could be done away by patching). In this case it makes sense to occasionally switch dictionaries, to reflect the effective current support set per segment (and possibly use patching to take care of some inter-dictionary overlap).

Memory layout

  • The following scalars are used: column length
  • The dictionary is a contiguous array of σ elements of type τ_D
  • The compressed data is a contiguous array of n indices into the dictionary array
  • For per-segment dictionaries the layout would be different - but that's not supported yet

Links to the code