Skip to content

Discard Zero Bytes

Eyal Rozenberg edited this page May 10, 2017 · 1 revision

Overview

When your data's domain type (τ_D) is a relatively large unsigned integer, but your actual data are all small - all fitting into less bytes than |τ_D| - it makes sense to represent them using only those bytes which may actually have a non-zero value for any of the data. This compression scheme is often referred to as "Null Suppression" even though it doesn't relate to the SQL null value. In the Giddy code it's more commonly referred to as "Discard Zero Bytes".

This scheme is not to be confused with [Discard Zero Bytes - Variable](Discard Zero Bytes - Variable), another of libGiddy's supported schemes in which the number of non-zero bytes is maintained for every element

The scheme's shorthand designation is NS or DZB, depending on which version of the name you prefer.

Scheme parameters

Parameter Aliases Symbol used
Domain Uncompressed domain D
Domain element type Uncompressed type, uncompressed data type τ_D
Column length Single bitmap length, uncompressed data length n
Subdomain Compressed element D'
Subdomain element type Compressed type, compressed data type τ_D'

Variants

Endianness / direction from which to drop bytes (unsupported)

nVIDIA GPUs are little-endian, and Discard Zero Bytes discards the "big end" of the number; thus, it is the bytes with higher memory addresses that get discarded. It might be useful to support discarding the bytes with lower memory addresses, in the following cases:

  • Support for big-endian data
  • Lossy compression of little-endian data by discarding the less-significant bits

Bit resolution size for the discarded part

At the moment, one can only drop entire bytes. It stands to reason that only some bits within the highest non-zero byte will actually be used, and one could explot this by keeping the higher b out of the ceil(log_2(|D|)) bits

Non-zero replacement pattern

Discard Zero Bytes' name suggests the only data discarded are zeros. Instead, one could discard the higher bytes whenever they fit a certain pattern; or, in other words - decompress by filling the higher bytes with a pattern other than all-zeros. Note that would be the equivalent of applying [Frame-of-Reference|FOR] decompression with the (left-shifted) fill pattern as the reference value after DZB decompression.

Arbitrary mask of bits/bytes to discard

Sometime the fixed, known bytes in the uncompressed input may not be the highest ones. For example, if the input is a byte concatenation of different fields, or a fixed-length string with known structure (say, "A,B,C,D" with AB and CD varying but the commas always being there) - one may want to only keep the varying bytes and discard the fixed ones - essentially what this scheme is about - using a replacement pattern as per the above, but a mask of which parts of the uncompressed data are fixed (and may be discarded and later reconstructed with the replacement) and which parts vary (and must be maintained in the compressed data).

Memory layout

  • The following scalars (not known at compile time) are passed: column length
  • The compressed data is a contiguous array of with elements of type τ_D' (... well, not exactly; we actually only consider the elements as fixed-number-of-bytes data, i.e. an uncompressed datum is a |τ_D|-byte unsigned integer and in compressed form it's a |τ_D'|-byte unsigned integer)

Links to the code