Skip to content

Run Length Encoding

Eyal Rozenberg edited this page Jun 7, 2017 · 2 revisions

Overview

The [Run-Length Encoding] compression scheme caters to the case of data with long runs of the same value. Instead of, say, 5 consecutive elements each with value 123, their run-length encoding is the pair (123,5). Unlike most other schemes such as DZB, FOR, DELTA etc. - for which the input data must meet a strict criterion to be compressed using a scheme - anything can be compressed with RLE, but the shorter the runs are, the worse the compression ration becomes. It could be n/(τ_c + τ_rl) in the extreme case of a single run, or it could be 1/(τ_c + τ_rl) in the unfortunate extreme of no two identical consecutive elements.

Another aspect of complexity of RLE is that given an index of the decompressed output, one cannot know apriori where in the compressed input its value is represented. One must compute a sum of run lengths up to the target index, using the value reached when the run lengths add up to the position of interest. This is also the reason why we use RLE with anchors at each consecutive segment: The anchors indicate which run, and which index within that run, corresponds to the first element which segment s; this limits the dependence on the data or previous runs to within a given segment (but doesn't remove it altogether).

This scheme's shorthand designator is RLE.

Scheme parameters

Parameter Aliases Symbol used
Domain element size domain element width, uncompressed element sizee w
Column length uncompressed data length n
Run length type τ_rl
Number of runs R
Run index type τ_r
Segment length s

Memory layout

As with most schemes targetted at a GPU, RLE also uses an SoA (structure-of-arrays) layout. So while a run is characterized by a tuple (value, length) and a segment anchor by a tuple (run index, offset within that run) - the kernel takes 4 arrays:

  • Run data, of the uncompressed data's type (although it's interpreted as an unsigned integer of the same size - w)
  • Run lengths, of the unsigned integeral type τ_rl ; elements of the same index in this array and the run data array correpond
  • Anchor run indices, of a type τ_r
  • Anchor intra-run offsets of type τ_rl

Scalars passed are:

  • The segment length s
  • The total length n
  • The number of runs R
  • The number of segments - redundant, but precludes an integer division by the kernel itself

Notes

This is a "sister scheme" to Run-Position Encoding compression scheme. That one keeps the run start positions instead of the lengths, i.e. the summed-up previous lengths, for each run.

Links to the code

Clone this wiki locally