Skip to content
cifkao edited this page Sep 16, 2014 · 2 revisions

Bis is a file compression tool written in C. It compresses files using a combination of the Burrows-Wheeler transform and Huffman coding.

Usage

bis is run from the terminal with the name of the file to be compressed (or decompressed) passed as a parameter. Use the -d option for decompressing. By default, bis will append the .bis file extension when compressing and strip it when decompressing (if possible).

The following options can be used:

Option Effect
-b size[k] Set the block size to size in bytes (or kilobytes). Default is 200k. Has no effect when decompressing.
-c Force writing to the standard output.
-d Decompress instead of compressing.
-h Display help.
-o outfile Use outfile as the output file.

Compression algorithm

bis splits the input into blocks and uses a number of compression techniques on each block:

  • First, the block is encoded using a run-length encoding (RLE) such that every sequence of 2 to 257 identical bytes is replaced by the first two bytes and a repeat length between 0 and 255.
  • Then, the Burrows–Wheeler transform (BWT) is applied. To make this phase of the compression reversible, a special EOF symbol (256) is appended to the block before applying the transform.
  • Subsequently, the data is encoded using the move-to-front (MTF) transform.
  • The run-length encoding is applied again, with the difference that every sequence of 2 to 258 identical symbols is replaced, so the repeat length lies between 0 and 256.
  • Finally, the data is encoded using Huffman coding. Another special EOF symbol (257) is added to mark the end of the block, so the coding uses (at most) 258 different codes.

File format

A bis compressed file consists of a three byte signature 42 49 53 (“BIS”) and a sequence of Huffman-encoded blocks. Each block is a sequence of bits, packed MSB-first.

A block begins with a representation of the binary Huffman tree used for compressing this block. This representation is generated using a pre-order traversal of the Huffman tree, writing a 0-bit for every non-leaf node, and a 1-bit, followed by the corresponding symbol value, for every leaf node. The encoded data follows immediately. The last byte of each block is padded with zeroes, so that the next block starts with the next byte.

Clone this wiki locally