Skip to content

eloj/lzw-eddy

Repository files navigation

Simple LZW (Lempel-Ziv-Welch) Library, Compressor & Decompressor

A single-header library and basic headerless compressor and decompressor. Supports variable length codes between 9 and 12 bits per default, but the upper bound is a compile-time constant that can be adjusted between 9 and 16 bits.

The algorithm implemented by this code was widely distributed in the old MS-DOS days in places like Dr.Dobbs and a popular book on compression, probably due to its use in GIF. This resulted in it being used in all sorts of places.

Specifically the code in this repository was written to be bit-compatible with Puzznic (MS-DOS), and as such does not represent an effort to write "the best" LZW codec.

Code developed using the note "LZW and GIF explained" by Steve Blackstock as a reference.

All code is provided under the MIT License.

Build status

Features

  • Single-Header Library.
  • Fixed memory requirements:
    • Uses ~16KiB for state/string table by default.
    • At least ~4KiB output buffer recommended, but can go much lower in practice.
    • Low stack usage.
  • Compressor can be 'short-stroked' to limit decompression buffer size requirement.
  • Fast decompression. Very slow compression.
  • Releases are:

C interface

ssize_t lzw_decompress(struct lzw_state *state, uint8_t *src, size_t slen, uint8_t *dest, size_t dlen);
ssize_t lzw_compress(struct lzw_state *state, uint8_t *src, size_t slen, uint8_t *dest, size_t dlen);
const char *lzw_strerror(enum lzw_errors errnum);
  • State must be zero-initialized.
  • The return value is the number of bytes compressed or decompressed into dest. Once all input has been processed, 0 is returned. See example.
  • On error, a negative integer is returned.

All input is assumed to be available at src; e.g it is NOT allowed to switch src during encoding/decoding. A function to 'hand over' state to new input could be added, but I don't have the need.

Security

I would not recommend using this code in a security-sensitive context. If you expose this code to possibly adversarial data, beware that you do so at your own risk.

I make no guarantees, provide no warranties, and may not respond to security issues in a timely manner.

This code was written simply to allow interoperability with an old MS-DOS game, nothing more.

Thank you and enjoy.

Usage

In your code, define LZW_EDDY_IMPLEMENTATION and then #include "lzw.h". This will give you a decoder/encoder specific for 9-12 bit codes, giving a string table of 4096 entries.

You can optionally define LZW_MAX_CODE_WIDTH to a value between 9 and 16 before including the header to change this compile-time default. Due to the way the dictionary is reconstructed during decompression, a decoder is only compatible with data generated for the exact same size string table.

12-bit codes are probably the sweet spot for most applications. Larger codes means more bits are needed to encode newer strings, and because the string table is larger, the dictionary doesn't adapt as fast as it would if it was smaller. This combination means that a larger table can result in worse compression ratio.

The encoder could theoretically be improved to flush or prune the existing string table if few long matches are made over some window, but no such adaptability is present.

CLI compressor

lzw-eddy is a simple command-line compressor built using the library.

lzw-eddy 1.1.0-dev <45bf69f1>
Usage: ./lzw-eddy -c file|-d file -o outfile
Compiled Configuration:
 LZW_MIN_CODE_WIDTH=9, LZW_MAX_CODE_WIDTH=12, LZW_MAX_CODES=4096, sizeof(lzw_state)=16440

You can pass BITWIDTH=<num> to build it with a non-default string table size.

$ make -B BITWIDTH=14 && ./lzw-eddy -c lzw.h -o /dev/null
lzw-eddy 1.1.0-dev <45bf69f1>
Compressing 'lzw.h', 14566 bytes.
6947 bytes written to output, reduction=52.31% (longest prefix=15).

Example

	#define LZW_EDDY_IMPLEMENTATION
	#include "lzw.h"

	struct lzw_state state = { 0 };

	size_t slen = <length of compressed data>
	uint8_t *src = <compressed data>;
	uint8_t dest[4096];

	ssize_t res, written = 0;
	while ((res = lzw_decompress(&state, src, slen, dest, sizeof(dest))) > 0) {
		// Process `res` bytes of output in `dest`, e.g:
		// fwrite(dest, res, 1, outfile);
		written += res;
	}
	if (res == 0) {
		printf("%zd bytes successfully decompressed.\n", written);
	} else if (res < 0) {
		fprintf(stderr, "Decompression error: %s (err:%zd)\n", lzw_strerror(res), res);
	}

Unlikely To Do

  • Add Google Benchmark.
  • Use hashing for lookups in lzw_string_table_lookup.
  • Support changing inputs during processing.
  • Gather/Scatter alternative interface.