@Cyan4973 Cyan4973 released this Oct 9, 2017 · 1814 commits to master since this release

Assets 4

Zstandard Long Range Match Finder

Zstandard has a new long range match finder written by Facebook's intern Stella Lau (@stellamplau), which specializes on finding long matches in the distant past. It integrates seamlessly with the regular compressor, and the output can be decompressed just like any other Zstandard compressed data.

The long range match finder adds minimal overhead to the compressor, works with any compression level, and maintains Zstandard's blazingly fast decompression speed. However, since the window size is larger, it requires more memory for compression and decompression.

To go along with the long range match finder, we've increased the maximum window size to 2 GB. The decompressor only accepts window sizes up to 128 MB by default, but zstd -d --memory=2GB will decompress window sizes up to 2 GB.

Example usage

# 128 MB window size
zstd -1 --long file
zstd -d file.zst

# 2 GB window size (window log = 31)
zstd -6 --long=31 file
zstd -d --long=31 file.zst
# OR
zstd -d --memory=2GB file.zst
ZSTD_CCtx *cctx = ZSTD_createCCtx();
ZSTD_CCtx_setParameter(cctx, ZSTD_p_compressionLevel, 19);
ZSTD_CCtx_setParameter(cctx, ZSTD_p_enableLongDistanceMatching, 1); // Sets windowLog=27
ZSTD_CCtx_setParameter(cctx, ZSTD_p_windowLog, 30); // Optionally increase the window log
ZSTD_compress_generic(cctx, &out, &in, ZSTD_e_end);

ZSTD_DCtx *dctx = ZSTD_createDCtx();
ZSTD_DCtx_setMaxWindowSize(dctx, 1 << 30);
ZSTD_decompress_generic(dctx, &out, &in);

Benchmarks

We compared the zstd long range matcher to zstd and lrzip. The benchmarks were run on an AMD Ryzen 1800X (8 cores with 16 threads at 3.6 GHz).

Compressors

  • zstd — The regular Zstandard compressor.
  • zstd 128 MB — The Zstandard compressor with a 128 MB window size.
  • zstd 2 GB — The Zstandard compressor with a 2 GB window size.
  • lrzip xz — The lrzip compressor with default options, which uses the xz backend at level 7 with 16 threads.
  • lrzip xz single — The lrzip compressor with a single-threaded xz backend at level 7.
  • lrzip zstd — The lrzip compressor without a backend, then its output is compressed by zstd (not multithreaded).

Files

  • Linux 4.7 - 4.12 — This file consists of the uncompressed tarballs of the six Linux kernel release from 4.7 to 4.12 concatenated together in order. This file is extremely compressible if the compressor can match against the previous versions well.
  • Linux git — This file is a tarball of the linux repo, created by git clone https://github.com/torvalds/linux && tar -cf linux-git.tar linux/. This file gets a small benefit from long range matching. This file shows how the long range matcher performs when there isn't too many matches to find.

Results

Both zstd and zstd 128 MB don't have large enough of a window size to compress Linux 4.7 - 4.12 well. zstd 2 GB compresses the fastest, and slightly better than lrzip-zstd. lrzip-xz compresses the best, and at a reasonable speed with multithreading enabled. The place where zstd shines is decompression ease and speed. Since it is just regular Zstandard compressed data, it is decompressed by the highly optimized decompressor.

The Linux git file shows that the long range matcher maintains good compression and decompression speed, even when there are far less long range matches. The decompression speed takes a small hit because it has to look further back to reconstruct the matches.

Compression Ratio vs Speed Decompression Speed
Linux 4.7 - 12 compression ratio vs speed Linux 4.7 - 12 decompression speed
Linux git compression ratio vs speed Linux git decompression speed

Implementation details

The long distance match finder was inspired by great work from Con Kolivas' lrzip, which in turn was inspired by Andrew Tridgell's rzip. Also, let's mention Bulat Ziganshin's srep, which we have not been able to test unfortunately (site down), but the discussions on encode.ru proved great sources of inspiration.

Therefore, many similar mechanisms are adopted, such as using a rolling hash, and filling a hash table divided into buckets of entries.

That being said, we also made different choices, with the goal to favor speed, as can be observed in benchmark. The rolling hash formula is selected for computing efficiency. There is a restrictive insertion policy, which only inserts candidates that respect a mask condition. The insertion policy allows us to skip the hash table in the common case that a match isn't present. Confirmation bits are saved, to only check for matches when there is a strong presumption of success. These and a few more details add up to make zstd's long range matcher a speed-oriented implementation.

The biggest difference though is that the long range matcher is blended into the regular compressor, producing a single valid zstd frame, undistinguishable from normal operation (except obviously for the larger window size). This makes decompression a single pass process, preserving its speed property.

More details are available directly in source code, at lib/compress/zstd_ldm.c.

Future work

This is a first implementation, and it still has a few limitations, that we plan to lift in the future.

The long range matcher doesn't interact well with multithreading. Due to the way zstd multithreading is currently implemented, memory usage will scale with the window size times the number of threads, which is a problem for large window sizes. We plan on supporting multithreaded long range matching with reasonable memory usage in a future version.

Secondly, Zstandard is currently limited to a 2 GB window size because of indexer's design. While this is a significant update compared to previous 128 MB limit, we believe this limitation can be lifted altogether, with some structural changes in the indexer. However, it also means that window size would become really big, with knock-off consequences on memory usage. So, to reduce this load, we will have to consider memory map as a complementary way to reference past content in the uncompressed file.

Detailed list of changes

  • new : long range mode, using --long command, by Stella Lau (@stellamplau)
  • new : ability to generate and decode magicless frames (#591)
  • changed : maximum nb of threads reduced to 200, to avoid address space exhaustion in 32-bits mode
  • fix : multi-threading compression works with custom allocators, by @terrelln
  • fix : a rare compression bug when compression generates very large distances and bunch of other conditions (only possible at --ultra -22)
  • fix : 32-bits build can now decode large offsets (levels 21+)
  • cli : added LZ4 frame support by default, by Felix Handte (@felixhandte)
  • cli : improved --list output
  • cli : new : can split input file for dictionary training, using command -B#
  • cli : new : clean operation artefact on Ctrl-C interruption (#854)
  • cli : fix : do not change /dev/null permissions when using command -t with root access, reported by @mike155 (#851)
  • cli : fix : write file size in header in multiple-files mode
  • api : added macro ZSTD_COMPRESSBOUND() for static allocation
  • api : experimental : new advanced decompression API
  • api : fix : sizeof_CCtx() used to over-estimate
  • build: fix : compilation works with -mbmi (#868)
  • build: fix : no-multithread variant compiles without pool.c dependency, reported by Mitchell Blank Jr (@mitchblank) (#819)
  • build: better compatibility with reproducible builds, by Bernhard M. Wiedemann (@bmwiedemann) (#818)
  • example : added streaming_memory_usage
  • license : changed /examples license to BSD + GPLv2
  • license : fix a few header files to reflect new license (#825)

Warning

bug #944 : v1.3.2 is known to produce corrupted data in the following scenario, requiring all these conditions simultaneously :

  • compression using multi-threading
  • with a dictionary
  • on "large enough" files (several MB, exact threshold depends on compression level)

Note that dictionary is meant to help compression of small files (a few KB), while multi-threading is only useful for large files, so it's pretty rare to need both at the same time. Nonetheless, if your application happens to trigger this situation, it's recommended to skip v1.3.2 for a newer version. At the time of this warning, the dev branch is known to work properly for the same scenario.