This repo is a workspace for experimenting toward a future Hutter Prize submission in Rust.
It is not competitive yet. The current repo contains several small experimental codecs:
- adaptive block Huffman
- order-1 adaptive block Huffman
- naive LZ77 with literal and back-reference tokens
- byte-level PPM with arithmetic coding
- bit-level PPM with arithmetic coding
There is also an additive WikiMix-5 design skeleton for a future mixed-model codec:
src/wikimix.rsdocs/wikimix5.md
Codec names in the CLI:
huffman: original block-model Huffman codec, faster and the defaulthuffman-o1: order-1 Huffman codec, usually slowerlz77: naive LZ77 codecppm: order-3 byte-level PPM-style arithmetic coderppm-bit: order-3 bit-level PPM-style arithmetic coderppm-byte-mix: mixed-order byte-level PPM with adaptive per-order weightsppm-bit-mix: mixed-order bit-level PPM candidates with adaptive per-order weightsppm-mix: hybrid bit coder that mixes byte-level and bit-level predictionsmatch: standalone byte-level match predictor with arithmetic codingppm-match-mix: byte+match hybrid that mixes byte-level PPM and match-model predictions
Family note:
ppm-o1throughppm-o6are byte-level PPM codecsppm-bit,ppm-bit-o16,ppm-bit-o32, andppm-bit-o64are bit-level PPM codecsppm-byte-mixis byte-level mixed-order PPMppm-mixuses both byte-level and bit-level modelsmatchisolates the hashed match model on its own and codes whole bytes directlyppm-match-mixcombines the byte-level mixed model with the hashed match model
Their purpose is to give you a clean, testable loop for:
- compress
- decompress
- profile
ppm-match-mixinternals - verify correctness
- inspect file sizes quickly
As of the latest official record on the Hutter Prize site, the enwik9 record to beat is:
110,793,128total bytes by Kaido Orav and Byron Knoll withfx2-cmixon September 3, 2024
For a prize-eligible submission, you need at least a 1% improvement over the previous record, which means:
- your total
S = compressor + archivemust be below109,685,197bytes
Recent enwik9 records listed by the official site:
110,793,128bytes,fx2-cmix, Kaido Orav and Byron Knoll, September 3, 2024112,578,322bytes,fx-cmix, Kaido Orav, February 2, 2024114,156,155bytes,fast cmix, Saurabh Kumar, July 16, 2023115,352,938bytes,starlit, Artemiy Margaritov, May 31, 2021116,673,681bytes,phda9v1.8, Alexander Rhatushnyak, July 4, 2019
The official task is to create a Linux or Windows compressor plus a self-extracting archive that reproduces enwik9 exactly, with these practical constraints:
- lossless reconstruction of the exact
enwik9 - total size counts both the compressor executable and the produced archive
- the archive must be self-contained, so yes, you effectively need decompression support
- documented source code must be published under an OSI-approved license before prize payment
- practical limits are roughly single-core, under 10 GB RAM, and under about 50 hours on the prize machine
For this starter repo, the CLI uses separate compress and decompress commands because that is easier to iterate on. Later, if you approach a serious submission, you would likely switch to a self-extracting archive flow.
Recent winning submissions are mostly C++-centric cmix derivatives with shell scripts and, in some cases, Python or notebooks for preprocessing.
Rust is not the common language in public Hutter Prize winners, but it is still a reasonable choice for a clean experimental codebase.
src/main.rs: top-level entrypointsrc/cli.rs: command-line parsing and usage textsrc/codec.rs: codec selection and archive-format dispatchsrc/hybrid_ppm.rs: hybrid byte+bit mixed PPM codecsrc/ppm_match_mix.rs: standalone byte-level match codec, legacy match-only decoder, and match-augmented hybrid codecsrc/block_huffman.rs: original adaptive block-Huffman codecsrc/adaptive_huffman.rs: slower order-1 adaptive block-Huffman codecsrc/lz77.rs: simple LZ77 codecsrc/byte_ppm.rs: byte-level PPM-style arithmetic codecsrc/ppm.rs: bit-level PPM-style arithmetic codecdata/sample.txt: tiny sample corpus for smoke testsdata/enwik8: optional 100 MB Hutter corpus for faster iterationdata/enwik9: optional 1 GB Hutter Prize corpus for full-scale benchmarking
Build:
make buildRun a round-trip test on the sample file:
make roundtripSelect a codec explicitly:
make roundtrip CODEC=huffman
make roundtrip CODEC=huffman-o1
make roundtrip CODEC=lz77
make roundtrip CODEC=ppm-byte-mix
make roundtrip CODEC=ppm-bit
make roundtrip CODEC=ppm-bit-mix
make roundtrip CODEC=ppm-mix
make roundtrip CODEC=match
make roundtrip CODEC=ppm-match-mix
make roundtrip CODEC=wikimix5Run a round-trip with an explicit codec:
cargo run --release -- compress --codec huffman data/sample.txt build/sample.huf
cargo run --release -- decompress build/sample.huf build/sample.restored
cmp data/sample.txt build/sample.restored
cargo run --release -- compress --codec huffman-o1 data/sample.txt build/sample-o1.huf
cargo run --release -- decompress build/sample-o1.huf build/sample.restored
cmp data/sample.txt build/sample.restored
cargo run --release -- compress --codec lz77 data/sample.txt build/sample.lz77
cargo run --release -- decompress build/sample.lz77 build/sample.restored
cmp data/sample.txt build/sample.restored
cargo run --release -- compress --codec ppm-bit data/sample.txt build/sample.ppm
cargo run --release -- decompress build/sample.ppm build/sample.restored
cmp data/sample.txt build/sample.restored
cargo run --release -- compress --codec ppm-byte-mix data/sample.txt build/sample.pbmx
cargo run --release -- decompress build/sample.pbmx build/sample.restored
cmp data/sample.txt build/sample.restored
cargo run --release -- compress --codec ppm-bit-mix data/sample.txt build/sample.pmix
cargo run --release -- decompress build/sample.pmix build/sample.restored
cmp data/sample.txt build/sample.restored
cargo run --release -- compress --codec ppm-mix data/sample.txt build/sample.mix
cargo run --release -- decompress build/sample.mix build/sample.restored
cmp data/sample.txt build/sample.restored
cargo run --release -- compress --codec match data/sample.txt build/sample.match
cargo run --release -- decompress build/sample.match build/sample.restored
cmp data/sample.txt build/sample.restored
cargo run --release -- compress --codec ppm-match-mix data/sample.txt build/sample.pmm2
cargo run --release -- decompress build/sample.pmm2 build/sample.restored
cmp data/sample.txt build/sample.restored
cargo run --release -- compress --codec wikimix5 data/sample.txt build/sample.wmx5
cargo run --release -- decompress build/sample.wmx5 build/sample.restored
cmp data/sample.txt build/sample.restoredIf --codec is omitted, compress defaults to huffman. decompress auto-detects the archive format from the file header, so you do not need to specify the codec when restoring.
The match codec writes new PMAT archives with a byte-oriented payload marker. The decoder still accepts older PMAT archives produced by the previous bitwise match-only implementation.
The ppm-match-mix codec now writes PMM3 archives for the byte+match design. The decoder still accepts older PMM2 archives produced by the previous bit+byte+match implementation.
Progress bars are shown by default on terminal runs of compress, decompress, and stats. Use --no-progress to disable them:
cargo run --release -- compress --no-progress --codec ppm-match-mix data/sample.txt build/sample.pmm2
cargo run --release -- decompress --no-progress build/sample.pmm2 build/sample.restored
cargo run --release -- stats --no-progress data/sample.txt build/sample.pmm2Profile the internal ppm-match-mix model combinations on one input:
cargo run --release -- profile data/sample.txt
cargo run --release -- profile --no-progress data/enwik8profile prints timings and archive sizes for:
bit-onlybyte-onlymatch-onlybyte+matchlegacy bit+byte+matchppm-match-mix
It also prints the percentage of time saved versus the current ppm-match-mix implementation.
Run the unit tests:
cargo testDownload the official Hutter corpora into data/ and verify them:
make enwik8
make enwik9
make corporamake corpora downloads and verifies both enwik8 and enwik9. make data-files is an alias for the same target.
Run on a larger input:
make roundtrip INPUT=data/enwik8
make bench INPUT=data/enwik8
make bench INPUT=data/enwik8 CODEC=huffman
make bench INPUT=data/enwik8 CODEC=huffman-o1
make bench INPUT=data/enwik8 CODEC=ppm-byte-mix
make bench INPUT=data/enwik8 CODEC=ppm-bit
make bench INPUT=data/enwik8 CODEC=ppm-bit-mix
make bench INPUT=data/enwik8 CODEC=ppm-mix
make bench INPUT=data/enwik8 CODEC=match
make bench INPUT=data/enwik8 CODEC=ppm-match-mix
make bench INPUT=data/enwik8 CODEC=wikimix5
make roundtrip INPUT=data/enwik8 CODEC=huffman ARCHIVE=build/enwik8.ahf1 RESTORED=build/enwik8.ahf1.restored
make roundtrip INPUT=data/enwik8 CODEC=huffman-o1 ARCHIVE=build/enwik8.ahf1 RESTORED=build/enwik8.ahf1.restored
make roundtrip INPUT=data/enwik8 CODEC=lz77 ARCHIVE=build/enwik8.lz77 RESTORED=build/enwik8.lz77.restored
make roundtrip INPUT=data/enwik8 CODEC=ppm-byte-mix ARCHIVE=build/enwik8.pbmx RESTORED=build/enwik8.pbmx.restored
make roundtrip INPUT=data/enwik8 CODEC=ppm-bit ARCHIVE=build/enwik8.ppm RESTORED=build/enwik8.ppm.restored
make roundtrip INPUT=data/enwik8 CODEC=ppm-bit-mix ARCHIVE=build/enwik8.pmix RESTORED=build/enwik8.pmix.restored
make roundtrip INPUT=data/enwik8 CODEC=ppm-mix ARCHIVE=build/enwik8.mix RESTORED=build/enwik8.mix.restored
make roundtrip INPUT=data/enwik8 CODEC=match ARCHIVE=build/enwik8.match RESTORED=build/enwik8.match.restored
make roundtrip INPUT=data/enwik8 CODEC=ppm-match-mix ARCHIVE=build/enwik8.pmm2 RESTORED=build/enwik8.pmm2.restored
make roundtrip INPUT=data/enwik8 CODEC=wikimix5 ARCHIVE=build/enwik8.wmx5 RESTORED=build/enwik8.wmx5.restored
make roundtrip INPUT=data/enwik9
make bench INPUT=data/enwik9
make bench INPUT=data/enwik9 CODEC=huffman
make bench INPUT=data/enwik9 CODEC=huffman-o1
make bench INPUT=data/enwik9 CODEC=ppm-byte-mix
make bench INPUT=data/enwik9 CODEC=ppm-bit
make bench INPUT=data/enwik9 CODEC=ppm-bit-mix
make bench INPUT=data/enwik9 CODEC=ppm-mix
make bench INPUT=data/enwik9 CODEC=match
make bench INPUT=data/enwik9 CODEC=ppm-match-mix
make bench INPUT=data/enwik9 CODEC=wikimix5
make roundtrip INPUT=data/enwik9 CODEC=huffman ARCHIVE=build/enwik9.ahf1 RESTORED=build/enwik9.ahf1.restored
make roundtrip INPUT=data/enwik9 CODEC=huffman-o1 ARCHIVE=build/enwik9.ahf1 RESTORED=build/enwik9.ahf1.restored
make roundtrip INPUT=data/enwik9 CODEC=lz77 ARCHIVE=build/enwik9.lz77 RESTORED=build/enwik9.lz77.restored
make roundtrip INPUT=data/enwik9 CODEC=ppm-byte-mix ARCHIVE=build/enwik9.pbmx RESTORED=build/enwik9.pbmx.restored
make roundtrip INPUT=data/enwik9 CODEC=ppm-bit ARCHIVE=build/enwik9.ppm RESTORED=build/enwik9.ppm.restored
make roundtrip INPUT=data/enwik9 CODEC=ppm-bit-mix ARCHIVE=build/enwik9.pmix RESTORED=build/enwik9.pmix.restored
make roundtrip INPUT=data/enwik9 CODEC=ppm-mix ARCHIVE=build/enwik9.mix RESTORED=build/enwik9.mix.restored
make roundtrip INPUT=data/enwik9 CODEC=match ARCHIVE=build/enwik9.match RESTORED=build/enwik9.match.restored
make roundtrip INPUT=data/enwik9 CODEC=ppm-match-mix ARCHIVE=build/enwik9.pmm2 RESTORED=build/enwik9.pmm2.restored
make roundtrip INPUT=data/enwik9 CODEC=wikimix5 ARCHIVE=build/enwik9.wmx5 RESTORED=build/enwik9.wmx5.restored- Official prize page: https://prize.hutter1.net/
- Official rules: https://prize.hutter1.net/hrules.htm
- Official FAQ: https://www.hutter1.net/prize/hfaq.htm
- 2024 winning repo: https://github.com/kaitz/fx2-cmix
- 2021 winning repo: https://github.com/amargaritov/starlit