Skip to content

bbondy/hutter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hutter Prize

This repo is a workspace for experimenting toward a future Hutter Prize submission in Rust.

It is not competitive yet. The current repo contains several small experimental codecs:

  • adaptive block Huffman
  • order-1 adaptive block Huffman
  • naive LZ77 with literal and back-reference tokens
  • byte-level PPM with arithmetic coding
  • bit-level PPM with arithmetic coding

There is also an additive WikiMix-5 design skeleton for a future mixed-model codec:

  • src/wikimix.rs
  • docs/wikimix5.md

Codec names in the CLI:

  • huffman: original block-model Huffman codec, faster and the default
  • huffman-o1: order-1 Huffman codec, usually slower
  • lz77: naive LZ77 codec
  • ppm: order-3 byte-level PPM-style arithmetic coder
  • ppm-bit: order-3 bit-level PPM-style arithmetic coder
  • ppm-byte-mix: mixed-order byte-level PPM with adaptive per-order weights
  • ppm-bit-mix: mixed-order bit-level PPM candidates with adaptive per-order weights
  • ppm-mix: hybrid bit coder that mixes byte-level and bit-level predictions
  • match: standalone byte-level match predictor with arithmetic coding
  • ppm-match-mix: byte+match hybrid that mixes byte-level PPM and match-model predictions

Family note:

  • ppm-o1 through ppm-o6 are byte-level PPM codecs
  • ppm-bit, ppm-bit-o16, ppm-bit-o32, and ppm-bit-o64 are bit-level PPM codecs
  • ppm-byte-mix is byte-level mixed-order PPM
  • ppm-mix uses both byte-level and bit-level models
  • match isolates the hashed match model on its own and codes whole bytes directly
  • ppm-match-mix combines the byte-level mixed model with the hashed match model

Their purpose is to give you a clean, testable loop for:

  • compress
  • decompress
  • profile ppm-match-mix internals
  • verify correctness
  • inspect file sizes quickly

Current target

As of the latest official record on the Hutter Prize site, the enwik9 record to beat is:

  • 110,793,128 total bytes by Kaido Orav and Byron Knoll with fx2-cmix on September 3, 2024

For a prize-eligible submission, you need at least a 1% improvement over the previous record, which means:

  • your total S = compressor + archive must be below 109,685,197 bytes

Recent enwik9 records listed by the official site:

  • 110,793,128 bytes, fx2-cmix, Kaido Orav and Byron Knoll, September 3, 2024
  • 112,578,322 bytes, fx-cmix, Kaido Orav, February 2, 2024
  • 114,156,155 bytes, fast cmix, Saurabh Kumar, July 16, 2023
  • 115,352,938 bytes, starlit, Artemiy Margaritov, May 31, 2021
  • 116,673,681 bytes, phda9v1.8, Alexander Rhatushnyak, July 4, 2019

What the contest actually requires

The official task is to create a Linux or Windows compressor plus a self-extracting archive that reproduces enwik9 exactly, with these practical constraints:

  • lossless reconstruction of the exact enwik9
  • total size counts both the compressor executable and the produced archive
  • the archive must be self-contained, so yes, you effectively need decompression support
  • documented source code must be published under an OSI-approved license before prize payment
  • practical limits are roughly single-core, under 10 GB RAM, and under about 50 hours on the prize machine

For this starter repo, the CLI uses separate compress and decompress commands because that is easier to iterate on. Later, if you approach a serious submission, you would likely switch to a self-extracting archive flow.

Language notes

Recent winning submissions are mostly C++-centric cmix derivatives with shell scripts and, in some cases, Python or notebooks for preprocessing. Rust is not the common language in public Hutter Prize winners, but it is still a reasonable choice for a clean experimental codebase.

Workspace layout

  • src/main.rs: top-level entrypoint
  • src/cli.rs: command-line parsing and usage text
  • src/codec.rs: codec selection and archive-format dispatch
  • src/hybrid_ppm.rs: hybrid byte+bit mixed PPM codec
  • src/ppm_match_mix.rs: standalone byte-level match codec, legacy match-only decoder, and match-augmented hybrid codec
  • src/block_huffman.rs: original adaptive block-Huffman codec
  • src/adaptive_huffman.rs: slower order-1 adaptive block-Huffman codec
  • src/lz77.rs: simple LZ77 codec
  • src/byte_ppm.rs: byte-level PPM-style arithmetic codec
  • src/ppm.rs: bit-level PPM-style arithmetic codec
  • data/sample.txt: tiny sample corpus for smoke tests
  • data/enwik8: optional 100 MB Hutter corpus for faster iteration
  • data/enwik9: optional 1 GB Hutter Prize corpus for full-scale benchmarking

Commands

Build:

make build

Run a round-trip test on the sample file:

make roundtrip

Select a codec explicitly:

make roundtrip CODEC=huffman
make roundtrip CODEC=huffman-o1
make roundtrip CODEC=lz77
make roundtrip CODEC=ppm-byte-mix
make roundtrip CODEC=ppm-bit
make roundtrip CODEC=ppm-bit-mix
make roundtrip CODEC=ppm-mix
make roundtrip CODEC=match
make roundtrip CODEC=ppm-match-mix
make roundtrip CODEC=wikimix5

Run a round-trip with an explicit codec:

cargo run --release -- compress --codec huffman data/sample.txt build/sample.huf
cargo run --release -- decompress build/sample.huf build/sample.restored
cmp data/sample.txt build/sample.restored

cargo run --release -- compress --codec huffman-o1 data/sample.txt build/sample-o1.huf
cargo run --release -- decompress build/sample-o1.huf build/sample.restored
cmp data/sample.txt build/sample.restored

cargo run --release -- compress --codec lz77 data/sample.txt build/sample.lz77
cargo run --release -- decompress build/sample.lz77 build/sample.restored
cmp data/sample.txt build/sample.restored

cargo run --release -- compress --codec ppm-bit data/sample.txt build/sample.ppm
cargo run --release -- decompress build/sample.ppm build/sample.restored
cmp data/sample.txt build/sample.restored

cargo run --release -- compress --codec ppm-byte-mix data/sample.txt build/sample.pbmx
cargo run --release -- decompress build/sample.pbmx build/sample.restored
cmp data/sample.txt build/sample.restored

cargo run --release -- compress --codec ppm-bit-mix data/sample.txt build/sample.pmix
cargo run --release -- decompress build/sample.pmix build/sample.restored
cmp data/sample.txt build/sample.restored

cargo run --release -- compress --codec ppm-mix data/sample.txt build/sample.mix
cargo run --release -- decompress build/sample.mix build/sample.restored
cmp data/sample.txt build/sample.restored

cargo run --release -- compress --codec match data/sample.txt build/sample.match
cargo run --release -- decompress build/sample.match build/sample.restored
cmp data/sample.txt build/sample.restored

cargo run --release -- compress --codec ppm-match-mix data/sample.txt build/sample.pmm2
cargo run --release -- decompress build/sample.pmm2 build/sample.restored
cmp data/sample.txt build/sample.restored

cargo run --release -- compress --codec wikimix5 data/sample.txt build/sample.wmx5
cargo run --release -- decompress build/sample.wmx5 build/sample.restored
cmp data/sample.txt build/sample.restored

If --codec is omitted, compress defaults to huffman. decompress auto-detects the archive format from the file header, so you do not need to specify the codec when restoring.

The match codec writes new PMAT archives with a byte-oriented payload marker. The decoder still accepts older PMAT archives produced by the previous bitwise match-only implementation. The ppm-match-mix codec now writes PMM3 archives for the byte+match design. The decoder still accepts older PMM2 archives produced by the previous bit+byte+match implementation.

Progress bars are shown by default on terminal runs of compress, decompress, and stats. Use --no-progress to disable them:

cargo run --release -- compress --no-progress --codec ppm-match-mix data/sample.txt build/sample.pmm2
cargo run --release -- decompress --no-progress build/sample.pmm2 build/sample.restored
cargo run --release -- stats --no-progress data/sample.txt build/sample.pmm2

Profile the internal ppm-match-mix model combinations on one input:

cargo run --release -- profile data/sample.txt
cargo run --release -- profile --no-progress data/enwik8

profile prints timings and archive sizes for:

  • bit-only
  • byte-only
  • match-only
  • byte+match
  • legacy bit+byte+match
  • ppm-match-mix

It also prints the percentage of time saved versus the current ppm-match-mix implementation.

Run the unit tests:

cargo test

Download the official Hutter corpora into data/ and verify them:

make enwik8
make enwik9
make corpora

make corpora downloads and verifies both enwik8 and enwik9. make data-files is an alias for the same target.

Run on a larger input:

make roundtrip INPUT=data/enwik8
make bench INPUT=data/enwik8
make bench INPUT=data/enwik8 CODEC=huffman
make bench INPUT=data/enwik8 CODEC=huffman-o1
make bench INPUT=data/enwik8 CODEC=ppm-byte-mix
make bench INPUT=data/enwik8 CODEC=ppm-bit
make bench INPUT=data/enwik8 CODEC=ppm-bit-mix
make bench INPUT=data/enwik8 CODEC=ppm-mix
make bench INPUT=data/enwik8 CODEC=match
make bench INPUT=data/enwik8 CODEC=ppm-match-mix
make bench INPUT=data/enwik8 CODEC=wikimix5
make roundtrip INPUT=data/enwik8 CODEC=huffman ARCHIVE=build/enwik8.ahf1 RESTORED=build/enwik8.ahf1.restored
make roundtrip INPUT=data/enwik8 CODEC=huffman-o1 ARCHIVE=build/enwik8.ahf1 RESTORED=build/enwik8.ahf1.restored
make roundtrip INPUT=data/enwik8 CODEC=lz77 ARCHIVE=build/enwik8.lz77 RESTORED=build/enwik8.lz77.restored
make roundtrip INPUT=data/enwik8 CODEC=ppm-byte-mix ARCHIVE=build/enwik8.pbmx RESTORED=build/enwik8.pbmx.restored
make roundtrip INPUT=data/enwik8 CODEC=ppm-bit ARCHIVE=build/enwik8.ppm RESTORED=build/enwik8.ppm.restored
make roundtrip INPUT=data/enwik8 CODEC=ppm-bit-mix ARCHIVE=build/enwik8.pmix RESTORED=build/enwik8.pmix.restored
make roundtrip INPUT=data/enwik8 CODEC=ppm-mix ARCHIVE=build/enwik8.mix RESTORED=build/enwik8.mix.restored
make roundtrip INPUT=data/enwik8 CODEC=match ARCHIVE=build/enwik8.match RESTORED=build/enwik8.match.restored
make roundtrip INPUT=data/enwik8 CODEC=ppm-match-mix ARCHIVE=build/enwik8.pmm2 RESTORED=build/enwik8.pmm2.restored
make roundtrip INPUT=data/enwik8 CODEC=wikimix5 ARCHIVE=build/enwik8.wmx5 RESTORED=build/enwik8.wmx5.restored

make roundtrip INPUT=data/enwik9
make bench INPUT=data/enwik9
make bench INPUT=data/enwik9 CODEC=huffman
make bench INPUT=data/enwik9 CODEC=huffman-o1
make bench INPUT=data/enwik9 CODEC=ppm-byte-mix
make bench INPUT=data/enwik9 CODEC=ppm-bit
make bench INPUT=data/enwik9 CODEC=ppm-bit-mix
make bench INPUT=data/enwik9 CODEC=ppm-mix
make bench INPUT=data/enwik9 CODEC=match
make bench INPUT=data/enwik9 CODEC=ppm-match-mix
make bench INPUT=data/enwik9 CODEC=wikimix5
make roundtrip INPUT=data/enwik9 CODEC=huffman ARCHIVE=build/enwik9.ahf1 RESTORED=build/enwik9.ahf1.restored
make roundtrip INPUT=data/enwik9 CODEC=huffman-o1 ARCHIVE=build/enwik9.ahf1 RESTORED=build/enwik9.ahf1.restored
make roundtrip INPUT=data/enwik9 CODEC=lz77 ARCHIVE=build/enwik9.lz77 RESTORED=build/enwik9.lz77.restored
make roundtrip INPUT=data/enwik9 CODEC=ppm-byte-mix ARCHIVE=build/enwik9.pbmx RESTORED=build/enwik9.pbmx.restored
make roundtrip INPUT=data/enwik9 CODEC=ppm-bit ARCHIVE=build/enwik9.ppm RESTORED=build/enwik9.ppm.restored
make roundtrip INPUT=data/enwik9 CODEC=ppm-bit-mix ARCHIVE=build/enwik9.pmix RESTORED=build/enwik9.pmix.restored
make roundtrip INPUT=data/enwik9 CODEC=ppm-mix ARCHIVE=build/enwik9.mix RESTORED=build/enwik9.mix.restored
make roundtrip INPUT=data/enwik9 CODEC=match ARCHIVE=build/enwik9.match RESTORED=build/enwik9.match.restored
make roundtrip INPUT=data/enwik9 CODEC=ppm-match-mix ARCHIVE=build/enwik9.pmm2 RESTORED=build/enwik9.pmm2.restored
make roundtrip INPUT=data/enwik9 CODEC=wikimix5 ARCHIVE=build/enwik9.wmx5 RESTORED=build/enwik9.wmx5.restored

Sources

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors