We propose a method for lossless compression based on a two-fold strategy: (i) decomposition of a window of arbitray bytes to a series of reduced bitsets; (ii) compression of bitsets using colexicographic ranking.
The algorithm is presently incapable of outperforming standard compression utilities like gzip and xz. However, it works. We see it, therefore, as only a proof of concept - a starting point for bigger things.
Consider, in a universe of size 3, a window of five elements, say
In practice, we deal with byte windows which have a universe size of 256, but the methodology is similar to the synthetic example above. If each byte present in a byte window can be represented by a sparse (and diminishing) bitset, then these bitsets can be compressed and stored as a concatentation, potentially with fewer bytes than the original byte window.
One way to compress a bitset is to consider its length,
Bitset
We seek a function to uniquely index each
where
We can recover
where
Note: ranking/unranking heavily depend on binomial coefficients. Rather than compute these quantities on demand (a potentially expensive proposition), they can be efficiently pre-computed using Pascal's Rule. For small window size, gmpy2.bincoef is sufficiently fast.
The file format is fairly rudimentary. The beginning of the file contains magic bytes and the number of uncompressed bytes for each window. Then a succession of compressed windows. The end of the file contains the MD5 digest of the original.
usage: main.py [-h] [-d] [-k] [-s SIZE] [-v] file
Compress/decompress a file
positional arguments:
file the file to process
optional arguments:
-h, --help show this help message and exit
-d, --decompress run in decompression mode
-k, --keep retain files
-s SIZE, --size SIZE number of bytes per processing window (default 1024, max 4096)
-v, --verbose run verbosely
(Requires Python 3.9.)
- Variable/optimal length byte windows
- Port from Python to C++
- Parallelisation
- Somehow use Gosper's Hack for ranking/unranking?
I am more or less retired from full-time development. But I'm happy to talk to potential employers about niche R&D roles. rasmusaj at mac dot com
- Ranking and Unranking of Combinations and Permutations, Derrick Stolee (2012).
- Binary Image Compression Based on Binomial Numbers, O. Borysenko, I. Kulyk, S. Kostel, O. Skordina (2010).