Computes a compact delta between two files so the new file can be reconstructed from the old file plus the (small) delta. Supports in-place reconstruction — the new version can be rebuilt directly in the buffer holding the old version, with no scratch space.
Six implementations — Python, Rust, C++, C, Java, and Go — producing byte-identical binary deltas. Encode with any one, decode with any other.
Implements the algorithms from two papers:
-
M. Ajtai, R. Burns, R. Fagin, D.D.E. Long, L. Stockmeyer. Compactly Encoding Unstructured Inputs with Differential Compression. Journal of the ACM, 49(3):318-367, May 2002.
-
R.C. Burns, D.D.E. Long, L. Stockmeyer. In-Place Reconstruction of Version Differences. IEEE Transactions on Knowledge and Data Engineering, 15(4):973-984, Jul/Aug 2003.
Both papers are in pubs/.
cd src/python
python3 delta.py encode onepass old.bin new.bin delta.bin
python3 delta.py decode old.bin delta.bin recovered.bincd src/rust/delta
cargo build --release
target/release/delta encode onepass old.bin new.bin delta.bin
target/release/delta decode old.bin delta.bin recovered.bincd src/cpp
cmake -B build && cmake --build build
build/delta encode onepass old.bin new.bin delta.bin
build/delta decode old.bin delta.bin recovered.bincd src/c
make
./delta encode onepass old.bin new.bin delta.bin
./delta decode old.bin delta.bin recovered.bincd src/java
javac -d out delta/*.java
java -cp out delta.Delta encode onepass old.bin new.bin delta.bin
java -cp out delta.Delta decode old.bin delta.bin recovered.bincd src/go
go build ./cmd/delta
./delta encode onepass old.bin new.bin delta.bin
./delta decode old.bin delta.bin recovered.bin| Algorithm | Time | Space | Best for |
|---|---|---|---|
onepass |
O(n) | O(1) | General use — fast, good compression |
correcting |
~O(n) | O(q) | Data with rearranged/moved blocks |
greedy |
O(n^2) | O(n) | Smallest possible delta (optimal) |
All three use Karp-Rabin rolling hashes (Karp & Rabin 1987) with a
Mersenne prime (2^61-1) for fingerprinting and a polynomial base of 263
for good bit mixing. Hash tables are auto-sized based on input length
(--table-size acts as a floor). An optional --splay flag replaces
the hash table with a Sleator-Tarjan splay tree. Frequently accessed
fingerprints are splayed to the root and stay there, giving O(log(n/f))
access for a fingerprint with frequency f — the same reason LRU caching
works in practice. For correcting, --splay also improves compression
slightly: the hash table discards fingerprints that collide to the same
slot, while the splay tree stores every checkpoint-passing seed. See
HOWTO.md for benchmark data. Use --verbose to see hash table sizing
and match statistics on stderr.
See HOWTO.md for tuning parameters, library API examples,
checkpointing internals, and benchmark results.
Standard deltas require a separate output buffer. In-place mode produces a delta that reconstructs the new version directly in the buffer holding the reference, with no scratch space. Useful for space-constrained devices (embedded systems, IoT) where you want to apply a software update without needing 2x the file size in memory.
# Produce an in-place delta
delta encode onepass old.bin new.bin patch.delta --inplace
# Decode works the same (auto-detects format)
delta decode old.bin patch.delta recovered.binThe in-place converter builds a CRWI (Conflicting Read-Write Interval) digraph where an edge from command i to j means "i must execute before j" (because i reads from a region j will overwrite). Kahn's algorithm (Kahn 1962) gives a topological sort with deterministic ordering via a min-heap keyed on (copy length, index). Cycles are broken by converting copy commands to literal add commands.
Cycle-breaking policies:
localmin(default) — converts the smallest copy in each cycle. Minimizes compression loss.constant— converts any vertex in the cycle. Slightly faster, marginally worse compression.
Unified format used by all six implementations:
Header (25 bytes):
DLT\x03 4-byte magic
flags 1 byte (bit 0 = in-place)
version_size uint32 big-endian
src_crc 8 bytes (CRC-64/XZ of reference file)
dst_crc 8 bytes (CRC-64/XZ of version file)
Commands (repeated):
type 0 END (1 byte)
type 1 COPY: src(u32) dst(u32) len(u32) — 13 bytes
type 2 ADD: dst(u32) len(u32) data — 9 + len bytes
All multi-byte integers are big-endian. Commands carry explicit
source and destination offsets. The flags byte distinguishes
standard deltas (flag 0x00) from in-place deltas (flag 0x01).
The two 8-byte CRC-64/XZ checksums are verified on decode: src_crc
is checked against the supplied reference file before reconstruction
begins; dst_crc is checked against the reconstructed output
afterwards. Pass --ignore-hash to decode to skip both checks and
attempt recovery from a corrupted or mismatched delta.
To run all unit and cross-language compatibility tests in one shot:
./tests/correctness.shIndividual suites:
| Language | Tests | Command |
|---|---|---|
| Python | 215 | cd src/python && python3 -m unittest test_delta -v |
| Rust | 85 | cd src/rust/delta && cargo test |
| C++ | 75 | cd src/cpp && cmake -B build && cmake --build build && ctest --test-dir build |
| C | 91 | cd src/c && make && bash test_delta.sh |
| Java | 57 | cd src/java && make test |
| Go | 63 | cd src/go && go test ./delta/... |
Tests cover all three algorithms, binary round-trips, paper examples,
edge cases (empty/identical/completely different files), in-place
reconstruction with both cycle-breaking policies, variable-length block
transpositions (8–64 blocks with controlled transpositions),
checkpointing correctness, and cross-language compatibility.
A kernel tarball benchmark (tests/kernel-delta-test.sh) exercises
onepass and correcting on ~871 MB inputs. On linux-5.1 → 5.1.1, all
six implementations are byte-compatible. See ANALYSIS.md for
tables and details.
src/
python/ Single-file library + CLI + 215-test suite
rust/delta/ Cargo crate — library + clap CLI + 85 tests
cpp/ CMake project — static library + CLI11 CLI + Catch2 tests (75)
c/ Makefile project — single-header API + CLI + 91 tests
java/ Makefile project — library + CLI + 57 tests
go/ Go module — library + CLI + 63 tests
tests/
correctness.sh Run all unit + cross-language tests (all 6 implementations)
kernel-delta-test.sh Kernel tarball benchmark
transposition-benchmark.sh Synthetic permutation benchmark
pubs/ Ajtai et al. 2002, Burns et al. 2003 (PDFs)
Each implementation has the same architecture: rolling hash, three
algorithm modules, binary encoding, command placement, and CRWI-based
in-place conversion. See HOWTO.md for detailed file
listings and library API examples.
The same pipeline runs in all six implementations:
R, V (byte arrays)
│
▼
[Algorithm] greedy / onepass / correcting
│ Karp-Rabin rolling hash (Mersenne prime 2^61-1, base 263)
│ Hash table or splay tree maps fingerprint → source offset(s)
│ Scan V; on match extend backward/forward to find longest copy
│
▼ Vec<Command> (Copy {offset, length} | Add {data})
│
▼
[place_commands] (trivial: assigns sequential dst offsets, no reordering)
│
▼ Vec<PlacedCommand> (Copy {src, dst, length} | Add {dst, data})
│
├──────────────────────────────────────────────────────────┐
│ (standard path) (--inplace) │
│ [make_inplace] │
│ CRWI digraph: edge i→j │
│ means i reads a region │
│ j will overwrite. │
│ Kahn topo-sort; cycles │
│ broken by converting │
│ smallest copy to ADD. │
│ │ │
▼ ▼ │
[encode_delta] ←──────────────────────────────┘
│ Header: DLT\x03 + flags(1) + version_size(u32 BE)
│ + src_crc(8) + dst_crc(8)
│ COPY: 0x01 + src(u32) + dst(u32) + len(u32)
│ ADD: 0x02 + dst(u32) + len(u32) + payload
│ END: 0x00
▼
binary delta file
Key types
| Type | Description |
|---|---|
Command |
Algorithm output: offset-relative Copy or raw Add |
PlacedCommand |
Absolute src/dst offsets — encodable and validatable |
DiffOptions |
p (seed length, default 16), q (table floor, default 1M+), max_table, verbose, use_splay |
CyclePolicy |
Localmin (convert smallest copy in cycle) · Constant (convert any) |
Module responsibilities (Rust names; other languages mirror this)
| Module | Role |
|---|---|
hash |
Karp-Rabin rolling hash; Mersenne modulo; next_prime for table sizing |
splay |
Sleator-Tarjan self-adjusting BST — optional alternative to hash table |
algorithm/greedy |
O(n²) optimal: stores every R offset per fingerprint, scans all candidates |
algorithm/onepass |
O(n) linear: concurrent R+V scan, two hash tables, version-based eviction (§4.3) |
algorithm/correcting |
~O(n): checkpoint filter (§8) limits table entries; lookback buffer corrects missed tail matches |
apply |
place_commands, apply_placed_to, apply_delta_inplace, bounds validation |
inplace |
CRWI graph construction, Kahn topological sort, cycle-breaking (Burns et al. 2003) |
encoding |
Binary encode/decode; CRC-64/XZ verification of reference and output |
The Karp-Rabin paper introduces the rolling hash / fingerprinting technique used by all three differencing algorithms. Wagner and Fischer formalized string-to-string correction (edit distance). Tichy extended it to block moves — the model solved by the algorithms here. Reichenberger and Miller-Myers are the prior O(n^2) optimal algorithms that Ajtai et al. improve upon. Rabin's paper describes the deterministic Miller-Rabin primality test (fixed witness set, Jaeschke 1993) used for hash table auto-sizing. Kahn's algorithm is used for topological sorting of the CRWI digraph during in-place conversion. Sleator and Tarjan's splay tree provides an alternative to hash tables for fingerprint lookup; frequent fingerprints self-promote to the root, giving sub-logarithmic amortised access on Zipfian distributions.
@article{ajtai2002differential,
author = {Ajtai, Mikl\'{o}s and Burns, Randal and Fagin, Ronald
and Long, Darrell D. E. and Stockmeyer, Larry},
title = {Compactly Encoding Unstructured Inputs with
Differential Compression},
journal = {Journal of the ACM},
volume = {49},
number = {3},
pages = {318--367},
year = {2002},
month = may,
publisher = {ACM},
doi = {10.1145/567112.567114},
}
@article{burns2003inplace,
author = {Burns, Randal C. and Long, Darrell D. E.
and Stockmeyer, Larry},
title = {In-Place Reconstruction of Version Differences},
journal = {IEEE Transactions on Knowledge and Data Engineering},
volume = {15},
number = {4},
pages = {973--984},
year = {2003},
month = jul # {/} # aug,
publisher = {IEEE},
doi = {10.1109/TKDE.2003.1209013},
}
@article{kahn1962topological,
author = {Kahn, Arthur B.},
title = {Topological Sorting of Large Networks},
journal = {Communications of the ACM},
volume = {5},
number = {11},
pages = {558--562},
year = {1962},
month = nov,
publisher = {ACM},
doi = {10.1145/368996.369025},
}
@article{karp1987efficient,
author = {Karp, Richard M. and Rabin, Michael O.},
title = {Efficient Randomized Pattern-Matching Algorithms},
journal = {IBM Journal of Research and Development},
volume = {31},
number = {2},
pages = {249--260},
year = {1987},
month = mar,
publisher = {IBM},
doi = {10.1147/rd.312.0249},
}
@article{rabin1980probabilistic,
author = {Rabin, Michael O.},
title = {Probabilistic Algorithm for Testing Primality},
journal = {Journal of Number Theory},
volume = {12},
number = {1},
pages = {128--138},
year = {1980},
month = feb,
publisher = {Elsevier},
doi = {10.1016/0022-314X(80)90084-0},
}
@article{levenshtein1966binary,
author = {Levenshtein, Vladimir I.},
title = {Binary Codes Capable of Correcting Deletions,
Insertions, and Reversals},
journal = {Soviet Physics Doklady},
volume = {10},
number = {8},
pages = {707--710},
year = {1966},
}
@article{miller1985file,
author = {Miller, Webb and Myers, Eugene W.},
title = {A File Comparison Program},
journal = {Software---Practice and Experience},
volume = {15},
number = {11},
pages = {1025--1040},
year = {1985},
publisher = {Wiley},
doi = {10.1002/spe.4380151102},
}
@article{sleator1985self,
author = {Sleator, Daniel D. and Tarjan, Robert E.},
title = {Self-Adjusting Binary Search Trees},
journal = {Journal of the ACM},
volume = {32},
number = {3},
pages = {652--686},
year = {1985},
month = jul,
publisher = {ACM},
doi = {10.1145/3828.3835},
}
@inproceedings{reichenberger1991delta,
author = {Reichenberger, Christoph},
title = {Delta Storage for Arbitrary Non-Text Files},
booktitle = {Proceedings of the 3rd International Workshop on
Software Configuration Management},
pages = {144--152},
year = {1991},
publisher = {ACM},
doi = {10.1145/111062.111080},
}
@article{tichy1984string,
author = {Tichy, Walter F.},
title = {The String-to-String Correction Problem with Block Moves},
journal = {ACM Transactions on Computer Systems},
volume = {2},
number = {4},
pages = {309--321},
year = {1984},
month = nov,
publisher = {ACM},
doi = {10.1145/357401.357404},
}
@article{wagner1974string,
author = {Wagner, Robert A. and Fischer, Michael J.},
title = {The String-to-String Correction Problem},
journal = {Journal of the ACM},
volume = {21},
number = {1},
pages = {168--173},
year = {1974},
month = jan,
publisher = {ACM},
doi = {10.1145/321796.321811},
}BSD 2-Clause. See LICENSE.