Release v0.3.0 · feichai0017/holt

The v0.3 milestone ships the API split, walker hot-path
optimizations, WAL format cleanup, batch-WAL encoding, recursive
cross-blob latch coupling, journal group commit, candidate-driven
online maintenance, Linux fixed-buffer io_uring checkpoint I/O,
and release benchmark coverage focused on metadata workloads.

The three breaking-but-surgical wins below land first; the
extreme metadata-engine performance track builds on them.

Breaking — WAL format v3 (drops dead audit fields)

TxnOp::Insert.prev_value and TxnOp::Erase.value are gone.
Both were documented as "for replay reversibility" but replay is
an idempotent forward redo that only consumes (key, value) for
Insert and key for Erase — the prior-value slots were dead
weight on every returning Tree::insert / Tree::remove (the
blind variants already wrote None).

File format version bumped 2 → 3. A binary built from the
earlier v0.3 draft opening a v3 WAL fails with
Error::ReplaySanityFailed /
"WAL file format version unsupported" rather than mis-parsing
the absent optional-bytes slot as a length prefix. Upgrade
path for local data written by the earlier v0.3 draft:
Tree::checkpoint() before swapping in the new binary —
checkpoint truncates the WAL to header-only, so the next open
writes a v0.3 (= v3-format) header.

Concretely:

WAL Insert record body shrinks by one optional_bytes slot
(u8 presence tag + u32 length + prev_value.len() bytes
on the returning path).
WAL Erase record body shrinks by one optional_bytes slot
(same shape as Insert).
Returning Tree::insert / Tree::remove no longer clone or
serialise the prior value into the WAL buffer; the caller still
receives it through the return path (walker leaf_extent read,
same as the earlier v0.3 API split).

Breaking — public lookup surface stays owned

The draft Tree::get_with closure API was removed before v0.3.
The engine still uses an internal zero-copy lookup walker
for existence probes and rename preflight, but the public API stays
small:

pub fn get(&self, key: &[u8]) -> Result<Option<Vec<u8>>>;

Reason: a closure borrowing live cache bytes is a lifetime and
contention contract, not just a convenience method. For the
metadata-engine surface we want to stabilize now, owned Vec<u8>
lookups plus range iteration are the right external boundary.

Performance — `txn()` batch bypasses `TxnOp` enum

Tree::txn (the batched-multi-op API) now uses a streaming
BatchEncoder that writes inner-op bytes directly from &[u8]
refs into the WAL pending buffer. The previous v0.3 draft path
constructed TxnOp::Insert { key: Vec<u8>, value: Vec<u8>, .. }
(and similarly for Erase / RenameObject) per inner op, then
encoded the enum-of-Vecs — two extra clones per op the WAL never
needed.

Mid-batch error handling is preserved: if a walker call returns
Err partway through, the encoder's Drop rolls the partial
record bytes off the WAL pending buffer (truncate to where
begin started). On Ok, the encoder backpatches the inner
count + body length and appends the CRC — byte-identical to
what the old encode_record(&TxnOp::Batch { .. }) path would
have produced (verified by batch_encoder_wire_matches_encode_record
in src/journal/codec.rs::tests).

Performance — rename's wasted `wants_prev`

Tree::rename's two walker calls (erase_multi on src,
insert_multi on dst) were passing wants_prev=true even
though the rename already read the src value in the pre-walker
lookup and the dst-existence check (or force=true) gates the
insert side. The walker-materialised previous values were
dropped on the floor. Flipped to wants_prev=false on both;
same change applied inside the batch-path rename arm of
apply_batch_walker_inline.

Internal — codec cleanup

Dropped write_optional_bytes / read_optional_bytes helpers
(no callers after Insert/Erase shed their optional slots).
WalWriter::append (the generic &TxnOp entry) is now
test-only. Foreground mutation paths encode owned WAL records
with the per-variant codec helpers and hand those bytes to the
journal worker.
apply_put_inner / apply_delete_inner / apply_rename_inner
Tree helpers folded into the new apply_batch_walker_inline.

Breaking — BlobNode format + cross-blob latch coupling

Implements the first real concurrency cut for split metadata
trees and makes the on-disk BlobNode contract smaller. put /
insert and delete / remove no longer hold a root or
ancestor blob's exclusive latch while mutating a descendant blob
on the key path. The walker reads the BlobNode, pins the child
blob, releases the parent, and repeats that handoff recursively
for every deeper BlobNode.

The correctness boundary also changes: parent
BlobNode.child_entry_ptr is removed. The child blob's own
header.root_slot is the only cross-blob entry. Child-local
splits, collapses, and post-compact root-slot changes therefore
do not require re-acquiring and rewriting the parent BlobNode.

Concretely:

BlobNode stays 128 bytes, but the removed child-entry field
is reclaimed for inline prefix payload. BLOB_MAX_INLINE
grows from 96 to 104 bytes.
This is an on-disk format break for multi-blob trees written by
earlier v0.3 draft binaries. Rebuild from checkpointed data or
export/import before moving existing draft data onto this
layout.
BlobWriteGuard::frame() remains the writer-side convenience
for in-place mutation, but it now returns a plain
BlobFrame::wrap(...); there is no per-slot version sidecar.
walker::insert::insert_multi first tries the recursive
lock-coupled path. If the root blob is full, spillover_blob
- compact_blob run as before, then the walker re-checks the
  key path; if the target subtree moved behind a new BlobNode,
  the retry immediately hands off to the child instead of holding
  the root latch across child mutation.
walker::erase::erase_multi mirrors the same recursive
handoff. The lock-coupled delete path leaves an emptied child
blob reachable with an EmptyRoot sentinel; structural pruning
remains a maintenance concern rather than a foreground delete
latch chain.
walker::lookup, range, and merge paths follow
child.header.root_slot when crossing blobs, so parent
BlobNodes never need child-entry repair after child compaction.
The old recursive insert_at_blob_node /
erase_at_blob_node parent-held fallback arms are removed.

Performance / Correctness — online maintenance gate + shape counters

The remaining v0.3 concurrency cleanup is now in place:

Foreground mutation paths (put / insert / delete /
remove / rename / txn) enter the shared side of a narrow
atomic maintenance_gate while they may cross BlobNode
boundaries.
Tree::compact() runs blob-local compaction on the shared side,
skips clean stale candidates after a shared-latch header check,
and both manual compact plus the background checkpointer's merge
pass enter the exclusive side only around one parent
merge/delete window at a time.
Online maintenance is now candidate-driven. Deletes and
leaf-slot churn enqueue blob-local compaction candidates;
spillovers enqueue parent-merge candidates. Tree::compact()
cold-seeds the queues only when no hints exist, then drains a
bounded batch instead of sweeping every blob on every call.
Background auto-merge drains the same queued parent candidates,
and too-large parent candidates are consumed until fresh shape
debt requeues them, so idle checkpoint rounds avoid a large-tree
merge scan.
Point reads (get) also take the shared
maintenance gate so a merge cannot delete a child after a reader
observes the parent BlobNode. Blob-local reads still use
per-blob optimistic validation; ordinary readers and writers
remain mutually concurrent.
Tree::compact() is no longer documented as quiescent-only.
It is safe against active point reads and foreground writers;
range iterators remain best-effort snapshots because they keep a
raw (blob_guid, slot) stack across calls.
Tree::stats() and holt::metrics::render_prometheus now
expose walker/shape counters: mutation walker ops, total and
average blob hops, max blob hops, max cross-blob boundary depth,
foreground spillovers, and child-blob merges.
Cross-blob put no longer takes the root's exclusive latch just
to read a BlobNode. The root is held shared while the child
write latch is acquired, then the mutation proceeds from the
child down. Cross-blob updates also return a precise
root_dirty bit so child-only writes do not mark the root dirty
or take the dirty-map mutex for an unchanged blob.
Checkpoint dirty snapshots now move drained blobs into an
in-flight flushing protection set until write_through
completes. Eviction skips both live dirty and flushing entries,
closing the pressure-window where a background sweep could drop
the only cached image after snapshot_dirty() drained the dirty
map but before the checkpoint planner copied the bytes.
Dirty / flushing / pending-delete bookkeeping is sharded by
BlobGuid (64 shards). mark_dirty and mark_for_delete now
take one per-guid shard lock instead of one global dirty mutex,
which removes the next persistent-write contention point after
CommitGate.
Fresh spillover blobs keep a local Arc pin alive until their
dirty entry is published. This closes the complementary I1
window where a background eviction sweep could see a just-created
child blob as clean, remove its cache image, and leave checkpoint
with a dirty entry but no bytes to flush, without introducing a
mutation-shard / cache-shard lock-order inversion.

Performance / Correctness — journal group commit

Persistent trees now own a dedicated Journal worker instead of
sharing Arc<Mutex<WalWriter>> directly.
Foreground writers encode a complete WAL record into owned bytes,
enter the writer-shared CommitGate only for walker mutation +
dirty publish + journal submission, then wait for the journal
acknowledgement outside that gate.
wal_sync_on_commit=true writers are batched by a short group
window; the journal worker appends every queued record and calls
sync_data once for all durable waiters in the batch.
Manual and background checkpoint rounds use the same
CommitGate on its exclusive side while draining dirty/pending
sets, flushing the journal, and cloning snapshotted bytes. This
prevents checkpoint I/O from copying bytes from a writer whose
WAL record was not in the durable snapshot without serialising
ordinary writers against each other.
Tree::stats() / Prometheus metrics expose journal appends,
append batches, and sync counts.
Short-key padding now uses the 256-byte inline path without
clearing the full stack buffer on every operation, and tree
sequence allocation uses relaxed atomics because ordering is
provided by WAL/dirty/latch synchronization rather than by the
counter itself.

Breaking — API redesign (split returning from blind)

The v0.2 put / delete returned Option<Vec<u8>> by default,
forcing every caller to pay the read-old-leaf + value-clone cost
even when the prior value wasn't needed. This worked but
contradicted the "metadata hot path" design goal — for a storage
engine, the HashMap-style "give me the old value for free" contract
is anything but free. Aligned with RocksDB / LevelDB / FoundationDB
convention: blind by default, returning by explicit opt-in.

The new surface:

// blind hot paths — no leaf-extent value read
put(&self, k: &[u8], v: &[u8]) -> Result<()>
delete(&self, k: &[u8]) -> Result<bool>

// returning variants — pay the read + clone explicitly
insert(&self, k: &[u8], v: &[u8]) -> Result<Option<Vec<u8>>>
remove(&self, k: &[u8]) -> Result<Option<Vec<u8>>>

Migrating from v0.2.x:

tree.put(k, v).unwrap() → unchanged (returns () now; .unwrap() works the same).
let prev = tree.put(k, v).unwrap(); → let prev = tree.insert(k, v).unwrap();
tree.delete(k).unwrap().is_some() → tree.delete(k).unwrap() (already a bool).
let prev = tree.delete(k).unwrap(); → let prev = tree.remove(k).unwrap();

Breaking — WAL format

TxnOp::Erase.value changed from Vec<u8> (always present) to
Option<Vec<u8>>: Some(prev) on the returning Tree::remove
path, None on the blind Tree::delete path. Wire shape: the
trailing bytes(value) became optional_bytes(value).

File format version bumped 1 → 2. A v0.3 binary opening a
v0.2 WAL fails with Error::ReplaySanityFailed /
"WAL file format version unsupported" rather than mis-decoding.
Upgrade path: run Tree::checkpoint() on the v0.2 tree
before swapping in the v0.3 binary — checkpoint truncates the
WAL to header-only, so the next open writes a v0.3 header.

Performance — walker hot-path optimizations

The walker now threads a wants_prev: bool flag through
insert_at / erase_at and all their arms. Concrete savings on
the blind path:

read_leaf_key_only (new helper): same-key check reads
only the leaf's key bytes, not value. Saves a per-op
value_size-byte clone on every same-key put / delete.
insert_into_prefix + erase_at_prefix borrow-only
descent: Prefix is Copy so let p = read_prefix(...)
is an owned stack value; the inline prefix bytes can be held
via &p.bytes[..plen] across the subsequent frame.*
mutations without the previous .to_vec() allocation. Hot on
path-shaped workloads (objstore / fs) where prefix chains are
long.
WAL Insert.prev_value encoded as None on blind put;
WAL Erase.value encoded as None on blind delete. Both
skip the Vec clone + bytes copy that the v0.2.x always-encoded
path paid.

Linux v0.3 release run after the cross-blob latch-coupling and
BlobNode format break:

kv put @ 2 M: 1 866 ns (vs RocksDB 2 001 ns, SQLite 2 336 ns).
objstore put @ 2 M: 1 707 ns (vs RocksDB 1 994 ns, SQLite 2 222 ns).
fs put @ 2 M: 1 796 ns (vs RocksDB 1 969 ns, SQLite 2 199 ns).

At 2 M vs RocksDB: kv is 1.07× ahead, objstore is 1.17×
ahead, and fs is 1.10× ahead. Full table in
benches/RESULTS.md. Point writes are now
competitive at large scale; the release headline remains
metadata-native mixes and delimiter directory rollup.

Changed — internal types

EraseOutcome and the walker-internal EraseReturn gain a
mutated: bool field separate from previous: Option<Vec<u8>>.
mutated is the authoritative "did the walker tombstone a
leaf" signal regardless of whether the caller asked for the
prior value; previously this was inferred from
previous.is_some(), which conflated "no mutation" with
"blind erase".

Internal

BufferManager and other crate-private types unchanged in
shape; only the walker entry-point signatures and WAL codec
changed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Breaking — WAL format v3 (drops dead audit fields)

Breaking — public lookup surface stays owned

Performance — `txn()` batch bypasses `TxnOp` enum

Performance — rename's wasted `wants_prev`

Internal — codec cleanup

Breaking — BlobNode format + cross-blob latch coupling

Performance / Correctness — online maintenance gate + shape counters

Performance / Correctness — journal group commit

Breaking — API redesign (split returning from blind)

Breaking — WAL format

Performance — walker hot-path optimizations

Changed — internal types

Internal

Uh oh!

v0.3.0

Breaking — WAL format v3 (drops dead audit fields)

Breaking — public lookup surface stays owned

Performance — txn() batch bypasses TxnOp enum

Performance — rename's wasted wants_prev

Internal — codec cleanup

Breaking — BlobNode format + cross-blob latch coupling

Performance / Correctness — online maintenance gate + shape counters

Performance / Correctness — journal group commit

Breaking — API redesign (split returning from blind)

Breaking — WAL format

Performance — walker hot-path optimizations

Changed — internal types

Internal

Uh oh!

Performance — `txn()` batch bypasses `TxnOp` enum

Performance — rename's wasted `wants_prev`