v0.3.0
The v0.3 milestone ships the API split, walker hot-path
optimizations, WAL format cleanup, batch-WAL encoding, recursive
cross-blob latch coupling, journal group commit, candidate-driven
online maintenance, Linux fixed-buffer io_uring checkpoint I/O,
and release benchmark coverage focused on metadata workloads.
The three breaking-but-surgical wins below land first; the
extreme metadata-engine performance track builds on them.
Breaking — WAL format v3 (drops dead audit fields)
TxnOp::Insert.prev_value and TxnOp::Erase.value are gone.
Both were documented as "for replay reversibility" but replay is
an idempotent forward redo that only consumes (key, value) for
Insert and key for Erase — the prior-value slots were dead
weight on every returning Tree::insert / Tree::remove (the
blind variants already wrote None).
File format version bumped 2 → 3. A binary built from the
earlier v0.3 draft opening a v3 WAL fails with
Error::ReplaySanityFailed /
"WAL file format version unsupported" rather than mis-parsing
the absent optional-bytes slot as a length prefix. Upgrade
path for local data written by the earlier v0.3 draft:
Tree::checkpoint() before swapping in the new binary —
checkpoint truncates the WAL to header-only, so the next open
writes a v0.3 (= v3-format) header.
Concretely:
- WAL
Insertrecord body shrinks by oneoptional_bytesslot
(u8presence tag +u32length +prev_value.len()bytes
on the returning path). - WAL
Eraserecord body shrinks by oneoptional_bytesslot
(same shape as Insert). - Returning
Tree::insert/Tree::removeno longer clone or
serialise the prior value into the WAL buffer; the caller still
receives it through the return path (walkerleaf_extentread,
same as the earlier v0.3 API split).
Breaking — public lookup surface stays owned
The draft Tree::get_with closure API was removed before v0.3.
The engine still uses an internal zero-copy lookup walker
for existence probes and rename preflight, but the public API stays
small:
pub fn get(&self, key: &[u8]) -> Result<Option<Vec<u8>>>;Reason: a closure borrowing live cache bytes is a lifetime and
contention contract, not just a convenience method. For the
metadata-engine surface we want to stabilize now, owned Vec<u8>
lookups plus range iteration are the right external boundary.
Performance — txn() batch bypasses TxnOp enum
Tree::txn (the batched-multi-op API) now uses a streaming
BatchEncoder that writes inner-op bytes directly from &[u8]
refs into the WAL pending buffer. The previous v0.3 draft path
constructed TxnOp::Insert { key: Vec<u8>, value: Vec<u8>, .. }
(and similarly for Erase / RenameObject) per inner op, then
encoded the enum-of-Vecs — two extra clones per op the WAL never
needed.
Mid-batch error handling is preserved: if a walker call returns
Err partway through, the encoder's Drop rolls the partial
record bytes off the WAL pending buffer (truncate to where
begin started). On Ok, the encoder backpatches the inner
count + body length and appends the CRC — byte-identical to
what the old encode_record(&TxnOp::Batch { .. }) path would
have produced (verified by batch_encoder_wire_matches_encode_record
in src/journal/codec.rs::tests).
Performance — rename's wasted wants_prev
Tree::rename's two walker calls (erase_multi on src,
insert_multi on dst) were passing wants_prev=true even
though the rename already read the src value in the pre-walker
lookup and the dst-existence check (or force=true) gates the
insert side. The walker-materialised previous values were
dropped on the floor. Flipped to wants_prev=false on both;
same change applied inside the batch-path rename arm of
apply_batch_walker_inline.
Internal — codec cleanup
- Dropped
write_optional_bytes/read_optional_byteshelpers
(no callers after Insert/Erase shed their optional slots). WalWriter::append(the generic&TxnOpentry) is now
test-only. Foreground mutation paths encode owned WAL records
with the per-variant codec helpers and hand those bytes to the
journal worker.apply_put_inner/apply_delete_inner/apply_rename_inner
Tree helpers folded into the newapply_batch_walker_inline.
Breaking — BlobNode format + cross-blob latch coupling
Implements the first real concurrency cut for split metadata
trees and makes the on-disk BlobNode contract smaller. put /
insert and delete / remove no longer hold a root or
ancestor blob's exclusive latch while mutating a descendant blob
on the key path. The walker reads the BlobNode, pins the child
blob, releases the parent, and repeats that handoff recursively
for every deeper BlobNode.
The correctness boundary also changes: parent
BlobNode.child_entry_ptr is removed. The child blob's own
header.root_slot is the only cross-blob entry. Child-local
splits, collapses, and post-compact root-slot changes therefore
do not require re-acquiring and rewriting the parent BlobNode.
Concretely:
BlobNodestays 128 bytes, but the removed child-entry field
is reclaimed for inline prefix payload.BLOB_MAX_INLINE
grows from 96 to 104 bytes.- This is an on-disk format break for multi-blob trees written by
earlier v0.3 draft binaries. Rebuild from checkpointed data or
export/import before moving existing draft data onto this
layout. BlobWriteGuard::frame()remains the writer-side convenience
for in-place mutation, but it now returns a plain
BlobFrame::wrap(...); there is no per-slot version sidecar.walker::insert::insert_multifirst tries the recursive
lock-coupled path. If the root blob is full,spillover_blobcompact_blobrun as before, then the walker re-checks the
key path; if the target subtree moved behind a newBlobNode,
the retry immediately hands off to the child instead of holding
the root latch across child mutation.
walker::erase::erase_multimirrors the same recursive
handoff. The lock-coupled delete path leaves an emptied child
blob reachable with anEmptyRootsentinel; structural pruning
remains a maintenance concern rather than a foreground delete
latch chain.walker::lookup,range, and merge paths follow
child.header.root_slotwhen crossing blobs, so parent
BlobNodes never need child-entry repair after child compaction.- The old recursive
insert_at_blob_node/
erase_at_blob_nodeparent-held fallback arms are removed.
Performance / Correctness — online maintenance gate + shape counters
The remaining v0.3 concurrency cleanup is now in place:
- Foreground mutation paths (
put/insert/delete/
remove/rename/txn) enter the shared side of a narrow
atomicmaintenance_gatewhile they may crossBlobNode
boundaries.
Tree::compact()runs blob-local compaction on the shared side,
skips clean stale candidates after a shared-latch header check,
and both manual compact plus the background checkpointer's merge
pass enter the exclusive side only around one parent
merge/delete window at a time. - Online maintenance is now candidate-driven. Deletes and
leaf-slot churn enqueue blob-local compaction candidates;
spillovers enqueue parent-merge candidates.Tree::compact()
cold-seeds the queues only when no hints exist, then drains a
bounded batch instead of sweeping every blob on every call.
Background auto-merge drains the same queued parent candidates,
and too-large parent candidates are consumed until fresh shape
debt requeues them, so idle checkpoint rounds avoid a large-tree
merge scan. - Point reads (
get) also take the shared
maintenance gate so a merge cannot delete a child after a reader
observes the parentBlobNode. Blob-local reads still use
per-blob optimistic validation; ordinary readers and writers
remain mutually concurrent. Tree::compact()is no longer documented as quiescent-only.
It is safe against active point reads and foreground writers;
range iterators remain best-effort snapshots because they keep a
raw(blob_guid, slot)stack across calls.Tree::stats()andholt::metrics::render_prometheusnow
expose walker/shape counters: mutation walker ops, total and
average blob hops, max blob hops, max cross-blob boundary depth,
foreground spillovers, and child-blob merges.- Cross-blob
putno longer takes the root's exclusive latch just
to read aBlobNode. The root is held shared while the child
write latch is acquired, then the mutation proceeds from the
child down. Cross-blob updates also return a precise
root_dirtybit so child-only writes do not mark the root dirty
or take the dirty-map mutex for an unchanged blob. - Checkpoint dirty snapshots now move drained blobs into an
in-flightflushingprotection set untilwrite_through
completes. Eviction skips both live dirty and flushing entries,
closing the pressure-window where a background sweep could drop
the only cached image aftersnapshot_dirty()drained the dirty
map but before the checkpoint planner copied the bytes. - Dirty / flushing / pending-delete bookkeeping is sharded by
BlobGuid(64 shards).mark_dirtyandmark_for_deletenow
take one per-guid shard lock instead of one global dirty mutex,
which removes the next persistent-write contention point after
CommitGate. - Fresh spillover blobs keep a local
Arcpin alive until their
dirty entry is published. This closes the complementary I1
window where a background eviction sweep could see a just-created
child blob as clean, remove its cache image, and leave checkpoint
with a dirty entry but no bytes to flush, without introducing a
mutation-shard / cache-shard lock-order inversion.
Performance / Correctness — journal group commit
- Persistent trees now own a dedicated
Journalworker instead of
sharingArc<Mutex<WalWriter>>directly. - Foreground writers encode a complete WAL record into owned bytes,
enter the writer-sharedCommitGateonly for walker mutation +
dirty publish + journal submission, then wait for the journal
acknowledgement outside that gate. wal_sync_on_commit=truewriters are batched by a short group
window; the journal worker appends every queued record and calls
sync_dataonce for all durable waiters in the batch.- Manual and background checkpoint rounds use the same
CommitGateon its exclusive side while draining dirty/pending
sets, flushing the journal, and cloning snapshotted bytes. This
prevents checkpoint I/O from copying bytes from a writer whose
WAL record was not in the durable snapshot without serialising
ordinary writers against each other. Tree::stats()/ Prometheus metrics expose journal appends,
append batches, and sync counts.- Short-key padding now uses the 256-byte inline path without
clearing the full stack buffer on every operation, and tree
sequence allocation uses relaxed atomics because ordering is
provided by WAL/dirty/latch synchronization rather than by the
counter itself.
Breaking — API redesign (split returning from blind)
The v0.2 put / delete returned Option<Vec<u8>> by default,
forcing every caller to pay the read-old-leaf + value-clone cost
even when the prior value wasn't needed. This worked but
contradicted the "metadata hot path" design goal — for a storage
engine, the HashMap-style "give me the old value for free" contract
is anything but free. Aligned with RocksDB / LevelDB / FoundationDB
convention: blind by default, returning by explicit opt-in.
The new surface:
// blind hot paths — no leaf-extent value read
put(&self, k: &[u8], v: &[u8]) -> Result<()>
delete(&self, k: &[u8]) -> Result<bool>
// returning variants — pay the read + clone explicitly
insert(&self, k: &[u8], v: &[u8]) -> Result<Option<Vec<u8>>>
remove(&self, k: &[u8]) -> Result<Option<Vec<u8>>>Migrating from v0.2.x:
tree.put(k, v).unwrap()→ unchanged (returns()now;.unwrap()works the same).let prev = tree.put(k, v).unwrap();→let prev = tree.insert(k, v).unwrap();tree.delete(k).unwrap().is_some()→tree.delete(k).unwrap()(already abool).let prev = tree.delete(k).unwrap();→let prev = tree.remove(k).unwrap();
Breaking — WAL format
TxnOp::Erase.value changed from Vec<u8> (always present) to
Option<Vec<u8>>: Some(prev) on the returning Tree::remove
path, None on the blind Tree::delete path. Wire shape: the
trailing bytes(value) became optional_bytes(value).
File format version bumped 1 → 2. A v0.3 binary opening a
v0.2 WAL fails with Error::ReplaySanityFailed /
"WAL file format version unsupported" rather than mis-decoding.
Upgrade path: run Tree::checkpoint() on the v0.2 tree
before swapping in the v0.3 binary — checkpoint truncates the
WAL to header-only, so the next open writes a v0.3 header.
Performance — walker hot-path optimizations
The walker now threads a wants_prev: bool flag through
insert_at / erase_at and all their arms. Concrete savings on
the blind path:
read_leaf_key_only(new helper): same-key check reads
only the leaf's key bytes, not value. Saves a per-op
value_size-byte clone on every same-keyput/delete.insert_into_prefix+erase_at_prefixborrow-only
descent:PrefixisCopysolet p = read_prefix(...)
is an owned stack value; the inline prefix bytes can be held
via&p.bytes[..plen]across the subsequentframe.*
mutations without the previous.to_vec()allocation. Hot on
path-shaped workloads (objstore / fs) where prefix chains are
long.- WAL
Insert.prev_valueencoded asNoneon blind put;
WALErase.valueencoded asNoneon blind delete. Both
skip theVecclone + bytes copy that the v0.2.x always-encoded
path paid.
Linux v0.3 release run after the cross-blob latch-coupling and
BlobNode format break:
- kv put @ 2 M: 1 866 ns (vs RocksDB 2 001 ns, SQLite 2 336 ns).
- objstore put @ 2 M: 1 707 ns (vs RocksDB 1 994 ns, SQLite 2 222 ns).
- fs put @ 2 M: 1 796 ns (vs RocksDB 1 969 ns, SQLite 2 199 ns).
At 2 M vs RocksDB: kv is 1.07× ahead, objstore is 1.17×
ahead, and fs is 1.10× ahead. Full table in
benches/RESULTS.md. Point writes are now
competitive at large scale; the release headline remains
metadata-native mixes and delimiter directory rollup.
Changed — internal types
EraseOutcomeand the walker-internalEraseReturngain a
mutated: boolfield separate fromprevious: Option<Vec<u8>>.
mutatedis the authoritative "did the walker tombstone a
leaf" signal regardless of whether the caller asked for the
prior value; previously this was inferred from
previous.is_some(), which conflated "no mutation" with
"blind erase".
Internal
BufferManagerand other crate-private types unchanged in
shape; only the walker entry-point signatures and WAL codec
changed.