Skip to content

v0.3.0

Choose a tag to compare

@feichai0017 feichai0017 released this 21 May 06:26
· 200 commits to main since this release

The v0.3 milestone ships the API split, walker hot-path
optimizations, WAL format cleanup, batch-WAL encoding, recursive
cross-blob latch coupling, journal group commit, candidate-driven
online maintenance, Linux fixed-buffer io_uring checkpoint I/O,
and release benchmark coverage focused on metadata workloads.

The three breaking-but-surgical wins below land first; the
extreme metadata-engine performance track builds on them.

Breaking — WAL format v3 (drops dead audit fields)

TxnOp::Insert.prev_value and TxnOp::Erase.value are gone.
Both were documented as "for replay reversibility" but replay is
an idempotent forward redo that only consumes (key, value) for
Insert and key for Erase — the prior-value slots were dead
weight on every returning Tree::insert / Tree::remove (the
blind variants already wrote None).

File format version bumped 2 → 3. A binary built from the
earlier v0.3 draft opening a v3 WAL fails with
Error::ReplaySanityFailed /
"WAL file format version unsupported" rather than mis-parsing
the absent optional-bytes slot as a length prefix. Upgrade
path for local data written by the earlier v0.3 draft:
Tree::checkpoint() before swapping in the new binary

checkpoint truncates the WAL to header-only, so the next open
writes a v0.3 (= v3-format) header.

Concretely:

  • WAL Insert record body shrinks by one optional_bytes slot
    (u8 presence tag + u32 length + prev_value.len() bytes
    on the returning path).
  • WAL Erase record body shrinks by one optional_bytes slot
    (same shape as Insert).
  • Returning Tree::insert / Tree::remove no longer clone or
    serialise the prior value into the WAL buffer; the caller still
    receives it through the return path (walker leaf_extent read,
    same as the earlier v0.3 API split).

Breaking — public lookup surface stays owned

The draft Tree::get_with closure API was removed before v0.3.
The engine still uses an internal zero-copy lookup walker
for existence probes and rename preflight, but the public API stays
small:

pub fn get(&self, key: &[u8]) -> Result<Option<Vec<u8>>>;

Reason: a closure borrowing live cache bytes is a lifetime and
contention contract, not just a convenience method. For the
metadata-engine surface we want to stabilize now, owned Vec<u8>
lookups plus range iteration are the right external boundary.

Performance — txn() batch bypasses TxnOp enum

Tree::txn (the batched-multi-op API) now uses a streaming
BatchEncoder that writes inner-op bytes directly from &[u8]
refs into the WAL pending buffer. The previous v0.3 draft path
constructed TxnOp::Insert { key: Vec<u8>, value: Vec<u8>, .. }
(and similarly for Erase / RenameObject) per inner op, then
encoded the enum-of-Vecs — two extra clones per op the WAL never
needed.

Mid-batch error handling is preserved: if a walker call returns
Err partway through, the encoder's Drop rolls the partial
record bytes off the WAL pending buffer (truncate to where
begin started). On Ok, the encoder backpatches the inner
count + body length and appends the CRC — byte-identical to
what the old encode_record(&TxnOp::Batch { .. }) path would
have produced (verified by batch_encoder_wire_matches_encode_record
in src/journal/codec.rs::tests).

Performance — rename's wasted wants_prev

Tree::rename's two walker calls (erase_multi on src,
insert_multi on dst) were passing wants_prev=true even
though the rename already read the src value in the pre-walker
lookup and the dst-existence check (or force=true) gates the
insert side. The walker-materialised previous values were
dropped on the floor. Flipped to wants_prev=false on both;
same change applied inside the batch-path rename arm of
apply_batch_walker_inline.

Internal — codec cleanup

  • Dropped write_optional_bytes / read_optional_bytes helpers
    (no callers after Insert/Erase shed their optional slots).
  • WalWriter::append (the generic &TxnOp entry) is now
    test-only. Foreground mutation paths encode owned WAL records
    with the per-variant codec helpers and hand those bytes to the
    journal worker.
  • apply_put_inner / apply_delete_inner / apply_rename_inner
    Tree helpers folded into the new apply_batch_walker_inline.

Breaking — BlobNode format + cross-blob latch coupling

Implements the first real concurrency cut for split metadata
trees and makes the on-disk BlobNode contract smaller. put /
insert and delete / remove no longer hold a root or
ancestor blob's exclusive latch while mutating a descendant blob
on the key path. The walker reads the BlobNode, pins the child
blob, releases the parent, and repeats that handoff recursively
for every deeper BlobNode.

The correctness boundary also changes: parent
BlobNode.child_entry_ptr is removed. The child blob's own
header.root_slot is the only cross-blob entry. Child-local
splits, collapses, and post-compact root-slot changes therefore
do not require re-acquiring and rewriting the parent BlobNode.

Concretely:

  • BlobNode stays 128 bytes, but the removed child-entry field
    is reclaimed for inline prefix payload. BLOB_MAX_INLINE
    grows from 96 to 104 bytes.
  • This is an on-disk format break for multi-blob trees written by
    earlier v0.3 draft binaries. Rebuild from checkpointed data or
    export/import before moving existing draft data onto this
    layout.
  • BlobWriteGuard::frame() remains the writer-side convenience
    for in-place mutation, but it now returns a plain
    BlobFrame::wrap(...); there is no per-slot version sidecar.
  • walker::insert::insert_multi first tries the recursive
    lock-coupled path. If the root blob is full, spillover_blob
    • compact_blob run as before, then the walker re-checks the
      key path; if the target subtree moved behind a new BlobNode,
      the retry immediately hands off to the child instead of holding
      the root latch across child mutation.
  • walker::erase::erase_multi mirrors the same recursive
    handoff. The lock-coupled delete path leaves an emptied child
    blob reachable with an EmptyRoot sentinel; structural pruning
    remains a maintenance concern rather than a foreground delete
    latch chain.
  • walker::lookup, range, and merge paths follow
    child.header.root_slot when crossing blobs, so parent
    BlobNodes never need child-entry repair after child compaction.
  • The old recursive insert_at_blob_node /
    erase_at_blob_node parent-held fallback arms are removed.

Performance / Correctness — online maintenance gate + shape counters

The remaining v0.3 concurrency cleanup is now in place:

  • Foreground mutation paths (put / insert / delete /
    remove / rename / txn) enter the shared side of a narrow
    atomic maintenance_gate while they may cross BlobNode
    boundaries.
    Tree::compact() runs blob-local compaction on the shared side,
    skips clean stale candidates after a shared-latch header check,
    and both manual compact plus the background checkpointer's merge
    pass enter the exclusive side only around one parent
    merge/delete window at a time.
  • Online maintenance is now candidate-driven. Deletes and
    leaf-slot churn enqueue blob-local compaction candidates;
    spillovers enqueue parent-merge candidates. Tree::compact()
    cold-seeds the queues only when no hints exist, then drains a
    bounded batch instead of sweeping every blob on every call.
    Background auto-merge drains the same queued parent candidates,
    and too-large parent candidates are consumed until fresh shape
    debt requeues them, so idle checkpoint rounds avoid a large-tree
    merge scan.
  • Point reads (get) also take the shared
    maintenance gate so a merge cannot delete a child after a reader
    observes the parent BlobNode. Blob-local reads still use
    per-blob optimistic validation; ordinary readers and writers
    remain mutually concurrent.
  • Tree::compact() is no longer documented as quiescent-only.
    It is safe against active point reads and foreground writers;
    range iterators remain best-effort snapshots because they keep a
    raw (blob_guid, slot) stack across calls.
  • Tree::stats() and holt::metrics::render_prometheus now
    expose walker/shape counters: mutation walker ops, total and
    average blob hops, max blob hops, max cross-blob boundary depth,
    foreground spillovers, and child-blob merges.
  • Cross-blob put no longer takes the root's exclusive latch just
    to read a BlobNode. The root is held shared while the child
    write latch is acquired, then the mutation proceeds from the
    child down. Cross-blob updates also return a precise
    root_dirty bit so child-only writes do not mark the root dirty
    or take the dirty-map mutex for an unchanged blob.
  • Checkpoint dirty snapshots now move drained blobs into an
    in-flight flushing protection set until write_through
    completes. Eviction skips both live dirty and flushing entries,
    closing the pressure-window where a background sweep could drop
    the only cached image after snapshot_dirty() drained the dirty
    map but before the checkpoint planner copied the bytes.
  • Dirty / flushing / pending-delete bookkeeping is sharded by
    BlobGuid (64 shards). mark_dirty and mark_for_delete now
    take one per-guid shard lock instead of one global dirty mutex,
    which removes the next persistent-write contention point after
    CommitGate.
  • Fresh spillover blobs keep a local Arc pin alive until their
    dirty entry is published. This closes the complementary I1
    window where a background eviction sweep could see a just-created
    child blob as clean, remove its cache image, and leave checkpoint
    with a dirty entry but no bytes to flush, without introducing a
    mutation-shard / cache-shard lock-order inversion.

Performance / Correctness — journal group commit

  • Persistent trees now own a dedicated Journal worker instead of
    sharing Arc<Mutex<WalWriter>> directly.
  • Foreground writers encode a complete WAL record into owned bytes,
    enter the writer-shared CommitGate only for walker mutation +
    dirty publish + journal submission, then wait for the journal
    acknowledgement outside that gate.
  • wal_sync_on_commit=true writers are batched by a short group
    window; the journal worker appends every queued record and calls
    sync_data once for all durable waiters in the batch.
  • Manual and background checkpoint rounds use the same
    CommitGate on its exclusive side while draining dirty/pending
    sets, flushing the journal, and cloning snapshotted bytes. This
    prevents checkpoint I/O from copying bytes from a writer whose
    WAL record was not in the durable snapshot without serialising
    ordinary writers against each other.
  • Tree::stats() / Prometheus metrics expose journal appends,
    append batches, and sync counts.
  • Short-key padding now uses the 256-byte inline path without
    clearing the full stack buffer on every operation, and tree
    sequence allocation uses relaxed atomics because ordering is
    provided by WAL/dirty/latch synchronization rather than by the
    counter itself.

Breaking — API redesign (split returning from blind)

The v0.2 put / delete returned Option<Vec<u8>> by default,
forcing every caller to pay the read-old-leaf + value-clone cost
even when the prior value wasn't needed. This worked but
contradicted the "metadata hot path" design goal — for a storage
engine, the HashMap-style "give me the old value for free" contract
is anything but free. Aligned with RocksDB / LevelDB / FoundationDB
convention: blind by default, returning by explicit opt-in.

The new surface:

// blind hot paths — no leaf-extent value read
put(&self, k: &[u8], v: &[u8]) -> Result<()>
delete(&self, k: &[u8]) -> Result<bool>

// returning variants — pay the read + clone explicitly
insert(&self, k: &[u8], v: &[u8]) -> Result<Option<Vec<u8>>>
remove(&self, k: &[u8]) -> Result<Option<Vec<u8>>>

Migrating from v0.2.x:

  • tree.put(k, v).unwrap() → unchanged (returns () now; .unwrap() works the same).
  • let prev = tree.put(k, v).unwrap();let prev = tree.insert(k, v).unwrap();
  • tree.delete(k).unwrap().is_some()tree.delete(k).unwrap() (already a bool).
  • let prev = tree.delete(k).unwrap();let prev = tree.remove(k).unwrap();

Breaking — WAL format

TxnOp::Erase.value changed from Vec<u8> (always present) to
Option<Vec<u8>>: Some(prev) on the returning Tree::remove
path, None on the blind Tree::delete path. Wire shape: the
trailing bytes(value) became optional_bytes(value).

File format version bumped 1 → 2. A v0.3 binary opening a
v0.2 WAL fails with Error::ReplaySanityFailed /
"WAL file format version unsupported" rather than mis-decoding.
Upgrade path: run Tree::checkpoint() on the v0.2 tree
before swapping in the v0.3 binary
— checkpoint truncates the
WAL to header-only, so the next open writes a v0.3 header.

Performance — walker hot-path optimizations

The walker now threads a wants_prev: bool flag through
insert_at / erase_at and all their arms. Concrete savings on
the blind path:

  • read_leaf_key_only (new helper): same-key check reads
    only the leaf's key bytes, not value. Saves a per-op
    value_size-byte clone on every same-key put / delete.
  • insert_into_prefix + erase_at_prefix borrow-only
    descent
    : Prefix is Copy so let p = read_prefix(...)
    is an owned stack value; the inline prefix bytes can be held
    via &p.bytes[..plen] across the subsequent frame.*
    mutations without the previous .to_vec() allocation. Hot on
    path-shaped workloads (objstore / fs) where prefix chains are
    long.
  • WAL Insert.prev_value encoded as None on blind put;
    WAL Erase.value encoded as None on blind delete. Both
    skip the Vec clone + bytes copy that the v0.2.x always-encoded
    path paid.

Linux v0.3 release run after the cross-blob latch-coupling and
BlobNode format break:

  • kv put @ 2 M: 1 866 ns (vs RocksDB 2 001 ns, SQLite 2 336 ns).
  • objstore put @ 2 M: 1 707 ns (vs RocksDB 1 994 ns, SQLite 2 222 ns).
  • fs put @ 2 M: 1 796 ns (vs RocksDB 1 969 ns, SQLite 2 199 ns).

At 2 M vs RocksDB: kv is 1.07× ahead, objstore is 1.17×
ahead, and fs is 1.10× ahead. Full table in
benches/RESULTS.md. Point writes are now
competitive at large scale; the release headline remains
metadata-native mixes and delimiter directory rollup.

Changed — internal types

  • EraseOutcome and the walker-internal EraseReturn gain a
    mutated: bool field separate from previous: Option<Vec<u8>>.
    mutated is the authoritative "did the walker tombstone a
    leaf" signal regardless of whether the caller asked for the
    prior value; previously this was inferred from
    previous.is_some(), which conflated "no mutation" with
    "blind erase".

Internal

  • BufferManager and other crate-private types unchanged in
    shape; only the walker entry-point signatures and WAL codec
    changed.