Skip to content

v0.2.0

Choose a tag to compare

@feichai0017 feichai0017 released this 19 May 17:23
· 236 commits to main since this release

Breaking

  • Public API surface closure. holt::layout, holt::journal,
    holt::store are now pub(crate). The supported holt::*
    surface is Tree, TreeBuilder, TreeConfig, Storage,
    Error, Result, RangeBuilder, RangeEntry, RangeIter,
    BlobStats, TreeStats, CheckpointerStats, TxnBatch,
    CheckpointConfig, Backend, MemoryBackend,
    PersistentBackend, AlignedBlobBuf, BlobGuid. The
    metrics::render_prometheus renderer is part of the
    metrics-feature surface.
  • pub use holt::BufferManager removed; BufferManager is
    internal.
  • BlobGuid now re-exported at the crate root for custom
    Backend implementations.
  • RangeBuilder::new is pub(crate) — use Tree::range() /
    Tree::scan_prefix().
  • TreeConfig::checkpoint_byte_interval field +
    TreeBuilder::checkpoint_byte_interval method removed.
    The
    field was reserved and never read.
  • AllocOutcome shrunk to { slot }; ExtentAllocOutcome
    shrunk to { byte_offset }.
    The other fields were dead.
  • encode_record returns () instead of Result<()> — no
    fallible step.
  • BufferManager::capacity() / clear() removed. Dead code.
  • TreeConfig::flush_on_write renamed to
    memory_flush_on_write
    — the field had no effect on
    persistent trees; the v0.1 name suggested per-write fsync, which
    it never was.
  • Error::NodeCorrupt is a struct variant with optional
    blob_guid + slot fields.
    Construct via
    Error::node_corrupt(ctx) and enrich via .with_blob_guid(g)
    / .with_slot(s). Pattern-matchers must spread the new fields
    (NodeCorrupt { context, .. }).

Fixed — durability (W2D-strict)

  • Checkpoint error paths no longer drop drained state. Manual
    Tree::checkpoint and the background round now restore every
    snapshot they drained on every error return — WAL flush
    failure, I/O worker channel-closed, and pre-delete Sync
    failure paths previously left dirty / pending partially
    drained, allowing the next round to truncate the WAL with cache
    state still pending. See ARCHITECTURE.md §6 for the seven-phase
    protocol.
  • Abort-on-dirty-failure gate before pending-delete. A failed
    parent write_through no longer propagates to the dependent
    child's manifest delete (which would have left the on-disk
    parent referencing a slot the manifest no longer had). The pre-
    delete sync still runs to fsync the writes that did succeed;
    the pending set is restored and the next round retries the
    parent + child together.
  • Writer ↔ background-checkpoint W2D race. Pending-delete
    snapshot now drains inside the same wal.lock critical section
    as snapshot_dirty + wal.flush, closing the inversion window
    where a writer could land a fresh blob between the two drains.
  • scan.rs::refresh_blob_node_pointers inline bm.commit
    replaced with bm.mark_dirty(parent_guid, STRUCTURAL_SEQ) so
    the post-compact pointer repair stages through the unified
    dirty-set protocol instead of pushing cache state straight to
    backend.
  • Tree::compact documented NOT online-safe — running
    concurrently with reads or writes can torn-read across
    BlobNode crossings. The v0.3 maintenance latch will lift this.

Added

  • io-uring feature flag (Linux only). PersistentBackend
    reads/writes route through a per-backend io_uring (depth 8)
    instead of pread/pwrite.
  • tracing feature flag (off by default). Structured
    tracing events on checkpoint round complete, spillover,
    merge, compact, WAL truncate, and eviction sweeps. Zero-
    cost when the feature is off.
  • metrics feature flag (off by default). Renders
    TreeStats into Prometheus text format. Gauges
    (holt_slots, holt_tombstones, holt_compactions) follow
    the convention of dropping the _total suffix.
  • 3-thread background checkpointer — planner + dedicated I/O
    worker + cold-blob eviction sweep, parked between rounds via
    park_timeout(idle_interval). Default disabled; opt in via
    TreeBuilder::checkpoint(CheckpointConfig::default() .enabled(true)). Drop runs one final synchronous round on
    the calling thread.
  • Tree::scan_prefix(p) — one-line wrapper for
    tree.range().prefix(p).
  • Tree::stats extended with bm_dirty_count,
    bm_pending_delete_count, bm_cache_hits / bm_cache_misses,
    bm_optimistic_restarts, and an Option<CheckpointerStats>.
  • Silent observability readspin_silent /
    get_cached_silent / collect_blob_guids_silent don't bump
    cache counters or refresh the LRU tick, so Tree::stats and
    metrics scrapes don't pollute the counters they report.
  • Error::Internal(&'static str) variant for invariant-
    violation paths (previously Error::NotYetImplemented, now
    reserved for genuine walker-arm feature gaps). Non-breaking
    thanks to Error's #[non_exhaustive] marker.

Changed

  • Sharded BufferManager cache — v0.1's
    Mutex<HashMap<BlobGuid, _>> + VecDeque<BlobGuid> LRU
    replaced by DashMap<BlobGuid, Arc<CachedBlob>> with
    clock_tick / last_touched eviction; concurrent pins on
    different blobs hit different shards instead of contending on
    a single mutex.
  • Cached Tree.root_pin — every get / put / delete
    keeps the root pinned via Arc<CachedBlob> and skips the BM
    hash lookup on the root hop (~300 ns/op on the hot path).
  • RangeIter delimiter fast-forward — after emitting a
    CommonPrefix(C), ascend the descent stack past C's subtree
    instead of scanning every leaf. *_list_dir is now
    O(distinct_rollups).
  • Hardware-accelerated CRC32 via crc32fast — auto-detects
    PCLMULQDQ on x86_64 and ARM-CRC32 on AArch64. Drops per-record
    WAL cost from ~110 ns to ~20 ns on supported hardware.
  • SIMD Node48 / Node256 range-iter scansvpcmpeqb / NEON
    byte search for Node48::index[256], slot-index scan for
    Node256::children[256]. Worth ~80-120 ns per next() on
    wide branch nodes; matters most for *_list_dir.

Benchmarks

  • Group B — scale curve across kv / objstore / fs × four
    dataset sizes ({ 20 k, 100 k, 500 k, 2 M }). The 500 k tier
    already exceeds the default 32 MB buffer pool; the 2 M tier
    (~192 MB payload) forces full eviction churn. Get scales
    beautifully on all three workloads (holt wins every cell with
    the lead vs RocksDB widening to 5.4× / 2.8× / 2.2× at 2 M).
    Put wins at 20 k / 100 k / 500 k, ties RocksDB at 2 M kv,
    but loses 8-22 % to RocksDB / SQLite at 2 M on objstore / fs
    — the regime where LSM-style write amortization is the right
    choice and ART-over-blobs isn't competitive; cross-blob lock-
    coupling is queued for v0.3 to close the gap.
  • Group C — p95/p99 under maintenance interference
    (tests/bench_contention_p95.rs, #[ignore]). 4 writer
    threads + 5 ms-cadence background checkpointer + concurrent
    Tree::compact(); tracks every put latency via
    hdrhistogram. M3 Pro: 307k ops/s sustained, p50 = 2 µs,
    p99 = 108 µs.
  • PGO build profile docs in PGO.md.

Full numbers in benches/RESULTS.md.