v0.2.0
Breaking
- Public API surface closure.
holt::layout,holt::journal,
holt::storeare nowpub(crate). The supportedholt::*
surface isTree,TreeBuilder,TreeConfig,Storage,
Error,Result,RangeBuilder,RangeEntry,RangeIter,
BlobStats,TreeStats,CheckpointerStats,TxnBatch,
CheckpointConfig,Backend,MemoryBackend,
PersistentBackend,AlignedBlobBuf,BlobGuid. The
metrics::render_prometheusrenderer is part of the
metrics-feature surface. pub use holt::BufferManagerremoved;BufferManageris
internal.BlobGuidnow re-exported at the crate root for custom
Backendimplementations.RangeBuilder::newispub(crate)— useTree::range()/
Tree::scan_prefix().TreeConfig::checkpoint_byte_intervalfield +
TreeBuilder::checkpoint_byte_intervalmethod removed. The
field was reserved and never read.AllocOutcomeshrunk to{ slot };ExtentAllocOutcome
shrunk to{ byte_offset }. The other fields were dead.encode_recordreturns()instead ofResult<()>— no
fallible step.BufferManager::capacity()/clear()removed. Dead code.TreeConfig::flush_on_writerenamed to
memory_flush_on_write— the field had no effect on
persistent trees; the v0.1 name suggested per-write fsync, which
it never was.Error::NodeCorruptis a struct variant with optional
blob_guid+slotfields. Construct via
Error::node_corrupt(ctx)and enrich via.with_blob_guid(g)
/.with_slot(s). Pattern-matchers must spread the new fields
(NodeCorrupt { context, .. }).
Fixed — durability (W2D-strict)
- Checkpoint error paths no longer drop drained state. Manual
Tree::checkpointand the background round now restore every
snapshot they drained on every error return — WAL flush
failure, I/O worker channel-closed, and pre-deleteSync
failure paths previously leftdirty/pendingpartially
drained, allowing the next round to truncate the WAL with cache
state still pending. See ARCHITECTURE.md §6 for the seven-phase
protocol. - Abort-on-dirty-failure gate before pending-delete. A failed
parentwrite_throughno longer propagates to the dependent
child's manifest delete (which would have left the on-disk
parent referencing a slot the manifest no longer had). The pre-
delete sync still runs to fsync the writes that did succeed;
the pending set is restored and the next round retries the
parent + child together. - Writer ↔ background-checkpoint W2D race. Pending-delete
snapshot now drains inside the samewal.lockcritical section
assnapshot_dirty+wal.flush, closing the inversion window
where a writer could land a fresh blob between the two drains. scan.rs::refresh_blob_node_pointersinlinebm.commit
replaced withbm.mark_dirty(parent_guid, STRUCTURAL_SEQ)so
the post-compact pointer repair stages through the unified
dirty-set protocol instead of pushing cache state straight to
backend.Tree::compactdocumentedNOT online-safe— running
concurrently with reads or writes can torn-read across
BlobNodecrossings. The v0.3 maintenance latch will lift this.
Added
io-uringfeature flag (Linux only).PersistentBackend
reads/writes route through a per-backendio_uring(depth 8)
instead ofpread/pwrite.tracingfeature flag (off by default). Structured
tracingevents oncheckpointround complete,spillover,
merge,compact, WAL truncate, and eviction sweeps. Zero-
cost when the feature is off.metricsfeature flag (off by default). Renders
TreeStatsinto Prometheus text format. Gauges
(holt_slots,holt_tombstones,holt_compactions) follow
the convention of dropping the_totalsuffix.- 3-thread background checkpointer — planner + dedicated I/O
worker + cold-blob eviction sweep, parked between rounds via
park_timeout(idle_interval). Default disabled; opt in via
TreeBuilder::checkpoint(CheckpointConfig::default() .enabled(true)).Dropruns one final synchronous round on
the calling thread. Tree::scan_prefix(p)— one-line wrapper for
tree.range().prefix(p).Tree::statsextended withbm_dirty_count,
bm_pending_delete_count,bm_cache_hits/bm_cache_misses,
bm_optimistic_restarts, and anOption<CheckpointerStats>.- Silent observability reads —
pin_silent/
get_cached_silent/collect_blob_guids_silentdon't bump
cache counters or refresh the LRU tick, soTree::statsand
metrics scrapes don't pollute the counters they report. Error::Internal(&'static str)variant for invariant-
violation paths (previouslyError::NotYetImplemented, now
reserved for genuine walker-arm feature gaps). Non-breaking
thanks toError's#[non_exhaustive]marker.
Changed
- Sharded
BufferManagercache — v0.1's
Mutex<HashMap<BlobGuid, _>>+VecDeque<BlobGuid>LRU
replaced byDashMap<BlobGuid, Arc<CachedBlob>>with
clock_tick/last_touchedeviction; concurrent pins on
different blobs hit different shards instead of contending on
a single mutex. - Cached
Tree.root_pin— everyget/put/delete
keeps the root pinned viaArc<CachedBlob>and skips the BM
hash lookup on the root hop (~300 ns/op on the hot path). RangeIterdelimiter fast-forward — after emitting a
CommonPrefix(C), ascend the descent stack pastC's subtree
instead of scanning every leaf.*_list_diris now
O(distinct_rollups).- Hardware-accelerated CRC32 via
crc32fast— auto-detects
PCLMULQDQ on x86_64 and ARM-CRC32 on AArch64. Drops per-record
WAL cost from ~110 ns to ~20 ns on supported hardware. - SIMD Node48 / Node256 range-iter scans —
vpcmpeqb/ NEON
byte search forNode48::index[256], slot-index scan for
Node256::children[256]. Worth ~80-120 ns pernext()on
wide branch nodes; matters most for*_list_dir.
Benchmarks
- Group B — scale curve across kv / objstore / fs × four
dataset sizes ({ 20 k, 100 k, 500 k, 2 M }). The 500 k tier
already exceeds the default 32 MB buffer pool; the 2 M tier
(~192 MB payload) forces full eviction churn. Get scales
beautifully on all three workloads (holt wins every cell with
the lead vs RocksDB widening to 5.4× / 2.8× / 2.2× at 2 M).
Put wins at 20 k / 100 k / 500 k, ties RocksDB at 2 M kv,
but loses 8-22 % to RocksDB / SQLite at 2 M on objstore / fs
— the regime where LSM-style write amortization is the right
choice and ART-over-blobs isn't competitive; cross-blob lock-
coupling is queued for v0.3 to close the gap. - Group C — p95/p99 under maintenance interference
(tests/bench_contention_p95.rs,#[ignore]). 4 writer
threads + 5 ms-cadence background checkpointer + concurrent
Tree::compact(); tracks everyputlatency via
hdrhistogram. M3 Pro: 307k ops/s sustained, p50 = 2 µs,
p99 = 108 µs. - PGO build profile docs in
PGO.md.
Full numbers in benches/RESULTS.md.