feat: pluggable rocksdb/lmdb backend (#782)#783
Conversation
Snapshot of the four liblmdb source files from the OpenLDAP mdb.master branch, pinned at commit 40d3741b7d40ba4c75cb91dd9987ce692d376d71 (2026-01-13). License is OpenLDAP Public License v2.8 (see LICENSE). Subsequent commits on this branch add the Zig build glue, bindings, and a tagged-union Db dispatcher so rocksdb and lmdb can be selected at startup via --db-backend. See #782.
Thin wrapper over the vendored liblmdb exposing just enough primitives for pkgs/database to back a RocksDB-shaped key/value API: - Env: open / close / sync, configurable map size / max dbs / readers. - Txn: begin (ro or rw) / commit / abort, plus openDbi. - Get / put / delete. - Cursor: first / last / next / prev / seekRange. - Error translation for every MDB error code. The module is compiled as a Zig module with the C sources attached; it is imported by @zeam/database so the vendored C is built once per target. Two smoke tests (env lifecycle + cursor iteration) exercise the binding end-to-end. See #782.
Refactor `database.Db` from a bare alias over the RocksDB wrapper into
a tagged union that dispatches to either the existing RocksDB engine
or the new LMDB engine at runtime. Every method the node uses
(`initWriteBatch`, `commit`, `loadLatestFinalizedState`, `loadBlock`,
etc.) now `switch`es on the active variant.
Backend selection is surfaced to operators through a new CLI flag:
--db-backend <rocksdb|lmdb> (default: rocksdb)
The flag is threaded from `NodeCommand`/`BeamCmd` through
`node.NodeOptions` into `database.Db.openBackend(...)`. The existing
`Db.open(...)` helper is kept as a rocksdb-defaulted shim so the
existing tests and the 16+ call sites in `pkgs/node` stay unchanged.
Database stale-wipe logic now computes the on-disk directory name per
backend (`rocksdb/` vs `lmdb/`) so a genesis-time mismatch cleans up
whichever engine is in use.
Parity tests are added to `pkgs/database/src/database.zig` that run
the save/load block, batched commit, and `loadLatestFinalizedState`
flows against both backends through the tagged union.
Relates to #782.
…anic `zig build` randomizes its dependency-graph traversal to surface non-determinism, seeded from the build-runner's auto-picked seed. On Zig 0.15.0 `std.Random.shuffleWithIndex` has an off-by-one that can return `buf.len` for certain seed/length combinations, causing `constructGraphAndCheckForDependencyLoop` to panic with `index out of bounds: index 18, len 18`. Our build graph landed on 18 nodes on this branch (new `lmdb` module plus its test target) and the multi-arch Docker build drew the unlucky seed `0xf02b800c`, so every image build was crashing during graph prepare. Pinning `--seed 0` sidesteps the bug and also makes the image reproducible, which is what we want inside a Dockerfile regardless.
Moves the vendored liblmdb C sources and the Zig wrapper out of zeam and into a standalone repo: https://github.com/blockblaz/lmdb-zig (v0.1.0, commit 8e1f544) zeam now depends on it through build.zig.zon like any other Zig dep. No behaviour change: the wrapper API and vendored liblmdb commit are identical to what lived in pkgs/lmdb. The standalone repo lets other Zig projects reuse the bindings without pulling in zeam. - build.zig.zon: add lmdb dependency pinned to 8e1f544. - build.zig: replace the in-tree module + C source wiring with b.dependency("lmdb", ...).module("lmdb"). - pkgs/lmdb: deleted; now lives upstream.
zclawz
left a comment
There was a problem hiding this comment.
Adversarial Review
Solid approach overall — tagged-union dispatch via inline else, SWMR model matches zeam's writer-serialised threading, and the API surface mirrors RocksDB cleanly. Three blocking issues and several non-blocking notes.
🔴 Silent write-batch failure — caller never knows
initWriteBatch is infallible (to match the RocksDB signature), so when env.beginTxn(false) fails it returns a batch with txn = null. Every subsequent put/delete silently no-ops (self.txn orelse return). When the caller calls db.commit(&batch), the null-txn path also no-ops. Result: the node believes the batch committed but no data was written. This is worse than an error — the caller (e.g. processFinalizationAdvancement) cannot detect the failure and continues as if state was persisted.
One approach: make initWriteBatch fallible, or store a failed: bool in WriteBatch and have Db.commit return an error if batch.failed.
🔴 commit swallows errors silently
pub fn commit(self: *Self, batch: *WriteBatch) void {
if (batch.txn) |*txn| {
txn.commit() catch |err| {
batch.logger.err("commit: txn.commit failed: {any}", .{err});
};
batch.txn = null;
}
}Db.commit is void (matching RocksDB). A commit failure at the LMDB level is logged and then discarded. The chain continues believing the write committed. For a consensus-critical write (finalization advancement, state persistence) this is unacceptable. Either propagate the error (requires changing the Db.commit signature everywhere — probably correct) or at minimum set batch.txn = null only on success and let the batch linger in a failed state that the caller can detect.
🔴 realpathAlloc in tests — fails on Zig 0.16.0
const data_dir = try tmp_dir.dir.realpathAlloc(allocator, ".");realpathAlloc was removed in Zig 0.16.0 (just fixed in several upstream deps in other PRs this week). Since zeam is currently on 0.15.2 this compiles, but zeam is actively upgrading. Use a fixed-size buffer instead:
var path_buf: [std.fs.max_path_bytes]u8 = undefined;
const data_dir = try tmp_dir.dir.realpath(".", &path_buf);⚠️ Backend mismatch at restart goes undetected
If a node runs with --db-backend rocksdb (default), accumulates chain data in {data_dir}/rocksdb, then is restarted with --db-backend lmdb, wipeAndReopenDb deletes {data_dir}/lmdb (may not exist → ignore_not_found swallows the miss) and opens a fresh LMDB env. The node starts from genesis while the old RocksDB data sits intact at {data_dir}/rocksdb. There is no "you switched backends; your existing data is in the other engine" warning. At minimum, log a warning when the non-selected backend directory exists.
⚠️ DEFAULT_MAP_SIZE = 1 TiB is too large for multi-node setups
LMDB reserves the full 1 TiB of virtual address space at env.open time. Each Db.openBackend call (the beam mock network opens 3) consumes 3 TiB of VA space. On Linux this is purely virtual and does not consume RAM or disk, but it can exhaust the 128 TiB of user-space VA on machines running many processes, and may fail outright on systems with vm.overcommit_memory = 2. A more conservative default (say 256 GiB) or — better — implement grow-on-MDB_MAP_FULL retry would be safer.
⚠️ deleteFilesInRange allocates a key buffer unnecessarily
The comment says "mutating during cursor iteration is less obviously correct" but LMDB explicitly supports deleting the current cursor position via MDB_CURRENT flag. The current approach collects all keys into an ArrayList, then deletes them in a second pass — doubling allocations. This is not a correctness issue but costs memory for large ranges (e.g., pruning old unfinalized slot indices).
✅ Good things
inline elsedispatch is clean and DRY ✅WriteBatch.deinitcorrectly aborts the live txn ✅assert(batch.* == .rocksdb/.lmdb)guards cross-backend mismatches in debug builds ✅Iteratorcopies key/value out of the mmap before returning — correct (LMDB pointers are only valid until the next operation) ✅getcopies the value out of the mmap before closing the txn ✅flushcomment correctly notes it is a no-op most of the time ✅- Tagged union tests exercise both backends with the same helper functions ✅
--seed 0in Dockerfile for reproducible builds ✅
Addresses review feedback on #783: - initWriteBatch / Db.commit are now fallible. RocksDB and LMDB both return !void from commit, and Db.commit propagates so callers (processFinalizationAdvancement, updateBlockDb, etc.) can detect a failed persistence rather than silently continuing. - LMDB WriteBatch carries a sticky `failed` flag. Any mid-batch put/delete error sets it; commit refuses to commit a poisoned batch, aborts the txn, and surfaces error.LmdbBatch. - DEFAULT_MAP_SIZE dropped from 1 TiB to 256 GiB. 1 TiB x N parallel env opens exhausts user-space VA on crowded hosts and trips vm.overcommit_memory=2. - On open, warn if the non-selected backend's data directory already contains entries. Silently starting from genesis on a newly chosen engine was an invisible footgun after a --db-backend switch. - Parity tests in database.zig swap realpathAlloc for a fixed-buffer realpath so they keep compiling on Zig 0.16 (realpathAlloc was removed upstream). Propagation updates: - pkgs/node/src/chain.zig: `try` at the two existing batch/commit sites (updateBlockDb, processFinalizationAdvancement). - pkgs/database/src/rocksdb.zig tests that exercise Db (the tagged union) get `try` on initWriteBatch/commit too. All 151 unit tests pass; simtest (15/15) green.
|
Thanks for the careful review. Pushed 🔴 #1 + #2 — silent batch failure / swallowed commit errorBoth
Propagation at the call sites:
🔴 #3 —
|
zclawz
left a comment
There was a problem hiding this comment.
Re-review: all blocking issues resolved ✅
Checked the latest commit (4ae901e5).
🔴 → ✅ Silent write-batch failure
initWriteBatch is now fallible (!WriteBatch) for the LMDB backend, propagated through Db.initWriteBatch with try. The failed: bool sticky flag on WriteBatch poisons mid-batch and causes commit to abort + return error.LmdbBatch. No silent data loss path remains.
🔴 → ✅ commit swallows errors
Db.commit is now !void. Both backends propagate:
- LMDB:
commitchecksbatch.failed, aborts, returnserror.LmdbBatchif poisoned, otherwisetry txn.commit() - RocksDB: changed from
catch { log; return }totry callRocksDB(...)— errors now surface to the caller
Call sites in chain.zig (updateBlockDb, processFinalizationAdvancement) now use try, so a persistence failure surfaces immediately rather than letting the chain advance on phantom writes.
🔴 → ✅ realpathAlloc in tests
All three test helpers (testSaveAndLoadBlock, testBatchWriteAndCommit, testLoadLatestFinalizedStateHappyPath) now use a fixed-size buffer + dir.realpath(".", &path_buf). Compiles on Zig 0.16.0.
⚠️ → ✅ Backend mismatch at restart
warnIfOtherBackendPopulated walks the other engine's directory; if it contains any entries it logs a warn-level message naming both paths and advising the operator to delete or switch back. Called from openBackend. Diagnostic-only (no correctness barrier) which is the right trade-off — forcibly blocking would break dev workflows.
⚠️ → ✅ Default map size
DEFAULT_MAP_SIZE reduced from 1 TiB (1 << 40) to 256 GiB (256 * (1 << 30)). Comment updated to explain the vm.overcommit_memory = 2 and multi-node VA exhaustion concern.
No new issues found. LGTM.
Resolve build.zig.zon: keep lmdb (PR) and thread_pool (main).
g11tech
left a comment
There was a problem hiding this comment.
looks good can you ask for @anshalshukla 's review on this as well
Summary
Adds LMDB as an alternative database engine alongside the existing
RocksDB backend, selectable at runtime via a new CLI flag:
Closes #782.
Design
pkgs/lmdb— vendored liblmdb C sources (pinned snapshot ofmdb.master) plus a minimal Zig wrapper (Env,Txn,Dbi,Cursor) that translatesMDB_*errors into aErrorset andexposes addressable put/get/delete/iterate.
pkgs/database/src/lmdb.zig— an LMDB-flavoured backend thatmirrors the shape of the existing RocksDB wrapper (same
ColumnNamespace, sameWriteBatch/IteratorAPI, same SSZconvenience helpers). Reads open short-lived read-only txns; writes
and batches use an
MDB_RDWRtxn that thecommitarms. Valuesreturned from
getare copied out of LMDB's mmap into a heap-ownedDataso they live beyond the transaction.pkgs/database/src/database.zig—Dbbecomes aunion(Backend) { rocksdb, lmdb }. All methods used by the node(
initWriteBatch,commit,loadBlock,loadLatestFinalizedState,etc.) dispatch through
switch. The tagged-union shape isdeliberate: every method takes
comptime cn: ColumnNamespace, andcomptime args survive
switchbut not pointer indirection.BeamCmdandNodeCommandboth grow a--db-backendflag that feeds
database.Db.openBackend(...). The existing3-argument
Db.open(...)is kept (defaulting to rocksdb) so the16+ test call sites stay untouched.
name (
rocksdb/vslmdb/) so the genesis-time mismatch pathcleans up whichever engine is in use.
Why the #782 assumption needed adjusting
The issue motivated LMDB on the basis that zeam already exhibits a
single-writer / multiple-reader access pattern. On audit, all DB
reads and writes today are serialized behind
BeamNode.mutex; theAPI and metrics servers don't touch the DB at all. The SWMR win is
therefore latent, not current — but switching to LMDB is still
worth it for the other reasons (minimal C surface, no compaction,
mmap reads, ~40 kB of vendor code). The issue body was rewritten to
reflect this, and a follow-up can move DB reads off
BeamNode.mutexto realise the SWMR advantage once both backends have been exercised
in CI.
Test plan
zig build test --summary all— 151/151 pass (includes the newLMDB parity tests in
pkgs/database/src/database.zigthat runthe save/load block, batched commit and
loadLatestFinalizedStateflows through both engines).zig build simtest --summary all— 15/15 pass.zig fmt --check .cargo fmt --manifest-path rust/Cargo.toml --all -- --checkcargo clippy --manifest-path rust/Cargo.toml --workspace -- -D warnings--db-backend lmdbon devnet (follow-up).lean-quickstartdocs to mention the new flag (follow-up).