feat: pluggable rocksdb/lmdb backend (#782) by ch4r10t33r · Pull Request #783 · blockblaz/zeam

ch4r10t33r · 2026-04-24T13:33:39Z

Summary

Adds LMDB as an alternative database engine alongside the existing
RocksDB backend, selectable at runtime via a new CLI flag:

--db-backend <rocksdb|lmdb>   (default: rocksdb)

Closes #782.

Design

pkgs/lmdb — vendored liblmdb C sources (pinned snapshot of
mdb.master) plus a minimal Zig wrapper (Env, Txn, Dbi,
Cursor) that translates MDB_* errors into a Error set and
exposes addressable put/get/delete/iterate.
pkgs/database/src/lmdb.zig — an LMDB-flavoured backend that
mirrors the shape of the existing RocksDB wrapper (same
ColumnNamespace, same WriteBatch/Iterator API, same SSZ
convenience helpers). Reads open short-lived read-only txns; writes
and batches use an MDB_RDWR txn that the commit arms. Values
returned from get are copied out of LMDB's mmap into a heap-owned
Data so they live beyond the transaction.
pkgs/database/src/database.zig — Db becomes a
union(Backend) { rocksdb, lmdb }. All methods used by the node
(initWriteBatch, commit, loadBlock, loadLatestFinalizedState,
etc.) dispatch through switch. The tagged-union shape is
deliberate: every method takes comptime cn: ColumnNamespace, and
comptime args survive switch but not pointer indirection.
CLI — BeamCmd and NodeCommand both grow a --db-backend
flag that feeds database.Db.openBackend(...). The existing
3-argument Db.open(...) is kept (defaulting to rocksdb) so the
16+ test call sites stay untouched.
Stale-DB wipe now knows about the per-backend on-disk directory
name (rocksdb/ vs lmdb/) so the genesis-time mismatch path
cleans up whichever engine is in use.

Why the #782 assumption needed adjusting

The issue motivated LMDB on the basis that zeam already exhibits a
single-writer / multiple-reader access pattern. On audit, all DB
reads and writes today are serialized behind BeamNode.mutex; the
API and metrics servers don't touch the DB at all. The SWMR win is
therefore latent, not current — but switching to LMDB is still
worth it for the other reasons (minimal C surface, no compaction,
mmap reads, ~40 kB of vendor code). The issue body was rewritten to
reflect this, and a follow-up can move DB reads off BeamNode.mutex
to realise the SWMR advantage once both backends have been exercised
in CI.

Test plan

zig build test --summary all — 151/151 pass (includes the new
LMDB parity tests in pkgs/database/src/database.zig that run
the save/load block, batched commit and
loadLatestFinalizedState flows through both engines).
zig build simtest --summary all — 15/15 pass.
zig fmt --check .
cargo fmt --manifest-path rust/Cargo.toml --all -- --check
cargo clippy --manifest-path rust/Cargo.toml --workspace -- -D warnings
Soak --db-backend lmdb on devnet (follow-up).
Update lean-quickstart docs to mention the new flag (follow-up).

Snapshot of the four liblmdb source files from the OpenLDAP mdb.master branch, pinned at commit 40d3741b7d40ba4c75cb91dd9987ce692d376d71 (2026-01-13). License is OpenLDAP Public License v2.8 (see LICENSE). Subsequent commits on this branch add the Zig build glue, bindings, and a tagged-union Db dispatcher so rocksdb and lmdb can be selected at startup via --db-backend. See #782.

Thin wrapper over the vendored liblmdb exposing just enough primitives for pkgs/database to back a RocksDB-shaped key/value API: - Env: open / close / sync, configurable map size / max dbs / readers. - Txn: begin (ro or rw) / commit / abort, plus openDbi. - Get / put / delete. - Cursor: first / last / next / prev / seekRange. - Error translation for every MDB error code. The module is compiled as a Zig module with the C sources attached; it is imported by @zeam/database so the vendored C is built once per target. Two smoke tests (env lifecycle + cursor iteration) exercise the binding end-to-end. See #782.

Refactor `database.Db` from a bare alias over the RocksDB wrapper into a tagged union that dispatches to either the existing RocksDB engine or the new LMDB engine at runtime. Every method the node uses (`initWriteBatch`, `commit`, `loadLatestFinalizedState`, `loadBlock`, etc.) now `switch`es on the active variant. Backend selection is surfaced to operators through a new CLI flag: --db-backend <rocksdb|lmdb> (default: rocksdb) The flag is threaded from `NodeCommand`/`BeamCmd` through `node.NodeOptions` into `database.Db.openBackend(...)`. The existing `Db.open(...)` helper is kept as a rocksdb-defaulted shim so the existing tests and the 16+ call sites in `pkgs/node` stay unchanged. Database stale-wipe logic now computes the on-disk directory name per backend (`rocksdb/` vs `lmdb/`) so a genesis-time mismatch cleans up whichever engine is in use. Parity tests are added to `pkgs/database/src/database.zig` that run the save/load block, batched commit, and `loadLatestFinalizedState` flows against both backends through the tagged union. Relates to #782.

…anic `zig build` randomizes its dependency-graph traversal to surface non-determinism, seeded from the build-runner's auto-picked seed. On Zig 0.15.0 `std.Random.shuffleWithIndex` has an off-by-one that can return `buf.len` for certain seed/length combinations, causing `constructGraphAndCheckForDependencyLoop` to panic with `index out of bounds: index 18, len 18`. Our build graph landed on 18 nodes on this branch (new `lmdb` module plus its test target) and the multi-arch Docker build drew the unlucky seed `0xf02b800c`, so every image build was crashing during graph prepare. Pinning `--seed 0` sidesteps the bug and also makes the image reproducible, which is what we want inside a Dockerfile regardless.

Moves the vendored liblmdb C sources and the Zig wrapper out of zeam and into a standalone repo: https://github.com/blockblaz/lmdb-zig (v0.1.0, commit 8e1f544) zeam now depends on it through build.zig.zon like any other Zig dep. No behaviour change: the wrapper API and vendored liblmdb commit are identical to what lived in pkgs/lmdb. The standalone repo lets other Zig projects reuse the bindings without pulling in zeam. - build.zig.zon: add lmdb dependency pinned to 8e1f544. - build.zig: replace the in-tree module + C source wiring with b.dependency("lmdb", ...).module("lmdb"). - pkgs/lmdb: deleted; now lives upstream.

zclawz

Adversarial Review

Solid approach overall — tagged-union dispatch via inline else, SWMR model matches zeam's writer-serialised threading, and the API surface mirrors RocksDB cleanly. Three blocking issues and several non-blocking notes.

🔴 Silent write-batch failure — caller never knows

initWriteBatch is infallible (to match the RocksDB signature), so when env.beginTxn(false) fails it returns a batch with txn = null. Every subsequent put/delete silently no-ops (self.txn orelse return). When the caller calls db.commit(&batch), the null-txn path also no-ops. Result: the node believes the batch committed but no data was written. This is worse than an error — the caller (e.g. processFinalizationAdvancement) cannot detect the failure and continues as if state was persisted.

One approach: make initWriteBatch fallible, or store a failed: bool in WriteBatch and have Db.commit return an error if batch.failed.

🔴 `commit` swallows errors silently

pub fn commit(self: *Self, batch: *WriteBatch) void {
    if (batch.txn) |*txn| {
        txn.commit() catch |err| {
            batch.logger.err("commit: txn.commit failed: {any}", .{err});
        };
        batch.txn = null;
    }
}

Db.commit is void (matching RocksDB). A commit failure at the LMDB level is logged and then discarded. The chain continues believing the write committed. For a consensus-critical write (finalization advancement, state persistence) this is unacceptable. Either propagate the error (requires changing the Db.commit signature everywhere — probably correct) or at minimum set batch.txn = null only on success and let the batch linger in a failed state that the caller can detect.

🔴 `realpathAlloc` in tests — fails on Zig 0.16.0

const data_dir = try tmp_dir.dir.realpathAlloc(allocator, ".");

realpathAlloc was removed in Zig 0.16.0 (just fixed in several upstream deps in other PRs this week). Since zeam is currently on 0.15.2 this compiles, but zeam is actively upgrading. Use a fixed-size buffer instead:

var path_buf: [std.fs.max_path_bytes]u8 = undefined;
const data_dir = try tmp_dir.dir.realpath(".", &path_buf);

⚠️ Backend mismatch at restart goes undetected

If a node runs with --db-backend rocksdb (default), accumulates chain data in {data_dir}/rocksdb, then is restarted with --db-backend lmdb, wipeAndReopenDb deletes {data_dir}/lmdb (may not exist → ignore_not_found swallows the miss) and opens a fresh LMDB env. The node starts from genesis while the old RocksDB data sits intact at {data_dir}/rocksdb. There is no "you switched backends; your existing data is in the other engine" warning. At minimum, log a warning when the non-selected backend directory exists.

⚠️ `DEFAULT_MAP_SIZE = 1 TiB` is too large for multi-node setups

LMDB reserves the full 1 TiB of virtual address space at env.open time. Each Db.openBackend call (the beam mock network opens 3) consumes 3 TiB of VA space. On Linux this is purely virtual and does not consume RAM or disk, but it can exhaust the 128 TiB of user-space VA on machines running many processes, and may fail outright on systems with vm.overcommit_memory = 2. A more conservative default (say 256 GiB) or — better — implement grow-on-MDB_MAP_FULL retry would be safer.

⚠️ `deleteFilesInRange` allocates a key buffer unnecessarily

The comment says "mutating during cursor iteration is less obviously correct" but LMDB explicitly supports deleting the current cursor position via MDB_CURRENT flag. The current approach collects all keys into an ArrayList, then deletes them in a second pass — doubling allocations. This is not a correctness issue but costs memory for large ranges (e.g., pruning old unfinalized slot indices).

✅ Good things

inline else dispatch is clean and DRY ✅
WriteBatch.deinit correctly aborts the live txn ✅
assert(batch.* == .rocksdb/.lmdb) guards cross-backend mismatches in debug builds ✅
Iterator copies key/value out of the mmap before returning — correct (LMDB pointers are only valid until the next operation) ✅
get copies the value out of the mmap before closing the txn ✅
flush comment correctly notes it is a no-op most of the time ✅
Tagged union tests exercise both backends with the same helper functions ✅
--seed 0 in Dockerfile for reproducible builds ✅

Addresses review feedback on #783: - initWriteBatch / Db.commit are now fallible. RocksDB and LMDB both return !void from commit, and Db.commit propagates so callers (processFinalizationAdvancement, updateBlockDb, etc.) can detect a failed persistence rather than silently continuing. - LMDB WriteBatch carries a sticky `failed` flag. Any mid-batch put/delete error sets it; commit refuses to commit a poisoned batch, aborts the txn, and surfaces error.LmdbBatch. - DEFAULT_MAP_SIZE dropped from 1 TiB to 256 GiB. 1 TiB x N parallel env opens exhausts user-space VA on crowded hosts and trips vm.overcommit_memory=2. - On open, warn if the non-selected backend's data directory already contains entries. Silently starting from genesis on a newly chosen engine was an invisible footgun after a --db-backend switch. - Parity tests in database.zig swap realpathAlloc for a fixed-buffer realpath so they keep compiling on Zig 0.16 (realpathAlloc was removed upstream). Propagation updates: - pkgs/node/src/chain.zig: `try` at the two existing batch/commit sites (updateBlockDb, processFinalizationAdvancement). - pkgs/database/src/rocksdb.zig tests that exercise Db (the tagged union) get `try` on initWriteBatch/commit too. All 151 unit tests pass; simtest (15/15) green.

ch4r10t33r · 2026-04-24T17:05:21Z

Thanks for the careful review. Pushed 4ae901e addressing everything except #6 (tracked as a follow-up).

🔴 #1 + #2 — silent batch failure / swallowed commit error

Both initWriteBatch and Db.commit are now fallible end-to-end:

RocksDbBackend.commit, LmdbBackend.commit, and Db.commit all return !void now. The RocksDB arm just trys callRocksDB (no more catch |err| { log; return }); the LMDB arm propagates either error.LmdbBatch (batch poisoned — see below) or error.LmdbCommit (mdb_txn_commit failed).
LmdbBackend.initWriteBatch is fallible too, returning error.LmdbBatch if beginTxn(false) fails. This removes the whole "txn: null and silently no-op" footgun.
LmdbBackend.WriteBatch gained a sticky failed: bool flag. Every queued put / delete / deleteRange that fails sets it. commit then aborts the txn instead of committing a partial batch and surfaces error.LmdbBatch. That covers the deeper version of the concern — even if initWriteBatch succeeds, a later MDB_MAP_FULL on put can't sneak through as a "successful" commit.

Propagation at the call sites:

pkgs/node/src/chain.zig — updateBlockDb and processFinalizationAdvancement both already return !void, so try self.db.initWriteBatch() / try self.db.commit(&batch).
pkgs/database/src/rocksdb.zig tests that exercise Db (the tagged union) get try on both calls; the direct-RocksDB tests stay as-is since RocksDB.initWriteBatch is still infallible (rocksdb.WriteBatch.init() has no error path).

🔴 #3 — `realpathAlloc` removed in Zig 0.16

Fixed in the new parity tests in database.zig:

var path_buf: [std.fs.max_path_bytes]u8 = undefined;
const data_dir = try tmp_dir.dir.realpath(".", &path_buf);

(Pre-existing realpathAlloc call sites in rocksdb.zig are left alone to keep this PR's diff focused; happy to sweep them in a separate PR alongside the 0.16 upgrade if you'd like.)

⚠️ #4 — backend-switch warning

Added warnIfOtherBackendPopulated next to Db.openBackend. On open it checks {data_dir}/{other_backend} and, if that directory exists and has entries, logs:

detected existing lmdb data at .../lmdb while starting with --db-backend rocksdb;
the previous engine's data is NOT migrated. Delete .../lmdb or switch back to
--db-backend lmdb to re-use it.

Best-effort — any filesystem error is silently ignored so a permission quirk can't block startup.

⚠️ #5 — 1 TiB map size

Dropped to 256 GiB (256 * (1 << 30)). Explained in the updated doc-comment: comfortably beyond a lean node's working set, leaves room for mock-network multi-env tests, avoids exhausting user-space VA on crowded hosts and vm.overcommit_memory=2 boxes. Grow-on-MDB_MAP_FULL is the right long-term answer but out of scope here.

⚠️ #6 — cursor-delete via `MDB_CURRENT`

Acknowledged — tracking as a follow-up. The lmdb-zig wrapper (blockblaz/lmdb-zig) doesn't expose mdb_cursor_del yet; once I add Cursor.deleteCurrent upstream and bump the dep here, deleteFilesInRange / batch.deleteRange can delete in-place and drop the two-pass ArrayList(keys) buffer.

Verification

zig fmt --check . ✅
cargo fmt --manifest-path rust/Cargo.toml --all -- --check ✅
cargo clippy --manifest-path rust/Cargo.toml --workspace -- -D warnings ✅
zig build test --summary all → 151/151 pass (parity tests through the tagged union for both backends) ✅
zig build simtest --summary all → 15/15 pass ✅

zclawz

Re-review: all blocking issues resolved ✅

Checked the latest commit (4ae901e5).

🔴 → ✅ Silent write-batch failure

initWriteBatch is now fallible (!WriteBatch) for the LMDB backend, propagated through Db.initWriteBatch with try. The failed: bool sticky flag on WriteBatch poisons mid-batch and causes commit to abort + return error.LmdbBatch. No silent data loss path remains.

🔴 → ✅ `commit` swallows errors

Db.commit is now !void. Both backends propagate:

LMDB: commit checks batch.failed, aborts, returns error.LmdbBatch if poisoned, otherwise try txn.commit()
RocksDB: changed from catch { log; return } to try callRocksDB(...) — errors now surface to the caller

Call sites in chain.zig (updateBlockDb, processFinalizationAdvancement) now use try, so a persistence failure surfaces immediately rather than letting the chain advance on phantom writes.

🔴 → ✅ `realpathAlloc` in tests

All three test helpers (testSaveAndLoadBlock, testBatchWriteAndCommit, testLoadLatestFinalizedStateHappyPath) now use a fixed-size buffer + dir.realpath(".", &path_buf). Compiles on Zig 0.16.0.

⚠️ → ✅ Backend mismatch at restart

warnIfOtherBackendPopulated walks the other engine's directory; if it contains any entries it logs a warn-level message naming both paths and advising the operator to delete or switch back. Called from openBackend. Diagnostic-only (no correctness barrier) which is the right trade-off — forcibly blocking would break dev workflows.

⚠️ → ✅ Default map size

DEFAULT_MAP_SIZE reduced from 1 TiB (1 << 40) to 256 GiB (256 * (1 << 30)). Comment updated to explain the vm.overcommit_memory = 2 and multi-node VA exhaustion concern.

No new issues found. LGTM.

Resolve build.zig.zon: keep lmdb (PR) and thread_pool (main).

g11tech

looks good can you ask for @anshalshukla 's review on this as well

ch4r10t33r added 3 commits April 24, 2026 13:49

ch4r10t33r marked this pull request as ready for review April 24, 2026 14:56

ch4r10t33r changed the title ~~database, cli: pluggable rocksdb/lmdb backend (#782)~~ feat: pluggable rocksdb/lmdb backend (#782) Apr 24, 2026

ch4r10t33r requested a review from anshalshukla April 24, 2026 16:34

zclawz requested changes Apr 24, 2026

View reviewed changes

ch4r10t33r requested a review from zclawz April 24, 2026 18:19

zclawz previously approved these changes Apr 26, 2026

View reviewed changes

merge main into pr-783

a402a64

Resolve build.zig.zon: keep lmdb (PR) and thread_pool (main).

ch4r10t33r dismissed zclawz’s stale review via a402a64 April 30, 2026 13:15

ch4r10t33r requested review from g11tech and zclawz April 30, 2026 13:27

g11tech approved these changes Apr 30, 2026

View reviewed changes

ch4r10t33r merged commit 33ea6e4 into main Apr 30, 2026
13 checks passed

ch4r10t33r deleted the feat/lmdb-backend branch April 30, 2026 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: pluggable rocksdb/lmdb backend (#782)#783

feat: pluggable rocksdb/lmdb backend (#782)#783
ch4r10t33r merged 7 commits into
mainfrom
feat/lmdb-backend

ch4r10t33r commented Apr 24, 2026

Uh oh!

zclawz left a comment

Uh oh!

ch4r10t33r commented Apr 24, 2026

Uh oh!

zclawz left a comment

Uh oh!

g11tech left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ch4r10t33r commented Apr 24, 2026

Summary

Design

Why the #782 assumption needed adjusting

Test plan

Uh oh!

zclawz left a comment

Choose a reason for hiding this comment

Adversarial Review

🔴 Silent write-batch failure — caller never knows

🔴 commit swallows errors silently

🔴 realpathAlloc in tests — fails on Zig 0.16.0

⚠️ Backend mismatch at restart goes undetected

⚠️ DEFAULT_MAP_SIZE = 1 TiB is too large for multi-node setups

⚠️ deleteFilesInRange allocates a key buffer unnecessarily

✅ Good things

Uh oh!

ch4r10t33r commented Apr 24, 2026

🔴 #1 + #2 — silent batch failure / swallowed commit error

🔴 #3 — realpathAlloc removed in Zig 0.16

⚠️ #4 — backend-switch warning

⚠️ #5 — 1 TiB map size

⚠️ #6 — cursor-delete via MDB_CURRENT

Verification

Uh oh!

zclawz left a comment

Choose a reason for hiding this comment

Re-review: all blocking issues resolved ✅

🔴 → ✅ Silent write-batch failure

🔴 → ✅ commit swallows errors

🔴 → ✅ realpathAlloc in tests

⚠️ → ✅ Backend mismatch at restart

⚠️ → ✅ Default map size

Uh oh!

g11tech left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

🔴 `commit` swallows errors silently

🔴 `realpathAlloc` in tests — fails on Zig 0.16.0

⚠️ `DEFAULT_MAP_SIZE = 1 TiB` is too large for multi-node setups

⚠️ `deleteFilesInRange` allocates a key buffer unnecessarily

🔴 #3 — `realpathAlloc` removed in Zig 0.16

⚠️ #6 — cursor-delete via `MDB_CURRENT`

🔴 → ✅ `commit` swallows errors

🔴 → ✅ `realpathAlloc` in tests