Skip to content

feat: pluggable rocksdb/lmdb backend (#782)#783

Merged
ch4r10t33r merged 7 commits into
mainfrom
feat/lmdb-backend
Apr 30, 2026
Merged

feat: pluggable rocksdb/lmdb backend (#782)#783
ch4r10t33r merged 7 commits into
mainfrom
feat/lmdb-backend

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

Summary

Adds LMDB as an alternative database engine alongside the existing
RocksDB backend, selectable at runtime via a new CLI flag:

--db-backend <rocksdb|lmdb>   (default: rocksdb)

Closes #782.

Design

  • pkgs/lmdb — vendored liblmdb C sources (pinned snapshot of
    mdb.master) plus a minimal Zig wrapper (Env, Txn, Dbi,
    Cursor) that translates MDB_* errors into a Error set and
    exposes addressable put/get/delete/iterate.
  • pkgs/database/src/lmdb.zig — an LMDB-flavoured backend that
    mirrors the shape of the existing RocksDB wrapper (same
    ColumnNamespace, same WriteBatch/Iterator API, same SSZ
    convenience helpers). Reads open short-lived read-only txns; writes
    and batches use an MDB_RDWR txn that the commit arms. Values
    returned from get are copied out of LMDB's mmap into a heap-owned
    Data so they live beyond the transaction.
  • pkgs/database/src/database.zigDb becomes a
    union(Backend) { rocksdb, lmdb }. All methods used by the node
    (initWriteBatch, commit, loadBlock, loadLatestFinalizedState,
    etc.) dispatch through switch. The tagged-union shape is
    deliberate: every method takes comptime cn: ColumnNamespace, and
    comptime args survive switch but not pointer indirection.
  • CLIBeamCmd and NodeCommand both grow a --db-backend
    flag that feeds database.Db.openBackend(...). The existing
    3-argument Db.open(...) is kept (defaulting to rocksdb) so the
    16+ test call sites stay untouched.
  • Stale-DB wipe now knows about the per-backend on-disk directory
    name (rocksdb/ vs lmdb/) so the genesis-time mismatch path
    cleans up whichever engine is in use.

Why the #782 assumption needed adjusting

The issue motivated LMDB on the basis that zeam already exhibits a
single-writer / multiple-reader access pattern. On audit, all DB
reads and writes today are serialized behind BeamNode.mutex; the
API and metrics servers don't touch the DB at all. The SWMR win is
therefore latent, not current — but switching to LMDB is still
worth it for the other reasons (minimal C surface, no compaction,
mmap reads, ~40 kB of vendor code). The issue body was rewritten to
reflect this, and a follow-up can move DB reads off BeamNode.mutex
to realise the SWMR advantage once both backends have been exercised
in CI.

Test plan

  • zig build test --summary all — 151/151 pass (includes the new
    LMDB parity tests in pkgs/database/src/database.zig that run
    the save/load block, batched commit and
    loadLatestFinalizedState flows through both engines).
  • zig build simtest --summary all — 15/15 pass.
  • zig fmt --check .
  • cargo fmt --manifest-path rust/Cargo.toml --all -- --check
  • cargo clippy --manifest-path rust/Cargo.toml --workspace -- -D warnings
  • Soak --db-backend lmdb on devnet (follow-up).
  • Update lean-quickstart docs to mention the new flag (follow-up).

Snapshot of the four liblmdb source files from the OpenLDAP mdb.master
branch, pinned at commit 40d3741b7d40ba4c75cb91dd9987ce692d376d71
(2026-01-13). License is OpenLDAP Public License v2.8 (see LICENSE).

Subsequent commits on this branch add the Zig build glue, bindings, and
a tagged-union Db dispatcher so rocksdb and lmdb can be selected at
startup via --db-backend. See #782.
Thin wrapper over the vendored liblmdb exposing just enough primitives
for pkgs/database to back a RocksDB-shaped key/value API:

- Env: open / close / sync, configurable map size / max dbs / readers.
- Txn: begin (ro or rw) / commit / abort, plus openDbi.
- Get / put / delete.
- Cursor: first / last / next / prev / seekRange.
- Error translation for every MDB error code.

The module is compiled as a Zig module with the C sources attached; it
is imported by @zeam/database so the vendored C is built once per
target. Two smoke tests (env lifecycle + cursor iteration) exercise the
binding end-to-end.

See #782.
Refactor `database.Db` from a bare alias over the RocksDB wrapper into
a tagged union that dispatches to either the existing RocksDB engine
or the new LMDB engine at runtime. Every method the node uses
(`initWriteBatch`, `commit`, `loadLatestFinalizedState`, `loadBlock`,
etc.) now `switch`es on the active variant.

Backend selection is surfaced to operators through a new CLI flag:

    --db-backend <rocksdb|lmdb>   (default: rocksdb)

The flag is threaded from `NodeCommand`/`BeamCmd` through
`node.NodeOptions` into `database.Db.openBackend(...)`. The existing
`Db.open(...)` helper is kept as a rocksdb-defaulted shim so the
existing tests and the 16+ call sites in `pkgs/node` stay unchanged.

Database stale-wipe logic now computes the on-disk directory name per
backend (`rocksdb/` vs `lmdb/`) so a genesis-time mismatch cleans up
whichever engine is in use.

Parity tests are added to `pkgs/database/src/database.zig` that run
the save/load block, batched commit, and `loadLatestFinalizedState`
flows against both backends through the tagged union.

Relates to #782.
@ch4r10t33r ch4r10t33r marked this pull request as ready for review April 24, 2026 14:56
…anic

`zig build` randomizes its dependency-graph traversal to surface
non-determinism, seeded from the build-runner's auto-picked seed.
On Zig 0.15.0 `std.Random.shuffleWithIndex` has an off-by-one that
can return `buf.len` for certain seed/length combinations, causing
`constructGraphAndCheckForDependencyLoop` to panic with
`index out of bounds: index 18, len 18`.

Our build graph landed on 18 nodes on this branch (new `lmdb` module
plus its test target) and the multi-arch Docker build drew the unlucky
seed `0xf02b800c`, so every image build was crashing during graph
prepare.

Pinning `--seed 0` sidesteps the bug and also makes the image
reproducible, which is what we want inside a Dockerfile regardless.
@ch4r10t33r ch4r10t33r changed the title database, cli: pluggable rocksdb/lmdb backend (#782) feat: pluggable rocksdb/lmdb backend (#782) Apr 24, 2026
Moves the vendored liblmdb C sources and the Zig wrapper out of
zeam and into a standalone repo:

  https://github.com/blockblaz/lmdb-zig (v0.1.0, commit 8e1f544)

zeam now depends on it through build.zig.zon like any other Zig
dep. No behaviour change: the wrapper API and vendored liblmdb
commit are identical to what lived in pkgs/lmdb. The standalone
repo lets other Zig projects reuse the bindings without pulling
in zeam.

- build.zig.zon: add lmdb dependency pinned to 8e1f544.
- build.zig: replace the in-tree module + C source wiring with
  b.dependency("lmdb", ...).module("lmdb").
- pkgs/lmdb: deleted; now lives upstream.
@ch4r10t33r ch4r10t33r requested a review from anshalshukla April 24, 2026 16:34
Copy link
Copy Markdown
Contributor

@zclawz zclawz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Solid approach overall — tagged-union dispatch via inline else, SWMR model matches zeam's writer-serialised threading, and the API surface mirrors RocksDB cleanly. Three blocking issues and several non-blocking notes.


🔴 Silent write-batch failure — caller never knows

initWriteBatch is infallible (to match the RocksDB signature), so when env.beginTxn(false) fails it returns a batch with txn = null. Every subsequent put/delete silently no-ops (self.txn orelse return). When the caller calls db.commit(&batch), the null-txn path also no-ops. Result: the node believes the batch committed but no data was written. This is worse than an error — the caller (e.g. processFinalizationAdvancement) cannot detect the failure and continues as if state was persisted.

One approach: make initWriteBatch fallible, or store a failed: bool in WriteBatch and have Db.commit return an error if batch.failed.


🔴 commit swallows errors silently

pub fn commit(self: *Self, batch: *WriteBatch) void {
    if (batch.txn) |*txn| {
        txn.commit() catch |err| {
            batch.logger.err("commit: txn.commit failed: {any}", .{err});
        };
        batch.txn = null;
    }
}

Db.commit is void (matching RocksDB). A commit failure at the LMDB level is logged and then discarded. The chain continues believing the write committed. For a consensus-critical write (finalization advancement, state persistence) this is unacceptable. Either propagate the error (requires changing the Db.commit signature everywhere — probably correct) or at minimum set batch.txn = null only on success and let the batch linger in a failed state that the caller can detect.


🔴 realpathAlloc in tests — fails on Zig 0.16.0

const data_dir = try tmp_dir.dir.realpathAlloc(allocator, ".");

realpathAlloc was removed in Zig 0.16.0 (just fixed in several upstream deps in other PRs this week). Since zeam is currently on 0.15.2 this compiles, but zeam is actively upgrading. Use a fixed-size buffer instead:

var path_buf: [std.fs.max_path_bytes]u8 = undefined;
const data_dir = try tmp_dir.dir.realpath(".", &path_buf);

⚠️ Backend mismatch at restart goes undetected

If a node runs with --db-backend rocksdb (default), accumulates chain data in {data_dir}/rocksdb, then is restarted with --db-backend lmdb, wipeAndReopenDb deletes {data_dir}/lmdb (may not exist → ignore_not_found swallows the miss) and opens a fresh LMDB env. The node starts from genesis while the old RocksDB data sits intact at {data_dir}/rocksdb. There is no "you switched backends; your existing data is in the other engine" warning. At minimum, log a warning when the non-selected backend directory exists.


⚠️ DEFAULT_MAP_SIZE = 1 TiB is too large for multi-node setups

LMDB reserves the full 1 TiB of virtual address space at env.open time. Each Db.openBackend call (the beam mock network opens 3) consumes 3 TiB of VA space. On Linux this is purely virtual and does not consume RAM or disk, but it can exhaust the 128 TiB of user-space VA on machines running many processes, and may fail outright on systems with vm.overcommit_memory = 2. A more conservative default (say 256 GiB) or — better — implement grow-on-MDB_MAP_FULL retry would be safer.


⚠️ deleteFilesInRange allocates a key buffer unnecessarily

The comment says "mutating during cursor iteration is less obviously correct" but LMDB explicitly supports deleting the current cursor position via MDB_CURRENT flag. The current approach collects all keys into an ArrayList, then deletes them in a second pass — doubling allocations. This is not a correctness issue but costs memory for large ranges (e.g., pruning old unfinalized slot indices).


✅ Good things

  • inline else dispatch is clean and DRY ✅
  • WriteBatch.deinit correctly aborts the live txn ✅
  • assert(batch.* == .rocksdb/.lmdb) guards cross-backend mismatches in debug builds ✅
  • Iterator copies key/value out of the mmap before returning — correct (LMDB pointers are only valid until the next operation) ✅
  • get copies the value out of the mmap before closing the txn ✅
  • flush comment correctly notes it is a no-op most of the time ✅
  • Tagged union tests exercise both backends with the same helper functions ✅
  • --seed 0 in Dockerfile for reproducible builds ✅

Addresses review feedback on #783:

- initWriteBatch / Db.commit are now fallible. RocksDB and LMDB both
  return !void from commit, and Db.commit propagates so callers
  (processFinalizationAdvancement, updateBlockDb, etc.) can detect a
  failed persistence rather than silently continuing.
- LMDB WriteBatch carries a sticky `failed` flag. Any mid-batch
  put/delete error sets it; commit refuses to commit a poisoned batch,
  aborts the txn, and surfaces error.LmdbBatch.
- DEFAULT_MAP_SIZE dropped from 1 TiB to 256 GiB. 1 TiB x N parallel
  env opens exhausts user-space VA on crowded hosts and trips
  vm.overcommit_memory=2.
- On open, warn if the non-selected backend's data directory already
  contains entries. Silently starting from genesis on a newly chosen
  engine was an invisible footgun after a --db-backend switch.
- Parity tests in database.zig swap realpathAlloc for a fixed-buffer
  realpath so they keep compiling on Zig 0.16 (realpathAlloc was
  removed upstream).

Propagation updates:
- pkgs/node/src/chain.zig: `try` at the two existing batch/commit
  sites (updateBlockDb, processFinalizationAdvancement).
- pkgs/database/src/rocksdb.zig tests that exercise Db (the tagged
  union) get `try` on initWriteBatch/commit too.

All 151 unit tests pass; simtest (15/15) green.
@ch4r10t33r
Copy link
Copy Markdown
Contributor Author

Thanks for the careful review. Pushed 4ae901e addressing everything except #6 (tracked as a follow-up).

🔴 #1 + #2 — silent batch failure / swallowed commit error

Both initWriteBatch and Db.commit are now fallible end-to-end:

  • RocksDbBackend.commit, LmdbBackend.commit, and Db.commit all return !void now. The RocksDB arm just trys callRocksDB (no more catch |err| { log; return }); the LMDB arm propagates either error.LmdbBatch (batch poisoned — see below) or error.LmdbCommit (mdb_txn_commit failed).
  • LmdbBackend.initWriteBatch is fallible too, returning error.LmdbBatch if beginTxn(false) fails. This removes the whole "txn: null and silently no-op" footgun.
  • LmdbBackend.WriteBatch gained a sticky failed: bool flag. Every queued put / delete / deleteRange that fails sets it. commit then aborts the txn instead of committing a partial batch and surfaces error.LmdbBatch. That covers the deeper version of the concern — even if initWriteBatch succeeds, a later MDB_MAP_FULL on put can't sneak through as a "successful" commit.

Propagation at the call sites:

  • pkgs/node/src/chain.zigupdateBlockDb and processFinalizationAdvancement both already return !void, so try self.db.initWriteBatch() / try self.db.commit(&batch).
  • pkgs/database/src/rocksdb.zig tests that exercise Db (the tagged union) get try on both calls; the direct-RocksDB tests stay as-is since RocksDB.initWriteBatch is still infallible (rocksdb.WriteBatch.init() has no error path).

🔴 #3realpathAlloc removed in Zig 0.16

Fixed in the new parity tests in database.zig:

var path_buf: [std.fs.max_path_bytes]u8 = undefined;
const data_dir = try tmp_dir.dir.realpath(".", &path_buf);

(Pre-existing realpathAlloc call sites in rocksdb.zig are left alone to keep this PR's diff focused; happy to sweep them in a separate PR alongside the 0.16 upgrade if you'd like.)

⚠️ #4 — backend-switch warning

Added warnIfOtherBackendPopulated next to Db.openBackend. On open it checks {data_dir}/{other_backend} and, if that directory exists and has entries, logs:

detected existing lmdb data at .../lmdb while starting with --db-backend rocksdb;
the previous engine's data is NOT migrated. Delete .../lmdb or switch back to
--db-backend lmdb to re-use it.

Best-effort — any filesystem error is silently ignored so a permission quirk can't block startup.

⚠️ #5 — 1 TiB map size

Dropped to 256 GiB (256 * (1 << 30)). Explained in the updated doc-comment: comfortably beyond a lean node's working set, leaves room for mock-network multi-env tests, avoids exhausting user-space VA on crowded hosts and vm.overcommit_memory=2 boxes. Grow-on-MDB_MAP_FULL is the right long-term answer but out of scope here.

⚠️ #6 — cursor-delete via MDB_CURRENT

Acknowledged — tracking as a follow-up. The lmdb-zig wrapper (blockblaz/lmdb-zig) doesn't expose mdb_cursor_del yet; once I add Cursor.deleteCurrent upstream and bump the dep here, deleteFilesInRange / batch.deleteRange can delete in-place and drop the two-pass ArrayList(keys) buffer.

Verification

  • zig fmt --check .
  • cargo fmt --manifest-path rust/Cargo.toml --all -- --check
  • cargo clippy --manifest-path rust/Cargo.toml --workspace -- -D warnings
  • zig build test --summary all → 151/151 pass (parity tests through the tagged union for both backends) ✅
  • zig build simtest --summary all → 15/15 pass ✅

@ch4r10t33r ch4r10t33r requested a review from zclawz April 24, 2026 18:19
zclawz
zclawz previously approved these changes Apr 26, 2026
Copy link
Copy Markdown
Contributor

@zclawz zclawz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: all blocking issues resolved ✅

Checked the latest commit (4ae901e5).

🔴 → ✅ Silent write-batch failure

initWriteBatch is now fallible (!WriteBatch) for the LMDB backend, propagated through Db.initWriteBatch with try. The failed: bool sticky flag on WriteBatch poisons mid-batch and causes commit to abort + return error.LmdbBatch. No silent data loss path remains.

🔴 → ✅ commit swallows errors

Db.commit is now !void. Both backends propagate:

  • LMDB: commit checks batch.failed, aborts, returns error.LmdbBatch if poisoned, otherwise try txn.commit()
  • RocksDB: changed from catch { log; return } to try callRocksDB(...) — errors now surface to the caller

Call sites in chain.zig (updateBlockDb, processFinalizationAdvancement) now use try, so a persistence failure surfaces immediately rather than letting the chain advance on phantom writes.

🔴 → ✅ realpathAlloc in tests

All three test helpers (testSaveAndLoadBlock, testBatchWriteAndCommit, testLoadLatestFinalizedStateHappyPath) now use a fixed-size buffer + dir.realpath(".", &path_buf). Compiles on Zig 0.16.0.

⚠️ → ✅ Backend mismatch at restart

warnIfOtherBackendPopulated walks the other engine's directory; if it contains any entries it logs a warn-level message naming both paths and advising the operator to delete or switch back. Called from openBackend. Diagnostic-only (no correctness barrier) which is the right trade-off — forcibly blocking would break dev workflows.

⚠️ → ✅ Default map size

DEFAULT_MAP_SIZE reduced from 1 TiB (1 << 40) to 256 GiB (256 * (1 << 30)). Comment updated to explain the vm.overcommit_memory = 2 and multi-node VA exhaustion concern.


No new issues found. LGTM.

Resolve build.zig.zon: keep lmdb (PR) and thread_pool (main).
Copy link
Copy Markdown
Member

@g11tech g11tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good can you ask for @anshalshukla 's review on this as well

@ch4r10t33r ch4r10t33r merged commit 33ea6e4 into main Apr 30, 2026
13 checks passed
@ch4r10t33r ch4r10t33r deleted the feat/lmdb-backend branch April 30, 2026 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add LMDB backend with pluggable database backend selection

3 participants