Skip to content

Release v0.2.0#206

Merged
JoshuaChi merged 65 commits intomainfrom
develop
Dec 31, 2025
Merged

Release v0.2.0#206
JoshuaChi merged 65 commits intomainfrom
develop

Conversation

@JoshuaChi
Copy link
Copy Markdown
Contributor

@JoshuaChi JoshuaChi commented Dec 13, 2025

Type

  • New feature
  • Bug Fix

Description

Release v0.2.0

Related Issues

This release includes work from the following issues:

Checklist

  • The code has been tested locally (unit test or integration test)
  • Squash down commits to one or two logical commits which clearly describe the work you've done.

Summary by CodeRabbit

  • New Features

    • Watch API for real-time key-change streams; TTL/lease-based automatic key expiry; embedded mode and local in-process KV client; KV client adds TTL writes and watch support plus leader discovery.
  • Documentation

    • v0.2.0 changelog, comprehensive migration guide, rewritten README with new modes, examples and quickstart.
  • Chores

    • Workspace/workflow reorganization, weekly benchmark workflow and reports, expanded Makefile targets, CI and lint/config updates, new ignore patterns.
  • Tests

    • Large suite of new unit/integration tests and mock utilities covering client, raft, storage and replication.

✏️ Tip: You can customize this high-level summary in your review settings.

…y log entry. feat #66: Refactor the Errors to distinguish between protocol logic errors and system-level errors.

* feat #43: When a Raft leader is elected, it should first send an empty log entry. 
1. implement verify_leadership_in_new_term 
2. refactor verify_leadership_in_new_term and enforce_quorum_consensus

* feat #66: Refactor the Errors to distinguish between protocol logic errors and system-level errors.
* feat #71: add client usage example

* feat #59: Benchmarking against etcd v3.5, Enhance examples/three-nodes-cluster with Makefile

* fix: #70 Node Logs Exiting Program After Stress Test — Unexpected or Simulated Shutdown
…r-friendly StorageEngine and StateMachine

* feat #59: Refactor state_machine_commit_listener into a separate function to improve visibility across all newly spawned threads.

* feat #79: Add snapshot feature

* fix #90: Refactor: Decouple Client Command Protocol from Raft Internal Log Payload Type

* feat #89: Add auto-discovery support for new learner nodes

* feat #101: Revised learner join process with promotion/fail semantics

* feat #45: Implemented first/last index for term to resolve replication conflicts

* fix #106: Retry leadership noop confirmation until timeout

* feat #102: Automatic node removal 

* feat #109: Enable RPC connection cache

* fix #107: 1. Add back rpc connection params 2. Change parking lot sync lock to Tokio async lock in RaftMembership 3. optimize MembershipGuard for performance. Remove read locker.

* feat #110: 1. add new events SnapshotCreated and LogPurgeCompleted 2. decouple Leader snapshot create to several events 3. refactor raft log index generation from pre_allocate_raft_logs_next_index to pre_allocate_id_range 4. optimize LocalLogBatch structure - based on performance analysis

* feat #119: Make the StorageEngine  trait more developer-friendly

* feat #120: Make the StateMachine trait more developer-friendly

* fix #121: Election: higher term vote request should be updated immediately and reset voted_for

* feat #122: Refactor compress and decompress snapshot logic from StateMachine to StateMachineHandler

* fix #123: Replace global tracing log with trace_test crate

* feat #125: Make rocksdb adaptor as feature. Move sled implementation as example
…tart when path is already locked by another node
* refs #138: optimize to use long lived peer tasks for append entries request only

* refs #138: try to fix race issue - Multiple threads can pass the initial check and all create new appenders

* feat #138: 1. introduce crossbeam 2. optimize buffered_raft_log by creating a flush worker pool with configurable number of workers to handle persistence operations

* feat #138: optimize flush_workers as configure

* feat #138: optimize channel_capacity into configure

* feat #139: work on a reliable monitoring metric for resource utilization

* feat #140: Optimize proto bytes fields to use bytes::Bytes

* feat #140: switch Into<Bytes> to AsRef<[u8]>

* fix #141: Optimize RocksDB write path and Raft log loop for lower latency

* feat 142: Add Read Consistency Policy Support and Lease-Based Read Optimization

* feat #142: update v0.1.4 bench report

* feat #142: add client doc guide

* feat #143: Refactor gRPC compression configuration for Raft transport (performance optimization) - step1: disable server side client rpc compress response

* feat #143: Refactor gRPC compression configuration for Raft transport (performance optimization)

* feat #143: add performance diagram

* feat #138: ready to publish v0.1.4

* feat #138: add ut

* fix #146: buffered_raft_log.rs Code Reviews

* fix #146: update bench report

* fix #145: Bug: Undefined Behavior from Incorrect Lifetime Conversion in Mmap Zero-Copy Path

* fix #148: fix rocksdb storage engine flush bug

* fix #147: fix file state machine bug
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 13, 2025

Important

Review skipped

Too many files!

48 files out of 198 files are above the max files limit of 150.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Workspace-wide v0.2.0 migration: adds workspace Cargo layout and new crates, introduces a unified client crate and KvClient trait with GrpcKvClient (put_with_ttl, watch), moves protobuf imports to d_engine_proto, adds lease/watch/TTL plumbing and Raft/replication/raft-log/lease traits, broadens many visibilities, adds tests, benchmarks, CI/workflow and tooling updates, and adds .claudeignore. No breaking runtime protocol changes beyond documented API/signature adjustments.

Changes

Cohort / File(s) Summary
Ignore & Git config
\.claudeignore, \.gitignore
Add .claudeignore with common ephemeral/build patterns; add *.bak* and *.claudeignore to .gitignore.
CI / Workflows
\.github/workflows/benchmark.yml, \.github/workflows/ci.yml, \.github/workflows/commit-message-check.yml, \.github/workflows/dependency-audit.yml
Add weekly benchmark workflow; refactor CI toolchain/caching to dtolnay/rust-toolchain + rust-cache; broaden commit-type whitelist; update dependency-audit step.
Tooling & Make
Makefile, clippy.toml, cliff.toml
Replace Makefile with comprehensive workspace targets; add workspace clippy config; extend commit-parser rule.
Workspace manifests & benches
Cargo.toml, d-engine-core/Cargo.toml, d-engine-client/Cargo.toml, benches/**/Cargo.toml, benches/**/rustfmt.toml
Convert to workspace Cargo.toml; add d-engine-core and d-engine-client manifests; update bench crate to v0.2.0 and edition 2024.
Docs & Release
README.md, CHANGELOG.md, MIGRATION_GUIDE.md
Add v0.2.0 changelog, migration guide and large README updates describing new features and migration notes.
Bench reports & artifacts
benches/d-engine-bench/reports/*
Add new and remeasured benchmark reports for v0.2.0 and v0.1.4.
Client crate (d-engine-client)
d-engine-client/Cargo.toml, d-engine-client/src/**
New client crate exposing Client, ClientBuilder; add KvClient trait; rename KvClient→GrpcKvClient with new methods put_with_ttl and watch; refactor error types to use ErrorCode in ClientApiError; update connection pool to use ClusterMembership and current_leader_id; add mocks and many tests.
Client tests & bench binary
d-engine-client/src/*_test.rs, benches/d-engine-bench/src/main.rs
Add/adjust tests (pool, proto ext, kv, utils), add mock RPC services and MockNode helpers; bench binary adds key_space CLI and client pool changes.
Core crate (d-engine-core)
d-engine-core/Cargo.toml, d-engine-core/src/**
New core crate manifest; systematically migrate proto imports to d_engine_proto; add Lease and RaftLog traits; broaden many visibilities to public; add WatchConfig/LeaseConfig, Membership trait, leader-change notifications, ClusterMetadata cache; large raft/election/replication/state-machine/storage refactors and many new/updated tests.
State machine & watch integration
d-engine-core/src/state_machine_handler/*, d-engine-core/src/storage/*
Add optional WatchManager wiring and notify_watchers; thread snapshot_dir_prefix; add lease hooks and Lease trait; snapshot persistence hooks and updated state-machine handler APIs.
Raft / Election / Replication
d-engine-core/src/election/*, d-engine-core/src/replication/*, d-engine-core/src/raft*.rs, d-engine-core/src/raft_role/*
Single-node early-win election, VotedFor committed semantics, replication topology-aware fast path (single-node), SharedState hot-path current_leader atomic APIs, many exported role/leader APIs and visibility expansions.
Storage & RaftLog
d-engine-core/src/storage/raft_log.rs, d-engine-core/src/storage/*
Add public RaftLog trait surface; reorganize storage module (snapshot path manager, state_machine, storage_engine), re-export lease/state-machine/storage_engine.
Test infra & test-utils
d-engine-core/src/test_utils/*, many *_test.rs, client mocks
Add/expand tests and test helpers, broaden automock gating to any(test, feature = "test-utils"), expose test helpers and mocks, add many new test modules.
Proto import migrations
many d-engine-core and d-engine-client files
Systematic replacement of crate::proto::* imports with d_engine_proto::* and corresponding type/adaptor updates.
New repo files & reports
clippy.toml, .claudeignore, cliff.toml, bench reports
Add workspace lint config, ignore file, commit-parser rule, and benchmark artifacts/reports.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant App as Application
    participant Client as d-engine Client
    participant Pool as ConnectionPool
    participant CM as ClusterManagementService (gRPC)
    participant Leader as Leader Node (gRPC)

    App->>Client: Client::builder(endpoints)
    Client->>Pool: build_connections()
    Pool->>CM: get_cluster_metadata()
    CM-->>Pool: ClusterMembership{nodes, current_leader_id}
    Pool->>Pool: parse_cluster_metadata() -> leader addr
    Pool-->>Client: client ready (kv + cluster clients)

    App->>Client: kv().put(key, val) / put_with_ttl(...)
    Client->>Leader: ClientWriteRequest (gRPC)
    Leader-->>Client: ClientResponse (success/error)
    Client->>App: return result (mapped to KvClientError/KvResult)

    App->>Client: kv().watch(key)
    Client->>Leader: WatchRequest (stream)
    Leader-->>Client: streaming WatchResponse events
    Client->>App: stream events to caller
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Areas to focus during review:

  • Workspace & proto import migration: check for mixed/incorrect proto types and missing conversions.
  • Public API expansions (visibility promotions): verify intended exposure and backward-compat guarantees.
  • Client API changes: KvClient trait, GrpcKvClient rename, put_with_ttl, watch, and error mapping flow.
  • Watch integration and gRPC streaming: WatchManager wiring, notify_watchers correctness, stream lifecycle and shutdown behavior.
  • Lease/TTL semantics and persistence: Lease trait, to_snapshot/reload semantics and interactions with state machine purge logic.
  • Concurrency and atomic hot-paths: SharedState current_leader, ConnectionPool refresh/thread-safety, leader-change listeners.
  • Large test additions and automock gating: ensure test-utils feature does not leak behavior into production builds.
  • CI/Makefile/workflow changes: verify caching, toolchain install, and artifact upload semantics.

Possibly related PRs

Poem

🐇 I hopped through code and found a stream,

Watches whisper, TTLs softly gleam,
Clients call leaders, pools align,
Workspaces tidy, tests all shine —
v0.2.0: carrots in the codebase dream!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Release v0.2.0' is a clear, concise single sentence that directly describes the main objective of the pull request.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This release (v0.2.0) represents a major refactoring of the d-engine-core codebase, focusing on improved modularity, visibility control, and integration with the extracted d-engine-proto package. The changes establish clearer boundaries between internal implementation and public API while maintaining backward compatibility for test infrastructure.

Key changes:

  • Extracted protocol buffer definitions to separate d-engine-proto package
  • Enhanced visibility controls with pub/pub(crate) annotations for better API boundaries
  • Improved test infrastructure with test-utils feature flag
  • Added watch functionality for client notifications
  • Refactored storage layer with explicit safety contracts

Reviewed changes

Copilot reviewed 106 out of 373 changed files in this pull request and generated no comments.

Show a summary per file
File Description
d-engine-core/src/test_utils/mock/* Updated mock utilities to use extracted proto definitions and improved visibility
d-engine-core/src/storage/* Added comprehensive documentation and safety contracts for storage traits
d-engine-core/src/state_machine_handler/* Refactored to use d-engine-proto and added watch notification support
d-engine-core/src/raft_role/* Enhanced leader change notifications and atomic leader tracking
d-engine-core/src/replication/* Added single-node cluster optimization and improved error handling
d-engine-core/src/network/* Refactored transport layer with extracted proto definitions
d-engine-core/src/config/* Updated configuration to use d-engine-proto types
d-engine-core/src/lib.rs Restructured exports for clearer public API surface

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (13)
d-engine-core/src/storage/storage_test.rs (1)

39-60: Update constant references to use imported NodeRole variants.

The code references FOLLOWER and LEARNER constants that are no longer imported. Based on the imports at lines 6-7, these should be updated to use the NodeRole enum variants.

Apply this diff to fix the role assignments:

             NodeMeta {
                 id: 2,
                 address: "127.0.0.1:10000".to_string(),
-                role: FOLLOWER,
+                role: Follower as i32,
                 status: NodeStatus::Active.into(),
             },
             NodeMeta {
                 id: 3,
                 address: "127.0.0.1:10000".to_string(),
-                role: FOLLOWER,
+                role: Follower as i32,
                 status: NodeStatus::Active.into(),
             },
             NodeMeta {
                 id: 4,
                 address: "127.0.0.1:10000".to_string(),
-                role: LEARNER,
+                role: Learner as i32,
                 status: NodeStatus::Active.into(),
             },
             NodeMeta {
                 id: 5,
                 address: "127.0.0.1:10000".to_string(),
-                role: LEARNER,
+                role: Learner as i32,
                 status: NodeStatus::Active.into(),
             },
README.md (3)

57-66: Fix markdown: stray Rust use ... line is outside the fenced code block.
use d-engine::{RaftCore, MemoryStorage, Config}; appears outside a ```rust fence, which will render incorrectly (and the crate path also looks suspicious with the hyphen in code).


208-211: Typos/links: “d-eninge” and Apache URL.
Fix the project name typo and consider using https for the Apache license link.


91-100: Consider standardizing import paths in README examples for consistency.

The RocksDB example (line 96) uses use d_engine::{...} while earlier examples use use d_engine_server::{...}. Both import paths work correctly since d_engine re-exports all types from d_engine_server under the server and rocksdb features. However, standardizing to one approach throughout the documentation would improve clarity and reduce potential confusion for users.

d-engine-core/src/test_utils/mock/mock_rpc.rs (1)

141-151: Fix copy/paste error message in discover_leader (misleading debugging).
The None case returns "No mock get_cluster_metadata response set" inside discover_leader; this should reference discover_leader.

-            None => Err(tonic::Status::unknown(
-                "No mock get_cluster_metadata response set",
-            )),
+            None => Err(tonic::Status::unknown(
+                "No mock discover_leader response set",
+            )),
d-engine-core/src/state_machine_handler/default_state_machine_handler.rs (1)

1197-1209: Hardcoded offset in parse_snapshot_dirname doesn't match dynamic prefix length.

Line 1209 uses a hardcoded offset of 9 ("snapshot-".len() = 9), but the function now accepts a dynamic snapshot_dir_prefix parameter. If the prefix differs from "snapshot-", the parsing will be incorrect.

-    // Remove fixed parts
-    let core = &name[9..name.len()]; // "snapshot-".len() = 9,
+    // Remove the prefix (e.g., "snapshot-")
+    let core = &name[snapshot_dir_prefix.len()..];
d-engine-core/src/maybe_clone_oneshot.rs (2)

151-161: Potential panic in Clone implementation when test_inner is None.

Line 158 uses .unwrap() on self.test_inner.as_ref(), which will panic if test_inner is None. Although this code is gated by test-utils, the panic could occur if someone constructs a MaybeCloneOneshotReceiver with test_inner: None and then tries to clone it.

 #[cfg(any(test, feature = "test-utils"))]
 impl<T: Send + Clone> Clone for MaybeCloneOneshotReceiver<T> {
     fn clone(&self) -> Self {
         let (_, receiver) = oneshot::channel();
 
         Self {
             inner: receiver,
-            test_inner: Some(self.test_inner.as_ref().unwrap().resubscribe()),
+            test_inner: self.test_inner.as_ref().map(|rx| rx.resubscribe()),
         }
     }
 }

119-138: Busy-wait polling pattern causes excessive CPU usage.

The Future::poll implementation at line 124 calls cx.waker().wake_by_ref() when the channel is empty, creating a busy-wait loop that will continuously poll without yielding to the executor properly. This can cause 100% CPU usage.

The proper pattern would be to use tokio::pin! with the async recv() method, or implement a proper waker registration. For a quick fix in test-only code:

                 Err(broadcast::error::TryRecvError::Empty) => {
-                    // Register a Waker to wake up the task when data arrives
-                    cx.waker().wake_by_ref();
-                    Poll::Pending
+                    // Broadcast channels don't support proper waker registration via try_recv.
+                    // This is a limitation in test-only code; consider using recv().await pattern
+                    // in async contexts instead.
+                    cx.waker().wake_by_ref();
+                    Poll::Pending
                 }

Consider adding a doc comment warning about this behavior, or restructuring to avoid Future impl entirely in test mode.

d-engine-core/src/network/backgroup_snapshot_transfer_test.rs (1)

185-189: Comment is inconsistent with code.

The comment states "2 chunks * 5KB = 10KB" but line 156 creates 3 chunks: create_snapshot_stream(3, 5 * 1024). Update the comment to reflect the actual test data.

-        // Calculate expected minimum time:
-        // Total data = 2 chunks * 5KB = 10KB = 80,000 bits
-        // Bandwidth = 1 Mbps = 1,000,000 bps
-        // Minimum time = 80,000 / 1,000,000 = 0.08 seconds
+        // Calculate expected minimum time:
+        // Total data = 3 chunks * 5KB = 15KB = 120,000 bits
+        // Bandwidth = 1 Mbps = 1,000,000 bps
+        // Minimum time = 120,000 / 1,000,000 = 0.12 seconds
d-engine-core/src/raft_role/candidate_state.rs (1)

483-493: Incorrect role in error message: "Learner" should be "Candidate".

This appears to be a copy-paste error. The current_role and context message incorrectly identify the node as a "Learner" when it's actually a "Candidate".

             RaftEvent::LogPurgeCompleted(_purged_id) => {
                 return Err(ConsensusError::RoleViolation {
-                    current_role: "Learner",
+                    current_role: "Candidate",
                     required_role: "Leader",
                     context: format!(
-                        "Learner node {} should not receive LogPurgeCompleted event.",
+                        "Candidate node {} should not receive LogPurgeCompleted event.",
                         ctx.node_id
                     ),
                 }
                 .into());
             }
d-engine-core/src/raft_role/mod.rs (1)

133-136: Minor: fix log typo (“wtih”).

-        debug!(
-            "New Shared State wtih, hard_state_from_db:{:?}, last_applied_index_option:{:?} ",
+        debug!(
+            "New Shared State with, hard_state_from_db:{:?}, last_applied_index_option:{:?} ",
             &hard_state_from_db, &last_applied_index_option
         );
d-engine-client/src/pool.rs (1)

153-187: Normalize bootstrap endpoint addresses before dialing (avoid scheme-related hard failures).
You already normalize NodeMeta.address via address_str, but endpoints go straight into Endpoint::try_from. If configs commonly use 127.0.0.1:50051, pool creation will fail immediately.

 pub(super) async fn load_cluster_metadata(
     endpoints: &[String],
     config: &ClientConfig,
 ) -> std::result::Result<ClusterMembership, ClientApiError> {
     for addr in endpoints {
-        match Self::create_channel(addr.clone(), config).await {
+        let addr = address_str(addr);
+        match Self::create_channel(addr.clone(), config).await {
             Ok(channel) => {
d-engine-client/src/error.rs (1)

150-204: Preserve server error messages in the default Status mapping.
Right now the _ arm drops status.message() and replaces it with “Unhandled status code…”, which makes debugging harder.

-            _ => Self::Business {
-                code: ErrorCode::Uncategorized,
-                message: format!("Unhandled status code: {code:?}"),
-                required_action: None,
-            },
+            _ => Self::Business {
+                code: ErrorCode::Uncategorized,
+                message: format!("Unhandled status code: {code:?}: {message}"),
+                required_action: None,
+            },
         }
🧹 Nitpick comments (46)
d-engine-core/src/maybe_clone_oneshot_test.rs (2)

1-227: Consider adding error case tests for completeness.

While the happy path coverage is excellent, consider adding tests for error/failure scenarios to ensure robust behavior:

  • Sending after all receivers are dropped
  • Receiving after sender is dropped
  • Double-send on the oneshot channel (if applicable)
  • Timeout scenarios or cancellation

These tests would validate that MaybeCloneOneshot handles edge cases and failures gracefully.


49-50: Optional: Simplify variable handling.

The intermediate rx1_mut and rx2_mut variables are unnecessary. You could declare rx1 and rx2 as mutable directly or use them inline:

-    let mut rx1_mut = rx1;
-    let mut rx2_mut = rx2;
-    let received1 = rx1_mut.recv().await.expect("Failed to receive rx1");
-    let received2 = rx2_mut.recv().await.expect("Failed to receive rx2");
+    let received1 = rx1.recv().await.expect("Failed to receive rx1");
+    let received2 = rx2.recv().await.expect("Failed to receive rx2");
d-engine-core/src/config/config_test.rs (1)

9-12: LGTM! Unsafe block correctly guards environment variable removal.

The unsafe block is properly added as required by Rust's safety model, and the safety justification is reasonable for test code where the #[serial] attribute ensures sequential execution.

Optionally, consider refining the safety comment to be more precise:

-            // SAFETY: Test-only cleanup in single-threaded test context
+            // SAFETY: Test-only cleanup; all callers are marked #[serial] to prevent
+            // concurrent execution, and no other threads access env vars during tests
             unsafe {
                 std::env::remove_var(&key);
             }
MIGRATION_GUIDE.md (1)

20-22: Add language specifier to fenced code block.

The code fence is missing a language identifier, which improves syntax highlighting and follows Markdown best practices.

Apply this diff:

-````
+```text
 Entry fields: ..., ttl_secs: u32 (4 bytes, relative TTL)

</blockquote></details>
<details>
<summary>d-engine-client/src/utils.rs (1)</summary><blockquote>

`28-40`: **Optional: Simplify by reducing duplication.**

The function is correct and clear, but you could reduce duplication by combining the scheme-preservation cases.



Apply this diff to consolidate the logic:

```diff
 pub(crate) fn address_str(addr: &str) -> String {
-    // Detect if the address already has a scheme
-    if addr.starts_with("https://") {
-        // Preserve HTTPS scheme
-        addr.to_string()
-    } else if addr.starts_with("http://") {
-        // Preserve HTTP scheme
+    if addr.starts_with("https://") || addr.starts_with("http://") {
+        // Preserve existing scheme
         addr.to_string()
     } else {
         // No scheme present, default to HTTP
-        format!("http://{addr}",)
+        format!("http://{addr}")
     }
 }
README.md (1)

200-202: Docs claim is very strong—ensure it’s accurate or soften it.
“1000+ integration tests” and “Jepsen validation” are credibility-critical statements; if they’re not continuously true for v0.2.0, consider qualifying (e.g., “has CI integration tests and Jepsen coverage”) or linking to evidence.

d-engine-core/src/test_utils/mock/mock_rpc.rs (1)

155-194: watch() returns an immediately-closed stream—verify that’s what tests expect.
Because _tx is dropped right away, the client will observe EOF immediately. If you need a “quiet but open” stream for watch tests, store the sender on MockRpcService (or return a stream from a test-controlled receiver).

d-engine-core/src/state_machine_handler/default_state_machine_handler.rs (2)

717-737: Silent failures in notify_watchers may hide decoding issues.

When WriteCommand::decode fails at line 721, the error is silently ignored. While this is acceptable for non-command payloads, decoding failures for valid command payloads could indicate data corruption or protocol mismatches that should be logged.

Consider adding debug-level logging for decode failures:

                     // Decode the WriteCommand
-                    if let Ok(write_cmd) = WriteCommand::decode(bytes.as_ref()) {
+                    match WriteCommand::decode(bytes.as_ref()) {
+                        Ok(write_cmd) => {
                         match write_cmd.operation {
                             Some(Operation::Insert(insert)) => {
                                 watch_mgr.notify_put(insert.key, insert.value);
                             }
                             Some(Operation::Delete(delete)) => {
                                 watch_mgr.notify_delete(delete.key);
                             }
                             None => {
                                 // No operation, skip
                             }
                         }
+                        }
+                        Err(e) => {
+                            tracing::debug!("Failed to decode WriteCommand: {}", e);
+                        }
                     }

1185-1194: Function name zero_copy_bytes_from_mmap is misleading since it copies data.

The function uses Bytes::copy_from_slice(slice) which performs a full memory copy, not zero-copy. The documentation and naming suggest zero-copy behavior that isn't actually implemented.

Either rename the function to reflect actual behavior, or implement true zero-copy using Bytes::from_owner:

 fn zero_copy_bytes_from_mmap(
     mmap_arc: Arc<Mmap>,
     start: usize,
     end: usize,
 ) -> Bytes {
-    // Get a slice of the memory map
-    let slice = &mmap_arc[start..end];
-    Bytes::copy_from_slice(slice)
+    // True zero-copy: use Bytes::from_owner to share the Arc<Mmap>
+    let len = end - start;
+    Bytes::from_owner(mmap_arc).slice(start..end)
 }

Alternatively, rename to bytes_from_mmap to avoid the misleading "zero_copy" prefix.

d-engine-core/src/storage/raft_log.rs (3)

253-260: Default wait_durable implementation may mask bugs in MemFirst implementations.

The default no-op implementation is documented for DiskFirst strategy, but implementers using MemFirst might accidentally rely on the default and violate durability guarantees. Consider making this method required (no default) to force explicit implementation.

If keeping the default, consider adding a compile-time or runtime check:

     async fn wait_durable(
         &self,
         index: u64,
     ) -> Result<()> {
-        // Default implementation for DiskFirst: no-op (already durable)
+        // Default implementation for DiskFirst: no-op (already durable)
+        // WARNING: MemFirst implementations MUST override this method
+        #[cfg(debug_assertions)]
+        tracing::trace!(index, "wait_durable: using default no-op (DiskFirst assumed)");
         let _ = index;
         Ok(())
     }

46-55: Inconsistency between first_entry_id return value and is_empty check.

The doc states first_entry_id() returns 0 if log is empty, but is_empty() at line 91 states it's equivalent to (first_entry_id() == 0). However, after a purge operation, first_entry_id() could be non-zero while the log is technically not empty. The documentation should clarify the distinction between "empty log" and "log with entries purged".

Consider clarifying the documentation:

     /// Returns the smallest log index (inclusive).
     ///
     /// # Returns
-    /// - `0` if log is empty
+    /// - `0` if log is empty (never written to)
     /// - First index otherwise (typically 1 after initialization)
+    /// - After purge: first non-purged index (NOT 0)

374-377: purge_logs_up_to takes LogId but only needs index.

The method accepts a full LogId but the documentation at line 357 says "LogId of last entry to purge" while the actual need is just the index. Consider whether term validation is needed here or if just u64 would suffice.

If term validation isn't required:

-    async fn purge_logs_up_to(
-        &self,
-        cutoff_index: LogId,
-    ) -> Result<()>;
+    async fn purge_logs_up_to(
+        &self,
+        cutoff_index: u64,
+    ) -> Result<()>;

Or add documentation explaining why LogId is needed.

d-engine-core/src/membership.rs (1)

17-32: ConnectionType::all() allocates a new Vec on each call.

For a frequently-called method returning static data, consider using a const array or lazy_static.

 impl ConnectionType {
-    pub fn all() -> Vec<ConnectionType> {
-        vec![
-            ConnectionType::Control,
-            ConnectionType::Data,
-            ConnectionType::Bulk,
-        ]
-    }
+    pub const ALL: &'static [ConnectionType] = &[
+        ConnectionType::Control,
+        ConnectionType::Data,
+        ConnectionType::Bulk,
+    ];
+    
+    pub fn all() -> &'static [ConnectionType] {
+        Self::ALL
+    }
 }
d-engine-core/src/raft_test.rs (1)

7-62: Channel tests verify basic mpsc behavior rather than Raft logic.

These tests (test_leader_change_listener_registration, test_multiple_listeners, test_no_leader_notification, test_channel_closed) primarily validate tokio::sync::mpsc channel semantics rather than Raft-specific code. While they serve as documentation of expected behavior, consider whether they add sufficient value given they test standard library functionality.

d-engine-client/Cargo.toml (1)

29-32: Consider migrating remaining dependencies to workspace.

The dependencies arc-swap, tokio-stream, and async-trait use explicit versions while others use workspace = true. For consistency and centralized version management, consider adding these to the workspace dependencies.

d-engine-core/src/state_machine_handler/default_state_machine_handler_test.rs (2)

312-380: Significant commented-out code block.

This ~70-line commented-out test (test_install_snapshot_case1) should either be removed if no longer needed, or restored and fixed if still valuable. Leaving large commented code blocks can cause confusion about test coverage.

Consider either:

  1. Removing this commented block if the test is obsolete
  2. Restoring and updating the test if it provides valuable coverage
  3. Converting to a tracking issue if the test needs significant rework

1527-1611: Additional large commented-out code block.

Another ~85-line commented-out helper function (mock_node_with_rpc_service) that appears to be unused. This should be cleaned up for the same reasons as the previous comment.

d-engine-core/src/storage/state_machine_test.rs (1)

346-374: Consider resetting state machine between measurements for cleaner benchmarks.

The scalability test runs the large batch (1000 entries) on a state machine that already contains 100 entries from the small batch. While this still detects O(N²) complexity, the measurements may be influenced by the existing state. For a purer comparison, consider using a fresh state machine instance for each measurement.

That said, the current approach tests realistic incremental workloads, which is also valuable.

d-engine-core/src/storage/lease.rs (2)

31-31: Consider standardizing to English in documentation.

The doc comment contains Chinese characters "(租约)". For consistency with the rest of the codebase documentation, consider using English only or adding an English translation.


114-129: get_expired_keys has non-obvious mutation semantics.

The method name uses get_ prefix but performs mutation ("Removes returned keys from internal lease indexes"). While documented, this could be surprising. Consider renaming to drain_expired_keys or take_expired_keys to make the side effect clearer in the method name.

d-engine-client/src/kv_client.rs (1)

137-172: Consider documenting the get_multi signature difference.

The get_multi method takes &[Bytes] while other methods use impl AsRef<[u8]> + Send. This is intentional (to avoid allocation issues with iterators), but a brief doc comment explaining why Bytes is required here could help implementors understand the design choice.

d-engine-client/src/pool_test.rs (1)

103-121: Test name doesn't match behavior.

The test is named test_create_channel_success but it actually tests the failure case with an invalid address. Consider renaming to test_create_channel_invalid_address_fails or similar for clarity.

 #[tokio::test]
 #[traced_test]
-async fn test_create_channel_success() {
+async fn test_create_channel_invalid_address_fails() {
     let config = ClientConfig {
         id: get_now_as_u32(),
         connect_timeout: Duration::from_millis(1000),
d-engine-core/src/storage/mod.rs (1)

12-17: Consider gating test modules behind a feature flag.

The test modules (state_machine_test, storage_engine_test) are pub mod with #[doc(hidden)]. If these are only needed for testing, consider using #[cfg(any(test, feature = "test-utils"))] to avoid including test code in production builds.

 #[doc(hidden)]
 pub use raft_log::*;
-#[doc(hidden)]
-pub mod state_machine_test;
-#[doc(hidden)]
-pub mod storage_engine_test;
+#[cfg(any(test, feature = "test-utils"))]
+#[doc(hidden)]
+pub mod state_machine_test;
+#[cfg(any(test, feature = "test-utils"))]
+#[doc(hidden)]
+pub mod storage_engine_test;
Cargo.toml (1)

50-125: Remove commented-out code.

Large blocks of commented-out configuration (old package settings, dependencies, dev-dependencies) clutter the manifest and should be removed. If historical reference is needed, git history preserves it.

d-engine-client/src/kv_error.rs (1)

56-63: Consider adding a dedicated NotLeader variant for better client handling.

Currently, ErrorCode::NotLeader maps to KvClientError::ServerError(message), which loses the semantic distinction. Clients often need to handle "not leader" specially (e.g., for retry with redirection). A dedicated variant would enable:

KvClientError::NotLeader { leader_id: Option<String>, leader_address: Option<String> }

This would allow clients to automatically redirect to the leader without parsing error messages.

d-engine-client/src/proto/client_ext.rs (1)

49-62: Consider avoiding unnecessary clone in into_read_results.

The current implementation clones the results vector via .clone().into_iter(). Since self is consumed (moved), you can use into_iter() directly on the owned data.

     fn into_read_results(self) -> std::result::Result<Vec<Option<ClientResult>>, ClientApiError> {
         self.validate_error()?;
-        match &self.success_result {
+        match self.success_result {
             Some(SuccessResult::ReadData(data)) => data
                 .results
-                .clone()
                 .into_iter()
                 .map(|item| {
                     Ok(Some(ClientResult {
                         key: item.key,
                         value: item.value,
                     }))
                 })
                 .collect(),
-            _ => {
-                let found = match &self.success_result {
+            other => {
+                let found = match &other {
                     Some(SuccessResult::WriteAck(_)) => "WriteAck",
                     None => "None",
                     _ => "Unknown",
                 };
benches/d-engine-bench/src/main.rs (1)

120-140: Consider documenting the key generation behavior when key_space is 0.

The current implementation will panic on division by zero if key_space is Some(0). Consider adding validation or documenting this edge case.

 fn generate_prefixed_key(
     sequential: bool,
     key_size: usize,
     index: u64,
     key_space: Option<u64>,
 ) -> String {
     // Apply key space limit if specified
-    let effective_index = key_space.map_or(index, |space| index % space);
+    let effective_index = key_space.map_or(index, |space| {
+        debug_assert!(space > 0, "key_space must be > 0");
+        if space == 0 { index } else { index % space }
+    });
 
     if sequential {
d-engine-core/src/event.rs (1)

191-195: Placeholder mapping for StepDownSelfRemoved should use a dedicated variant.

Mapping StepDownSelfRemoved to TestEvent::CreateSnapshotEvent as a placeholder is confusing and could cause issues in test assertions. Consider adding a dedicated TestEvent::StepDownSelfRemoved variant.

 pub enum TestEvent {
     // ... existing variants ...
     PromoteReadyLearners,
+    StepDownSelfRemoved,
 }

 // In raft_event_to_test_event:
 RaftEvent::StepDownSelfRemoved => {
-    // StepDownSelfRemoved is handled at Raft level, not converted to TestEvent
-    // This is a control flow event, not a user-facing event
-    TestEvent::CreateSnapshotEvent // Placeholder - this event won't be emitted to tests
+    TestEvent::StepDownSelfRemoved
 }
d-engine-client/src/lib.rs (2)

57-65: Consider explicit re-exports instead of wildcard exports.

Using pub use module::* for multiple modules can lead to unintended public API exposure and potential naming conflicts. Consider explicitly re-exporting only the intended public types.

For example:

pub use builder::ClientBuilder;
pub use cluster::ClusterClient;
pub use config::ClientConfig;
pub use error::ClientApiError;
// ... etc

171-191: The refresh method safely updates connection state.

The implementation correctly:

  1. Loads current state
  2. Creates a new pool with updated endpoints
  3. Atomically swaps the inner state

However, the &mut self requirement may be overly restrictive since ArcSwap::store only needs &self.

-    pub async fn refresh(
-        &mut self,
+    pub async fn refresh(
+        &self,
         new_endpoints: Option<Vec<String>>,
     ) -> std::result::Result<(), ClientApiError> {

This would allow refreshing connections without requiring mutable access to the Client, which is beneficial for shared usage patterns.

d-engine-core/src/replication/replication_handler.rs (1)

96-106: Partition logic runs unnecessarily for single-node clusters.

When is_single_node is true, replication_targets is empty, making the partition and learner trace redundant. Consider moving this block inside the multi-node path.

+    // Multi-node cluster: perform replication to peers
+    if is_single_node {
+        debug!(
+            "Single-node cluster (leader={}): logs persisted, quorum automatically achieved",
+            self.my_id
+        );
+        return Ok(AppendResults {
+            commit_quorum_achieved: true,
+            peer_updates: HashMap::new(),
+            learner_progress: HashMap::new(),
+        });
+    }
+
     // Separate Voters and Learners
     let (voters, learners): (Vec<_>, Vec<_>) = replication_targets
         .iter()
         .partition(|node| node.status == NodeStatus::Active as i32);

This avoids the empty partition operation for single-node clusters.

d-engine-core/src/raft_role/candidate_state.rs (1)

571-574: Consider defensive error handling instead of unreachable!().

While the comment indicates this should never be reached, using unreachable!() will cause a panic if the event is ever received due to a bug in the event routing logic. Consider returning an error instead for more graceful handling.

             RaftEvent::StepDownSelfRemoved => {
-                // Unreachable: handled at Raft level before reaching RoleState
-                unreachable!("StepDownSelfRemoved should be handled in Raft::run()");
+                // Should be handled at Raft level before reaching RoleState
+                return Err(ConsensusError::RoleViolation {
+                    current_role: "Candidate",
+                    required_role: "Raft",
+                    context: format!(
+                        "StepDownSelfRemoved should be handled in Raft::run(), not in CandidateState for node {}",
+                        ctx.node_id
+                    ),
+                }
+                .into());
             }
d-engine-client/src/grpc_kv_client.rs (1)

18-19: Consider removing #[allow(unused_imports)] if imports are used.

CoreKvClient and KvResult are used in the trait implementation. If KvClientError is intentionally imported for re-export or documentation purposes, consider using a more targeted allow attribute or removing the unused import.

d-engine-core/src/raft_role/mod.rs (1)

7-9: Make raft_role_test cfg consistent with test-utils-gated helpers.
Right now RaftRole::{voted_for,commit_index,match_index,next_index,follower_role_i32} can be built with feature="test-utils", but mod raft_role_test; is #[cfg(test)] only—so any feature-gated internal tests/helpers won’t compile/run under that feature setup.

-#[cfg(test)]
+#[cfg(any(test, feature = "test-utils"))]
 mod raft_role_test;

Also applies to: 306-316, 322-328, 363-366

d-engine-client/src/mock_rpc_service.rs (1)

84-90: Return Result<Channel, _> from mock_channel_with_port instead of expect.
This is a test utility, but it’s used as infrastructure—better to bubble failures up than abort the whole test process.

-    pub(crate) async fn mock_channel_with_port(port: u16) -> Channel {
-        Channel::from_shared(format!("http://127.0.0.1:{port}"))
-            .expect("valid address")
-            .connect()
-            .await
-            .expect("connection failed")
+    pub(crate) async fn mock_channel_with_port(
+        port: u16,
+    ) -> std::result::Result<Channel, tonic::transport::Error> {
+        Channel::from_shared(format!("http://127.0.0.1:{port}"))
+            .expect("valid address")
+            .connect()
+            .await
     }
d-engine-core/src/config/raft.rs (2)

538-620: Prefer an enum for LeaseConfig.cleanup_strategy (avoid stringly-typed config).
Validation is good, but this still makes downstream logic more error-prone and less discoverable (docs/IDE).


1384-1416: Gate watch-size warnings behind enabled (avoid noisy configs when feature is off).
Right now huge values warn even when watch is disabled.

-        if self.event_queue_size > 100_000 {
+        if self.enabled && self.event_queue_size > 100_000 {
             warn!( ... );
         }
@@
-        if self.watcher_buffer_size > 1000 {
+        if self.enabled && self.watcher_buffer_size > 1000 {
             warn!( ... );
         }
d-engine-client/src/error.rs (1)

55-98: Remove large blocks of commented-out code (old error enums / categories).
This adds noise and makes the new API harder to read.

Also applies to: 207-238

d-engine-client/src/mock_rpc.rs (2)

71-76: Redundant clone in error mapping.

The map_err(|e| e.clone()) is redundant since e is already moved into the closure and tonic::Status implements Clone. You can simplify this.

-            (Some(f), Some(port)) => f(port).map(tonic::Response::new).map_err(|e| e.clone()),
+            (Some(f), Some(port)) => f(port).map(tonic::Response::new),

134-142: Inconsistent watch() implementation across mock files.

This implementation returns Err(tonic::Status::unimplemented(...)), while the similar mock in d-engine-core/src/test_utils/mock/mock_rpc.rs (lines 185-193) returns an empty stream via Ok(tonic::Response::new(...)). Consider aligning these implementations for consistent test behavior.

d-engine-core/src/raft.rs (2)

163-171: Silent error handling when notifying leader change listeners.

The let _ = tx.send(...) silently ignores failures when the channel is closed. Consider logging a warning or debug message when a listener's channel is unavailable, similar to how notify_new_commit (line 420-421) logs errors.

     fn notify_leader_change(
         &self,
         leader_id: Option<u32>,
         term: u64,
     ) {
         for tx in &self.leader_change_listener {
-            let _ = tx.send((leader_id, term));
+            if let Err(e) = tx.send((leader_id, term)) {
+                tracing::debug!("Leader change listener channel closed: {:?}", e);
+            }
         }
     }

434-440: Inconsistent error handling in test notification methods.

notify_role_transition uses expect("should succeed") which will panic if the channel is closed, while notify_leader_change silently ignores errors. Consider using consistent error handling patterns across notification methods.

     pub fn notify_role_transition(&self) {
         let new_role_i32 = self.role.as_i32();
         for tx in &self.test_role_transition_listener {
-            tx.send(new_role_i32).expect("should succeed");
+            if let Err(e) = tx.send(new_role_i32) {
+                tracing::warn!("Role transition listener send failed: {:?}", e);
+            }
         }
     }
d-engine-core/src/raft_role/leader_state.rs (3)

1538-1548: Use tracing macros instead of println! in library code.

Direct println! statements in library code bypass structured logging and can interfere with applications that control stdout. Consider using tracing::info! consistently.

-            for learner_id in &ready_learners {
-                let match_index = learner_progress.get(learner_id).and_then(|mi| *mi).unwrap_or(0);
-                crate::utils::cluster_printer::print_leader_promoting_learner(
-                    self.node_id(),
-                    *learner_id,
-                    match_index,
-                    leader_commit_index,
-                );
-            }
+            for learner_id in &ready_learners {
+                let match_index = learner_progress.get(learner_id).and_then(|mi| *mi).unwrap_or(0);
+                info!(
+                    leader_id = self.node_id(),
+                    learner_id = *learner_id,
+                    match_index,
+                    leader_commit_index,
+                    "Promoting learner to voter"
+                );
+            }

2092-2093: Replace println! with structured logging.

Similar to the previous comment, this println! should use tracing macros for consistent logging behavior.

-            println!("============== Promotion successful ================");
-            println!("Now cluster members: {:?}", membership.voters().await);
+            info!(
+                members = ?membership.voters().await,
+                "Promotion successful"
+            );

2305-2311: Replace println! with tracing for stale learner notification.

This println! should use structured logging instead.

-        println!(
-            "
-            =====================
-            Learner {node_id} is stalled
-            =====================
-            ",
-        );
         info!("Learner {} is stalled", node_id);
d-engine-core/src/lib.rs (1)

21-36: Extensive wildcard re-exports expand public API surface.

Multiple pub use module::* statements significantly expand the crate's public API. This can lead to:

  1. Namespace pollution with internal types unintentionally exposed
  2. Potential naming conflicts with downstream crates
  3. Difficulty tracking API changes

Consider using explicit re-exports for the intended public API surface, or document which types from each module are part of the stable public API.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 230b095 and 9123484.

⛔ Files ignored due to path filters (11)
  • Cargo.lock is excluded by !**/*.lock
  • benches/d-engine-bench/Cargo.lock is excluded by !**/*.lock
  • d-engine-client/Cargo.lock is excluded by !**/*.lock
  • d-engine-proto/src/generated/d_engine.client.rs is excluded by !**/generated/**
  • d-engine-proto/src/generated/d_engine.common.error.rs is excluded by !**/generated/**
  • d-engine-proto/src/generated/d_engine.common.rs is excluded by !**/generated/**
  • d-engine-proto/src/generated/d_engine.error.rs is excluded by !**/generated/**
  • d-engine-proto/src/generated/d_engine.server.cluster.rs is excluded by !**/generated/**
  • d-engine-proto/src/generated/d_engine.server.election.rs is excluded by !**/generated/**
  • d-engine-proto/src/generated/d_engine.server.replication.rs is excluded by !**/generated/**
  • d-engine-proto/src/generated/d_engine.server.storage.rs is excluded by !**/generated/**
📒 Files selected for processing (103)
  • .claudeignore (1 hunks)
  • .github/workflows/benchmark.yml (1 hunks)
  • .github/workflows/ci.yml (1 hunks)
  • .github/workflows/commit-message-check.yml (1 hunks)
  • .github/workflows/dependency-audit.yml (2 hunks)
  • .gitignore (1 hunks)
  • CHANGELOG.md (1 hunks)
  • Cargo.toml (1 hunks)
  • MIGRATION_GUIDE.md (1 hunks)
  • Makefile (1 hunks)
  • README.md (5 hunks)
  • benches/d-engine-bench/Cargo.toml (1 hunks)
  • benches/d-engine-bench/reports/v0.1.4/report_20251205_remeasured.md (1 hunks)
  • benches/d-engine-bench/reports/v0.2.0/report_20251205.md (1 hunks)
  • benches/d-engine-bench/reports/v0.2.0/report_20251209.md (1 hunks)
  • benches/d-engine-bench/reports/v0.2.0/report_20251210.md (1 hunks)
  • benches/d-engine-bench/src/main.rs (9 hunks)
  • clippy.toml (1 hunks)
  • config/base/raft.toml (2 hunks)
  • d-engine-client/Cargo.toml (1 hunks)
  • d-engine-client/src/builder.rs (3 hunks)
  • d-engine-client/src/cluster.rs (2 hunks)
  • d-engine-client/src/cluster_test.rs (5 hunks)
  • d-engine-client/src/config.rs (1 hunks)
  • d-engine-client/src/error.rs (8 hunks)
  • d-engine-client/src/grpc_kv_client.rs (6 hunks)
  • d-engine-client/src/kv_client.rs (1 hunks)
  • d-engine-client/src/kv_error.rs (1 hunks)
  • d-engine-client/src/kv_test.rs (29 hunks)
  • d-engine-client/src/lib.rs (1 hunks)
  • d-engine-client/src/mock_rpc.rs (1 hunks)
  • d-engine-client/src/mock_rpc_service.rs (1 hunks)
  • d-engine-client/src/pool.rs (10 hunks)
  • d-engine-client/src/pool_test.rs (1 hunks)
  • d-engine-client/src/proto/client_ext.rs (1 hunks)
  • d-engine-client/src/proto/client_ext_test.rs (1 hunks)
  • d-engine-client/src/proto/mod.rs (1 hunks)
  • d-engine-client/src/scoped_timer.rs (2 hunks)
  • d-engine-client/src/utils.rs (1 hunks)
  • d-engine-client/src/utils_test.rs (1 hunks)
  • d-engine-core/Cargo.toml (1 hunks)
  • d-engine-core/src/commit_handler/default_commit_handler.rs (6 hunks)
  • d-engine-core/src/commit_handler/default_commit_handler_test.rs (31 hunks)
  • d-engine-core/src/commit_handler/mod.rs (1 hunks)
  • d-engine-core/src/config/cluster.rs (2 hunks)
  • d-engine-core/src/config/config_test.rs (2 hunks)
  • d-engine-core/src/config/mod.rs (1 hunks)
  • d-engine-core/src/config/network_test.rs (1 hunks)
  • d-engine-core/src/config/raft.rs (13 hunks)
  • d-engine-core/src/config/tls_test.rs (2 hunks)
  • d-engine-core/src/election/election_handler.rs (4 hunks)
  • d-engine-core/src/election/election_handler_test.rs (1 hunks)
  • d-engine-core/src/election/mod.rs (2 hunks)
  • d-engine-core/src/errors.rs (1 hunks)
  • d-engine-core/src/errors_test.rs (1 hunks)
  • d-engine-core/src/event.rs (4 hunks)
  • d-engine-core/src/lib.rs (2 hunks)
  • d-engine-core/src/maybe_clone_oneshot.rs (15 hunks)
  • d-engine-core/src/maybe_clone_oneshot_test.rs (1 hunks)
  • d-engine-core/src/membership.rs (1 hunks)
  • d-engine-core/src/network/backgroup_snapshot_transfer.rs (1 hunks)
  • d-engine-core/src/network/backgroup_snapshot_transfer_test.rs (2 hunks)
  • d-engine-core/src/network/mod.rs (7 hunks)
  • d-engine-core/src/purge/default_executor.rs (2 hunks)
  • d-engine-core/src/purge/default_executor_test.rs (1 hunks)
  • d-engine-core/src/purge/mod.rs (2 hunks)
  • d-engine-core/src/raft.rs (10 hunks)
  • d-engine-core/src/raft_context.rs (3 hunks)
  • d-engine-core/src/raft_role/candidate_state.rs (16 hunks)
  • d-engine-core/src/raft_role/follower_state.rs (15 hunks)
  • d-engine-core/src/raft_role/leader_state.rs (34 hunks)
  • d-engine-core/src/raft_role/learner_state.rs (16 hunks)
  • d-engine-core/src/raft_role/mod.rs (9 hunks)
  • d-engine-core/src/raft_role/raft_role_test.rs (1 hunks)
  • d-engine-core/src/raft_role/role_state.rs (6 hunks)
  • d-engine-core/src/raft_test.rs (1 hunks)
  • d-engine-core/src/replication/mod.rs (2 hunks)
  • d-engine-core/src/replication/replication_handler.rs (9 hunks)
  • d-engine-core/src/state_machine_handler/default_state_machine_handler.rs (12 hunks)
  • d-engine-core/src/state_machine_handler/default_state_machine_handler_test.rs (46 hunks)
  • d-engine-core/src/state_machine_handler/mod.rs (4 hunks)
  • d-engine-core/src/state_machine_handler/snapshot_assembler.rs (2 hunks)
  • d-engine-core/src/state_machine_handler/snapshot_assembler_test.rs (1 hunks)
  • d-engine-core/src/state_machine_handler/snapshot_policy/composite_test.rs (2 hunks)
  • d-engine-core/src/state_machine_handler/snapshot_policy/log_size.rs (1 hunks)
  • d-engine-core/src/state_machine_handler/snapshot_policy/log_size_test.rs (7 hunks)
  • d-engine-core/src/state_machine_handler/snapshot_policy/mod.rs (2 hunks)
  • d-engine-core/src/state_machine_handler/snapshot_policy/time_based_test.rs (4 hunks)
  • d-engine-core/src/storage/lease.rs (1 hunks)
  • d-engine-core/src/storage/mod.rs (1 hunks)
  • d-engine-core/src/storage/raft_log.rs (1 hunks)
  • d-engine-core/src/storage/snapshot_path_manager.rs (1 hunks)
  • d-engine-core/src/storage/state_machine.rs (2 hunks)
  • d-engine-core/src/storage/state_machine_test.rs (4 hunks)
  • d-engine-core/src/storage/storage_engine.rs (3 hunks)
  • d-engine-core/src/storage/storage_engine_test.rs (3 hunks)
  • d-engine-core/src/storage/storage_test.rs (1 hunks)
  • d-engine-core/src/test_utils/common.rs (4 hunks)
  • d-engine-core/src/test_utils/common_test.rs (1 hunks)
  • d-engine-core/src/test_utils/entry_builder.rs (1 hunks)
  • d-engine-core/src/test_utils/entry_builder_test.rs (1 hunks)
  • d-engine-core/src/test_utils/mock/mock_raft_builder.rs (6 hunks)
  • d-engine-core/src/test_utils/mock/mock_rpc.rs (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (22)
d-engine-client/src/proto/client_ext_test.rs (1)
d-engine-client/src/error.rs (2)
  • code (401-409)
  • message (412-420)
d-engine-client/src/builder.rs (1)
d-engine-client/src/grpc_kv_client.rs (1)
  • new (40-42)
d-engine-client/src/pool_test.rs (1)
d-engine-client/src/pool.rs (5)
  • parse_cluster_metadata (190-212)
  • load_cluster_metadata (154-187)
  • create_channel (116-130)
  • create (39-54)
  • get_all_channels (138-142)
d-engine-client/src/cluster.rs (1)
d-engine-client/src/lib.rs (1)
  • cluster (151-153)
d-engine-core/src/raft_test.rs (1)
d-engine-core/src/raft.rs (2)
  • new (79-93)
  • new (101-143)
d-engine-core/src/storage/state_machine_test.rs (2)
d-engine-server/src/storage/lease_integration_test.rs (2)
  • create_insert_entry (129-153)
  • create_insert_entry (700-724)
d-engine-proto/src/exts/client_ext.rs (1)
  • insert (53-65)
d-engine-core/src/storage/mod.rs (1)
d-engine-server/src/node/builder.rs (2)
  • state_machine (197-203)
  • storage_engine (188-194)
d-engine-client/src/kv_test.rs (4)
d-engine-client/src/lib.rs (1)
  • cluster (151-153)
d-engine-client/src/grpc_kv_client.rs (1)
  • new (40-42)
d-engine-client/src/mock_rpc_service.rs (1)
  • simulate_client_write_mock_server (154-183)
d-engine-proto/src/exts/client_ext.rs (2)
  • write_success (105-111)
  • client_error (139-145)
d-engine-core/src/storage/storage_engine_test.rs (3)
d-engine-server/src/storage/adaptors/rocksdb/rocksdb_engine_test.rs (2)
  • create_test_command_payload (100-120)
  • new (26-33)
examples/sled-cluster/src/sled_engine_test.rs (2)
  • create_test_command_payload (97-118)
  • new (23-26)
d-engine-proto/src/exts/client_ext.rs (1)
  • insert (53-65)
d-engine-core/src/raft_role/raft_role_test.rs (1)
d-engine-core/src/raft_role/mod.rs (1)
  • current_leader (147-152)
d-engine-core/src/raft_role/follower_state.rs (2)
d-engine-core/src/utils/cluster.rs (1)
  • error (20-25)
d-engine-core/src/raft_role/leader_state.rs (1)
  • can_purge_logs (1798-1813)
d-engine-core/src/commit_handler/default_commit_handler_test.rs (1)
d-engine-core/src/commit_handler/default_commit_handler.rs (1)
  • is_self_removal_config (231-240)
d-engine-client/src/lib.rs (3)
d-engine-client/src/grpc_kv_client.rs (1)
  • new (40-42)
d-engine-client/src/builder.rs (1)
  • new (35-40)
d-engine-client/src/pool.rs (2)
  • refresh (63-80)
  • create (39-54)
d-engine-core/src/election/election_handler.rs (2)
d-engine-core/src/lib.rs (2)
  • if_higher_term_found (74-86)
  • is_target_log_more_recent (95-103)
d-engine-core/src/membership.rs (1)
  • members (42-42)
d-engine-core/src/raft_role/candidate_state.rs (2)
d-engine-core/src/raft_role/mod.rs (1)
  • can_serve_read_locally (429-466)
d-engine-core/src/raft_role/leader_state.rs (2)
  • send_become_follower_event (1749-1765)
  • send_replay_raft_event (2441-2450)
d-engine-core/src/replication/replication_handler.rs (2)
d-engine-core/src/utils/cluster.rs (1)
  • is_majority (8-13)
d-engine-server/src/membership/raft_membership.rs (1)
  • voters (106-119)
d-engine-core/src/storage/lease.rs (1)
d-engine-core/src/storage/state_machine.rs (2)
  • len (62-62)
  • is_empty (66-68)
d-engine-client/src/kv_client.rs (2)
d-engine-client/src/grpc_kv_client.rs (10)
  • put (50-85)
  • put (404-410)
  • put_with_ttl (100-137)
  • put_with_ttl (412-419)
  • get (217-222)
  • get (421-430)
  • get_multi (249-254)
  • get_multi (432-440)
  • delete (151-180)
  • delete (442-447)
d-engine-server/src/node/client/local_kv.rs (7)
  • put (153-189)
  • put (451-457)
  • put_with_ttl (459-499)
  • get (501-506)
  • get_multi (508-513)
  • delete (385-417)
  • delete (515-520)
d-engine-client/src/grpc_kv_client.rs (3)
d-engine-server/src/node/client/local_kv.rs (7)
  • put_with_ttl (459-499)
  • put (153-189)
  • put (451-457)
  • get (501-506)
  • get_multi (508-513)
  • delete (385-417)
  • delete (515-520)
d-engine-client/src/kv_client.rs (5)
  • put_with_ttl (101-106)
  • put (75-79)
  • get (137-140)
  • get_multi (169-172)
  • delete (194-197)
d-engine-server/src/storage/adaptors/file/file_state_machine.rs (2)
  • new (193-211)
  • get (987-1006)
d-engine-core/src/state_machine_handler/default_state_machine_handler.rs (2)
d-engine-proto/src/exts/client_ext.rs (2)
  • insert (53-65)
  • delete (92-97)
d-engine-core/src/watch/manager.rs (1)
  • new (276-297)
d-engine-core/src/test_utils/mock/mock_raft_builder.rs (1)
d-engine-server/src/test_utils/mock/mock_node_builder.rs (2)
  • build_context (164-201)
  • new (138-159)
d-engine-core/src/state_machine_handler/default_state_machine_handler_test.rs (1)
d-engine-core/src/state_machine_handler/default_state_machine_handler.rs (2)
  • new (744-770)
  • new_without_watch (777-794)
🪛 checkmake (0.2.2)
Makefile

[warning] 54-54: Target body for "help" exceeds allowed length of 5 (35).

(maxbodylength)


[warning] 403-403: Target body for "audit" exceeds allowed length of 5 (6).

(maxbodylength)


[warning] 412-412: Target body for "deny" exceeds allowed length of 5 (6).

(maxbodylength)


[warning] 444-444: Target body for "troubleshoot" exceeds allowed length of 5 (34).

(maxbodylength)


[warning] 18-18: Missing required phony target "clean"

(minphony)


[warning] 115-115: Target "install" should be declared PHONY.

(phonydeclared)


[warning] 393-393: Target "doc" should be declared PHONY.

(phonydeclared)

🪛 GitHub Check: CodeQL
.github/workflows/benchmark.yml

[warning] 27-89: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}

🪛 LanguageTool
benches/d-engine-bench/reports/v0.1.4/report_20251205_remeasured.md

[style] ~3-~3: Some style guides suggest that commas should set off the year in a month-day-year date.
Context: ...Benchmark Report Date: December 5, 2025 Hardware: Apple Mac mini M2 (8-co...

(MISSING_COMMA_AFTER_YEAR)

benches/d-engine-bench/reports/v0.2.0/report_20251210.md

[style] ~3-~3: Some style guides suggest that commas should set off the year in a month-day-year date.
Context: ...enchmark Report Date: December 10, 2025 Hardware: Apple Mac mini M2 (8-co...

(MISSING_COMMA_AFTER_YEAR)

benches/d-engine-bench/reports/v0.2.0/report_20251209.md

[style] ~3-~3: Some style guides suggest that commas should set off the year in a month-day-year date.
Context: ...Benchmark Report Date: December 9, 2025 Hardware: Apple Mac mini M2 (8-co...

(MISSING_COMMA_AFTER_YEAR)

benches/d-engine-bench/reports/v0.2.0/report_20251205.md

[style] ~3-~3: Some style guides suggest that commas should set off the year in a month-day-year date.
Context: ...Benchmark Report Date: December 5, 2025 Hardware: Apple Mac mini M2 (8-co...

(MISSING_COMMA_AFTER_YEAR)

🪛 markdownlint-cli2 (0.18.1)
benches/d-engine-bench/reports/v0.2.0/report_20251210.md

24-24: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


33-33: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

benches/d-engine-bench/reports/v0.2.0/report_20251209.md

121-121: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


139-139: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


155-155: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


171-171: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


187-187: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


203-203: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

MIGRATION_GUIDE.md

20-20: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint and Format Check
  • GitHub Check: audit
  • GitHub Check: Analyze (rust)

Comment on lines +284 to +309
#[test]
fn test_shared_state_concurrent_updates() {
use std::sync::Arc;
use std::thread;

let shared = Arc::new(SharedState::new(1, None, None));

// Simulate concurrent leader updates
let handles: Vec<_> = (0..10)
.map(|i| {
let shared = Arc::clone(&shared);
thread::spawn(move || {
shared.set_current_leader(i as u32);
})
})
.collect();

for h in handles {
h.join().unwrap();
}

// Final value is one of the concurrent updates
let final_leader = shared.current_leader();
assert!(final_leader.is_some());
assert!(final_leader.unwrap() < 10);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential test flakiness: thread setting leader to 0 can cause assertion failure.

The concurrent test spawns threads setting leader IDs from 0 to 9. If thread 0 wins the race, set_current_leader(0) is called, which per the current_leader() implementation returns None for value 0. The assertion assert!(final_leader.is_some()) would then fail.

Consider starting the range from 1:

     let handles: Vec<_> = (0..10)
         .map(|i| {
             let shared = Arc::clone(&shared);
             thread::spawn(move || {
-                shared.set_current_leader(i as u32);
+                shared.set_current_leader((i + 1) as u32);
             })
         })
         .collect();

Or adjust the assertion:

-    assert!(final_leader.is_some());
-    assert!(final_leader.unwrap() < 10);
+    // final_leader could be None if thread 0 won (set_current_leader(0) clears leader)
+    match final_leader {
+        Some(id) => assert!(id < 10),
+        None => {} // thread 0 won the race
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[test]
fn test_shared_state_concurrent_updates() {
use std::sync::Arc;
use std::thread;
let shared = Arc::new(SharedState::new(1, None, None));
// Simulate concurrent leader updates
let handles: Vec<_> = (0..10)
.map(|i| {
let shared = Arc::clone(&shared);
thread::spawn(move || {
shared.set_current_leader(i as u32);
})
})
.collect();
for h in handles {
h.join().unwrap();
}
// Final value is one of the concurrent updates
let final_leader = shared.current_leader();
assert!(final_leader.is_some());
assert!(final_leader.unwrap() < 10);
}
#[test]
fn test_shared_state_concurrent_updates() {
use std::sync::Arc;
use std::thread;
let shared = Arc::new(SharedState::new(1, None, None));
// Simulate concurrent leader updates
let handles: Vec<_> = (0..10)
.map(|i| {
let shared = Arc::clone(&shared);
thread::spawn(move || {
shared.set_current_leader((i + 1) as u32);
})
})
.collect();
for h in handles {
h.join().unwrap();
}
// Final value is one of the concurrent updates
let final_leader = shared.current_leader();
assert!(final_leader.is_some());
assert!(final_leader.unwrap() < 10);
}
🤖 Prompt for AI Agents
d-engine-core/src/raft_role/raft_role_test.rs around lines 284 to 309: the
concurrent test spawns threads that set leader IDs 0..9 but
SharedState::current_leader() treats 0 as None, so if thread 0 wins the race the
test flakes; fix by making threads set non-zero IDs (e.g., iterate 1..=10 or
1..11) or alternatively relax the assertions to accept None/0 (for example
assert that final_leader.unwrap_or(0) < 10) — pick one approach and update the
test so it never relies on 0 being treated as Some.

Comment on lines +154 to +156
## clippy Run Clippy linter (fail on warnings with -D flag)
## clippy Run Clippy linter on all crates
clippy: check-workspace clippy-excluded
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate comment for clippy target.

Lines 154-155 have two ## comments for the same target, which may cause issues with the help parser.

-## clippy               Run Clippy linter (fail on warnings with -D flag)
-## clippy                Run Clippy linter on all crates
+## clippy               Run Clippy linter on all crates (fail on warnings)
 clippy: check-workspace clippy-excluded
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## clippy Run Clippy linter (fail on warnings with -D flag)
## clippy Run Clippy linter on all crates
clippy: check-workspace clippy-excluded
## clippy Run Clippy linter on all crates (fail on warnings)
clippy: check-workspace clippy-excluded
🤖 Prompt for AI Agents
In Makefile around lines 154 to 156, there are duplicate comment lines
describing the clippy target which can confuse the help parser; remove the
redundant comment so only one descriptive comment remains (e.g., keep a single
"## clippy    Run Clippy linter on all crates" line) ensuring the comments are
unique and accurately describe the target.

Changes:
- Remove commented legacy package config from workspace Cargo.toml
- Update README benchmark reference to v0.2.0 (dengine_comparison_v0.2.0.png)
- Upgrade rustfmt edition from 2021 to 2024
- Clean up outdated v0.2.0 benchmark reports (report_20251205/09/10.md)
- Add final v0.2.0 benchmark report and comparison chart
- Sync version references across all example Cargo.toml files
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
Makefile (1)

154-155: Fix duplicate comment for clippy target.

Lines 154-155 contain duplicate comment lines describing the clippy target, which may confuse the help parser. Keep a single, descriptive comment.

Apply this diff:

-## clippy               Run Clippy linter (fail on warnings with -D flag)
-## clippy                Run Clippy linter on all crates
+## clippy               Run Clippy linter on all crates (fail on warnings)
🧹 Nitpick comments (3)
benches/d-engine-bench/reports/v0.2.0/report_v0.2.0_final.md (1)

174-174: Minor grammar refinement: "a small number of keys" → "10 keys".

At line 174, the phrase "a small number of keys (10)" can be tightened to simply "10 keys" or "few keys" for clearer, more direct prose.

README.md (1)

80-80: Clarify benchmark methodology and source attribution.

The footnote "*Benchmarks show 51% higher write throughput vs etcd..." references specific numbers but doesn't clearly link to the detailed benchmark report. Consider explicitly directing readers to benches/d-engine-bench/reports/v0.2.0/ for the comprehensive methodology and environment details.

Consider updating to: "Benchmarks show 51% higher write throughput vs etcd in test environments (see benches/d-engine-bench/reports/v0.2.0/ for methodology and environment details). Production results vary by hardware."

Makefile (1)

1-576: Approve comprehensive Makefile structure.

The Makefile provides excellent developer experience with:

  • Clear organization across logical sections (quality, build, test, docs, security, cleanup, troubleshooting)
  • Detailed help output with color-coded messages and workflow examples
  • Environment validation and workspace structure checks
  • Comprehensive test targets (unit, integration, doc, examples, detailed output, verbose, all)
  • Documentation generation with browser automation
  • Security audit targets
  • Composite workflows for common tasks (pre-release, check-all-projects)
  • Color-coded output for improved readability

The overall structure and functionality are well-designed and should significantly improve the development experience.

Consider addressing the static analysis warnings for consistency: while body length warnings are acceptable and don't affect functionality, ensuring all targets are declared .PHONY is a Makefile best practice that improves portability and clarity.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9123484 and f7b0418.

⛔ Files ignored due to path filters (2)
  • benches/d-engine-bench/Cargo.lock is excluded by !**/*.lock
  • benches/d-engine-bench/reports/v0.2.0/dengine_comparison_v0.2.0.png is excluded by !**/*.png
📒 Files selected for processing (5)
  • Cargo.toml (2 hunks)
  • Makefile (1 hunks)
  • README.md (4 hunks)
  • benches/d-engine-bench/reports/v0.2.0/report_v0.2.0_final.md (1 hunks)
  • benches/d-engine-bench/rustfmt.toml (1 hunks)
🧰 Additional context used
🪛 checkmake (0.2.2)
Makefile

[warning] 54-54: Target body for "help" exceeds allowed length of 5 (35).

(maxbodylength)


[warning] 407-407: Target body for "audit" exceeds allowed length of 5 (6).

(maxbodylength)


[warning] 416-416: Target body for "deny" exceeds allowed length of 5 (6).

(maxbodylength)


[warning] 448-448: Target body for "troubleshoot" exceeds allowed length of 5 (34).

(maxbodylength)


[warning] 18-18: Missing required phony target "clean"

(minphony)


[warning] 115-115: Target "install" should be declared PHONY.

(phonydeclared)


[warning] 397-397: Target "doc" should be declared PHONY.

(phonydeclared)

🪛 LanguageTool
benches/d-engine-bench/reports/v0.2.0/report_v0.2.0_final.md

[style] ~174-~174: Specify a number, remove phrase, use “a few”, or use “some”
Context: ...st simulates contention scenarios where a small number of keys (10) receive high request volume. ...

(SMALL_NUMBER_OF)

🪛 markdownlint-cli2 (0.18.1)
benches/d-engine-bench/reports/v0.2.0/report_v0.2.0_final.md

178-178: Multiple headings with the same content

(MD024, no-duplicate-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: audit
  • GitHub Check: Analyze (rust)
  • GitHub Check: audit
🔇 Additional comments (4)
Cargo.toml (2)

50-76: Approve profile configuration tuning.

The profile configurations are well-tuned for different build scenarios:

  • dev: Low optimization (opt-level=1) for fast iteration, incremental compilation enabled
  • release: Aggressive optimization (opt-level='z' for size, LTO, single codegen unit) for production
  • test: Balanced approach with incremental compilation and multi-core codegen

This is a good setup for a Rust workspace.


2-10: All workspace members are correctly declared and paths are valid.

Verification confirms all 6 members exist with proper Cargo.toml files: d-engine, d-engine-proto, d-engine-core, d-engine-client, d-engine-docs, and d-engine-server. Each crate name matches its path reference.

README.md (1)

10-11: Verify new public APIs and features are properly exported.

The README introduces several new public APIs and features that must exist and be properly exported:

  1. EmbeddedEngine (lines 22, 107, 116): Public struct in d-engine-server at src/embedded/mod.rs, re-exported via d-engine::prelude
  2. LocalKvClient (line 23): Public struct in d-engine-server, re-exported via d-engine::prelude
  3. prelude module (line 101): Public module at d-engine/src/lib.rs:203 that exports EmbeddedEngine, LocalKvClient, Node, NodeBuilder, and related types (conditional on features)
  4. Feature flags (lines 48, 51): client and server features properly defined in d-engine/Cargo.toml as optional dependencies
  5. TTL/Lease (lines 20-21): Lease trait re-exported from d-engine-core, with DefaultLease implementation in d-engine-server/src/storage/lease.rs
  6. Watch API (lines 20-21): Implemented via watch() method in EmbeddedEngine at src/embedded/mod.rs:322 and leader_notifier() for leader change notifications

All mentioned example directories exist: examples/quick-start/, examples/single-node-expansion/, examples/sled-cluster/, examples/three-nodes-cluster/.

Makefile (1)

18-22: Add missing .PHONY declarations.

Several targets are missing from the .PHONY declaration at lines 18-22, including install (line 115) and doc/doc-check (lines 397-400). Update the .PHONY list to include all non-file targets.

Apply this diff to add missing phony targets:

 .PHONY: help all check fmt fmt-check clippy clippy-fix test test-unit \
         test-integration test-doc bench clean doc build build-release \
         pre-release install check-env audit deny troubleshoot \
         test-crate test-examples test-detailed \
-        docs docs-all docs-check docs-check-all
+        docs docs-all docs-check docs-check-all \
+        install-tools doc doc-check check-workspace check-examples \
+        check-benches check-all-projects build-examples fmt-fix \
+        clippy-excluded test-verbose fix clean-deps

Likely an incorrect or invalid review comment.

- Fix typo: rename backgroup → background in snapshot transfer
- Remove obsolete storage_test.rs
- Update client error handling and KV operations
- Fix network module exports
- Update candidate state and state machine handler
- Fix single-node cluster examples documentation
- Update benchmark workflow configuration
Cache cluster metadata in LeaderState to eliminate 3 async calls per write.

Performance: +2-3% throughput, -13-21% p99 latency vs v0.2.0

Changes:
- Add ClusterMetadata { is_single_node, total_voters } cache
- Initialize on leader election, update on membership changes
- Pass to replication handler to avoid repeated queries
- Fix 6 test failures due to missing metadata initialization
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
d-engine-client/src/grpc_kv_client.rs (1)

376-395: Potential panic if channel pool is empty.

rng.gen_range(0..channels.len()) will panic with an empty range if get_all_channels() returns an empty vector (e.g., during initialization or when all nodes are disconnected).

Add a guard for empty channels:

     pub(super) async fn make_client(
         &self
     ) -> std::result::Result<RaftClientServiceClient<Channel>, ClientApiError> {
         let client_inner = self.client_inner.load();
 
         // Balance from read clients
         let mut rng = StdRng::from_entropy();
         let channels = client_inner.pool.get_all_channels();
+        if channels.is_empty() {
+            return Err(ClientApiError::Network("No available channels".into()));
+        }
         let i = rng.gen_range(0..channels.len());
 
         let mut client = RaftClientServiceClient::new(channels[i].clone());
d-engine-core/src/replication/mod.rs (1)

189-220: Docs reference AppendEntriesResponse::{success,higher_term,conflict}—likely stale after helper removal.

check_append_entries_request_is_legal’s docs still point at constructor-style APIs that (per this PR) were removed from AppendEntriesResponse. Update doc return bullets to match the new construction path (e.g., “returns an AppendEntriesResponse with success_result/conflict_result/term semantics…”), so downstream implementors don’t chase non-existent helpers.

d-engine-core/src/state_machine_handler/default_state_machine_handler.rs (1)

1199-1224: Critical: Hardcoded offset breaks configurable prefix.

Line 1211 uses a hardcoded offset of 9 bytes, assuming the prefix is always "snapshot-". However, the function now accepts a configurable snapshot_dir_prefix parameter. If a different prefix is used (e.g., "snap-", "backup-", or any non-9-character prefix), the parsing will:

  • Slice at the wrong position, producing incorrect index/term values
  • Potentially panic if the string is shorter than 9 characters
  • Cause snapshot cleanup and management to fail

Apply this diff to use the dynamic prefix length:

-    // Remove fixed parts
-    let core = &name[9..name.len()]; // "snapshot-".len() = 9,
+    // Remove prefix (now configurable)
+    let core = &name[snapshot_dir_prefix.len()..];
d-engine-client/src/error.rs (1)

150-205: Preserve server message in fallback case for better diagnostics.

The fallback handler (_ => Self::Business) drops status.message() when mapping unhandled status codes, replacing it only with a code debug representation. This makes diagnosing server-side failures harder. Include the original message:

            _ => Self::Business {
                code: ErrorCode::Uncategorized,
-                message: format!("Unhandled status code: {code:?}"),
+                message: format!("Unhandled status code: {code:?}: {message}"),
                required_action: None,
            },
d-engine-core/src/raft_role/candidate_state.rs (1)

243-284: Avoid self.voted_for().unwrap() in vote-request handling (panic risk).
If voted_for() can ever error (storage, state init), Candidate will crash instead of rejecting / stepping down cleanly.

-                if ctx.election_handler().check_vote_request_is_legal(
+                if ctx.election_handler().check_vote_request_is_legal(
                     &vote_request,
                     my_term,
                     last_log_index,
                     last_log_term,
-                    self.voted_for().unwrap(),
+                    self.voted_for()?,
                 ) {
d-engine-core/src/replication/replication_handler.rs (2)

564-614: New public API: avoid assert! panics in generate_new_entries.
assert!(!id_range.is_empty()) (Line 579) will abort the process if allocation ever fails. For a pub async fn returning Result, prefer non-panicking handling (at minimum debug_assert! + a structured error path).


686-706: New public helper must not panic: remove .unwrap() on protobuf encoding.
cmd.encode(&mut buf).unwrap() (Line 699) can panic; for a public API, use the infallible encode_to_vec() method instead.

 pub fn client_command_to_entry_payloads(commands: Vec<WriteCommand>) -> Vec<EntryPayload> {
     commands
         .into_iter()
         .map(|cmd| {
-            let mut buf = BytesMut::with_capacity(cmd.encoded_len());
-            cmd.encode(&mut buf).unwrap();
-
             EntryPayload {
-                payload: Some(Payload::Command(buf.freeze())),
+                payload: Some(Payload::Command(cmd.encode_to_vec().into())),
             }
         })
         .collect()
 }
♻️ Duplicate comments (2)
d-engine-core/src/network/mod.rs (1)

9-13: Typo fix from past review has been addressed.

The module has been correctly renamed from backgroup_snapshot_transfer to background_snapshot_transfer.

d-engine-client/src/error.rs (1)

104-147: Default transport failure: make invalid-URI detection robust + align retry hint.

  • err.to_string().contains("invalid uri") is case-sensitive; tonic errors often say “invalid URI”, so this can silently misclassify.
  • The fallback now correctly uses Uncategorized (good), but retry_after_ms: Some(5000) implies retryability despite being “unclassified”; past guidance suggested None.
@@
-        // Check for invalid address errors
-        if err.to_string().contains("invalid uri") {
+        // Check for invalid address errors (be case-insensitive)
+        let err_s = err.to_string();
+        if err_s.to_ascii_lowercase().contains("invalid uri") {
             return Self::Network {
                 code: ErrorCode::InvalidAddress,
-                message: format!("Invalid address: {err}"),
+                message: format!("Invalid address: {err}"),
                 retry_after_ms: None, // Not retryable - needs address correction
                 leader_hint: None,
             };
         }
@@
         Self::Network {
             code: ErrorCode::Uncategorized,
             message: format!("Transport error: {err}"),
-            retry_after_ms: Some(5000),
+            retry_after_ms: None,
             leader_hint: None,
         }
🧹 Nitpick comments (14)
d-engine-core/src/storage/lease.rs (3)

114-141: Clarify caller responsibilities for expired keys.

Both get_expired_keys and on_apply return Vec<Bytes> of expired keys, but the documentation doesn't explicitly state what the caller should do with them. While it's implied that these keys should be deleted from the main storage, making this contract explicit would improve API clarity.

Consider adding to the documentation:

 /// # Returns
-/// List of expired keys (keys are removed from lease tracking)
+/// List of expired keys (keys are removed from lease tracking).
+/// The caller is responsible for deleting these keys from the main storage.

Similar clarification for on_apply would be helpful.


182-188: Consider returning Result from to_snapshot for error transparency.

The method returns Vec<u8> rather than Result<Vec<u8>>, preventing serialization errors from being propagated. The implementation in d-engine-server uses .unwrap_or_default(), which silently returns an empty snapshot on failure. This could mask snapshot creation issues.

While reload properly returns Result<()> for deserialization errors, the asymmetry means snapshot creation failures go unnoticed until restoration fails.

Consider changing the signature to:

-fn to_snapshot(&self) -> Vec<u8>;
+fn to_snapshot(&self) -> Result<Vec<u8>>;

This allows callers to handle serialization failures explicitly.


156-169: Documentation is slightly imprecise about performance.

The documentation states "Should be O(1) or O(log N) at most" (line 165), but the implementation in d-engine-server samples the first 10 entries, which is O(1). The "or O(log N)" part might be unnecessary given the heuristic nature of this check.

Consider tightening the performance documentation:

 /// # Performance
-/// Should be O(1) or O(log N) at most.
+/// Should be O(1) - typically a fast heuristic check (e.g., sampling a few entries).
d-engine-client/src/grpc_kv_client.rs (1)

99-136: Consider extracting shared write logic to reduce duplication.

The put_with_ttl implementation mirrors put except for the WriteCommand construction. The request building, leader client acquisition, and response handling are duplicated.

Consider extracting a helper method:

+    async fn execute_write(&self, commands: Vec<WriteCommand>, op_name: &str) -> std::result::Result<(), ClientApiError> {
+        let client_inner = self.client_inner.load();
+        let request = ClientWriteRequest {
+            client_id: client_inner.client_id,
+            commands,
+        };
+        let mut client = self.make_leader_client().await?;
+        match client.handle_client_write(request).await {
+            Ok(response) => {
+                debug!("[:KvClient:{}] response: {:?}", op_name, response);
+                response.get_ref().validate_error()
+            }
+            Err(status) => {
+                error!("[:KvClient:{}] status: {:?}", op_name, status);
+                Err(status.into())
+            }
+        }
+    }

This would simplify put, put_with_ttl, and delete to single command construction + helper call.

d-engine-client/src/error.rs (3)

55-98: Delete commented-out public enums instead of keeping them in-tree.
Large commented blocks (NetworkErrorType/ProtocolErrorType/etc.) add noise and tend to rot; rely on git history or reintroduce with a proper deprecation cycle if needed.


239-275: Avoid manual “JSON-like” parsing for x-raft-leader metadata.
The current split-by-, / split-by-: parser will break if values contain commas/colons, or if encoding changes slightly. If the header is JSON, parse it as JSON; if it’s not JSON, standardize it (e.g., base64 JSON) or use a structured delimiter format.

@@
 fn parse_leader_from_metadata(status: &Status) -> Option<LeaderInfo> {
     status
         .metadata()
         .get("x-raft-leader")
         .and_then(|v| v.to_str().ok())
-        .and_then(|s| {
-            // Manually parse JSON-like string
-            let mut id = None;
-            let mut address = None;
-            let mut last_contact = None;
-
-            // Remove whitespace and outer braces
-            let s = s.trim().trim_start_matches('{').trim_end_matches('}');
-
-            // Split into key-value pairs
-            for pair in s.split(',') {
-                let pair = pair.trim();
-                if let Some((key, value)) = pair.split_once(':') {
-                    let key = key.trim().trim_matches('"');
-                    let value = value.trim().trim_matches('"');
-
-                    match key {
-                        "id" => id = Some(value.to_string()),
-                        "address" => address = Some(value.to_string()),
-                        "last_contact" => last_contact = value.parse().ok(),
-                        _ => continue,
-                    }
-                }
-            }
-
-            Some(LeaderInfo {
-                id: id?,
-                address: address?,
-                last_contact: last_contact?,
-            })
-        })
+        .and_then(|s| serde_json::from_str::<LeaderInfo>(s).ok())
 }

277-397: Consider mapping ErrorCode::Uncategorized to General (not Business).
Right now ErrorCode::Uncategorized becomes ClientApiError::Business, which can drive callers toward “client must change behavior” semantics for unknown failures. General seems more appropriate unless you have a specific reason.

d-engine-client/src/pool.rs (1)

153-187: Improve final failure signal from load_cluster_metadata by including last error.
Currently you log each failure but return a generic ClusterUnavailable. Consider capturing the last error string and embedding it in ClientApiError (or attach as context) to help operators.

d-engine-core/src/raft_role/role_state.rs (1)

322-353: create_not_leader_response() likely does an avoidable O(n) scan; prefer a direct leader lookup if Membership supports it.
Right now it awaits members() then iter().find(...). If there’s a retrieve_node_meta(id)/map-like accessor, it’ll be cheaper and clearer.

d-engine-core/src/raft_role/candidate_state.rs (1)

571-575: StepDownSelfRemoved as unreachable!() is OK, but consider a defensive fallback in release builds.
If routing ever regresses and this becomes reachable, this will crash the node. A warn! + BecomeFollower(None) fallback can make the system more resilient.

d-engine-core/src/raft_role/leader_state.rs (2)

1235-1253: Step-down on self-removal is good; consider also clearing current_leader_id to avoid stale redirection.
After stepping down, clients hitting this node could still be redirected to the old leader id until the next heartbeat updates state elsewhere.

             RaftEvent::StepDownSelfRemoved => {
+                self.shared_state().clear_current_leader();
                 warn!(
                     "[Leader-{}] Removed from cluster membership, stepping down to Follower",
                     self.node_id()
                 );

1008-1032: snapshot_in_progress uses mixed orderings (Acquire/Release then SeqCst); consider standardizing.
Not wrong, but it’s harder to reason about than sticking to Acquire loads + Release stores (or SeqCst consistently).

d-engine-core/src/replication/replication_handler.rs (2)

543-549: New public struct: add #[non_exhaustive] (or keep fields private) to reduce semver breakage.
Making ReplicationData and all fields pub locks you into this shape; #[non_exhaustive] is a low-cost guardrail.

 #[derive(Debug)]
+#[non_exhaustive]
 pub struct ReplicationData {
     pub leader_last_index_before: u64,
     pub current_term: u64,
     pub commit_index: u64,
     pub peer_next_indices: HashMap<u32, u64>,
 }

617-631: Public prepare_peer_entries returns DashMap + clones input—document ownership/perf expectations.
Now that this is pub, callers may assume it’s cheap; but it clones new_entries and later clones per peer. A short doc note (or a future API taking shared buffers) will prevent misuse.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f7b0418 and 1b55a6f.

📒 Files selected for processing (18)
  • .github/workflows/benchmark.yml (1 hunks)
  • d-engine-client/src/error.rs (8 hunks)
  • d-engine-client/src/grpc_kv_client.rs (6 hunks)
  • d-engine-client/src/kv_client.rs (1 hunks)
  • d-engine-client/src/pool.rs (10 hunks)
  • d-engine-client/src/pool_test.rs (1 hunks)
  • d-engine-core/src/network/background_snapshot_transfer.rs (1 hunks)
  • d-engine-core/src/network/background_snapshot_transfer_test.rs (2 hunks)
  • d-engine-core/src/network/mod.rs (7 hunks)
  • d-engine-core/src/raft.rs (10 hunks)
  • d-engine-core/src/raft_role/candidate_state.rs (16 hunks)
  • d-engine-core/src/raft_role/leader_state.rs (40 hunks)
  • d-engine-core/src/raft_role/mod.rs (9 hunks)
  • d-engine-core/src/raft_role/role_state.rs (7 hunks)
  • d-engine-core/src/replication/mod.rs (3 hunks)
  • d-engine-core/src/replication/replication_handler.rs (10 hunks)
  • d-engine-core/src/state_machine_handler/default_state_machine_handler.rs (12 hunks)
  • d-engine-core/src/storage/lease.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • d-engine-client/src/pool_test.rs
  • d-engine-core/src/raft.rs
🧰 Additional context used
🧬 Code graph analysis (4)
d-engine-core/src/network/mod.rs (1)
d-engine-core/src/network/background_snapshot_transfer.rs (1)
  • mpsc (83-83)
d-engine-core/src/replication/replication_handler.rs (1)
d-engine-core/src/utils/cluster.rs (1)
  • is_majority (8-13)
d-engine-core/src/storage/lease.rs (2)
d-engine-server/src/storage/lease.rs (10)
  • register (200-214)
  • unregister (221-227)
  • is_expired (234-243)
  • get_expired_keys (252-269)
  • on_apply (280-293)
  • has_lease_keys (300-302)
  • may_have_expired_keys (312-329)
  • len (336-338)
  • to_snapshot (345-354)
  • reload (363-395)
d-engine-core/src/storage/state_machine.rs (2)
  • len (62-62)
  • is_empty (66-68)
d-engine-core/src/raft_role/mod.rs (2)
d-engine-core/src/raft_role/leader_state.rs (6)
  • new (105-113)
  • new (2008-2049)
  • update_voted_for (290-295)
  • voted_for (284-286)
  • match_index (327-332)
  • next_index (297-306)
d-engine-core/src/raft_role/role_state.rs (7)
  • update_voted_for (295-300)
  • current_term (230-232)
  • voted_for (289-291)
  • commit_index (242-244)
  • match_index (65-71)
  • node_id (42-44)
  • next_index (49-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: audit
  • GitHub Check: Lint and Format Check
  • GitHub Check: Analyze (rust)
  • GitHub Check: Lint and Format Check
  • GitHub Check: audit
🔇 Additional comments (36)
.github/workflows/benchmark.yml (2)

3-4: Permissions block properly implemented, addressing previous CodeQL feedback.

The explicit permissions: contents: read block correctly restricts the GITHUB_TOKEN to read-only access, which aligns with the principle of least privilege. This addresses the earlier security concern flagged by CodeQL.


1-83: Well-structured benchmark workflow with sound caching and artifact strategies.

The workflow is well-designed for performance testing:

  • Scheduling & triggers (lines 7–13): Weekly runs with manual override provide flexibility for on-demand benchmarking.
  • Rust setup (lines 30–36): Stable toolchain + protobuf-compiler installation is correct for the workspace.
  • Caching (lines 38–42): Custom cache key benchmark-* appropriately isolates benchmark cache from test runs.
  • Full history checkout (line 28): fetch-depth: 0 enables meaningful performance comparisons across commits.
  • Output & artifacts (lines 62–83): Criterion results + log tee + 30-day retention and clear summary messaging support post-hoc analysis and iteration.
d-engine-core/src/network/background_snapshot_transfer.rs (1)

30-33: LGTM! Import migration to external proto crate is consistent.

The migration from crate::proto::server::storage::* to d_engine_proto::server::storage::* aligns with the PR's architectural goal of using the external proto crate.

d-engine-core/src/network/background_snapshot_transfer_test.rs (2)

24-46: LGTM! Well-structured test helper for creating snapshot streams.

The helper function properly:

  1. Creates test chunks with metadata on seq 0
  2. Maps tonic Status errors to NetworkError appropriately
  3. Returns a properly boxed stream

19-22: Import migration to d_engine_proto is consistent with main module.

The test file properly imports SnapshotChunk and SnapshotResponse from d_engine_proto::server::storage, matching the production code's import paths.

d-engine-core/src/network/mod.rs (4)

36-44: Verify public API exposure is intentional for AppendResults.

The visibility has been widened from pub(crate) to pub. Ensure this struct is properly documented and re-exported from the crate's public API surface if intended for external use.


93-98: LGTM! Extended automock availability for integration testing.

The #[cfg_attr(any(test, feature = "test-utils"), automock)] pattern correctly enables MockTransport both for internal tests and for downstream crates using the test-utils feature.


24-32: Import migration to d_engine_proto is consistent and complete.

All protobuf types used in the Transport trait have been properly migrated from internal paths to d_engine_proto::server::*.


305-309: Confirm grpc_task_with_timeout_and_exponential_backoff should be public API.

This utility function is now pub instead of pub(crate). While useful for consumers implementing custom gRPC operations, verify this is intentionally part of the v0.2.0 public API and document its usage accordingly.

d-engine-core/src/storage/lease.rs (2)

1-24: LGTM! Excellent module documentation.

The module documentation provides clear design philosophy, architecture overview, and concurrency guidelines. This sets appropriate expectations for trait implementers.


73-87: Parameter type choice is appropriate.

The register method takes ownership of key: Bytes while other methods (like unregister, is_expired) use &[u8]. This asymmetry is intentional and efficient: register needs to store the key, while lookups only need a reference.

d-engine-client/src/kv_client.rs (2)

1-51: Well-documented trait interface with clear consistency guarantees.

The module documentation clearly explains the design principles, implementations, and performance characteristics. The trait's thread safety requirements (Send + Sync) and consistency guarantees are well-specified.


74-197: LGTM!

Method signatures are consistent with idiomatic Rust async trait patterns. Using impl AsRef<[u8]> + Send provides flexibility for callers while maintaining type safety. The idempotent delete semantics are correctly documented.

d-engine-client/src/grpc_kv_client.rs (3)

14-36: LGTM!

Imports are well-organized with clear aliasing (KvClient as CoreKvClient). The struct definition with Arc<ArcSwap<ClientInner>> enables efficient concurrent access with runtime-updatable configuration.


310-358: LGTM!

The watch implementation correctly uses load-balanced client selection since followers can serve watch streams. Documentation clearly notes the v1 limitation (no prefix/range watch support).


398-447: LGTM!

The CoreKvClient trait implementation correctly delegates to the inherent methods and properly maps ClientResult to Bytes (extracting the .value field). Error propagation via Into::into is clean and type-safe.

d-engine-core/src/replication/mod.rs (2)

105-113: No action required; parameter change is complete and properly integrated.

All implementors and call sites have been updated correctly. The single ReplicationCore implementation in replication_handler.rs includes the new cluster_metadata parameter (line 65) and actively uses it on line 76 (let is_single_node = cluster_metadata.is_single_node;). The production call site in leader_state.rs (lines 1383-1389) properly passes &self.cluster_metadata, and all 30+ test cases pass the parameter correctly.


21-26: No issues found. The test-utils feature properly gates mockall as an optional dependency in Cargo.toml, and all #[cfg(any(test, feature = "test-utils"))] guards correctly restrict mockall imports to test/test-utils builds only. The async-trait crate is a non-optional dependency (needed for all async trait definitions), so tonic::async_trait is unconditionally available—this is correct. Tonic v0.12.3 is declared with proper features. Test-only dependencies do not leak into production.

d-engine-core/src/state_machine_handler/default_state_machine_handler.rs (5)

111-113: Good design for optional feature integration.

The use of Option<Arc<WatchManager>> provides zero-overhead when watch functionality is disabled, with clear documentation explaining the behavior.


186-198: LGTM: Watcher notification correctly guarded by success check.

The previous concern about notifying watchers even on error has been properly addressed. The notification now only occurs when apply_result.is_ok() is true, ensuring watchers are not notified about failed operations.


709-739: LGTM: Clean watcher notification implementation.

The method correctly:

  • Gracefully handles decode failures with if let Ok
  • Uses pattern matching for operation variants
  • Maintains performance with inline attribute and early returns

The silent skipping of decode errors is appropriate here, as not all entries may contain decodable WriteCommands.


746-796: LGTM: Constructor properly updated with test helper.

The changes correctly:

  • Add the watch_manager parameter to the main constructor
  • Initialize all fields including the new watch_manager
  • Provide a test-only new_without_watch helper for backward compatibility in tests

The #[cfg(test)] attribute appropriately limits the helper to test builds.


400-476: LGTM: Snapshot cleanup properly parameterized.

The cleanup method now correctly accepts a configurable snapshot_dir_prefix parameter and threads it through to parse_snapshot_dirname calls, both for direct filenames and .tar.gz suffixed variants.

d-engine-client/src/pool.rs (2)

189-214: Good fix: “no leader” now maps to retryable ClusterUnavailable.
This matches typical “election in progress / temporarily unavailable” semantics and should improve client retry behavior.


39-54: Returning/storing current_leader_id looks correct; ensure callers treat it as advisory.

current_leader_id is snapshotted at pool build/refresh time and used only during initialization to map node ID to leader address. It's exposed via get_leader_id() for metadata queries, not for per-request routing. Staleness is handled naturally by subsequent RPC redirects when the leader changes.

d-engine-core/src/raft_role/role_state.rs (3)

394-424: Leader discovery notification gating looks good; double-check correctness of “leader committed” semantics.
You now (1) atomically store leader_id, (2) persist/record VotedFor{committed:true}, and (3) emit LeaderDiscovered only when update_voted_for() reports a new committed transition. That’s a solid de-dupe strategy—just ensure SharedState::update_voted_for() is the single source of truth for “committed leader” and is persisted where required.


35-37: No action required — all callers of update_voted_for already properly handle the Result<bool> return type. The API migration is complete with no compilation breaks or missing updates in implementations, tests, or call sites.


89-96: This concern is already properly addressed in the code.

init_cluster_metadata() is correctly called in the RoleEvent::BecomeLeader handler (raft.rs:343) immediately after the Leader role is created and before any hot paths execute. The call sequence ensures cluster topology is cached before process_batch and other leader operations access cluster_metadata for quorum calculations. Error handling with the ? operator propagates any failures, and the Leader implementation correctly populates is_single_node and total_voters based on the membership configuration.

d-engine-core/src/raft_role/candidate_state.rs (1)

390-398: Returning NOT_LEADER with leader metadata on Candidate proposals is the right UX.
This aligns Candidate behavior with redirection and avoids clients waiting on election completion.

d-engine-core/src/raft_role/mod.rs (3)

79-88: Atomic leader-id ordering upgraded to Acquire/Release: good.
This matches the intended “publish leader info / observe leader info” pattern for cross-thread visibility.

Also applies to: 146-167


190-214: SharedState::update_voted_for() semantics look consistent with event-driven leader discovery.
The “only true on committed transition/leader-term change” contract is clear and matches how roles consume it.


48-61: HardState: Copy is valid; VotedFor implements Copy and serde traits.
The proto-generated VotedFor struct explicitly derives Clone, Copy, PartialEq, ::prost::Message, serde::Serialize, serde::Deserialize. All its fields (u32, u64, bool) are primitives, making Copy safe. Option<VotedFor> is therefore also Copy. The current derives on HardState are correct and the code compiles without issues.

d-engine-core/src/raft_role/leader_state.rs (1)

500-520: Verify comment: ensure voters() excludes self and remove the off-by-one concern.

The voters.len() + 1 calculation is correct—membership.voters() explicitly excludes the leader node (filtered by node.id != self.node_id), so adding 1 accounts for self. This is not an off-by-one error.

However, a maintainability concern remains: update_cluster_metadata() only updates total_voters and does not recompute is_single_node, while init_cluster_metadata() computes both. Since is_single_node_cluster() derives from the immutable initial_cluster_size, desynchronization is impossible at runtime. Consider adding a comment in update_cluster_metadata() explaining that is_single_node is immutable configuration and does not need recomputation, to prevent future confusion or refactoring mistakes.

d-engine-core/src/replication/replication_handler.rs (3)

634-671: build_append_request logic looks consistent with follower “virtual log” handling.
Defaulting to (prev_log_index, prev_log_term) = (0, 0) aligns with check_append_entries_request_is_legal’s special-case accept path.


60-67: Verify is_single_node consistency with actual cluster membership (lines 74-96).
cluster_metadata.is_single_node is used to determine quorum behavior; if this field diverges from the authoritative membership state, replication can be incorrectly skipped. Ensure either: (1) is_single_node is derived from Membership at call time, or (2) there is an assertion verifying consistency with Membership when cluster_metadata is instantiated.


149-164: No action needed. Single-node commit index advancement is correctly implemented with a dedicated code path that does not depend on peer match indices.

Comment on lines +59 to 60
#[cfg_attr(any(test, feature = "test-utils"), automock)]
#[async_trait]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

test-utils automock is enabled, but tests modules are still #[cfg(test)] only.

If the goal is to compile/use mocks under --features test-utils outside cfg(test) (e.g., integration tests in other crates), consider also widening local test module gating (e.g., batch_buffer_test) or documenting that test-utils only affects mock generation, not unit tests in this crate.

Also applies to: 13-15

🤖 Prompt for AI Agents
In d-engine-core/src/replication/mod.rs around lines 59-60 (and also apply same
change for lines 13-15), the automock attribute is enabled for the "test-utils"
feature but the test modules themselves remain gated with #[cfg(test)],
preventing mocks from being available when building with --features test-utils
outside of unit-test compilation; either change the test module gating to
#[cfg(any(test, feature = "test-utils"))] (or equivalent per-module cfg) so
mocks are compiled when the feature is enabled, or update crate documentation to
state that "test-utils" only controls mock generation and does not enable the
internal test modules — implement the cfg change for relevant test modules if
you want mocks usable by integration tests in other crates.

…ale cache

Bug: update_cluster_metadata() only updated total_voters, missing is_single_node.
Caused test_embedded_node_rejoin to hang intermittently.

Fix:
- Update BOTH is_single_node and total_voters in update_cluster_metadata()
- Call update_cluster_metadata() after learner promotion
- Add unit test to catch this bug
Race condition: gRPC servers start async, election RPCs timeout before ready.

Solution: Increase test election timeout 100ms→2000ms, retries 3→5.

Only affects tests/common/mod.rs test configuration.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
d-engine-core/src/raft_role/leader_state.rs (1)

1599-1608: Consider replacing println! with structured logging.

While the "Plan B" print statements serve immediate visibility needs, consider using the existing tracing infrastructure (info!/warn!) for consistency and production deployability.

Also applies to: 1985-1992

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b55a6f and bcf2239.

📒 Files selected for processing (3)
  • d-engine-core/src/raft_role/leader_state.rs (40 hunks)
  • d-engine-core/src/raft_role/leader_state_test.rs (1 hunks)
  • d-engine-core/src/raft_role/mod.rs (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
d-engine-core/src/raft_role/mod.rs (2)
d-engine-core/src/raft_role/leader_state.rs (6)
  • new (105-113)
  • new (2014-2055)
  • update_voted_for (290-295)
  • voted_for (284-286)
  • match_index (327-332)
  • next_index (297-306)
d-engine-core/src/raft_role/role_state.rs (5)
  • update_voted_for (295-300)
  • voted_for (289-291)
  • match_index (65-71)
  • node_id (42-44)
  • next_index (49-55)
🪛 GitHub Actions: CI
d-engine-core/src/raft_role/leader_state_test.rs

[error] 11-11: Rustfmt would modify this file: formatting changes detected (diffs shown at lines 11, 18, 27, 34, 41). Run 'cargo fmt' to apply formatting.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: audit
  • GitHub Check: Analyze (rust)
  • GitHub Check: audit
🔇 Additional comments (9)
d-engine-core/src/raft_role/mod.rs (4)

77-170: LGTM: Hot-path leadership tracking with correct memory ordering.

The atomic current_leader_id field with Acquire/Release semantics properly synchronizes with persistent state updates, addressing the previous review concern about memory ordering.


193-217: LGTM: Leader commitment detection logic is sound.

The boolean return value correctly identifies transitions to a committed leader vote, enabling event-driven leader discovery without hot-path overhead.


310-368: LGTM: Test-only accessor methods properly gated.

The cfg(any(test, feature = "test-utils")) guards correctly restrict these internal accessors to test contexts.


15-15: LGTM: Public API expansion aligns with v0.2.0 objectives.

The newly public exports (ClusterMetadata, as_i32, current_term, QuorumVerificationResult) support the broader API surface documented in the release.

Also applies to: 267-267, 305-305, 421-421

d-engine-core/src/raft_role/leader_state.rs (5)

116-127: LGTM: ClusterMetadata struct design is sound.

The cached topology metadata approach avoids repeated async membership queries in hot paths, with clear documentation of immutability constraints.


1416-1427: LGTM: Single-node commit optimization is correct.

Single-node clusters can commit immediately (quorum of 1), while multi-node clusters use the existing majority-based calculation. The cached metadata avoids async calls in this hot path.


1930-1933: LGTM: Cluster metadata updated after successful join.

Updating the cached metadata after membership changes ensures the hot-path commit logic uses current topology information.


2440-2442: LGTM: Leader ID set immediately in role transition.

Setting current_leader_id atomically during the CandidateStateLeaderState conversion ensures the hot-path current_leader() reads are correct from the moment the new leader is created.


500-520: > Likely an incorrect or invalid review comment.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
cliff.toml (1)

66-67: Consider removing the redundant commit_parser rule.

Line 67 with the docs? regex pattern matches both "doc" and "docs" followed by an issue number, making line 66's ^doc #\\d+: rule redundant. Both rules assign to the same "Documentation" group.

If line 66 is intended only as a comment or historical reference, consider removing it to adhere to the DRY principle and simplify the configuration:

  { message = "^feat #\\d+:", group = "Features" },
  { message = "^fix #\\d+:", group = "Bug Fixes" },
  { message = "^refs #\\d+:", group = "Work in progress" },
- { message = "^doc #\\d+:", group = "Documentation" },
  { message = "^docs? #\\d+:", group = "Documentation" },
  { message = "^perf #\\d+:", group = "Performance" },
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bcf2239 and f601479.

📒 Files selected for processing (2)
  • .github/workflows/commit-message-check.yml (1 hunks)
  • cliff.toml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/commit-message-check.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: audit
  • GitHub Check: Analyze (rust)
  • GitHub Check: audit

…tations

- Add Project Status section to README (pre-1.0 + compatibility promise)
- Simplify MIGRATION_GUIDE (remove NodeBuilder API, keep WAL format only)
- Streamline CHANGELOG v0.2.0 highlights (code examples + concise features)
- Update benchmark reports with actual test data (Dec 13, 2025)
- Restructure performance report (TL;DR first, detailed results collapsed)

Focus: Lower cognitive load for new users while maintaining transparency
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
README.md (1)

259-259: Fix typo in license section.

Line 259 contains "d-eninge" which should be "d-engine".

Apply this diff:

-d-eninge is licensed under the terms of the [MIT License](https://en.wikipedia.org/wiki/MIT_License#License_terms)
+d-engine is licensed under the terms of the [MIT License](https://en.wikipedia.org/wiki/MIT_License#License_terms)
🧹 Nitpick comments (5)
benches/d-engine-bench/reports/v0.1.4/report_20251205_remeasured.md (1)

3-3: Optional: Apply consistent date formatting per style guide.

LanguageTool suggests that some style guides recommend commas to set off the year in month-day-year date format. If your project follows a formal style guide, consider updating both date mentions.

Apply this diff to add commas after the year:

-**Date:** December 13, 2025  
+**Date:** December 13, 2025,
-**Test Date:** December 13, 2025  
+**Test Date:** December 13, 2025,

Also applies to: 185-185

d-engine-client/src/kv_client.rs (3)

20-25: Example return type inconsistent with actual API.

The example uses Result<()> but trait methods return KvResult<()>. Consider updating for consistency:

-//! async fn store_config<C: KvClient>(client: &C) -> Result<()> {
+//! async fn store_config<C: KvClient>(client: &C) -> KvResult<()> {

80-105: Consider documenting TTL edge case behavior.

The behavior of ttl_secs = 0 is not specified. Does it mean "no expiration" or "immediate expiration"? Clarifying this edge case in the documentation would help implementers maintain consistent semantics.


168-171: API signature inconsistency with other methods.

get_multi requires &[Bytes] while other methods accept impl AsRef<[u8]> + Send. This forces callers to pre-convert keys to Bytes for batch operations, whereas single-key operations accept byte slices directly.

If this is intentional for performance (avoiding per-key allocations), consider documenting the rationale. Otherwise, a more flexible signature could improve ergonomics:

async fn get_multi<K: AsRef<[u8]> + Send>(
    &self,
    keys: &[K],
) -> KvResult<Vec<Option<Bytes>>>;
benches/d-engine-bench/reports/v0.2.0/report_v0.2.0_final.md (1)

260-260: Minor style suggestion: Add comma after year in date.

Some style guides suggest commas after the year in month-day-year format. Currently "December 13, 2025" could be "December 13, 2025," (comma before or after depending on guide).

This is purely stylistic and can be deferred.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f601479 and 8204c5c.

⛔ Files ignored due to path filters (1)
  • benches/d-engine-bench/reports/v0.2.0/dengine_comparison_v0.2.0.png is excluded by !**/*.png
📒 Files selected for processing (6)
  • CHANGELOG.md (1 hunks)
  • MIGRATION_GUIDE.md (1 hunks)
  • README.md (4 hunks)
  • benches/d-engine-bench/reports/v0.1.4/report_20251205_remeasured.md (1 hunks)
  • benches/d-engine-bench/reports/v0.2.0/report_v0.2.0_final.md (1 hunks)
  • d-engine-client/src/kv_client.rs (1 hunks)
🧰 Additional context used
🪛 LanguageTool
benches/d-engine-bench/reports/v0.2.0/report_v0.2.0_final.md

[style] ~260-~260: Some style guides suggest that commas should set off the year in a month-day-year date.
Context: ... v0.2.0 Report Date: December 13, 2025 Test Environment: Apple M2 Mac mi...

(MISSING_COMMA_AFTER_YEAR)

benches/d-engine-bench/reports/v0.1.4/report_20251205_remeasured.md

[style] ~3-~3: Some style guides suggest that commas should set off the year in a month-day-year date.
Context: ...enchmark Report Date: December 13, 2025 Hardware: Apple Mac mini M2 (8-co...

(MISSING_COMMA_AFTER_YEAR)


[style] ~185-~185: Some style guides suggest that commas should set off the year in a month-day-year date.
Context: ...ne v0.1.4 Test Date: December 13, 2025 Benchmark Runs: Single run per co...

(MISSING_COMMA_AFTER_YEAR)

🪛 markdownlint-cli2 (0.18.1)
MIGRATION_GUIDE.md

71-71: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: audit
  • GitHub Check: Analyze (rust)
  • GitHub Check: audit
🔇 Additional comments (12)
benches/d-engine-bench/reports/v0.1.4/report_20251205_remeasured.md (1)

1-187: Excellent benchmark documentation structure.

This remeasured baseline report is well-organized and provides clear, reproducible performance data for v0.1.4. The sections are logically arranged, test commands are fully specified, and the comparison with the original report includes reasonable context for differences. This provides a solid foundation for comparing v0.2.0 performance.

d-engine-client/src/kv_client.rs (4)

32-52: Well-designed trait definition.

Good use of async_trait with explicit Send + Sync bounds. The documentation clearly explains consistency guarantees and performance characteristics for each implementation.


74-78: LGTM!

The generic bounds impl AsRef<[u8]> + Send provide good flexibility for callers while maintaining type safety.


136-139: LGTM!

Consistent API design with clear semantics for missing/expired keys returning None.


173-197: LGTM!

Idempotent delete semantics are appropriate for a distributed KV store, avoiding race conditions and simplifying retry logic.

MIGRATION_GUIDE.md (1)

42-90: Migration strategies are clear and well-documented.

The three migration options (Clean Start, Rolling Upgrade, Snapshot-based) are practical, actionable, and appropriately scoped for different deployment scenarios. The verification section and timeline are helpful for operators. No concerns.

CHANGELOG.md (1)

7-86: v0.2.0 release entry is comprehensive and well-structured.

The new release notes clearly articulate feature value, provide actionable examples, explicitly call out breaking changes with migration guidance, and back performance claims with benchmarks. Structure and tone are professional. No concerns.

benches/d-engine-bench/reports/v0.2.0/report_v0.2.0_final.md (2)

3-20: TL;DR is clear and provides honest hardware context.

The executive summary effectively communicates v0.2.0 advantages while explicitly noting hardware differences between test setups. This transparency is important for readers evaluating the results. Good practice.


86-134: Reproduction steps are detailed and executable.

The benchmark commands include all necessary flags (endpoints, connections, clients, key/value sizes, consistency levels) for reproducibility. Commands span single-client, high-concurrency, and hot-key scenarios. Clear structure. No concerns.

README.md (3)

37-53: v0.2.0 features and core capabilities are well-articulated.

The distinction between new features (Modular Workspace, TTL/Lease, Watch API, etc.) and core capabilities (Single-Node Start, Strong Consistency, Pluggable Storage) is clear and helpful. Feature descriptions are concise with concrete examples (e.g., "<0.1ms latency" for LocalKvClient).


15-31: Project Status section sets proper expectations for pre-1.0 users.

Clearly states that breaking changes may occur between minor versions, explicitly references MIGRATION_GUIDE.md for compatibility promise, and provides timeline context (v1.0 expected Q2 2026). This is valuable transparency for production users. Approach is professional and encourages adoption while managing expectations.


57-116: Quick Start reorganization is significantly improved.

Splitting installation guidance into three distinct modes (Embedded/Standalone/Custom Storage) with use cases, TOML examples, and "when to use" guidance resolves the ambiguity that likely existed in v0.1.x. Each option includes:

  • Clear TOML syntax for each mode
  • Use case examples (e.g., "Go/Python/Java apps for Standalone")
  • Performance note for context
  • Link to relevant documentation

This structure is intuitive for both new users and experienced operators choosing deployment architecture.

…ments

This PR consolidates architecture review improvements and critical bug fixes for v0.2.0 release.

## 🎯 Critical Fixes

- **#212**: Fix learner promotion stuck (voter count + role transition bugs)
- **#218**: Fix leader next_index initialization for new learners
- **#222**: Return NOT_LEADER with leader metadata for client redirection
- **#209**: Fix node restart wait_ready() timeout (leader notification race)

## ✨ Feature Additions

- **#213**: Implement READ_ONLY Learner nodes for permanent analytics
- **#219**: Pre-generate Go protobuf code for zero-config experience

## 🔧 Refactoring & Performance

- **#223**: Optimize check_learner_progress() lock contention
  - Extract 5 helper methods (SRP), add 11 unit tests
- **#211**: Remove Arc::get_mut anti-pattern from lease injection
- **#210**: Simplify watch architecture (tokio::broadcast, 90% code sharing)
- **#217**: Refactor Node::run() with strategy pattern
- **#209**: Consolidate Raft unit tests (migrate 27 tests from server to core)

## 📚 Documentation

- Restructure quick-start docs (embedded + standalone examples)
- Add integration-modes.md and use-cases.md
- Delete 911 lines of internal architecture docs (20/80 principle)
- Fix Go client example with pre-generated protobufs

## 🧪 Testing & Validation

- 14+ new unit tests across components
- Fix flaky tests and timing issues (#204, #209)
- Optimize test suite with nextest
- All 430 core + 305 server + 292 integration tests passing

## 📊 Performance

- Benchmarks stable (±3% variance)
- CI optimized: removed benchmark compile check (saves 2-3 min/run)

---

**Closes:** #209, #210, #211, #212, #213, #217, #218, #219, #222, #223

**Migration Notes:**
- Config: `raft.watch.enabled` removed (breaking)
- API: `StateMachine::start()` changed to async (breaking)
- NodeStatus enum refactored: PROMOTABLE/READ_ONLY/ACTIVE (non-breaking)

**Files Changed:** 100+ files, ~5000 insertions, ~1500 deletions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants