Skip to content

feat #236: Implement ReadIndex batching mechanism to improve LinearizableRead throughput#240

Merged
JoshuaChi merged 15 commits intodevelopfrom
feature/236_readindex
Jan 12, 2026
Merged

feat #236: Implement ReadIndex batching mechanism to improve LinearizableRead throughput#240
JoshuaChi merged 15 commits intodevelopfrom
feature/236_readindex

Conversation

@JoshuaChi
Copy link
Copy Markdown
Contributor

@JoshuaChi JoshuaChi commented Jan 9, 2026

Type

  • New feature
  • Bug Fix

Related Issues

Checklist

  • The code has been tested locally (unit test or integration test)
  • Squash down commits to one or two logical commits which clearly describe the work you've done.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added linearizable read batching optimization for improved performance with configurable thresholds.
    • Introduced read batching configuration for tuning linearizable read behavior.
  • Performance Improvements

    • Published v0.2.2 benchmark report with enhanced performance metrics.
    • Optimized leadership verification and state machine sync pathways.
  • Updates

    • Refreshed example and benchmark naming conventions for clarity.
    • Enhanced contributor documentation and guidelines.

✏️ Tip: You can customize this high-level summary in your review settings.

…state machine apply

Changes:
- Skip commit_index update wait when current commit >= readIndex
- Add tracking for last_state_machine_applied_index
- Improve ReadIndex state management in leader
- Add comprehensive test coverage for optimization scenarios

This resolves the redundant wait issue where readers would unnecessarily
block on state machine apply even when the commit_index already satisfies
the read requirement.
Changes:
- Remove unstable latency ratio assertion in integration test
- Add unit tests for slow state machine apply scenario
- Fix CI disk space cleanup to preserve build toolchain
- Remove unused tracing imports

The integration test now focuses on functional correctness rather than
performance timing, which was unreliable in CI environments. Unit tests
provide precise coverage of the optimization logic using mocks.

CI fix: Preserve GCC/Clang toolchain by disabling large-packages cleanup,
resolving "stdbool.h not found" error during RocksDB compilation.
…tion

- Add read_buffer to LeaderState for batching LinearizableRead requests
- Implement dual-threshold flush mechanism (size: 50 requests, time: 10ms)
- Remove redundant needs_verification check in flush_read_buffer
- Update lease timestamp after successful verification for LeaseRead optimization
- Add drain_read_buffer error handling during role transitions
- Add comprehensive unit tests for batching behavior and edge cases

This change reduces leadership verification overhead by batching multiple
LinearizableRead requests and performing single quorum check per batch,
significantly improving read throughput under high concurrency.
…functions

- Remove verify_leadership_for_read() and verify_leadership_limited_retry()
- Introduce verify_leadership_and_refresh_lease() for unified Read path
- Update verify_leadership_persistent() to auto-refresh lease after Config changes
- Align with Raft best practices: fast-fail for reads, persistent retry for Config

Impact: Fixes linearizable read batching bugs (case6_2, case6_3)
Architecture: 4-layer → 3-layer call chain, net -145 lines
…contribution guidelines

Changes:
- Test Migration: Move 9 LinearizableRead unit tests from server to core package
  * Created leader_state_linearizable_read_test.rs (430 lines)
  * Tests for calculate_read_index, wait_until_applied, on_noop_committed
  * All 566 core tests passing

- Contribution Guidelines: Clarify project philosophy to protect focus
  * CONTRIBUTING.md: Add 20/80 rule, Issue-first workflow, out-of-scope boundaries
  * README.md: Add Maintainer Philosophy and Contribution Guide sections
  * .github: Update PR/Issue templates to enforce quality standards
  * Removed blank issue template to require structured reports

Rationale:
- Core unit tests belong in core package (single responsibility)
- Clear guidelines prevent feature bloat and maintain project direction
Fixed test_batch_promotion_failure hang (28.3s) by syncing node_config timeout in TestFixture.
Added LeaderState benchmarks + optimized 11 existing benchmarks by moving StateMachine/Engine creation outside iteration loops.
Benchmark suite now runs in <2min (was 15min+) with <10% measurement overhead (was 90%+).
…d_test.rs

## Summary
Migrated 29 client read tests from multiple files into a unified test suite,
renamed unclear test names, enhanced documentation, and added 3 P0 missing tests.

## Changes

### 1. File Reorganization
- Renamed: leader_state_linearizable_read_test.rs → leader_state_client_read_test.rs
- Migrated 17 tests from server layer (d-engine-server/leader_state_test.rs)
- Migrated 12 batching tests from core layer (d-engine-core/leader_state_test.rs)
- Deleted duplicate tests from original locations

### 2. Test Renaming (Clarity Improvement)
- case6_1 → test_linearizable_read_quorum_failure
- case6_2 → test_linearizable_read_quorum_success
- case6_3 → test_linearizable_read_encounters_higher_term

### 3. Documentation Enhancement
All tests now include:
- Test Scenario description
- Given/When/Then structure
- Raft Protocol Context (where applicable)
- Clear English comments following project standards

### 4. New P0 Tests Added
1. test_mixed_read_policies_in_batch
   - Verifies batch processing of multiple LinearizableRead requests
   - Validates batch optimization (single quorum check)

2. test_lease_reuse_after_linearizable_read_refresh
   - Validates lease timestamp refresh after LinearizableRead
   - Verifies LeaseRead can reuse refreshed lease (no redundant quorum)

3. test_eventual_consistency_ignores_stale_lease
   - Confirms EventualConsistency skips lease validation
   - Documents trade-off: performance vs linearizability guarantee

## Test Coverage
- Total tests: 32 (29 migrated + 3 new)
- All tests passing: ✅
- Test categories:
  * LinearizableRead core functions (9 tests)
  * Read policy handling (8 tests)
  * Read batching (15 tests)

## Rationale
Consolidating client read tests improves:
- Maintainability (single source of truth)
- Discoverability (unified test location)
- Documentation (consistent comment style)
- Coverage (filled critical gaps)

Follows d-engine principle: Simple, clean, maintainable code with narrow impact scope.
Copilot AI review requested due to automatic review settings January 9, 2026 14:58
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 9, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

This PR implements ReadIndex batching for linearizable reads to improve throughput, reorganizes the benchmark suite (renaming d-engine-bench to standalone-bench and examples/three-nodes-cluster to three-nodes-standalone), updates Raft configuration with new snapshot, membership, watch, and RPC compression settings, refactors leadership verification from limited-retry to persistent checks, and adds comprehensive integration tests for linearizable read optimization and batching semantics.

Changes

Cohort / File(s) Summary
GitHub Templates & PR/Issue Configuration
.github/ISSUE_TEMPLATE/blank.md, .github/ISSUE_TEMPLATE/feature_request.md, .github/PULL_REQUEST_TEMPLATE.md, .github/workflows/ci.yml
Deleted blank issue template; redesigned feature_request template with production-focused context, impact/complexity assessment, and pre-submission checklist; restructured PR template for clearer acceptance criteria and reviewer guidance; added free-disk-space CI step before Rust toolchain installation.
Contributor & Project Documentation
CONTRIBUTING.md, README.md, d-engine/src/docs/...
Replaced prescriptive contributing guide with philosophy-driven version emphasizing principles over procedures; updated performance benchmarks in README to reference v0.2.2 reports with embedded-mode metrics (4.6x write, 2x read); added Maintainer Philosophy section; updated all doc path references from three-nodes-cluster to three-nodes-standalone.
Benchmark Infrastructure Reorganization
Cargo.toml, benches/standalone-bench/*, benches/embedded-bench/*
Renamed benches/d-engine-benchbenches/standalone-bench and examples/three-nodes-clusterexamples/three-nodes-standalone throughout codebase; updated Cargo.toml workspace excludes and dependencies to use local path instead of versioned; updated all benchmark reports (v0.1.2–v0.2.2) and Makefiles to reflect new binary names and paths.
New Benchmark Reports
benches/embedded-bench/reports/v0.2.2/embedded_bench_report_v0.2.2.md, benches/standalone-bench/reports/v0.2.2/report_v0.2.2.md
Added comprehensive v0.2.2 embedded and standalone benchmark reports with detailed performance metrics (throughput, latency, p50/p99), consistency trade-offs, reproduce instructions, and key insights on ReadIndex batching gains.
Raft Configuration Expansion
config/base/raft.toml
Significantly expanded configuration with new sections: learner_catchup_threshold (renamed from learner_raft_log_gap), learner_check_throttle_ms; new raft.membership section (healthcheck, leadership verification, maintenance intervals, zombie/promotion thresholds); expanded raft.snapshot with storage paths, chunk size, bandwidth limits, transfer timeouts, and retry policies; added raft.read_consistency.read_batching (size/time thresholds); expanded raft.rpc_compression for per-traffic-type tuning; introduced raft.watch section for event queue and watcher buffer configuration.
ReadIndex Batching Core Implementation
d-engine-core/src/config/raft.rs, d-engine-core/src/errors.rs, d-engine-core/src/event.rs
Added ReadBatchingConfig struct with size_threshold (50) and time_threshold_ms (10); extended ReadConsistencyConfig with read_batching field; introduced ReadIndexError enum (NotLeader, Timeout, VerificationFailed) with From conversion to Consensus/Replication errors; added FlushReadBuffer event variant for timeout-driven read buffer flushes.
Raft Role & Leadership State Refactoring
d-engine-core/src/raft.rs, d-engine-core/src/raft_role/role_state.rs, d-engine-core/src/raft_role/mod.rs
Added eager read-buffer draining on role transitions (BecomeFollower, BecomeCandidate, BecomeLearner); added on_noop_committed hook for Leader to track no-op index; replaced verify_leadership_limited_retry with persistent leadership verification; added trait methods on_noop_committed and drain_read_buffer to RaftRoleState; added corresponding delegation methods to RaftRole.
Leader State Read Batching Logic
d-engine-core/src/raft_role/leader_state.rs
Added public read_buffer (vector of pending ClientReadRequest/response pairs) and read_buffer_start_time fields; introduced calculate_read_index, wait_until_applied, drain_read_buffer, verify_leadership_and_refresh_lease, and process_linearizable_read_batch methods; changed noop_log_id visibility to pub with #[doc(hidden)]; replaced multiple iterative verify_leadership_limited_retry calls with single-shot persistent verification in config/membership/read paths.
Non-Leader Role FlushReadBuffer Handling
d-engine-core/src/raft_role/candidate_state.rs, d-engine-core/src/raft_role/follower_state.rs, d-engine-core/src/raft_role/learner_state.rs
Added FlushReadBuffer event handlers in Candidate, Follower, and Learner states, each returning RoleViolation errors indicating Leader is required; tightened error handling for InstallSnapshotChunk in Candidate.
Comprehensive Read Batching & Linearizable Read Tests
d-engine-core/src/raft_role/leader_state_client_read_test.rs, d-engine-server/tests/linearizable_read/...
Added 2,377+ lines of unit/integration tests for LeaderState read paths covering read index calculation, state-machine waits, no-op tracking, linearizable vs. eventual consistency policies, read batching (size/time thresholds, role-change drains), lease reuse, and policy override behavior; added integration tests for linearizable-read consistency with concurrent writes, fixed read_index tracking, and 3-node cluster scenarios with rocksdb storage.
Non-Leader Role Test Coverage
d-engine-core/src/raft_role/candidate_state_test.rs, d-engine-core/src/raft_role/follower_state_test.rs, d-engine-core/src/raft_role/learner_state_test.rs
Added test modules verifying Candidate, Follower, and Learner reject FlushReadBuffer events with RoleViolation errors and that drain_read_buffer returns NotLeader.
Benchmark Optimization & New Benchmarks
d-engine-server/benches/leader_state_bench.rs, d-engine-server/benches/lease_performance.rs, d-engine-server/benches/state_machine.rs, d-engine-server/benches/ttl.rs, d-engine-server/benches/watch_overhead.rs
Added new leader_state_bench module (LeaderState creation, promotion processing); refactored existing benches to move resource setup (engine, state machine) outside iteration loops, reducing per-iteration overhead and improving measurement accuracy.
Leader State Integration Test Updates
d-engine-server/tests/components/raft_role/leader_state_test.rs
Added per-test RaftNodeConfig injections; adjusted read_batching.size_threshold to 1 for immediate flush in many tests; removed state_machine_upto_commit_index tests; added test_state_size memory footprint validation with warnings and cache-line guidance.
Integration Test Module Addition
d-engine-server/tests/integration_test.rs
Added linearizable_read integration test module declaration.
Example & Documentation Path Updates
d-engine-server/README.md, d-engine/src/docs/examples/..., examples/README.md, examples/quick-start-*.md, examples/service-discovery-*.md
Updated all example references and documentation links from three-nodes-cluster to three-nodes-standalone; updated Cargo.toml in quick-start-embedded to use local d-engine path with rocksdb feature.
Jepsen Test Infrastructure Cleanup
examples/three-nodes-cluster/docker/jepsen/sshkeys/...
Removed private SSH key (id_rsa) and related authorized_keys/known_hosts entries from repository.
Standalone Example Documentation
examples/three-nodes-standalone/docker/jepsen/README.md
Updated example cluster name reference from three-nodes-cluster to three-nodes-standalone.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Leader
    participant ReadBuffer as Read Buffer
    participant QuorumVerify as Quorum Verification
    participant StateMachine as State Machine

    Client->>Leader: LinearizableRead Request (arrives at t=0)
    activate Leader
    Leader->>ReadBuffer: Add request to buffer
    ReadBuffer->>ReadBuffer: Start timeout (time_threshold_ms)
    
    alt Buffer reaches size_threshold OR timeout fires
        ReadBuffer->>Leader: Flush signal (size/time trigger)
    else
        Client->>Leader: More LinearizableRead Requests
        Leader->>ReadBuffer: Add to same batch
    end
    
    Leader->>QuorumVerify: Single verify_leadership call for entire batch
    activate QuorumVerify
    QuorumVerify-->>Leader: Verified at commit_index C
    deactivate QuorumVerify
    
    Leader->>StateMachine: wait_until_applied(C)
    activate StateMachine
    StateMachine-->>Leader: Applied up to index C
    deactivate StateMachine
    
    Leader->>ReadBuffer: Respond all buffered requests with read_index=C
    ReadBuffer-->>Client: Linearizable responses (all from same snapshot)
    deactivate Leader
Loading
sequenceDiagram
    participant Node as Node<br/>(Any Role)
    participant RaftRole as RaftRole<br/>State Machine
    participant RoleState as Current Role<br/>(Candidate/Follower/Learner)

    Node->>RaftRole: Transition triggered (e.g., BecomeFollower)
    activate RaftRole
    RaftRole->>RoleState: drain_read_buffer()
    activate RoleState
    RoleState-->>RaftRole: Returns error (NotLeader)
    deactivate RoleState
    
    alt Role is Leader
        RaftRole->>RoleState: drain_read_buffer() succeeds
        RoleState->>RoleState: Respond all pending reads with error
    else Role is Non-Leader
        RaftRole->>RaftRole: Log drain failure warning
    end
    
    RaftRole->>RaftRole: Complete role transition
    deactivate RaftRole
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

The PR introduces significant architectural changes to the read path (ReadIndex batching with buffering logic, timeout-driven flushes, state-machine synchronization), spans multiple interconnected components (core config, errors, events, leader/non-leader role states), adds 2,377+ lines of dense test logic covering edge cases and concurrency scenarios, reorganizes the entire benchmark infrastructure with renamed directories/binaries, and updates raft configuration with 133 new/modified lines covering snapshot, membership, watch, and RPC compression tuning. While many files involve documentation or mechanical renaming (lower effort), the core read-batching implementation and extensive test coverage require careful logic review to ensure correctness under concurrent load, role transitions, and timeout conditions.

Possibly related issues

Possibly related PRs

Poem

🐰 A hop through batching lanes so bright,
Where reads all flush at once—oh what delight!
No-ops tracked, buffers drained with care,
Linearizable whispers float through the air,
Leadership verified, state machines align—
Read batching optimization, how you shine!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main feature: implementing a ReadIndex batching mechanism to improve LinearizableRead throughput, which matches the primary changes in the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@JoshuaChi
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 9, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements ReadIndex batching to significantly improve LinearizableRead throughput by sharing a single leadership verification across multiple concurrent read requests, addressing issue #236.

Changes:

  • Implemented ReadIndex batching mechanism for linearizable reads with configurable size (50) and time (10ms) thresholds
  • Added comprehensive integration tests validating batching behavior and linearizability guarantees
  • Renamed three-nodes-cluster to three-nodes-standalone across examples and documentation for clarity

Reviewed changes

Copilot reviewed 69 out of 118 changed files in this pull request and generated no comments.

Show a summary per file
File Description
d-engine-core/src/raft_role/leader_state.rs Core batching logic: read buffer, flush triggers, batch processing with shared quorum verification
d-engine-server/tests/linearizable_read/read_batching_integration.rs Integration tests for batching (linearizability, timeout, size threshold scenarios)
d-engine-core/src/config/raft.rs Added ReadBatchingConfig with size_threshold and time_threshold_ms
config/base/raft.toml Configuration documentation for read batching parameters
d-engine-core/src/event.rs Added FlushReadBuffer event for timeout-triggered batch processing
benches/standalone-bench/reports/v0.2.2/report_v0.2.2.md Performance report showing 428% linearizable read improvement
examples/* Renamed three-nodes-cluster to three-nodes-standalone across all examples
Comments suppressed due to low confidence (3)

d-engine-core/src/raft_role/learner_state.rs:1

  • Error message incorrectly states 'attempted to create snapshot' when the actual operation is FlushReadBuffer (read buffer flushing), not snapshot creation.
    d-engine-core/src/raft_role/follower_state.rs:1
  • Error message incorrectly states 'attempted to create snapshot' when the actual operation is FlushReadBuffer (read buffer flushing), not snapshot creation.
    d-engine-core/src/raft_role/candidate_state.rs:1
  • Error message incorrectly states 'attempted to create snapshot' when the actual operation is FlushReadBuffer (read buffer flushing), not snapshot creation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
d-engine-server/benches/state_machine.rs (1)

166-177: State machine reuse causes measurement inconsistency across benchmark iterations.

The state machine is created once before benchmarking and reused across all iterations. Each benchmark iteration applies entries with fixed indices (always starting at index 1), so the same keys are overwritten repeatedly:

  • First iteration: inserts fresh keys
  • Subsequent iterations: overwrites existing keys

Since insert and update paths may have different performance characteristics (e.g., hash table lookups, state machine traversal), the measurements conflate insert and update overhead. The benchmark should either:

  • Use unique keys per iteration to measure consistent insert performance, or
  • Document the intended behavior as measuring update/overwrite performance

This pattern affects:

  • bench_apply_without_ttl and bench_apply_with_ttl (lines 166–195)
  • bench_batch_apply and bench_batch_apply_with_ttl (lines 273–305) with fixed start_index=1
  • bench_apply_without_watch, bench_apply_with_1_watcher, and bench_apply_with_100_watchers (lines 315+)

The comment at line 316 acknowledges the design but doesn't address the measurement validity concern.

d-engine-server/benches/ttl.rs (1)

172-186: State machine reuse causes key overwrites and measurements of update operations instead of inserts.

This benchmark creates the state machine once and reuses it across iterations. Each iteration calls create_entries_with_ttl(size, 1, 3600) with the same start_index=1, generating identical keys (key_ttl_1, key_ttl_2, etc.) across all three benchmark sizes. This means:

  • Iteration 1 (size=10): Inserts keys 1–10
  • Iteration 2 (size=100): Overwrites keys 1–10, inserts keys 11–100
  • Iteration 3 (size=1000): Overwrites keys 1–100, inserts keys 101–1000

The benchmark measures update/overwrite performance, not fresh insert performance. Use unique indices per iteration (e.g., create_entries_with_ttl(size, iteration * size, 3600)) or recreate the state machine for each iteration to measure consistent insert behavior.

The same issue applies to bench_varying_ttl_durations (line 278).

benches/standalone-bench/reports/v0.2.0/report_v0.2.0_final.md (1)

90-134: Repro commands likely use the wrong relative path for the bench binary.

You cd examples/three-nodes-standalone (Line 91) but then execute ./target/release/standalone-bench (e.g., Line 99), which likely won’t exist from that directory.

Proposed fix (pick one approach)
 ### Start Cluster

 ```bash
-cd examples/three-nodes-standalone
+cd examples/three-nodes-standalone
 make start-cluster

Run Benchmarks

# Single client write (10K requests)
-./target/release/standalone-bench \
+../../target/release/standalone-bench \
    --endpoints http://127.0.0.1:9081 --endpoints http://127.0.0.1:9082 --endpoints http://127.0.0.1:9083 \
    --conns 1 --clients 1 --sequential-keys --total 10000 \
    --key-size 8 --value-size 256 put

(Apply the same ../../target/release/standalone-bench change to the other benchmark blocks.)

d-engine-core/src/config/raft.rs (2)

960-995: Config deserialization bug risk: partial read_batching overrides likely fail without per-field serde defaults.

Because ReadBatchingConfig fields lack #[serde(default)] (or default = ...), specifying only one of the fields in TOML can make deserialization error out.

Proposed fix
 /// Read batching configuration
 #[derive(Clone, Debug, Serialize, Deserialize)]
 pub struct ReadBatchingConfig {
     /// Flush when buffer reaches this size
+    #[serde(default = "default_read_batching_size_threshold")]
     pub size_threshold: usize,

     /// Flush when first request ages beyond this (milliseconds)
+    #[serde(default = "default_read_batching_time_threshold_ms")]
     pub time_threshold_ms: u64,
 }

+const fn default_read_batching_size_threshold() -> usize {
+    50
+}
+
+const fn default_read_batching_time_threshold_ms() -> u64 {
+    10
+}
+
 impl Default for ReadBatchingConfig {
     fn default() -> Self {
         Self {
-            size_threshold: 50,
-            time_threshold_ms: 10,
+            size_threshold: default_read_batching_size_threshold(),
+            time_threshold_ms: default_read_batching_time_threshold_ms(),
         }
     }
 }

Also applies to: 966-983


1011-1020: Add validation for read_consistency.read_batching thresholds.

Guard against 0 (and potentially very large values) to avoid pathological behavior at runtime.

Proposed fix
 impl ReadConsistencyConfig {
     fn validate(&self) -> Result<()> {
         // Validate read consistency configuration
         if self.lease_duration_ms == 0 {
             return Err(Error::Config(ConfigError::Message(
                 "read_consistency.lease_duration_ms must be greater than 0".into(),
             )));
         }
+        if self.read_batching.size_threshold == 0 {
+            return Err(Error::Config(ConfigError::Message(
+                "read_consistency.read_batching.size_threshold must be greater than 0".into(),
+            )));
+        }
+        if self.read_batching.time_threshold_ms == 0 {
+            return Err(Error::Config(ConfigError::Message(
+                "read_consistency.read_batching.time_threshold_ms must be greater than 0".into(),
+            )));
+        }
         Ok(())
     }
 }
🤖 Fix all issues with AI agents
In @.github/workflows/ci.yml:
- Around line 54-63: The workflow step using the reusable action currently
references jlumbroso/free-disk-space@main which is unstable; change the uses
reference to a specific release tag (e.g., jlumbroso/free-disk-space@vX.Y.Z) to
pin the dependency and ensure reproducible CI; locate the step with name "Free
Disk Space (Ubuntu)" or the uses string "jlumbroso/free-disk-space@main", query
the action repo releases to find the latest stable tag, and replace "@main" with
that tag in the uses field.

In @benches/standalone-bench/reports/v0.2.2/report_v0.2.2.md:
- Around line 7-10: The TL;DR and conclusion contain conflicting performance
claims; reconcile them to exactly match the numerical results shown in the
tables referenced (update the Write percent speedup, Read speedup ratios, and
Unique/LeaseRead multipliers so they are numerically identical to the table
values), remove or correct any blanket statements like “all metrics improved” if
any table shows regressions, and ensure the same corrected wording appears in
the TL;DR and conclusion sections (also apply these fixes to the same phrasing
in the other occurrences noted around lines 37-43 and 270-276).
- Around line 23-24: Choose a single canonical date and make it consistent
across the report: update the "Test Date" header (currently "Test Date: January
9, 2026"), the footer "Report Date: December 13, 2025", and any inline
references such as "Core Optimizations (January 9, 2026)" so they all use the
same chosen date; specifically replace the inconsistent strings at the "Test
Date" line, the footer "Report Date" line, and the occurrences around lines
289-292 to the selected date.

In @d-engine-server/benches/lease_performance.rs:
- Around line 176-177: The benchmark bench_batch_ttl_operations reuses the same
StateMachine (created by create_sm_background and assigned to sm_bg) across
different batch sizes so apply_chunk reuses overlapping log indices and keys;
fix by instantiating a fresh StateMachine for each batch size (call
create_sm_background/construct a new StateMachine per batch size before
preparing/applying that size's entries) or by resetting the StateMachine state
(clearing last_applied_index and stored entries) between sizes and between
b.iter() iterations so apply_chunk sees monotonic indices and each benchmark run
starts from a clean state.

In @d-engine-server/benches/ttl.rs:
- Around line 248-257: The benchmark currently seeds 100 expired entries once
(outside the per-iteration closure) then runs ten noop cleanup triggers per
iteration so the first iteration removes most expired entries and subsequent
iterations measure the noop path; fix by repopulating or resetting state inside
each iteration: move the loop that creates the 100 expired entries (the code
that calls create_noop_entry and sm.apply_chunk) into the async closure passed
to group.bench_function (or alternatively recreate/reset the StateMachine
instance `sm` at the start of each iteration) so that each iteration starts with
the same 100 expired entries before running the ten cleanup noops.
- Around line 214-221: The benchmark currently creates the 50 expired and 50
active entries once and then measures cleanup by repeatedly applying a noop (in
the group.bench_function closure using create_noop_entry and sm.apply_chunk),
which means only the first iteration measures the intended "cleanup 50 expired
entries" case; subsequent iterations see no expired entries. Fix by repopulating
the expired entries inside each measured iteration (move the setup that creates
the 50 expired/50 active entries into the async closure executed per iteration)
or switch to iter_custom and manually start/stop timing so you can recreate
expired entries between timing intervals; ensure the code that calls
create_noop_entry and sm.apply_chunk remains the measured operation while setup
is redone per iteration.

In @d-engine-server/tests/linearizable_read/linearizable_read_optimization.rs:
- Around line 391-419: Replace the fixed paths
node1_config_path/node2_config_path/node3_config_path with TempDir-scoped paths:
create a tempfile::TempDir at the start of the test, join filenames like
"linear_read_n1.toml" etc. onto that tempdir.path() and use those paths in the
tokio::fs::write calls for node1_config, node2_config and node3_config; ensure
the TempDir is held for the test lifetime so files are cleaned up automatically
to avoid cross-test collisions.
- Around line 43-63: ClusterConfig uses a randomized listen_address but
initial_cluster[0].address is hardcoded to "127.0.0.1:9081", causing mismatch;
update the test to compute the bind address once and reuse it for both
ClusterConfig.listen_address and NodeMeta.address (i.e., generate the port/addr
into a local variable and assign that same value to listen_address and
initial_cluster[0].address) so the node advertises the actual bound address for
membership and RPCs.

In @d-engine-server/tests/linearizable_read/read_batching_integration.rs:
- Around line 390-462: The assertion in test_batching_throughput_improvement is
flaky because it requires throughput2 >= throughput1; relax the check to allow a
small regression budget (e.g., require throughput2 >= throughput1 * 0.9) or
alternatively run multiple iterations of the benchmark (using NUM_REQUESTS
loops) and compare medians of throughput1/throughput2 before asserting; update
the regression check near throughput1/throughput2 and reference
test_batching_throughput_improvement, create_single_node_with_batching, and the
throughput1/throughput2 variables when making the change.
- Around line 48-68: ClusterConfig currently sets listen_address to a generated
ephemeral port but hardcodes initial_cluster[0].address to "127.0.0.1:9081",
causing mismatched advertised membership; change the code so the same address
string is used for both listen_address and NodeMeta.address (e.g., compute let
addr = format!("127.0.0.1:{}", port) / parse it once and assign addr.clone() to
initial_cluster[0].address) so ClusterConfig.listen_address and
initial_cluster[].address match; update references to ClusterConfig,
listen_address, initial_cluster, and NodeMeta accordingly.

In @d-engine/src/docs/examples/three-nodes-standalone.md:
- Around line 5-6: Update the broken relative links that point to the benches
directory: replace occurrences of "../../benches/..." with
"../../../../benches/..." for the two hrefs that currently use
"../../benches/..." and also update the link text that currently displays
"../../benches/..." to "../../../../benches/..." so the displayed text matches
the correct href (adjust the three instances of the "../../benches/..." string
in the markdown).
🟡 Minor comments (13)
d-engine-server/benches/watch_overhead.rs-203-206 (1)

203-206: Engine cleanup is skipped if benchmark panics before completion.

The cleanup blocks execute after all benchmark iterations, but if the benchmark panics (e.g., from an .expect() call), the engine.stop() code is never reached. While EmbeddedEngine's Drop implementation warns of improper shutdown, it does not automatically stop the engine, potentially leaving resources (ports, threads) unreleased.

For benchmarks this may be acceptable since resources are cleaned up on process exit, but consider whether Criterion provides panic-safe teardown hooks or a Drop guard pattern to improve CI/local development hygiene.

Also applies to: 253-256, 305-308

benches/embedded-bench/Makefile-2-2 (1)

2-2: Fix stale report references (v0.2.0 → v0.2.2) to avoid confusion.
Line 2 still points at the old report naming/version while the rest of the Makefile points to v0.2.2.

Also applies to: 211-211

benches/embedded-bench/README.md-27-29 (1)

27-29: Simplify the relative path reference to standalone-bench reports.

The path ../../benches/standalone-bench/reports/v0.2.2/report_v0.2.2.md is unnecessarily convoluted; from benches/embedded-bench/, use ../standalone-bench/reports/v0.2.2/report_v0.2.2.md instead. The filename report_v0.2.2.md is correct and confirmed to exist.

benches/embedded-bench/Makefile-78-78 (1)

78-78: Update "Comparable to" comments to use current embedded-bench CLI syntax.

The Makefile comments reference non-existent CLI flags (--conns) and subcommands (range) that embedded-bench does not support. Update lines 78, 97, 120, 138, 156, and 174 to reflect the actual current CLI: embedded-bench --mode local --total <N> --clients <N> <put|get> [--consistency <policy>] instead of the deprecated --conns and range references.

d-engine-core/src/raft_role/learner_state.rs-302-309 (1)

302-309: Fix misleading RoleViolation context for FlushReadBuffer.

This arm handles RaftEvent::FlushReadBuffer but the message says “attempted to create snapshot.” Suggest aligning the message with the actual operation.

Proposed fix
 RaftEvent::FlushReadBuffer => {
     return Err(ConsensusError::RoleViolation {
         current_role: "Learner",
         required_role: "Leader",
-        context: format!("Learner node {} attempted to create snapshot.", ctx.node_id),
+        context: format!("Learner node {} attempted to flush read buffer.", ctx.node_id),
     }
     .into());
 }
d-engine-core/src/raft_role/follower_state.rs-357-367 (1)

357-367: Fix misleading RoleViolation context for FlushReadBuffer.

The error says “attempted to create snapshot”, but this arm is for RaftEvent::FlushReadBuffer (ReadIndex batching). Also consider using self.node_id() for consistency.

Proposed fix
 RaftEvent::FlushReadBuffer => {
     return Err(ConsensusError::RoleViolation {
         current_role: "Follower",
         required_role: "Leader",
         context: format!(
-            "Follower node {} attempted to create snapshot.",
-            ctx.node_id
+            "Follower node {} attempted to flush read buffer.",
+            self.node_id()
         ),
     }
     .into());
 }
d-engine-core/src/raft_role/candidate_state.rs-400-410 (1)

400-410: FlushReadBuffer RoleViolation message is incorrect/misleading (mentions snapshot).

This looks copy/paste from snapshot handling; for RaftEvent::FlushReadBuffer the context should mention flushing/draining the read buffer (or batching flush), not snapshot creation. Also consider using self.node_id() for consistency with the rest of the role-state code.

Proposed fix
 RaftEvent::FlushReadBuffer => {
     return Err(ConsensusError::RoleViolation {
         current_role: "Candidate",
         required_role: "Leader",
         context: format!(
-            "Candidate node {} attempted to create snapshot.",
-            ctx.node_id
+            "Candidate node {} attempted to flush read buffer.",
+            self.node_id()
         ),
     }
     .into());
 }
d-engine-core/src/raft_role/follower_state_test.rs-15-49 (1)

15-49: Replace hard-coded /tmp paths with per-test temporary directories and use structured error matching.

  • Hard-coded /tmp paths can collide under parallel test runs and fail on non-Unix systems. Use tempfile::tempdir() or tempfile::TempDir to create isolated, platform-safe temporary directories per test.
  • String-based error matching (error_str.contains("...")) is brittle and breaks when debug formatting changes. Use structured matching via matches! on error enum variants (e.g., matches!(e, MembershipError::NotLeader)) instead.

This pattern is already established in the codebase (e.g., d-engine-server/tests/components/raft_role/follower_state_test.rs uses tempfile::tempdir()).

Also applies to: 55-80

d-engine-core/src/raft_role/mod.rs-10-19 (1)

10-19: Update drain_read_buffer documentation to accurately reflect behavior.

The doc comment "other roles are no-op" is misleading. The default implementation in role_state.rs returns Err(MembershipError::NotLeader), not a no-op. Update the comment to reflect this: "Only Leader drains the buffer; other roles return NotLeader error."

Good news: Call sites in raft.rs (lines 309, 330, 399) are already correctly placed in the handle_role_event() role transition handlers and properly handle the error with if let Err(e) = ... logging.

The on_noop_committed method is documented correctly and behaves as documented (true no-op for non-leaders).

benches/embedded-bench/reports/v0.2.2/embedded_bench_report_v0.2.2.md-72-80 (1)

72-80: “Performance Ladder” ordering is misleading when the multipliers are equal/near-equal.

If Eventual and LeaseRead are both ~1.8x vs Linearizable, prefer a tie notation and/or more precise ratios derived from the table values.

benches/standalone-bench/reports/v0.2.2/report_v0.2.2.md-168-170 (1)

168-170: Update the Go version in the environment note to Go 1.16 or newer.

The benchmark report lists Go 1.8.3 for etcd 3.5.0, but etcd 3.5.0 requires Go 1.16+ (per release notes and build documentation). Correct this at lines 168-170.

d-engine-server/tests/linearizable_read/read_batching_integration.rs-242-303 (1)

242-303: Timing-based assertions: widen bounds or use mocked time to avoid flaky failures.
E.g., “10–20ms” expectations can fail under noisy CI scheduling even if functionality is correct.

Also applies to: 305-371

d-engine-core/src/raft_role/role_state.rs-471-474 (1)

471-474: Fix copy/paste log message in drain_read_buffer default impl.
Currently logs "update_match_index NotLeader error" which is misleading and makes operational debugging harder.

Proposed fix
 fn drain_read_buffer(&mut self) -> Result<()> {
-    warn!("update_match_index NotLeader error");
+    warn!("drain_read_buffer NotLeader error");
     Err(MembershipError::NotLeader.into())
 }
🧹 Nitpick comments (20)
d-engine-server/tests/components/raft_role/leader_state_test.rs (1)

1831-1857: Appreciate the memory footprint documentation approach.

The soft guideline with historical tracking is a good practice for awareness without creating brittle tests. The cache-line calculation provides useful context for performance-conscious development.

Minor observation: The comment on line 1855 appears incomplete ("LeaderState spans {} cache lines" - missing the actual value reference), but this is benign since div_ceil(64) is printed on line 1856.

Optional: Complete the comment for consistency
-    // For reference: Modern CPU cache line is 64 bytes
-    // LeaderState spans {} cache lines
+    // For reference: Modern CPU cache line is 64 bytes
     println!("Cache lines occupied: {}", size.div_ceil(64));
d-engine-server/benches/state_machine.rs (2)

315-327: Clarify whether the benchmark measures insert or update performance.

The comment acknowledges that state machine initialization occurs outside the loop, but doesn't address that the same 100 keys (starting at index 1) are applied in every iteration. This means after the first iteration, all subsequent iterations measure update performance rather than insert performance.

If measuring insert performance is intended, use unique indices per iteration or recreate the state machine.


334-384: State machine reuse impacts watch benchmark measurements.

While the watch system is recreated per iteration (lines 338, 395, 452), the state machine is reused across iterations. Each iteration applies 100 entries with indices starting at 1, accumulating state:

  • Iteration 1: 100 fresh inserts
  • Iteration 2: 100 updates to existing keys
  • Iteration N: 100 updates plus accumulated overhead

This accumulation could affect the measured overhead of watch notifications. Consider using unique indices per iteration to ensure consistent measurement conditions.

Also applies to: 391-440, 448-497

d-engine-server/benches/lease_performance.rs (1)

145-148: Verify async iteration methodology change is intentional.

The benchmarks have been updated from synchronous iteration to b.to_async(&rt).iter() for async iteration. This changes what's being measured, as async iteration includes async runtime overhead in the benchmark timings.

This may provide more realistic results if the actual workload is async, but it's worth confirming:

  1. Is this methodology change intentional?
  2. Should historical benchmark results be considered incomparable with new results?
  3. Does this align with how the code is used in production?

Also applies to: 158-161, 184-197

examples/quick-start-embedded/Cargo.toml (1)

7-8: Path dependency is fine for in-repo examples, but consider a more “portable” pattern.

If users copy examples/quick-start-embedded out of the repo, path = "../../d-engine" will break. If portability matters, consider a workspace dependency (or document that this example is intended to be built from the repo root workspace).

d-engine-core/src/raft.rs (1)

308-312: Good safety improvement: drain read buffer on step-down; consider deduping the transition boilerplate.
The same “drain + warn” block appears in multiple transitions; a small helper (best-effort) would reduce drift risk.

Also applies to: 329-333, 398-402

README.md (1)

175-175: Fix redundant phrasing.

LanguageTool flagged line 175 as containing redundant phrasing. Review and simplify if needed.

✏️ Suggested edit

Current line 175:

- **Breaking changes are OK pre-1.0** - we prioritize getting it right over
  backward compatibility

Consider rewording for clarity (if redundancy exists in the actual whitespace/formatting):

- **Breaking changes are OK pre-1.0** - we prioritize correctness over backward compatibility
benches/standalone-bench/docker/Dockerfile (1)

24-24: Rename looks consistent; consider --locked for reproducible image builds.

Optional tweak
-RUN cargo build --release --bin standalone-bench
+RUN cargo build --release --locked --bin standalone-bench

Also applies to: 42-42

d-engine-core/src/raft_role/candidate_state_test.rs (1)

15-49: Replace hard-coded /tmp/ paths with tempfile::tempdir() and match error variants directly instead of debug-format strings.

Both tests use hard-coded /tmp/test_candidate_* paths that collide when tests run in parallel, and rely on brittle format!("{e:?}").contains(...) assertions that break if error debug formatting changes. The tempfile crate is already used elsewhere in the codebase (config_test.rs, state_machine_handler_test.rs); use it here. Replace debug-string matching with pattern matching on the concrete Error::Consensus(ConsensusError::RoleViolation { ... }) and ReplicationError::NotLeader { ... } variants.

Also applies to: 55-78

d-engine-core/src/raft_role/learner_state_test.rs (3)

17-49: Avoid brittle string-matching for error assertions (prefer type/variant matching).

format!("{e:?}") + contains(...) will break on error display/debug changes; prefer matches! against crate::Error/crate::ConsensusError variants if available.

Proposed refactor (illustrative)
+use crate::{ConsensusError, Error};
 ...
-    if let Err(e) = result {
-        let error_str = format!("{e:?}");
-        assert!(
-            error_str.contains("RoleViolation"),
-            "Error should be RoleViolation, got: {error_str}"
-        );
-        assert!(
-            error_str.contains("Learner"),
-            "Error should mention Learner role"
-        );
-        assert!(
-            error_str.contains("Leader"),
-            "Error should mention Leader as required role"
-        );
-    }
+    assert!(
+        matches!(
+            result,
+            Err(Error::Consensus(ConsensusError::RoleViolation {
+                current_role: "Learner",
+                required_role: "Leader",
+                ..
+            }))
+        ),
+        "Learner should reject FlushReadBuffer with RoleViolation(Learner -> Leader), got: {result:?}"
+    );

18-19: Hard-coded /tmp/... paths can make tests flaky under parallelism/leftovers.

Consider using a per-test temp dir (or at least unique paths) to avoid collisions and make cleanup deterministic.

Also applies to: 58-59


48-49: Redundant drop(shutdown_tx).

Letting it fall out of scope is sufficient; keeping it is harmless but noise.

benches/standalone-bench/reports/v0.2.2/report_v0.2.2.md (1)

67-76: Table labeling is confusing: “Before (v0.2.2)” vs “After (optimized)” within a v0.2.2 report.

Consider renaming to “Baseline” vs “Optimized” and, ideally, include the commit SHAs/config diffs used for each run.

d-engine-server/benches/leader_state_bench.rs (2)

134-146: Benchmark likely measures Arc::clone/allocation noise more than LeaderState::new.
TempDir + config setup outside the iter is good, but consider also making the LeaderState::new bench avoid any hidden filesystem/config overhead inside new (if any).


156-216: Move per-iteration fixture setup outside the measured region using iter_batched on AsyncBencher.

The current benchmarks time TempDir creation, mock setup, and config cloning alongside process_pending_promotions. Use b.to_async(&rt).iter_batched(setup, routine) with BatchSize::SmallInput to separate fixture creation (setup) from the function being measured (routine).

This applies to lines 225–242 and 251–278.

d-engine-core/src/raft_role/leader_state_client_read_test.rs (1)

1237-1526: Avoid reloading config in tests when you can assert against state.node_config.
E.g., test_read_buffer_size_trigger/test_read_buffer_time_trigger call node_config(...) again, which can drift from the config actually used by the LeaderState.

config/base/raft.toml (2)

43-69: Config structure looks aligned with batching feature; ensure deserialization + defaults are covered.
Given new nested tables ([raft.read_consistency.read_batching], [raft.membership]), I’d strongly prefer a config-load unit test that parses this base file (or an override) and asserts the fields are populated.

Also applies to: 188-196


37-39: Consider fixing the config key typo (if it’s not locked for backward-compat).
rpc_peer_connectinon_monitor_interval_in_sec looks misspelled; if you can’t rename due to compatibility, consider supporting both keys with clear precedence.

d-engine-core/src/raft_role/leader_state.rs (2)

157-162: Avoid widening API surface just for tests (noop_log_id as pub).
Since you already added test_get_noop_log_id, prefer keeping the field non-public (or pub(super)) to reduce accidental coupling.

Also applies to: 251-259


2666-2732: Refactor: reuse wait_until_applied() inside process_linearizable_read_batch().
Right now the “wait for apply” logic is duplicated; calling the helper would keep semantics aligned and reduce future drift.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef0c131 and fbf63da.

⛔ Files ignored due to path filters (7)
  • benches/d-engine-bench/Cargo.lock is excluded by !**/*.lock
  • benches/embedded-bench/reports/v0.2.2/dengine_comparison_v0.2.2.png is excluded by !**/*.png
  • benches/standalone-bench/reports/v0.1.3/dengine_comparison_v0.1.3.png is excluded by !**/*.png
  • benches/standalone-bench/reports/v0.1.4/dengine_comparison_v0.1.4.png is excluded by !**/*.png
  • benches/standalone-bench/reports/v0.2.0/dengine_comparison_v0.2.0.png is excluded by !**/*.png
  • benches/standalone-bench/reports/v0.2.2/dengine_comparison_v0.2.2.png is excluded by !**/*.png
  • examples/three-nodes-cluster/docker/jepsen/sshkeys/id_rsa.pub is excluded by !**/*.pub
📒 Files selected for processing (111)
  • .github/ISSUE_TEMPLATE/blank.md
  • .github/ISSUE_TEMPLATE/feature_request.md
  • .github/PULL_REQUEST_TEMPLATE.md
  • .github/workflows/ci.yml
  • CONTRIBUTING.md
  • Cargo.toml
  • README.md
  • benches/embedded-bench/Makefile
  • benches/embedded-bench/README.md
  • benches/embedded-bench/reports/v0.2.0/embedded_bench_report_v0.2.0.md
  • benches/embedded-bench/reports/v0.2.2/embedded_bench_report_v0.2.2.md
  • benches/embedded-bench/src/main.rs
  • benches/standalone-bench/.gitignore
  • benches/standalone-bench/Cargo.toml
  • benches/standalone-bench/Makefile
  • benches/standalone-bench/README.md
  • benches/standalone-bench/docker/Dockerfile
  • benches/standalone-bench/docker/README.md
  • benches/standalone-bench/docker/docker-compose.yml
  • benches/standalone-bench/docker/soak-test.sh
  • benches/standalone-bench/reports/v0.1.2/report_20250419.md
  • benches/standalone-bench/reports/v0.1.3/report_20250901.md
  • benches/standalone-bench/reports/v0.1.4/report_20251011.md
  • benches/standalone-bench/reports/v0.1.4/report_20251205_remeasured.md
  • benches/standalone-bench/reports/v0.2.0/report_v0.2.0_final.md
  • benches/standalone-bench/reports/v0.2.2/report_v0.2.2.md
  • benches/standalone-bench/rust-toolchain.toml
  • benches/standalone-bench/rustfmt.toml
  • benches/standalone-bench/src/main.rs
  • config/base/raft.toml
  • d-engine-core/src/config/raft.rs
  • d-engine-core/src/errors.rs
  • d-engine-core/src/event.rs
  • d-engine-core/src/raft.rs
  • d-engine-core/src/raft_role/candidate_state.rs
  • d-engine-core/src/raft_role/candidate_state_test.rs
  • d-engine-core/src/raft_role/follower_state.rs
  • d-engine-core/src/raft_role/follower_state_test.rs
  • d-engine-core/src/raft_role/leader_state.rs
  • d-engine-core/src/raft_role/leader_state_client_read_test.rs
  • d-engine-core/src/raft_role/learner_state.rs
  • d-engine-core/src/raft_role/learner_state_test.rs
  • d-engine-core/src/raft_role/mod.rs
  • d-engine-core/src/raft_role/role_state.rs
  • d-engine-server/Cargo.toml
  • d-engine-server/README.md
  • d-engine-server/benches/leader_state_bench.rs
  • d-engine-server/benches/lease_performance.rs
  • d-engine-server/benches/state_machine.rs
  • d-engine-server/benches/ttl.rs
  • d-engine-server/benches/watch_overhead.rs
  • d-engine-server/tests/components/raft_role/leader_state_test.rs
  • d-engine-server/tests/integration_test.rs
  • d-engine-server/tests/linearizable_read/linearizable_read_optimization.rs
  • d-engine-server/tests/linearizable_read/mod.rs
  • d-engine-server/tests/linearizable_read/read_batching_integration.rs
  • d-engine/src/docs/examples/mod.rs
  • d-engine/src/docs/examples/single-node-expansion.md
  • d-engine/src/docs/examples/three-nodes-standalone.md
  • d-engine/src/docs/overview.md
  • d-engine/src/docs/performance/throughput-optimization-guide.md
  • d-engine/src/docs/quick-start-standalone.md
  • d-engine/src/docs/server_guide/customize-state-machine.md
  • examples/README.md
  • examples/quick-start-embedded/Cargo.toml
  • examples/quick-start-standalone/Makefile
  • examples/quick-start-standalone/README.md
  • examples/service-discovery-standalone/Makefile
  • examples/service-discovery-standalone/README.md
  • examples/three-nodes-cluster/docker/jepsen/sshkeys/authorized_keys
  • examples/three-nodes-cluster/docker/jepsen/sshkeys/id_rsa
  • examples/three-nodes-cluster/docker/jepsen/sshkeys/known_hosts
  • examples/three-nodes-standalone/.gitignore
  • examples/three-nodes-standalone/Cargo.toml
  • examples/three-nodes-standalone/LICENSE
  • examples/three-nodes-standalone/Makefile
  • examples/three-nodes-standalone/README.md
  • examples/three-nodes-standalone/config/n1.toml
  • examples/three-nodes-standalone/config/n2.toml
  • examples/three-nodes-standalone/config/n3.toml
  • examples/three-nodes-standalone/docker/.env
  • examples/three-nodes-standalone/docker/Dockerfile
  • examples/three-nodes-standalone/docker/README.md
  • examples/three-nodes-standalone/docker/config/n1.toml
  • examples/three-nodes-standalone/docker/config/n2.toml
  • examples/three-nodes-standalone/docker/config/n3.toml
  • examples/three-nodes-standalone/docker/config/n4.toml
  • examples/three-nodes-standalone/docker/config/n5.toml
  • examples/three-nodes-standalone/docker/docker-compose.yml
  • examples/three-nodes-standalone/docker/jepsen/Dockerfile
  • examples/three-nodes-standalone/docker/jepsen/Makefile
  • examples/three-nodes-standalone/docker/jepsen/README.md
  • examples/three-nodes-standalone/docker/jepsen/entrypoint.sh
  • examples/three-nodes-standalone/docker/jepsen/project.clj
  • examples/three-nodes-standalone/docker/jepsen/src/jepsen/d_engine.clj
  • examples/three-nodes-standalone/docker/jepsen/vendor/dengine_ctl/Cargo.toml
  • examples/three-nodes-standalone/docker/jepsen/vendor/dengine_ctl/README.md
  • examples/three-nodes-standalone/docker/jepsen/vendor/dengine_ctl/src/main.rs
  • examples/three-nodes-standalone/docker/monitoring/.env
  • examples/three-nodes-standalone/docker/monitoring/docker-compose.yml
  • examples/three-nodes-standalone/docker/monitoring/grafana/dashboards/dashboard.json
  • examples/three-nodes-standalone/docker/monitoring/grafana/dashboards/node_exporter_dashboard.json
  • examples/three-nodes-standalone/docker/monitoring/grafana/datasources/loki-datasource.yml
  • examples/three-nodes-standalone/docker/monitoring/grafana/datasources/prometheus-datasource.yml
  • examples/three-nodes-standalone/docker/monitoring/prometheus/Dockerfile
  • examples/three-nodes-standalone/docker/monitoring/prometheus/README.md
  • examples/three-nodes-standalone/docker/monitoring/prometheus/prometheus.yml
  • examples/three-nodes-standalone/docker/monitoring/promtail/config.yml
  • examples/three-nodes-standalone/samply_reports/.gitignore
  • examples/three-nodes-standalone/scripts/reset.sh
  • examples/three-nodes-standalone/src/main.rs
💤 Files with no reviewable changes (5)
  • examples/three-nodes-cluster/docker/jepsen/sshkeys/known_hosts
  • examples/three-nodes-cluster/docker/jepsen/sshkeys/authorized_keys
  • examples/three-nodes-cluster/docker/jepsen/sshkeys/id_rsa
  • benches/embedded-bench/reports/v0.2.0/embedded_bench_report_v0.2.0.md
  • .github/ISSUE_TEMPLATE/blank.md
🧰 Additional context used
🧬 Code graph analysis (13)
d-engine-server/benches/watch_overhead.rs (1)
d-engine-core/src/watch/manager.rs (2)
  • watcher_count (185-190)
  • key (58-60)
d-engine-core/src/raft_role/follower_state.rs (3)
d-engine-core/src/raft_role/role_state.rs (1)
  • node_id (42-44)
d-engine-server/src/node/client/local_kv.rs (1)
  • node_id (434-436)
d-engine-server/src/node/mod.rs (1)
  • node_id (342-344)
d-engine-core/src/raft_role/learner_state_test.rs (2)
d-engine-core/src/raft_role/mod.rs (1)
  • state (239-246)
d-engine-client/src/error.rs (1)
  • e (114-114)
d-engine-server/benches/leader_state_bench.rs (1)
d-engine-server/tests/components/raft_role/leader_state_test.rs (1)
  • new (3756-3803)
d-engine-core/src/raft_role/learner_state.rs (3)
d-engine-core/src/raft_role/role_state.rs (1)
  • node_id (42-44)
d-engine-server/src/node/client/local_kv.rs (1)
  • node_id (434-436)
d-engine-server/src/node/mod.rs (1)
  • node_id (342-344)
d-engine-core/src/raft_role/candidate_state_test.rs (2)
d-engine-core/src/raft_role/mod.rs (1)
  • state (239-246)
d-engine-client/src/error.rs (1)
  • e (114-114)
d-engine-server/tests/linearizable_read/read_batching_integration.rs (1)
d-engine-server/src/node/builder.rs (4)
  • storage_engine (183-189)
  • state_machine (192-198)
  • raft_config (210-216)
  • from_cluster_config (154-161)
d-engine-core/src/raft_role/candidate_state.rs (3)
d-engine-core/src/raft_role/role_state.rs (1)
  • node_id (42-44)
d-engine-server/src/node/client/local_kv.rs (1)
  • node_id (434-436)
d-engine-server/src/node/mod.rs (1)
  • node_id (342-344)
d-engine-server/benches/state_machine.rs (1)
d-engine-server/benches/ttl.rs (1)
  • create_test_state_machine (33-54)
d-engine-core/src/raft.rs (1)
d-engine-client/src/error.rs (1)
  • e (114-114)
d-engine-server/tests/linearizable_read/linearizable_read_optimization.rs (3)
d-engine-server/src/node/builder.rs (4)
  • storage_engine (183-189)
  • state_machine (192-198)
  • raft_config (210-216)
  • from_cluster_config (154-161)
d-engine-server/tests/components/raft_role/leader_state_test.rs (1)
  • new (3756-3803)
d-engine-server/src/api/embedded.rs (2)
  • client (501-503)
  • start_custom (277-335)
d-engine-server/tests/components/raft_role/leader_state_test.rs (4)
d-engine-server/src/node/builder.rs (1)
  • node_config (201-207)
d-engine-core/src/config/raft.rs (14)
  • default (107-125)
  • default (197-204)
  • default (266-273)
  • default (325-333)
  • default (378-384)
  • default (518-540)
  • default (646-650)
  • default (709-714)
  • default (734-739)
  • default (886-894)
  • default (977-982)
  • default (986-994)
  • default (1109-1117)
  • default (1219-1225)
d-engine-core/src/raft_role/leader_state.rs (2)
  • new (108-116)
  • new (2173-2217)
d-engine-core/src/raft_role/role_state.rs (1)
  • commit_index (220-222)
d-engine-server/benches/ttl.rs (2)
d-engine-server/benches/state_machine.rs (2)
  • create_test_state_machine (41-57)
  • create_entries_with_ttl (130-159)
d-engine-proto/src/exts/common_ext.rs (1)
  • noop (15-19)
🪛 LanguageTool
README.md

[style] ~175-~175: This phrase is redundant. Consider using “over”.
Context: ...OK pre-1.0** - we prioritize getting it right over backward compatibility This approach...

(RIGHT_OVER)

🪛 markdownlint-cli2 (0.18.1)
examples/service-discovery-standalone/README.md

33-33: Hard tabs
Column: 6

(MD010, no-hard-tabs)


33-33: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)


34-34: Hard tabs
Column: 6

(MD010, no-hard-tabs)


35-35: Hard tabs
Column: 6

(MD010, no-hard-tabs)


36-36: Hard tabs
Column: 6

(MD010, no-hard-tabs)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Lint and Format Check
  • GitHub Check: Lint and Format Check

Major changes:
- Makefile: Reduce command count from ~45 to ~33 (delete 12 obsolete targets)
- Rename test-all → test with integrated examples compilation check
- Add bench-save and automatic regression detection to bench target
- Add version consistency check and docs.rs simulation to pre-release
- Replace manual feature enumeration with --all-features

Documentation improvements:
- Add three-nodes-embedded example README with HA deployment guide
- Add ha-deployment-load-balancing.md for production HA patterns
- Add graceful shutdown handling to three-nodes-embedded example
- Fix #241: Data persistence bug (missing engine.stop() call)
- Add "Recommended Entry Point" section to overview.md

Breaking changes:
- Removed commands: test-all, test-verbose, test-doc, test-examples,
  docs-all, docs-private, docs-crate, doc, doc-check, bench-compile,
  bench-gate, build-examples
- Updated all references: test-all → test in README, CONTRIBUTING,
  PULL_REQUEST_TEMPLATE
… fix data loss bugs

Add production-ready storage layer with full crash recovery test coverage
and fix critical durability violations in both RocksDB and File storage engines.

Test Infrastructure (+590 lines):
- Add StateMachineTestSuite with crash recovery tests
  * test_drop_persists_last_applied
  * test_ungraceful_shutdown_recovery
  * test_data_survives_reopen
- Add StorageEngineTestSuite with durability tests
  * test_persist_entries_durability
  * test_hard_state_crash_recovery
  * test_truncate_durability
  * test_purge_durability
- Add test_purge_empty_range edge case coverage

Critical Fixes:
- RocksDB: Fix iterator bug causing data loss on restart
  * Iterator.last() with IteratorMode::End returned smallest key instead of largest
  * Caused last_index=1 when should be last_index=N
  * Fix: Use iterator.next() to get first (largest) key from End
- RocksDB: Add Drop implementations to flush on drop
- File: Implement physical deletion in purge() for durability
  * Previously only deleted from memory, entries reappeared after restart
  * Now rewrites file to physically remove purged entries
- File: Add Drop implementation for FileLogStore

Module Structure:
- Expose storage module as public API (d-engine-core/src/lib.rs)
- Rename rocksdb_engine_test.rs -> rocksdb_storage_engine_test.rs
- Add rocksdb_state_machine_test.rs
- Add integration test structure
## Problem

State machine read operations were not protected during snapshot
restoration, potentially causing race conditions. During PHASE 2.5
(temporary DB swap to release lock), `get()` could access the
temporary empty database, returning incorrect empty results instead
of proper errors.
Add integration test validating Learner can safely receive AppendEntries
during snapshot application, specifically testing PHASE 2.5 temporary DB
window safety.

Test scenario:
- Start 3-node cluster, write 150 entries (triggers snapshot at 100)
- Add readonly Learner (Node 4) which begins receiving snapshot
- Concurrently write 50 more entries while Learner applies snapshot
- Verify Learner catches up to all 200 entries with correct values

Key validations:
- Learner successfully joins cluster (role=3 Learner, status=1 readonly)
- Snapshot generation at entry 100 completes successfully
- Concurrent writes during snapshot application succeed
- Learner receives and applies snapshot without data loss
- PHASE 2.5 temporary DB swap does not cause corruption

Technical notes:
- Single Learner requires readonly mode to avoid being kicked out
- Uses get_eventual() for Learner reads (not get_linearizable())
- All nodes' initial_cluster includes Learner configuration upfront

Location: tests/snapshot/concurrent_replication_embedded.rs
@JoshuaChi JoshuaChi added this to the v0.2.2 milestone Jan 12, 2026
@JoshuaChi JoshuaChi moved this to In testing in d-engine Jan 12, 2026
Critical fixes:
- Fix partial leader init when on_noop_committed fails
- Fix address mismatch in test cluster configs (3 files)
- Fix benchmark state pollution (TTL + lease benchmarks)

Documentation:
- Add performance-overview.md as top-level entry point
- Refactor standalone/embedded reports with mode context
- Fix date inconsistency (Dec 13 -> Jan 9, 2026)
- Fix relative paths to benches directory
- Fix data contradictions in conclusion section
@JoshuaChi JoshuaChi merged commit 83be905 into develop Jan 12, 2026
4 checks passed
@github-project-automation github-project-automation bot moved this from In testing to Done in d-engine Jan 12, 2026
JoshuaChi added a commit that referenced this pull request Jan 12, 2026
- Recreate engine per iteration in watch_overhead benchmark to avoid state pollution
- Add bincode 1.3.3 to advisory ignore list (team considers it complete)

Fixes benchmark timeout caused by engine state accumulation across iterations.
Addresses cargo-deny advisory warning for unmaintained but stable dependency.
JoshuaChi added a commit that referenced this pull request Jan 12, 2026
Version Updates:
- Bump workspace version from 0.2.1 to 0.2.2
- Update CHANGELOG with v0.2.2 highlights
@coderabbitai coderabbitai bot mentioned this pull request Jan 14, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants