feat #174: Implement lock-free Watch mechanism for real-time key monitoring by JoshuaChi · Pull Request #177 · deventlab/d-engine

JoshuaChi · 2025-11-14T07:57:04Z

Type

New feature
Bug Fix

Description

Implement lock-free Watch mechanism for real-time key monitoring

Add WatchManager with crossbeam-channel for lock-free event queue
Integrate watch notifications into StateMachine apply path (<0.01% overhead)
Implement gRPC streaming Watch RPC with automatic cleanup
Add WatchDispatcher for centralized stream lifecycle management
Fix WatcherHandle lifetime issue using std::mem::forget pattern
Add watch configuration to config/base/raft.toml with tuning guidelines
Add comprehensive test coverage: 10 unit + 9 integration + 5 benchmarks
Add developer documentation in d-engine-docs
Fix flaky timer test: improve reset() deadline assertion logic

Performance targets achieved:

Apply path overhead: <10ns per notification
End-to-end latency: <100μs for event delivery
Memory: ~2.4KB per watcher with default buffer size

Related Issues

Close Issue: Add Watch Mechanism for Real-time Key Change Notifications #174

Checklist

The code has been tested locally (unit test or integration test)
Squash down commits to one or two logical commits which clearly describe the work you've done.

Summary by CodeRabbit

Release Notes

New Features
- Introduced lock-free Watch system for real-time monitoring of key changes with PUT and DELETE events.
- Added gRPC streaming support to receive watch notifications with at-most-once delivery and FIFO ordering per key.
- Configurable watch parameters: enablement toggle, event queue size, watcher buffer size, and detailed metrics.
Tests
- Added comprehensive integration and unit tests for Watch functionality including multi-watcher scenarios and overflow handling.

…toring - Add WatchManager with crossbeam-channel for lock-free event queue - Integrate watch notifications into StateMachine apply path (<0.01% overhead) - Implement gRPC streaming Watch RPC with automatic cleanup - Add WatchDispatcher for centralized stream lifecycle management - Fix WatcherHandle lifetime issue using std::mem::forget pattern - Add watch configuration to config/base/raft.toml with tuning guidelines - Add comprehensive test coverage: 10 unit + 9 integration + 5 benchmarks - Add developer documentation in d-engine-docs - Fix flaky timer test: improve reset() deadline assertion logic Performance targets achieved: - Apply path overhead: <10ns per notification - End-to-end latency: <100μs for event delivery - Memory: ~2.4KB per watcher with default buffer size Closes #174

coderabbitai · 2025-11-14T07:57:10Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Introduces a watch feature for monitoring key changes in the state machine. Adds a WatchManager for event notification, extends the gRPC API with a Watch RPC method, integrates with the state machine to emit events on writes, includes configuration options, and implements a dispatcher for managing watch streams.

Changes

Cohort / File(s)	Summary
Configuration `config/base/raft.toml`, `d-engine-core/src/config/raft.rs`	Adds `[raft.watch]` configuration section with `enabled`, `event_queue_size`, `watcher_buffer_size`, and `enable_metrics` fields; introduces `WatchConfig` struct with validation and default helpers integrated into `RaftConfig`.
Core Watch Implementation `d-engine-core/src/watch/mod.rs`, `d-engine-core/src/watch/manager.rs`, `d-engine-core/src/watch/manager_test.rs`	Implements lock-free `WatchManager` with event types, watcher lifecycle, background dispatcher thread, non-blocking event notification, and per-key watcher registry; includes comprehensive unit tests covering event delivery, cleanup, overflow, and isolation.
Dependencies `d-engine-core/Cargo.toml`	Adds `crossbeam-channel = "0.5"` dependency for lock-free MPMC channels supporting the watch event queue.
Library Exports `d-engine-core/src/lib.rs`	Exposes new `watch` module and re-exports watch public API from crate root.
State Machine Integration `d-engine-core/src/state_machine_handler/default_state_machine_handler.rs`, `d-engine-core/src/state_machine_handler/default_state_machine_handler_test.rs`	Adds optional `watch_manager` field to `DefaultStateMachineHandler`; implements `notify_watchers` method to parse applied entries and emit PUT/DELETE events; updates constructor and test fixtures to use `new_without_watch` helper.
gRPC Protocol `d-engine-proto/proto/client/client_api.proto`	Adds `WatchEventType` enum (PUT, DELETE), `WatchRequest` and `WatchResponse` messages, and extends `RaftClientService` with server-streaming `Watch` RPC method.
gRPC Service Implementation `d-engine-server/src/network/grpc/grpc_raft_service.rs`, `d-engine-server/src/network/grpc/watch_dispatcher.rs`, `d-engine-server/src/network/grpc/watch_handler.rs`, `d-engine-server/src/network/grpc/mod.rs`	Implements `watch` RPC method in `RaftClientService`; adds `WatchDispatcher` for managing watcher registrations and spawning stream handlers; adds `WatchStreamHandler` to forward events to gRPC clients; re-exports public types from module.
Node Integration `d-engine-server/src/node/builder.rs`, `d-engine-server/src/node/mod.rs`	Initializes `WatchManager` and `WatchDispatcher` when enabled; passes watch manager to state machine handler; adds optional fields and background task spawning.
Mock Implementations `d-engine-client/src/mock_rpc.rs`, `d-engine-core/src/test_utils/mock/mock_rpc.rs`	Adds `WatchStream` associated type and `watch` method stub to mock RPC service implementations.
Test Utilities & Integration `d-engine-server/src/test_utils/integration/mod.rs`, `d-engine-server/src/test_utils/mock/mock_node_builder.rs`, `d-engine-server/tests/watch_integration_test.rs`, `d-engine-server/benches/state_machine.rs`	Updates test helpers to pass `None` for watch manager; adds `watch_manager` and `watch_dispatcher_handle` fields to mock nodes; introduces integration tests validating watch event delivery, isolation, overflow, and cleanup; adds benchmarks comparing apply performance with varying watcher counts.
Timer Test `d-engine-core/src/timer/timer_test.rs`	Refactors `test_election_timer_reset` to validate minimum elapsed time after reset instead of comparing deadline progression.
Documentation `d-engine-docs/src/docs/server_guide/mod.rs`, `d-engine-docs/src/docs/server_guide/watch-feature.md`	Adds `watch_feature` module to documentation index; introduces comprehensive guide covering configuration, client usage, event semantics, performance characteristics, and architectural overview.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant gRPC Service
    participant WatchDispatcher
    participant WatchManager
    participant StateMachine

    Client->>gRPC Service: Watch(WatchRequest: key)
    gRPC Service->>WatchManager: register(key)
    WatchManager->>WatchManager: create watcher, store in registry
    WatchManager-->>gRPC Service: WatcherHandle + Receiver
    
    gRPC Service->>WatchDispatcher: register(WatchRegistration)
    WatchDispatcher->>WatchDispatcher: spawn WatchStreamHandler task
    WatchDispatcher-->>gRPC Service: ready
    
    gRPC Service-->>Client: stream opened
    
    rect rgb(200, 220, 240)
        Note over StateMachine,WatchManager: Write Path
        Client->>StateMachine: Write(key, value)
        StateMachine->>WatchManager: notify_put(key, value)
        WatchManager->>WatchManager: queue event (non-blocking)
        WatchManager-->>StateMachine: return
    end
    
    rect rgb(240, 220, 200)
        Note over WatchManager,Client: Dispatch Path (background thread)
        WatchManager->>WatchManager: read from event queue
        WatchManager->>WatchManager: lookup key watchers
        WatchManager->>WatchManager: send event to watcher channels
        WatchDispatcher->>Client: WatchResponse(Put event)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

WatchManager lock-free design and concurrency guarantees: Verify correctness of DashMap usage, atomic operations, and non-blocking dispatch semantics under concurrent readers/writers.
State machine handler integration: Ensure notify_watchers correctly decodes WriteCommand, handles errors gracefully, and doesn't impact non-watch code paths.
Watch dispatcher lifecycle: Verify proper task spawning, shutdown coordination, and cleanup of watcher registrations when clients disconnect.
Configuration validation: Check that queue size constraints are enforced and buffer overflow behavior is correct.
gRPC streaming implementation: Confirm event ordering guarantees (FIFO per key) and proper error propagation to clients.
Test coverage: Verify integration tests comprehensively cover event delivery, key isolation, overflow scenarios, and edge cases like watcher drop and disabled watch config.

Poem

🐰 Whiskers twitch with watch delight,
Keys now notify day and night,
Lock-free queues dance swift and free,
Events bloom upon each key,
From state machine hearts take flight! 🌿✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main feature: a lock-free Watch mechanism for real-time key monitoring, which is the primary objective of this PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

JoshuaChi · 2025-11-14T07:58:03Z

@coderabbitai review

coderabbitai · 2025-11-14T07:58:09Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

d-engine-core/src/state_machine_handler/default_state_machine_handler.rs (1)
188-237: Guard watcher notifications behind successful apply
notify_watchers fires even when sm.apply_chunk(...) fails, so clients will observe PUT/DELETE events that never actually made it into the state machine. That breaks watch semantics and can leave replicas inconsistent from the client’s perspective. Move the notify call into the Ok(_) branch (after the state machine commit succeeds) so only successful applies publish events.
-        let apply_result = sm.apply_chunk(chunk.clone()).await;
-
-        // Notify watchers of changes (if watch manager is enabled)
-        if let Some(ref watch_mgr) = self.watch_manager {
-            self.notify_watchers(&chunk, watch_mgr);
-        }
+        let apply_result = sm.apply_chunk(chunk.clone()).await;
…and inside the Ok(_) arm:
             Ok(_) => {
+                if let Some(ref watch_mgr) = self.watch_manager {
+                    self.notify_watchers(&chunk, watch_mgr);
+                }

🧹 Nitpick comments (9)

d-engine-server/src/test_utils/mock/mock_node_builder.rs (1)
342-342: Consider simplifying the Arc dereference pattern.

Line 342 creates an intermediate Arc<RaftNodeConfig> variable that's only used once on line 345 to dereference and clone. For clarity, consider either:

Option 1 (inline): Remove line 342 and inline the dereference:
-        let node_config = node_config_arc.clone();
         tokio::spawn(async move {
             if let Err(e) =
-                grpc::start_rpc_server(node_clone, listen_address, (*node_config).clone(), shutdown)
+                grpc::start_rpc_server(node_clone, listen_address, (*node_config_arc).clone(), shutdown)
                     .await
Option 2 (rename): Use a more descriptive variable name:
-        let node_config = node_config_arc.clone();
+        let node_config_for_rpc = node_config_arc.clone();
         tokio::spawn(async move {
             if let Err(e) =
-                grpc::start_rpc_server(node_clone, listen_address, (*node_config).clone(), shutdown)
+                grpc::start_rpc_server(node_clone, listen_address, (*node_config_for_rpc).clone(), shutdown)
                     .await
Also applies to: 345-345
d-engine-docs/src/docs/server_guide/watch-feature.md (2)
21-48: Consider using rust syntax highlighting for code examples.

The code block uses text as the language identifier, but since this is Rust code, using rust would provide proper syntax highlighting for readers.

Apply this change:
-```text
+```rust
 use d_engine_client::RaftClient;
120-124: Consider specific language identifiers for code blocks.

Several code blocks use text as the language identifier. Consider using more specific identifiers where applicable (e.g., rust for Rust code, toml for configuration) to enable syntax highlighting.

Also applies to: 145-162, 192-212
d-engine-client/src/mock_rpc.rs (1)

4-5: Mock watch implementation comment vs behavior (and cross-mock consistency)

The watch mock currently always returns Status::unimplemented, but the comment says “return empty stream”, and the core mock (d-engine-core/src/test_utils/mock/mock_rpc.rs) actually returns an empty stream. For clarity and consistency, consider either:

Implementing this mock like the core mock (returning an empty stream), or

Updating the comment (and any expectations in tests) to reflect that this mock reports UNIMPLEMENTED.

This keeps behavior predictable for callers that may use either mock.

Also applies to: 16-17, 106-142

d-engine-server/benches/state_machine.rs (1)

39-53: Consider explicit WatchManager shutdown in benches (if Drop is not sufficient)

create_watch_manager calls WatchManager::start() and returns an Arc<WatchManager>, but none of the benchmarks ever call stop(). If WatchManager does not auto-stop on Drop, this can leave background dispatcher threads running across iterations and potentially skew measurements or leak resources.

If Drop does not already handle shutdown, consider:

Adding a small helper that wraps WatchManager in a RAII type which calls stop() on drop, or

Calling watch_manager.stop() at the end of each async iteration in bench_watch_e2e_latency and the apply-with-watchers benches.

If Drop already performs a clean shutdown, this can be ignored.

Also applies to: 421-445

d-engine-server/tests/watch_integration_test.rs (1)

1-363: Integration tests are valuable; consider explicit WatchManager shutdown for consistency

These integration tests give good coverage of WatchManager behavior in the server context (dispatch, isolation, cleanup, overflow, sequencing, and disabled path). One small lifecycle detail:

Each test calls WatchManager::start() on an Arc<WatchManager>, but none call stop().

If WatchManager does not already shut itself down in Drop, consider calling manager.stop() at the end of each test (as is done in d-engine-core/src/watch/manager_test.rs) to ensure dispatcher threads are cleaned up deterministically and to keep behavior consistent across test suites.
d-engine-server/src/network/grpc/grpc_raft_service.rs (1)
311-314: Align watch RPC with other client RPCs and avoid panic on misconfig

The overall wiring of WatchStream and the watch() implementation looks good, but two points are worth tightening up:

Readiness gating
Other client RPCs (handle_client_write, handle_client_read, etc.) all short-circuit if the node is not ready. watch() currently doesn’t, so a client could register watches before the node is operational. To keep behavior consistent:
@@
-    async fn watch(
-        &self,
-        request: tonic::Request<WatchRequest>,
-    ) -> std::result::Result<tonic::Response<Self::WatchStream>, tonic::Status> {
-        let watch_request = request.into_inner();
-        let key = watch_request.key;
+    async fn watch(
+        &self,
+        request: tonic::Request<WatchRequest>,
+    ) -> std::result::Result<tonic::Response<Self::WatchStream>, tonic::Status> {
+        if !self.server_is_ready() {
+            warn!("[watch] Node-{} is not ready!", self.node_id);
+            return Err(Status::unavailable("Service is not ready"));
+        }
+
+        let watch_request = request.into_inner();
+        let key = watch_request.key;
Avoid panicking via expect on watch_manager
watch_manager is accessed with expect("Watch manager must exist if dispatcher handle exists"). If builder wiring ever regresses, this would panic in the hot RPC path instead of returning a structured error. A safer pattern with logging:
-        let watch_manager = self
-            .watch_manager
-            .as_ref()
-            .expect("Watch manager must exist if dispatcher handle exists");
+        let watch_manager = self
+            .watch_manager
+            .as_ref()
+            .ok_or_else(|| {
+                error!(
+                    node_id = self.node_id,
+                    "watch_dispatcher_handle is set but watch_manager is None"
+                );
+                Status::internal("Watch manager missing while watch feature is enabled")
+            })?;
With these tweaks, the watch path matches the robustness and readiness semantics of the existing RPCs.

Also applies to: 387-454
d-engine-server/src/node/mod.rs (1)

44-53: Optional watch fields on Node are well-scoped; consider documenting invariant

Adding watch_manager: Option<Arc<WatchManager>> and watch_dispatcher_handle: Option<WatchDispatcherHandle> with from_raft defaulting both to None cleanly preserves existing behavior when watch is disabled. It does, however, rely on builders ensuring “both-or-none” is maintained.

A brief comment near the struct or in the builder enforcing/documenting the invariant “watch_manager.is_some() == watch_dispatcher_handle.is_some()” would make it clearer that RPC code can safely assume they’re coupled (and avoid surprises if future code sets only one of them).

Also applies to: 85-92, 201-203

d-engine-core/src/watch/manager.rs (1)

355-389: Notify functions match the “non-blocking, best-effort” design

notify_put and notify_delete correctly construct a WatchEvent and use try_send on the global queue, which aligns with the documented choice to drop events under backpressure to protect the write path.

If you later need better observability around drops, consider incrementing a lightweight counter or emitting a debug-level trace on try_send failure, guarded by a cheap feature flag, but the current behavior is consistent with the stated design.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69279ee and b7000cc.

⛔ Files ignored due to path filters (3)

Cargo.lock is excluded by !**/*.lock
d-engine-proto/src/generated/d_engine.client.rs is excluded by !**/generated/**
examples/rocksdb-cluster/Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (25)

config/base/raft.toml (1 hunks)
d-engine-client/src/mock_rpc.rs (4 hunks)
d-engine-core/Cargo.toml (1 hunks)
d-engine-core/src/config/raft.rs (4 hunks)
d-engine-core/src/lib.rs (2 hunks)
d-engine-core/src/state_machine_handler/default_state_machine_handler.rs (7 hunks)
d-engine-core/src/state_machine_handler/default_state_machine_handler_test.rs (28 hunks)
d-engine-core/src/test_utils/mock/mock_rpc.rs (3 hunks)
d-engine-core/src/timer/timer_test.rs (1 hunks)
d-engine-core/src/watch/manager.rs (1 hunks)
d-engine-core/src/watch/manager_test.rs (1 hunks)
d-engine-core/src/watch/mod.rs (1 hunks)
d-engine-docs/src/docs/server_guide/mod.rs (2 hunks)
d-engine-docs/src/docs/server_guide/watch-feature.md (1 hunks)
d-engine-proto/proto/client/client_api.proto (1 hunks)
d-engine-server/benches/state_machine.rs (4 hunks)
d-engine-server/src/network/grpc/grpc_raft_service.rs (4 hunks)
d-engine-server/src/network/grpc/mod.rs (1 hunks)
d-engine-server/src/network/grpc/watch_dispatcher.rs (1 hunks)
d-engine-server/src/network/grpc/watch_handler.rs (1 hunks)
d-engine-server/src/node/builder.rs (6 hunks)
d-engine-server/src/node/mod.rs (3 hunks)
d-engine-server/src/test_utils/integration/mod.rs (1 hunks)
d-engine-server/src/test_utils/mock/mock_node_builder.rs (2 hunks)
d-engine-server/tests/watch_integration_test.rs (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Lint and Format Check
GitHub Check: Lint and Format Check

🔇 Additional comments (20)

d-engine-server/src/test_utils/mock/mock_node_builder.rs (2)

306-307: LGTM: Watch fields correctly initialized for mock nodes.

The watch subsystem fields are appropriately set to None in the mock builder, as test fixtures don't require watch functionality.

329-329: LGTM: Arc-based config sharing implemented correctly.

The refactoring to use Arc<RaftNodeConfig> properly enables shared ownership between the Node struct and the RPC server task, which is necessary for crossing async boundaries.

Also applies to: 336-341

d-engine-core/src/timer/timer_test.rs (1)

124-137: LGTM! Improved test reliability.

The revised assertion logic fixes the flaky test by ensuring the new deadline is measured from the reset moment rather than comparing against a potentially stale old deadline. This approach is more robust for timing-sensitive tests.

config/base/raft.toml (1)

105-131: LGTM! Well-documented configuration.

The watch configuration section is comprehensive with clear tuning guidelines, traffic-based recommendations, and memory impact estimates. This will help operators configure the system appropriately for their workload.

d-engine-core/src/lib.rs (1)

19-19: LGTM! Consistent module structure.

The watch module is added and re-exported following the existing crate organization pattern.

Also applies to: 28-28

d-engine-server/src/test_utils/integration/mod.rs (1)

238-238: LGTM! Appropriate test setup.

Passing None for the watch manager in test utilities is correct for tests that don't exercise the watch feature.

d-engine-docs/src/docs/server_guide/mod.rs (1)

10-10: LGTM! Documentation properly integrated.

The watch feature documentation is correctly added to the server guide following the established pattern.

Also applies to: 20-22

d-engine-docs/src/docs/server_guide/watch-feature.md (1)

1-271: Excellent comprehensive documentation!

The Watch feature documentation is thorough and well-organized. It effectively covers:

Clear quick start guide

Detailed configuration with tuning guidelines

Behavior guarantees and limitations

Performance characteristics with concrete numbers

Practical use cases and examples

Troubleshooting guide

Architecture overview

This will be very helpful for users implementing the Watch feature.

d-engine-server/src/network/grpc/mod.rs (1)

12-16: LGTM! Proper module organization.

The watch-related modules and re-exports are correctly integrated following the existing crate structure. Using pub(crate) visibility is appropriate for internal server components.

d-engine-core/Cargo.toml (1)

51-52: No action required—version specification is valid.

The latest stable version of crossbeam-channel is 0.5.15 (released April 8, 2025). The "0.5" specification in Cargo.toml uses semantic versioning and will automatically resolve to the latest 0.5.x release, including 0.5.15. The dependency is correctly specified.

d-engine-core/src/watch/manager_test.rs (1)

1-320: Comprehensive WatchManager unit coverage looks solid

The tests thoroughly exercise core behaviors (dispatch, multiple watchers, cleanup, isolation, overflow, sequencing, and lifecycle) and explicitly start/stop the manager, which should help catch regressions in the dispatcher logic.

d-engine-core/src/test_utils/mock/mock_rpc.rs (1)

11-12: Watch streaming stub in core mock is appropriate

The WatchStream alias and watch implementation (empty ReceiverStream) fit the gRPC streaming shape and provide a safe, no-op stub for tests that just need the RPC surface to exist.

Also applies to: 155-194

d-engine-core/src/state_machine_handler/default_state_machine_handler_test.rs (2)

47-55: Tests correctly opt out of watch via new_without_watch

Using DefaultStateMachineHandler::<MockTypeConfig>::new_without_watch in these tests keeps behavior identical to the pre-watch constructor while avoiding any dependency on the watch pipeline. This is a clean way to adapt tests to the new optional watch manager without changing expectations.

382-399: Shared snapshot test helper updated consistently to new_without_watch

Updating create_test_handler to construct DefaultStateMachineHandler via new_without_watch centralizes the “no watch in tests” behavior and keeps all snapshot-related tests off the watch path. This also means future changes to the main constructor won’t require touching each test.

d-engine-core/src/watch/mod.rs (1)

1-140: Watch module surface and documentation look solid

The module docs clearly describe the architecture, performance characteristics, and error semantics, and the re-exports (WatchManager, WatchEvent*, WatcherHandle*, WatchConfig) present a tidy public surface for consumers. No issues spotted in how the module is structured or exposed.

d-engine-proto/proto/client/client_api.proto (1)

90-123: Watch proto surface matches server design and is well-documented

The new WatchEventType, WatchRequest, WatchResponse, and Watch RPC are coherently modeled and align with the server-side watch implementation and documented semantics (single-key watch, PUT/DELETE events, possible drops under backpressure). Field choices and numbering are straightforward, so the addition looks safe from a compatibility and readability perspective.

Also applies to: 130-142

d-engine-core/src/watch/manager.rs (4)

1-62: Well-documented architecture and event model

The top-level docs and the WatchEventType / WatchEvent definitions are clear and match the stated performance and delivery guarantees. Using Bytes for keys/values and an explicit event type enum is a solid foundation for the rest of the design.

245-266: WatchManager::new construction path looks good

The split between bounded vs unbounded global event queues based on config.event_queue_size and the initialization of WatchManagerInner (DashMap + atomics + config) is straightforward and matches the documented behavior. No issues here from a correctness standpoint.

391-429: Dispatch and monitoring helpers are straightforward and correct

dispatch_event looks up watchers by key and does per-watcher try_send with cloning of the event, which is cheap enough given Bytes-backed payloads and consistent with non-blocking dispatch.

watcher_count and watched_key_count are simple introspection helpers that don’t affect the hot path and are useful for tests/metrics.

No functional issues spotted here.

317-344: The review comment is incorrect. Config-level validation already prevents this panic.

The WatchConfig struct in d-engine-core/src/config/raft.rs includes explicit validation that returns an error if watcher_buffer_size == 0 when watch is enabled. The validation occurs at configuration time (before WatchManager initialization), not at runtime in register(). This means any attempt to set watcher_buffer_size to 0 fails with a clear error message during config parsing, making it impossible for the register() method to be called with an invalid buffer size.

The review overlooks this existing safety mechanism and proposes adding a redundant guard in register(). No code changes are needed.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-11-14T08:09:02Z

d-engine-core/src/watch/manager.rs

+    /// # Panics
+    ///
+    /// Panics if the receiver has already been taken.
+    pub fn into_receiver(mut self) -> (u64, Bytes, mpsc::Receiver<WatchEvent>, WatcherHandleGuard) {
+        let id = self.id;
+        let key = self.key.clone();
+        let receiver = self.receiver.take().expect("receiver already taken");
+
+        let guard = WatcherHandleGuard {
+            id,
+            key: key.clone(),
+            manager: self.manager.clone(),
+        };
+
+        // Prevent Drop from being called on self, since guard will handle cleanup
+        std::mem::forget(self);
+
+        (id, key, receiver, guard)


⚠️ Potential issue | 🟠 Major

WatcherHandle::into_receiver leaks key and manager via mem::forget(self)

std::mem::forget(self) prevents Drop from running on WatcherHandle, so the embedded Bytes key and Arc<WatchManagerInner> never have their refcounts decremented. Because you clone both (key.clone(), self.manager.clone()), every into_receiver call permanently leaks at least one Bytes allocation and one Arc<WatchManagerInner> that will never be dropped, even after all handles, guards, and managers are gone.

Consider restructuring so you can move, not clone, the cleanup state into the guard without mem::forget, for example:

Wrap cleanup fields (id, key, manager) in an Option/ManuallyDrop inside WatcherHandle and take() them in into_receiver, so Drop becomes a no-op when the inner state has been moved; or

Introduce a small shared WatcherRegistration inner struct held in an Arc that owns the cleanup logic, and let both WatcherHandle and WatcherHandleGuard reference that instead of cloning and forgetting.

That way you avoid double-cleanup while still allowing all heap-owned resources to be reclaimed.

🤖 Prompt for AI Agents

d-engine-core/src/watch/manager.rs around lines 96 to 113: into_receiver currently calls std::mem::forget(self) after cloning key and manager which leaks the cloned Bytes and Arc; instead, make the cleanup state movable into the guard so you don't need mem::forget — wrap id, key and manager in Option or ManuallyDrop inside WatcherHandle (or introduce a shared WatcherRegistration behind an Arc) and call take() to move them into the WatcherHandleGuard returned from into_receiver, implement Drop to be a no-op when the inner Option is None (or let the shared Arc handle cleanup), and remove mem::forget and redundant clones so the Bytes and Arc get properly dropped when no longer used.

coderabbitai · 2025-11-14T08:09:02Z

d-engine-core/src/watch/manager.rs

+impl Drop for WatcherHandleGuard {
+    fn drop(&mut self) {
+        // Guard cleanup: remove watcher from registry
+        if let Some(mut watchers) = self.manager.watchers.get_mut(&self.key) {
+            watchers.retain(|w| w.id != self.id);
+
+            if watchers.is_empty() {
+                drop(watchers);
+                self.manager.watchers.remove(&self.key);
+            }
+        }
+
+        trace!(
+            watcher_id = self.id,
+            key = ?self.key,
+            "Watcher unregistered via guard"
+        );
+    }
+}
+
+impl Drop for WatcherHandle {
+    fn drop(&mut self) {
+        // Automatic cleanup when client disconnects
+        if let Some(mut watchers) = self.manager.watchers.get_mut(&self.key) {
+            watchers.retain(|w| w.id != self.id);
+
+            if watchers.is_empty() {
+                drop(watchers);
+                self.manager.watchers.remove(&self.key);
+            }
+        }
+
+        trace!(
+            watcher_id = self.id,
+            key = ?self.key,
+            "Watcher unregistered"
+        );
+    }
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Race condition in Drop impls can remove newly-registered watchers for the same key

Both WatcherHandleGuard::drop and WatcherHandle::drop do:

if let Some(mut watchers) = self.manager.watchers.get_mut(&self.key) { watchers.retain(|w| w.id != self.id); if watchers.is_empty() { drop(watchers); self.manager.watchers.remove(&self.key); } }

Between drop(watchers) and self.manager.watchers.remove(&self.key), another thread can successfully register a new watcher for the same key (via entry(key).or_default().push(...)). In that case:

The dropping thread empties the Vec and releases the bucket lock.

A registering thread pushes a new watcher into that Vec.

The dropping thread then calls remove(&self.key) and deletes the entire entry, including the newly-registered watcher.

Result: watchers for a key can disappear unexpectedly under concurrent register/drop.

A simple, safe fix is to keep the per-key entry even when the Vec becomes empty and avoid the separate remove call:

- if let Some(mut watchers) = self.manager.watchers.get_mut(&self.key) { - watchers.retain(|w| w.id != self.id); - - if watchers.is_empty() { - drop(watchers); - self.manager.watchers.remove(&self.key); - } - } + if let Some(mut watchers) = self.manager.watchers.get_mut(&self.key) { + watchers.retain(|w| w.id != self.id); + // Intentionally keep an empty Vec to avoid racing with concurrent registrations. + }

Apply the same pattern to both Drop impls. You trade a negligible number of empty Vecs for correct concurrent behavior.

I'll start by examining the actual code structure to verify the race condition claim.

🏁 Script executed:

#!/bin/bash # First, find and examine the manager.rs file to understand data structures find . -name "manager.rs" -path "*/watch/*" -type f | head -20

Length of output: 36

🏁 Script executed:

#!/bin/bash # Look at the context around lines 134-172 to see the struct definitions and imports fd "manager\.rs" --path "*/watch/*" -x cat {} | head -200

Length of output: 215

🏁 Script executed:

#!/bin/bash # Search for the WatcherHandle and WatcherHandleGuard struct definitions rg -n "struct WatcherHandle|struct WatcherHandleGuard" --type rs -A 5 -B 2

Length of output: 26

🏁 Script executed:

#!/bin/bash # Check the manager struct definition to understand what 'watchers' is rg -n "struct.*Manager|watchers\s*:" --type rs -C 3 | head -100

Length of output: 26

🏁 Script executed:

#!/bin/bash # Look for DashMap imports and usage rg -n "use.*DashMap|dashmap" --type rs -C 2

Length of output: 26

🏁 Script executed:

#!/bin/bash # Examine the full Drop impl to understand the exact code sed -n '130,180p' d-engine-core/src/watch/manager.rs

Length of output: 1290

🏁 Script executed:

#!/bin/bash # Get broader context of the file structure wc -l d-engine-core/src/watch/manager.rs rg -n "impl Drop for WatcherHandleGuard|impl Drop for WatcherHandle" --type rs -A 20

Length of output: 65

🏁 Script executed:

#!/bin/bash # Check if there are registration methods using entry API rg -n "\.entry\(|or_default\(\)\.push" --type rs -B 2 -A 2

Length of output: 26

🏁 Script executed:

#!/bin/bash # Look for synchronization primitives around watchers rg -n "Mutex|RwLock|DashMap" d-engine-core/src/watch/manager.rs -C 2

Length of output: 1022

🏁 Script executed:

#!/bin/bash # Get the exact lines 134-172 from the file head -n 172 d-engine-core/src/watch/manager.rs | tail -n 39

Length of output: 1061

🏁 Script executed:

#!/bin/bash # Check complete file to understand full context cat d-engine-core/src/watch/manager.rs

Length of output: 12554

I need to verify DashMap's locking semantics to confirm the race condition claim.

Now let me also check if there are any existing tests or documentation in the codebase about concurrent behavior:

🏁 Script executed:

#!/bin/bash # Search for tests related to Drop or concurrent operations rg -n "test|#\[" d-engine-core/src/watch/manager.rs | grep -i "drop\|concurrent\|race\|thread"

Length of output: 0

🏁 Script executed:

#!/bin/bash # Check if there are test files for manager find d-engine-core -name "*test*" -o -name "*tests*" | xargs grep -l "watch\|manager" 2>/dev/null || echo "No test files found"

Length of output: 201

🏁 Script executed:

#!/bin/bash # Check Cargo.toml for dashmap version grep -A 5 -B 5 "dashmap" d-engine-core/Cargo.toml || grep -r "dashmap" . --include="Cargo.toml" | head -5

Length of output: 300

🏁 Script executed:

#!/bin/bash # Look for any synchronization or thread safety comments in manager.rs rg -n "thread|concurrent|race|safe|lock" d-engine-core/src/watch/manager.rs

Length of output: 1696

🏁 Script executed:

#!/bin/bash # Check if there are other usages of DashMap in codebase to see patterns rg -n "DashMap" --type rust -B 2 -A 2 | head -40

Length of output: 2999

🏁 Script executed:

#!/bin/bash # Check for entry() usage patterns in the codebase to understand registration flow rg "\.entry\(" d-engine-core/src/watch/manager.rs -B 3 -A 3

Length of output: 210

🌐 Web query:

DashMap get_mut RefMut lock release behavior when dropped

💡 Result:

Short answer: RefMut returned by DashMap::get_mut holds the shard's write lock for the duration of the RefMut value; when the RefMut is dropped (goes out of scope), its lock is released and other threads can access that shard. The docs explicitly show get_mut returns a mapref::one::RefMut and the RefMut type provides the mutable access (and is the RAII guard). [1][2]

Notes/warnings:

While the RefMut is alive the shard is locked; calling other DashMap methods that require the same shard may deadlock. Use scoped access (drop the RefMut as soon as possible) or try_get_mut/try_get to avoid blocking. [2]

Sources:

dashmap::mapref::one::RefMut (struct docs). [1]

dashmap::DashMap::get_mut (method docs + locking behaviour). [2]

🌐 Web query:

DashMap entry or_default concurrent behavior between get_mut and remove

💡 Result:

Short answer: Entry::or_default inserts (if needed) and returns a RefMut that holds the map's mutable reference (a shard write lock) for that entry. While that RefMut is held, operations that need a mutable lock on the same shard (e.g., remove, get_mut, insert, iter_mut, remove_if_mut, etc.) will block or, if attempted from the same thread, may deadlock. In practice: calling entry(key).or_default() -> RefMut keeps the lock until the RefMut is dropped; a concurrent thread that calls remove(key) will wait for that RefMut to be dropped, and calling remove/get_mut while still holding the RefMut in the same thread can deadlock. [1][2][3]

References

DashMap docs: remove() and get_mut() locking behaviour. [1]

DashMap Entry::or_default returning RefMut (holds mutable reference). [2]

Community explanation / deadlock examples (StackOverflow). [3]

Sources
[1] DashMap API docs — remove(), get_mut() locking behaviour.
[2] dashmap::mapref::entry::Entry::or_default docs (returns RefMut).
[3] StackOverflow discussion on DashMap deadlocks.

🌐 Web query:

DashMap bucket locking semantics lock-free concurrent HashMap

💡 Result:

Short answer: DashMap is sharded and uses per-shard RwLocks (reader/writer locks) — not a lock‑free hash map. Each shard contains a hashbrown raw table; reads take a shard read-lock, writes take a shard write-lock, and many APIs return guard objects that hold those locks. Because it uses locks (sharding + RwLocks) it is fast and concurrent, but not lock‑free/wait‑free and can deadlock if you call locking APIs while holding a map reference. [1][2][3]

Sources:

DashMap source (shards + RwLock + RawTable implementation). [1]

DashMap docs (locking behaviour on get/insert/iter and deadlock warnings). [2]

DashMap API reference / docs site (summary: sharded RwLock approach). [3]

🌐 Web query:

Rust DashMap empty Vec accumulation impact performance memory

💡 Result:

Short answer

An empty Vec (Vec::new()) does not allocate heap memory, but the Vec value itself still takes space as the Vec struct (typically 24 bytes on 64-bit). [1][2]

Storing many empty Vecs inside a DashMap therefore won't cause lots of heap allocations immediately, but each map entry still has per-entry overhead (map bucket/metadata, key storage, and the Vec struct). That overhead can be significant if you have millions of keys. [3]

Performance cost shows up when you mutate those Vecs (pushes) because each push may allocate, and because DashMap iteration/lookup has the usual hashmap overhead vs a compact Vec-based layout. Use appropriate data layout if you care about hot-path throughput or memory density. [3][4]

Practical guidance

If you only need to represent “no items” vs “some items” and want to avoid storing the Vec struct for many keys, consider Option<Vec> — but note Option<Vec> usually does not save space over Vec thanks to Rust’s null-pointer optimization for Vec, so it often doesn’t help memory. [2]

If per-entry size/density matters: consider alternatives — e.g. store indices into a shared arena/Vec, use smallvec (inlined small buffer) for very small lists, or allocate the Vec lazily (only insert a value when you first need to push). These reduce per-key overhead and heap churn when lists remain empty. [4]

Benchmark your real workload (insertion/mutation/iteration patterns). Hashmap-like structures (including DashMap) have higher per-entry and iteration overhead than a compact Vec for dense data; but they give concurrent access and O(1) keyed lookup which may be required. [3][4]

Sources

DashMap docs (DashMap behavior / creation). [1]

Vec allocation / empty-vec behavior discussion (Vec::new() does not allocate; Option notes). [2]

Benchmarks and discussions comparing HashMap/DashMap overhead vs Vec and why structure choice matters. [3][4]

References
[1] DashMap documentation.
[2] StackOverflow / Rust answers on empty Vec allocation and Option.
[3] Rust community posts / benchmarks showing HashMap vs Vec performance differences.
[4] Blog posts and discussions on allocation patterns (smallvec, arena, lazy allocation).

🌐 Web query:

DashMap compare swap pattern for atomic remove operations

💡 Result:

Use DashMap's conditional remove APIs (remove_if / remove_if_mut) so the check-and-remove happens atomically under the shard lock instead of doing a separate load + remove (which races). Example:

Read-check-remove in one operation:

dashmap.remove_if(&key, |k, v| /* return true to remove */ );

Example snippet:
let map = dashmap::DashMap::new();
map.insert("x", 1);
// remove only if value == 1
map.remove_if(&"x", |_, &v| v == 1);

This is the intended compare-and-swap style pattern for DashMap: evaluate the predicate and remove under the same lock/shard. See the DashMap docs for remove_if/remove_if_mut. [1][2]

Sources:
[1] DashMap docs — remove_if / remove_if_mut. [2] DashMap crate source (implementation).

Let me check the test file to see if there are any existing tests that catch this race:

🏁 Script executed:

#!/bin/bash # Examine the manager test file to see what's tested cat d-engine-core/src/watch/manager_test.rs

Length of output: 10147

🏁 Script executed:

#!/bin/bash # Search for any code that checks if a key exists (depends on entry removal) rg "contains_key|is_empty\(\)" d-engine-core/src/watch/ -n

Length of output: 151

🏁 Script executed:

#!/bin/bash # Check if watcher_count or watched_key_count are used in tests (which would be affected) rg "watcher_count|watched_key_count" d-engine-core -n

Length of output: 831

Fix race condition in Drop impls, but address test breakage from proposed solution

The race condition is confirmed. DashMap's get_mut() releases its shard lock when the RefMut is dropped, creating a window where another thread can successfully insert a new watcher via entry().or_default().push() before this thread calls remove(&self.key), causing newly-registered watchers to be deleted.

The proposed fix (keeping empty Vecs) correctly eliminates the race window. However, it breaks test_watcher_auto_cleanup() (manager_test.rs:126-138), which asserts watched_key_count() == 0 after all watchers drop. With empty Vecs remaining in the DashMap, watched_key_count() will not reach zero.

Two options:

Accept empty Vecs + update test: Update test_watcher_auto_cleanup() to remove assertions that watched_key_count() == 0, since entries will remain (with negligible overhead: ~24 bytes per Vec struct). Apply the proposed fix to both WatcherHandleGuard::drop and WatcherHandle::drop.

Atomic removal: Use DashMap's remove_if_mut to make the check-and-remove atomic, avoiding the race without keeping empty entries. This is more complex but preserves current semantics.

Choose which approach aligns with your design intent, then apply the fix and update the corresponding test.

🤖 Prompt for AI Agents

d-engine-core/src/watch/manager.rs around lines 134 to 172: the current Drop impls use get_mut() then remove(&self.key) which drops the shard lock before removal and allows a race where a concurrent insert can re-create a watcher vector that this Drop will then remove; instead, replace the get_mut() + retain + conditional remove logic with an atomic remove_if_mut (or equivalent DashMap API) that performs the "if vec is empty then remove key" check and removal under the same shard lock so no race window exists; apply this change to both WatcherHandleGuard::drop and WatcherHandle::drop and keep the existing test expectations (do not change test_watcher_auto_cleanup()).

coderabbitai · 2025-11-14T08:09:02Z

d-engine-core/src/watch/manager.rs

+    /// Start the background event dispatcher thread
+    ///
+    /// This spawns a dedicated thread that reads from the event queue and
+    /// dispatches events to registered watchers.
+    pub fn start(&self) {
+        if self.inner.running.swap(true, Ordering::SeqCst) {
+            // Already running
+            return;
+        }
+
+        let inner = self.inner.clone();
+        let receiver = Arc::clone(&self.event_receiver);
+
+        std::thread::spawn(move || {
+            debug!("Watch dispatcher thread started");
+
+            while inner.running.load(Ordering::Acquire) {
+                match receiver.recv() {
+                    Ok(event) => {
+                        Self::dispatch_event(&inner, event);
+                    }
+                    Err(_) => {
+                        // Channel closed, exit thread
+                        break;
+                    }
+                }
+            }
+
+            debug!("Watch dispatcher thread stopped");
+        });
+    }
+
+    /// Stop the background dispatcher thread
+    pub fn stop(&self) {
+        self.inner.running.store(false, Ordering::Release);
+    }
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

start/stop lifecycle can leak dispatcher threads and has surprising semantics

Current behavior:

start() spawns a dispatcher thread the first time running transitions from false to true.

The dispatcher loops on receiver.recv() while running is true.

stop() just sets running = false and returns.

Two issues:

If stop() is called when there are no further events and the channel stays open, the dispatcher thread will remain blocked in recv() indefinitely (it only observes running again after recv() returns).

After stop(), running is false, so calling start() again spawns an additional dispatcher thread. The original thread may still be blocked on recv(), so repeated start/stop cycles can accumulate sleeping threads competing for the same queue.

Depending on how tests and server lifecycles are structured, this can cause thread leaks and harder-to-reason-about dispatch behavior.

Consider either:

Making WatchManager “start once, no-op stop” (and adjusting docs accordingly), or

Introducing an explicit shutdown mechanism (e.g., send a sentinel event or use a control channel / recv_timeout) and storing a JoinHandle so stop() can reliably terminate and optionally join the dispatcher.

I'll verify the review comment by examining the WatchManager implementation and understanding the actual threading behavior.

--- 🏁 Script executed: ```shell #!/bin/bash # Find and examine the WatchManager struct definition and inner structure rg -n "struct WatchManager" -A 15 d-engine-core/src/watch/

Length of output: 2240

🏁 Script executed:

#!/bin/bash # Look for the inner struct or running field definition rg -n "running.*Atomic\|Channel.*receiver" -B2 -A2 d-engine-core/src/watch/manager.rs

Length of output: 0

🏁 Script executed:

#!/bin/bash # Search for test cases involving start/stop cycles rg -n "\.start\(\)|\.stop\(\)" -B3 -A3 d-engine-core/src/watch/ --type rust

Length of output: 11273

🏁 Script executed:

#!/bin/bash # Look for how event_receiver is initialized and if there's cleanup rg -n "event_receiver\|event_sender" -B2 -A2 d-engine-core/src/watch/manager.rs

Length of output: 0

🏁 Script executed:

#!/bin/bash # Check if there's a shutdown or drop implementation rg -n "impl.*Drop\|fn drop\|fn shutdown" -A 10 d-engine-core/src/watch/manager.rs

Length of output: 0

--- 🏁 Script executed: ```shell #!/bin/bash # Get the full new() method to see channel creation rg -n "pub fn new" -A 30 d-engine-core/src/watch/manager.rs | head -50

Length of output: 1016

🏁 Script executed:

#!/bin/bash # Look for any JoinHandle storage rg -n "JoinHandle" d-engine-core/src/watch/

Length of output: 0

🏁 Script executed:

#!/bin/bash # Search for the channel type and any capacity/bounded info rg -n "mpsc\|channel\|Sender\|Receiver" d-engine-core/src/watch/manager.rs | head -20

Length of output: 0

🏁 Script executed:

#!/bin/bash # Check the test_multiple_start_stop test more carefully sed -n '275,288p' d-engine-core/src/watch/manager_test.rs

Length of output: 380

🏁 Script executed:

#!/bin/bash # Search for Drop implementation or cleanup in the file cat d-engine-core/src/watch/manager.rs | tail -100

Length of output: 2698

Thread lifecycle has genuine restart and blocking issues

The review correctly identifies two real problems:

Thread accumulation on restart cycles: Each start() after stop() spawns a new dispatcher thread because swap(true) returns false when called after stop() sets it. The previous dispatcher thread remains blocked in receiver.recv() indefinitely since the channel is never closed between cycles. Repeated start/stop/start patterns accumulate sleeping threads.

Indefinite blocking on stop(): The dispatcher loop only checks the running flag after recv() returns. If stop() is called when no events are pending and the channel remains open, the thread blocks indefinitely waiting for the next event, never observing that running is now false.

The channel (bounded() or unbounded()) is created once and never recreated, persisting across cycles with no closure mechanism. There is no JoinHandle storage, so threads cannot be tracked or joined. The test suite includes test_start_stop_idempotent which calls start() multiple times then stop() multiple times, but it does not test the problematic start()→stop()→start() restart scenario that triggers the accumulation issue.

For use cases requiring manager restart (e.g., graceful server restart, test fixtures), this creates observable thread leaks and unpredictable dispatch behavior.

🤖 Prompt for AI Agents

In d-engine-core/src/watch/manager.rs around lines 268 to 304, the dispatcher thread accumulates and can block forever because start() spawns a new thread each restart while stop() merely flips the running flag and never unblocks or joins the existing thread or closes/recreates the channel; fix this by storing the spawned thread's JoinHandle (Option<JoinHandle<()>>) on the manager, update start() to be a no-op if already running and to recreate the channel/receiver if it was closed, and change the loop to use recv_timeout (or try_recv with a short sleep) so it regularly checks the running flag; implement stop() to set running=false, send a shutdown sentinel or close the sender to wake the receiver, and join the stored handle (clearing it) to ensure the previous thread exits before allowing a new start.

coderabbitai · 2025-11-14T08:09:02Z

d-engine-server/benches/state_machine.rs

+/// Helper to register multiple watchers
+async fn register_watchers(
+    manager: &WatchManager,
+    count: usize,
+    key_prefix: &str,
+) {
+    for i in 0..count {
+        let key = format!("{key_prefix}{i}");
+        let _ = manager.register(key.into()).await;
+    }
+}


⚠️ Potential issue | 🟠 Major

Apply-with-watchers benchmarks currently run with zero active watchers

register_watchers drops each WatcherHandle immediately:

for i in 0..count { let key = format!("{key_prefix}{i}"); let _ = manager.register(key.into()).await; // handle dropped here }

Since WatcherHandle cleanup runs on drop, by the time you call notify_put / notify_delete in the benchmarks there are no watchers registered. That means:

bench_apply_with_1_watcher

bench_apply_with_10_watchers

bench_apply_with_100_watchers

are effectively measuring the same “no-watchers” overhead as bench_apply_without_watch.

To actually benchmark apply-path cost with active watchers, keep the handles alive for the duration of each iteration, e.g.:

-use d_engine_core::watch::WatchManager; +use d_engine_core::watch::{WatchManager, WatcherHandle}; -/// Helper to register multiple watchers -async fn register_watchers( - manager: &WatchManager, - count: usize, - key_prefix: &str, -) { - for i in 0..count { - let key = format!("{key_prefix}{i}"); - let _ = manager.register(key.into()).await; - } -} +/// Helper to register multiple watchers and keep them alive +async fn register_watchers( + manager: &WatchManager, + count: usize, + key_prefix: &str, +) -> Vec<WatcherHandle> { + let mut handles = Vec::with_capacity(count); + for i in 0..count { + let key = format!("{key_prefix}{i}"); + handles.push(manager.register(key.into()).await); + } + handles +}

And in the benches:

- // Register 1 watcher - register_watchers(&watch_manager, 1, "key_").await; + // Register 1 watcher and keep it alive + let _watchers = register_watchers(&watch_manager, 1, "key_").await;

(similarly for 10 and 100 watchers). Keeping _watchers in scope preserves the intended watcher counts while still avoiding unused-variable warnings.

Also applies to: 294-332, 337-373, 379-415

🤖 Prompt for AI Agents

In d-engine-server/benches/state_machine.rs around lines 55-65 (and also apply same change to ranges 294-332, 337-373, 379-415), the helper currently drops each WatcherHandle immediately so benchmarks run with zero active watchers; change register_watchers to collect and return (or keep) the WatcherHandle instances in a Vec (e.g. Vec<WatcherHandle>) and keep that Vec in scope for the duration of the benchmark iteration (name it with a leading underscore like _watchers to avoid unused-variable warnings) so the handles are not dropped before notify_put/notify_delete are called. Ensure callers use or bind the returned Vec (or hold the local Vec) for the lifetime of each bench iteration.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

codecov · 2025-11-14T08:15:20Z

Codecov Report

❌ Patch coverage is 86.66667% with 110 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...e_machine_handler/default_state_machine_handler.rs	45.23%	23 Missing ⚠️
d-engine-core/src/watch/manager.rs	89.51%	15 Missing ⚠️
d-engine-server/src/node/builder.rs	40.90%	13 Missing ⚠️
d-engine-core/src/config/raft.rs	73.33%	12 Missing ⚠️
...engine-server/src/network/grpc/watch_dispatcher.rs	92.64%	10 Missing ⚠️
...ne-server/src/test_utils/mock/mock_node_builder.rs	20.00%	8 Missing ⚠️
d-engine-core/src/raft_role/candidate_state.rs	0.00%	5 Missing ⚠️
d-engine-core/src/raft_role/follower_state.rs	0.00%	5 Missing ⚠️
d-engine-core/src/raft_role/learner_state.rs	0.00%	5 Missing ⚠️
d-engine-server/src/network/grpc/watch_handler.rs	96.46%	4 Missing ⚠️
... and 5 more

📢 Thoughts on this report? Let us know!

Add explicit `option go_package` to all Protocol Buffer definitions to enable proper Go code generation with correct import paths. This change establishes d-engine-proto as a language-agnostic protobuf distribution that supports Go clients and future language implementations. Changes: - Add `option go_package = "github.com/deventlab/d-engine/proto/{package}"` to all 7 proto files (common, error, client, server/*) - Use organization-scoped package path (deventlab) instead of personal account to ensure stability and transferability of d-engine as open source project - Add .gitignore to d-engine-proto to prevent accidental commit of generated code Benefits: - Enables Go gRPC client generation with correct import paths - Establishes clear, organization-owned package naming for open source - Supports future multi-language client support (Python, Rust, Java, etc.) - Prevents generated code from polluting the source repository Documentation: - Added d-engine-docs/src/docs/client_guide/go-client.md with comprehensive Go client integration guide covering proto generation, best practices, error handling, and common patterns Related: #170

Move ClientResponseExt tests from client_ext.rs to client_ext_test.rs following project convention (howto.md #22) that requires unit tests in separate files rather than inline with source code.

- Fix memory leak in WatcherHandle::into_receiver by using Option pattern instead of mem::forget, eliminating cloned resource leaks - Fix race condition in Drop implementations using DashMap's remove_if_mut for atomic unregistration operations - Fix thread lifecycle issues by adding JoinHandle tracking and shutdown channel, enabling clean start/stop/start cycles - Fix benchmark watcher leaks by returning Vec<WatcherHandle> to keep watchers registered during measurements - Add comprehensive unit tests covering all fixes with concurrent scenarios

Copilot AI review requested due to automatic review settings November 14, 2025 07:57

Copilot started reviewing on behalf of JoshuaChi November 14, 2025 07:57 View session

Copilot finished reviewing on behalf of JoshuaChi November 14, 2025 07:59

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

Copilot AI reviewed Nov 14, 2025

View reviewed changes

JoshuaChi added 4 commits November 15, 2025 11:53

fix #178: Implement leader redirection metadata in NOT_LEADER responses

662e257

refactor #174: Separate unit tests into dedicated test files

a095e4e

Move ClientResponseExt tests from client_ext.rs to client_ext_test.rs following project convention (howto.md #22) that requires unit tests in separate files rather than inline with source code.

JoshuaChi merged commit 56f1410 into develop Nov 25, 2025
3 of 4 checks passed

JoshuaChi deleted the feature/174_watch branch December 5, 2025 02:01

coderabbitai bot mentioned this pull request Dec 7, 2025

Feature/196 client watch #199

Merged

4 tasks

coderabbitai bot mentioned this pull request Dec 20, 2025

refactor(watch): Remove runtime config and improve client error handling #210

Closed

coderabbitai bot mentioned this pull request Mar 11, 2026

feat #295: RocksDBUnifiedEngine — single shared DB, halved resource usage #309

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat #174: Implement lock-free Watch mechanism for real-time key monitoring#177

feat #174: Implement lock-free Watch mechanism for real-time key monitoring#177
JoshuaChi merged 5 commits intodevelopfrom
feature/174_watch

JoshuaChi commented Nov 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 14, 2025 •

edited

Loading

Review skipped

Uh oh!

JoshuaChi commented Nov 14, 2025

Uh oh!

coderabbitai bot commented Nov 14, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 14, 2025

Uh oh!

coderabbitai bot Nov 14, 2025

Uh oh!

coderabbitai bot Nov 14, 2025

Uh oh!

coderabbitai bot Nov 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

codecov bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JoshuaChi commented Nov 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type

Description

Related Issues

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

JoshuaChi commented Nov 14, 2025

Uh oh!

coderabbitai bot commented Nov 14, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JoshuaChi commented Nov 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 14, 2025 •

edited

Loading

codecov bot commented Nov 14, 2025 •

edited

Loading