Skip to content

feat #174: Implement lock-free Watch mechanism for real-time key monitoring#177

Merged
JoshuaChi merged 5 commits intodevelopfrom
feature/174_watch
Nov 25, 2025
Merged

feat #174: Implement lock-free Watch mechanism for real-time key monitoring#177
JoshuaChi merged 5 commits intodevelopfrom
feature/174_watch

Conversation

@JoshuaChi
Copy link
Copy Markdown
Contributor

@JoshuaChi JoshuaChi commented Nov 14, 2025

Type

  • New feature
  • Bug Fix

Description

Implement lock-free Watch mechanism for real-time key monitoring

  • Add WatchManager with crossbeam-channel for lock-free event queue
  • Integrate watch notifications into StateMachine apply path (<0.01% overhead)
  • Implement gRPC streaming Watch RPC with automatic cleanup
  • Add WatchDispatcher for centralized stream lifecycle management
  • Fix WatcherHandle lifetime issue using std::mem::forget pattern
  • Add watch configuration to config/base/raft.toml with tuning guidelines
  • Add comprehensive test coverage: 10 unit + 9 integration + 5 benchmarks
  • Add developer documentation in d-engine-docs
  • Fix flaky timer test: improve reset() deadline assertion logic

Performance targets achieved:

  • Apply path overhead: <10ns per notification
  • End-to-end latency: <100μs for event delivery
  • Memory: ~2.4KB per watcher with default buffer size

Related Issues

Checklist

  • The code has been tested locally (unit test or integration test)
  • Squash down commits to one or two logical commits which clearly describe the work you've done.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced lock-free Watch system for real-time monitoring of key changes with PUT and DELETE events.
    • Added gRPC streaming support to receive watch notifications with at-most-once delivery and FIFO ordering per key.
    • Configurable watch parameters: enablement toggle, event queue size, watcher buffer size, and detailed metrics.
  • Tests

    • Added comprehensive integration and unit tests for Watch functionality including multi-watcher scenarios and overflow handling.

…toring

- Add WatchManager with crossbeam-channel for lock-free event queue
- Integrate watch notifications into StateMachine apply path (<0.01% overhead)
- Implement gRPC streaming Watch RPC with automatic cleanup
- Add WatchDispatcher for centralized stream lifecycle management
- Fix WatcherHandle lifetime issue using std::mem::forget pattern
- Add watch configuration to config/base/raft.toml with tuning guidelines
- Add comprehensive test coverage: 10 unit + 9 integration + 5 benchmarks
- Add developer documentation in d-engine-docs
- Fix flaky timer test: improve reset() deadline assertion logic

Performance targets achieved:
- Apply path overhead: <10ns per notification
- End-to-end latency: <100μs for event delivery
- Memory: ~2.4KB per watcher with default buffer size

Closes #174
Copilot AI review requested due to automatic review settings November 14, 2025 07:57
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 14, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Introduces a watch feature for monitoring key changes in the state machine. Adds a WatchManager for event notification, extends the gRPC API with a Watch RPC method, integrates with the state machine to emit events on writes, includes configuration options, and implements a dispatcher for managing watch streams.

Changes

Cohort / File(s) Summary
Configuration
config/base/raft.toml, d-engine-core/src/config/raft.rs
Adds [raft.watch] configuration section with enabled, event_queue_size, watcher_buffer_size, and enable_metrics fields; introduces WatchConfig struct with validation and default helpers integrated into RaftConfig.
Core Watch Implementation
d-engine-core/src/watch/mod.rs, d-engine-core/src/watch/manager.rs, d-engine-core/src/watch/manager_test.rs
Implements lock-free WatchManager with event types, watcher lifecycle, background dispatcher thread, non-blocking event notification, and per-key watcher registry; includes comprehensive unit tests covering event delivery, cleanup, overflow, and isolation.
Dependencies
d-engine-core/Cargo.toml
Adds crossbeam-channel = "0.5" dependency for lock-free MPMC channels supporting the watch event queue.
Library Exports
d-engine-core/src/lib.rs
Exposes new watch module and re-exports watch public API from crate root.
State Machine Integration
d-engine-core/src/state_machine_handler/default_state_machine_handler.rs, d-engine-core/src/state_machine_handler/default_state_machine_handler_test.rs
Adds optional watch_manager field to DefaultStateMachineHandler; implements notify_watchers method to parse applied entries and emit PUT/DELETE events; updates constructor and test fixtures to use new_without_watch helper.
gRPC Protocol
d-engine-proto/proto/client/client_api.proto
Adds WatchEventType enum (PUT, DELETE), WatchRequest and WatchResponse messages, and extends RaftClientService with server-streaming Watch RPC method.
gRPC Service Implementation
d-engine-server/src/network/grpc/grpc_raft_service.rs, d-engine-server/src/network/grpc/watch_dispatcher.rs, d-engine-server/src/network/grpc/watch_handler.rs, d-engine-server/src/network/grpc/mod.rs
Implements watch RPC method in RaftClientService; adds WatchDispatcher for managing watcher registrations and spawning stream handlers; adds WatchStreamHandler to forward events to gRPC clients; re-exports public types from module.
Node Integration
d-engine-server/src/node/builder.rs, d-engine-server/src/node/mod.rs
Initializes WatchManager and WatchDispatcher when enabled; passes watch manager to state machine handler; adds optional fields and background task spawning.
Mock Implementations
d-engine-client/src/mock_rpc.rs, d-engine-core/src/test_utils/mock/mock_rpc.rs
Adds WatchStream associated type and watch method stub to mock RPC service implementations.
Test Utilities & Integration
d-engine-server/src/test_utils/integration/mod.rs, d-engine-server/src/test_utils/mock/mock_node_builder.rs, d-engine-server/tests/watch_integration_test.rs, d-engine-server/benches/state_machine.rs
Updates test helpers to pass None for watch manager; adds watch_manager and watch_dispatcher_handle fields to mock nodes; introduces integration tests validating watch event delivery, isolation, overflow, and cleanup; adds benchmarks comparing apply performance with varying watcher counts.
Timer Test
d-engine-core/src/timer/timer_test.rs
Refactors test_election_timer_reset to validate minimum elapsed time after reset instead of comparing deadline progression.
Documentation
d-engine-docs/src/docs/server_guide/mod.rs, d-engine-docs/src/docs/server_guide/watch-feature.md
Adds watch_feature module to documentation index; introduces comprehensive guide covering configuration, client usage, event semantics, performance characteristics, and architectural overview.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant gRPC Service
    participant WatchDispatcher
    participant WatchManager
    participant StateMachine

    Client->>gRPC Service: Watch(WatchRequest: key)
    gRPC Service->>WatchManager: register(key)
    WatchManager->>WatchManager: create watcher, store in registry
    WatchManager-->>gRPC Service: WatcherHandle + Receiver
    
    gRPC Service->>WatchDispatcher: register(WatchRegistration)
    WatchDispatcher->>WatchDispatcher: spawn WatchStreamHandler task
    WatchDispatcher-->>gRPC Service: ready
    
    gRPC Service-->>Client: stream opened
    
    rect rgb(200, 220, 240)
        Note over StateMachine,WatchManager: Write Path
        Client->>StateMachine: Write(key, value)
        StateMachine->>WatchManager: notify_put(key, value)
        WatchManager->>WatchManager: queue event (non-blocking)
        WatchManager-->>StateMachine: return
    end
    
    rect rgb(240, 220, 200)
        Note over WatchManager,Client: Dispatch Path (background thread)
        WatchManager->>WatchManager: read from event queue
        WatchManager->>WatchManager: lookup key watchers
        WatchManager->>WatchManager: send event to watcher channels
        WatchDispatcher->>Client: WatchResponse(Put event)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • WatchManager lock-free design and concurrency guarantees: Verify correctness of DashMap usage, atomic operations, and non-blocking dispatch semantics under concurrent readers/writers.
  • State machine handler integration: Ensure notify_watchers correctly decodes WriteCommand, handles errors gracefully, and doesn't impact non-watch code paths.
  • Watch dispatcher lifecycle: Verify proper task spawning, shutdown coordination, and cleanup of watcher registrations when clients disconnect.
  • Configuration validation: Check that queue size constraints are enforced and buffer overflow behavior is correct.
  • gRPC streaming implementation: Confirm event ordering guarantees (FIFO per key) and proper error propagation to clients.
  • Test coverage: Verify integration tests comprehensively cover event delivery, key isolation, overflow scenarios, and edge cases like watcher drop and disabled watch config.

Poem

🐰 Whiskers twitch with watch delight,
Keys now notify day and night,
Lock-free queues dance swift and free,
Events bloom upon each key,
From state machine hearts take flight! 🌿✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main feature: a lock-free Watch mechanism for real-time key monitoring, which is the primary objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@JoshuaChi
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 14, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
d-engine-core/src/state_machine_handler/default_state_machine_handler.rs (1)

188-237: Guard watcher notifications behind successful apply
notify_watchers fires even when sm.apply_chunk(...) fails, so clients will observe PUT/DELETE events that never actually made it into the state machine. That breaks watch semantics and can leave replicas inconsistent from the client’s perspective. Move the notify call into the Ok(_) branch (after the state machine commit succeeds) so only successful applies publish events.

-        let apply_result = sm.apply_chunk(chunk.clone()).await;
-
-        // Notify watchers of changes (if watch manager is enabled)
-        if let Some(ref watch_mgr) = self.watch_manager {
-            self.notify_watchers(&chunk, watch_mgr);
-        }
+        let apply_result = sm.apply_chunk(chunk.clone()).await;

…and inside the Ok(_) arm:

             Ok(_) => {
+                if let Some(ref watch_mgr) = self.watch_manager {
+                    self.notify_watchers(&chunk, watch_mgr);
+                }
🧹 Nitpick comments (9)
d-engine-server/src/test_utils/mock/mock_node_builder.rs (1)

342-342: Consider simplifying the Arc dereference pattern.

Line 342 creates an intermediate Arc<RaftNodeConfig> variable that's only used once on line 345 to dereference and clone. For clarity, consider either:

Option 1 (inline): Remove line 342 and inline the dereference:

-        let node_config = node_config_arc.clone();
         tokio::spawn(async move {
             if let Err(e) =
-                grpc::start_rpc_server(node_clone, listen_address, (*node_config).clone(), shutdown)
+                grpc::start_rpc_server(node_clone, listen_address, (*node_config_arc).clone(), shutdown)
                     .await

Option 2 (rename): Use a more descriptive variable name:

-        let node_config = node_config_arc.clone();
+        let node_config_for_rpc = node_config_arc.clone();
         tokio::spawn(async move {
             if let Err(e) =
-                grpc::start_rpc_server(node_clone, listen_address, (*node_config).clone(), shutdown)
+                grpc::start_rpc_server(node_clone, listen_address, (*node_config_for_rpc).clone(), shutdown)
                     .await

Also applies to: 345-345

d-engine-docs/src/docs/server_guide/watch-feature.md (2)

21-48: Consider using rust syntax highlighting for code examples.

The code block uses text as the language identifier, but since this is Rust code, using rust would provide proper syntax highlighting for readers.

Apply this change:

-```text
+```rust
 use d_engine_client::RaftClient;

120-124: Consider specific language identifiers for code blocks.

Several code blocks use text as the language identifier. Consider using more specific identifiers where applicable (e.g., rust for Rust code, toml for configuration) to enable syntax highlighting.

Also applies to: 145-162, 192-212

d-engine-client/src/mock_rpc.rs (1)

4-5: Mock watch implementation comment vs behavior (and cross-mock consistency)

The watch mock currently always returns Status::unimplemented, but the comment says “return empty stream”, and the core mock (d-engine-core/src/test_utils/mock/mock_rpc.rs) actually returns an empty stream. For clarity and consistency, consider either:

  • Implementing this mock like the core mock (returning an empty stream), or
  • Updating the comment (and any expectations in tests) to reflect that this mock reports UNIMPLEMENTED.

This keeps behavior predictable for callers that may use either mock.

Also applies to: 16-17, 106-142

d-engine-server/benches/state_machine.rs (1)

39-53: Consider explicit WatchManager shutdown in benches (if Drop is not sufficient)

create_watch_manager calls WatchManager::start() and returns an Arc<WatchManager>, but none of the benchmarks ever call stop(). If WatchManager does not auto-stop on Drop, this can leave background dispatcher threads running across iterations and potentially skew measurements or leak resources.

If Drop does not already handle shutdown, consider:

  • Adding a small helper that wraps WatchManager in a RAII type which calls stop() on drop, or
  • Calling watch_manager.stop() at the end of each async iteration in bench_watch_e2e_latency and the apply-with-watchers benches.

If Drop already performs a clean shutdown, this can be ignored.

Also applies to: 421-445

d-engine-server/tests/watch_integration_test.rs (1)

1-363: Integration tests are valuable; consider explicit WatchManager shutdown for consistency

These integration tests give good coverage of WatchManager behavior in the server context (dispatch, isolation, cleanup, overflow, sequencing, and disabled path). One small lifecycle detail:

  • Each test calls WatchManager::start() on an Arc<WatchManager>, but none call stop().

If WatchManager does not already shut itself down in Drop, consider calling manager.stop() at the end of each test (as is done in d-engine-core/src/watch/manager_test.rs) to ensure dispatcher threads are cleaned up deterministically and to keep behavior consistent across test suites.

d-engine-server/src/network/grpc/grpc_raft_service.rs (1)

311-314: Align watch RPC with other client RPCs and avoid panic on misconfig

The overall wiring of WatchStream and the watch() implementation looks good, but two points are worth tightening up:

  1. Readiness gating
    Other client RPCs (handle_client_write, handle_client_read, etc.) all short-circuit if the node is not ready. watch() currently doesn’t, so a client could register watches before the node is operational. To keep behavior consistent:
@@
-    async fn watch(
-        &self,
-        request: tonic::Request<WatchRequest>,
-    ) -> std::result::Result<tonic::Response<Self::WatchStream>, tonic::Status> {
-        let watch_request = request.into_inner();
-        let key = watch_request.key;
+    async fn watch(
+        &self,
+        request: tonic::Request<WatchRequest>,
+    ) -> std::result::Result<tonic::Response<Self::WatchStream>, tonic::Status> {
+        if !self.server_is_ready() {
+            warn!("[watch] Node-{} is not ready!", self.node_id);
+            return Err(Status::unavailable("Service is not ready"));
+        }
+
+        let watch_request = request.into_inner();
+        let key = watch_request.key;
  1. Avoid panicking via expect on watch_manager
    watch_manager is accessed with expect("Watch manager must exist if dispatcher handle exists"). If builder wiring ever regresses, this would panic in the hot RPC path instead of returning a structured error. A safer pattern with logging:
-        let watch_manager = self
-            .watch_manager
-            .as_ref()
-            .expect("Watch manager must exist if dispatcher handle exists");
+        let watch_manager = self
+            .watch_manager
+            .as_ref()
+            .ok_or_else(|| {
+                error!(
+                    node_id = self.node_id,
+                    "watch_dispatcher_handle is set but watch_manager is None"
+                );
+                Status::internal("Watch manager missing while watch feature is enabled")
+            })?;

With these tweaks, the watch path matches the robustness and readiness semantics of the existing RPCs.

Also applies to: 387-454

d-engine-server/src/node/mod.rs (1)

44-53: Optional watch fields on Node are well-scoped; consider documenting invariant

Adding watch_manager: Option<Arc<WatchManager>> and watch_dispatcher_handle: Option<WatchDispatcherHandle> with from_raft defaulting both to None cleanly preserves existing behavior when watch is disabled. It does, however, rely on builders ensuring “both-or-none” is maintained.

A brief comment near the struct or in the builder enforcing/documenting the invariant “watch_manager.is_some() == watch_dispatcher_handle.is_some()” would make it clearer that RPC code can safely assume they’re coupled (and avoid surprises if future code sets only one of them).

Also applies to: 85-92, 201-203

d-engine-core/src/watch/manager.rs (1)

355-389: Notify functions match the “non-blocking, best-effort” design

notify_put and notify_delete correctly construct a WatchEvent and use try_send on the global queue, which aligns with the documented choice to drop events under backpressure to protect the write path.

If you later need better observability around drops, consider incrementing a lightweight counter or emitting a debug-level trace on try_send failure, guarded by a cheap feature flag, but the current behavior is consistent with the stated design.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69279ee and b7000cc.

⛔ Files ignored due to path filters (3)
  • Cargo.lock is excluded by !**/*.lock
  • d-engine-proto/src/generated/d_engine.client.rs is excluded by !**/generated/**
  • examples/rocksdb-cluster/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (25)
  • config/base/raft.toml (1 hunks)
  • d-engine-client/src/mock_rpc.rs (4 hunks)
  • d-engine-core/Cargo.toml (1 hunks)
  • d-engine-core/src/config/raft.rs (4 hunks)
  • d-engine-core/src/lib.rs (2 hunks)
  • d-engine-core/src/state_machine_handler/default_state_machine_handler.rs (7 hunks)
  • d-engine-core/src/state_machine_handler/default_state_machine_handler_test.rs (28 hunks)
  • d-engine-core/src/test_utils/mock/mock_rpc.rs (3 hunks)
  • d-engine-core/src/timer/timer_test.rs (1 hunks)
  • d-engine-core/src/watch/manager.rs (1 hunks)
  • d-engine-core/src/watch/manager_test.rs (1 hunks)
  • d-engine-core/src/watch/mod.rs (1 hunks)
  • d-engine-docs/src/docs/server_guide/mod.rs (2 hunks)
  • d-engine-docs/src/docs/server_guide/watch-feature.md (1 hunks)
  • d-engine-proto/proto/client/client_api.proto (1 hunks)
  • d-engine-server/benches/state_machine.rs (4 hunks)
  • d-engine-server/src/network/grpc/grpc_raft_service.rs (4 hunks)
  • d-engine-server/src/network/grpc/mod.rs (1 hunks)
  • d-engine-server/src/network/grpc/watch_dispatcher.rs (1 hunks)
  • d-engine-server/src/network/grpc/watch_handler.rs (1 hunks)
  • d-engine-server/src/node/builder.rs (6 hunks)
  • d-engine-server/src/node/mod.rs (3 hunks)
  • d-engine-server/src/test_utils/integration/mod.rs (1 hunks)
  • d-engine-server/src/test_utils/mock/mock_node_builder.rs (2 hunks)
  • d-engine-server/tests/watch_integration_test.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Lint and Format Check
  • GitHub Check: Lint and Format Check
🔇 Additional comments (20)
d-engine-server/src/test_utils/mock/mock_node_builder.rs (2)

306-307: LGTM: Watch fields correctly initialized for mock nodes.

The watch subsystem fields are appropriately set to None in the mock builder, as test fixtures don't require watch functionality.


329-329: LGTM: Arc-based config sharing implemented correctly.

The refactoring to use Arc<RaftNodeConfig> properly enables shared ownership between the Node struct and the RPC server task, which is necessary for crossing async boundaries.

Also applies to: 336-341

d-engine-core/src/timer/timer_test.rs (1)

124-137: LGTM! Improved test reliability.

The revised assertion logic fixes the flaky test by ensuring the new deadline is measured from the reset moment rather than comparing against a potentially stale old deadline. This approach is more robust for timing-sensitive tests.

config/base/raft.toml (1)

105-131: LGTM! Well-documented configuration.

The watch configuration section is comprehensive with clear tuning guidelines, traffic-based recommendations, and memory impact estimates. This will help operators configure the system appropriately for their workload.

d-engine-core/src/lib.rs (1)

19-19: LGTM! Consistent module structure.

The watch module is added and re-exported following the existing crate organization pattern.

Also applies to: 28-28

d-engine-server/src/test_utils/integration/mod.rs (1)

238-238: LGTM! Appropriate test setup.

Passing None for the watch manager in test utilities is correct for tests that don't exercise the watch feature.

d-engine-docs/src/docs/server_guide/mod.rs (1)

10-10: LGTM! Documentation properly integrated.

The watch feature documentation is correctly added to the server guide following the established pattern.

Also applies to: 20-22

d-engine-docs/src/docs/server_guide/watch-feature.md (1)

1-271: Excellent comprehensive documentation!

The Watch feature documentation is thorough and well-organized. It effectively covers:

  • Clear quick start guide
  • Detailed configuration with tuning guidelines
  • Behavior guarantees and limitations
  • Performance characteristics with concrete numbers
  • Practical use cases and examples
  • Troubleshooting guide
  • Architecture overview

This will be very helpful for users implementing the Watch feature.

d-engine-server/src/network/grpc/mod.rs (1)

12-16: LGTM! Proper module organization.

The watch-related modules and re-exports are correctly integrated following the existing crate structure. Using pub(crate) visibility is appropriate for internal server components.

d-engine-core/Cargo.toml (1)

51-52: No action required—version specification is valid.

The latest stable version of crossbeam-channel is 0.5.15 (released April 8, 2025). The "0.5" specification in Cargo.toml uses semantic versioning and will automatically resolve to the latest 0.5.x release, including 0.5.15. The dependency is correctly specified.

d-engine-core/src/watch/manager_test.rs (1)

1-320: Comprehensive WatchManager unit coverage looks solid

The tests thoroughly exercise core behaviors (dispatch, multiple watchers, cleanup, isolation, overflow, sequencing, and lifecycle) and explicitly start/stop the manager, which should help catch regressions in the dispatcher logic.

d-engine-core/src/test_utils/mock/mock_rpc.rs (1)

11-12: Watch streaming stub in core mock is appropriate

The WatchStream alias and watch implementation (empty ReceiverStream) fit the gRPC streaming shape and provide a safe, no-op stub for tests that just need the RPC surface to exist.

Also applies to: 155-194

d-engine-core/src/state_machine_handler/default_state_machine_handler_test.rs (2)

47-55: Tests correctly opt out of watch via new_without_watch

Using DefaultStateMachineHandler::<MockTypeConfig>::new_without_watch in these tests keeps behavior identical to the pre-watch constructor while avoiding any dependency on the watch pipeline. This is a clean way to adapt tests to the new optional watch manager without changing expectations.


382-399: Shared snapshot test helper updated consistently to new_without_watch

Updating create_test_handler to construct DefaultStateMachineHandler via new_without_watch centralizes the “no watch in tests” behavior and keeps all snapshot-related tests off the watch path. This also means future changes to the main constructor won’t require touching each test.

d-engine-core/src/watch/mod.rs (1)

1-140: Watch module surface and documentation look solid

The module docs clearly describe the architecture, performance characteristics, and error semantics, and the re-exports (WatchManager, WatchEvent*, WatcherHandle*, WatchConfig) present a tidy public surface for consumers. No issues spotted in how the module is structured or exposed.

d-engine-proto/proto/client/client_api.proto (1)

90-123: Watch proto surface matches server design and is well-documented

The new WatchEventType, WatchRequest, WatchResponse, and Watch RPC are coherently modeled and align with the server-side watch implementation and documented semantics (single-key watch, PUT/DELETE events, possible drops under backpressure). Field choices and numbering are straightforward, so the addition looks safe from a compatibility and readability perspective.

Also applies to: 130-142

d-engine-core/src/watch/manager.rs (4)

1-62: Well-documented architecture and event model

The top-level docs and the WatchEventType / WatchEvent definitions are clear and match the stated performance and delivery guarantees. Using Bytes for keys/values and an explicit event type enum is a solid foundation for the rest of the design.


245-266: WatchManager::new construction path looks good

The split between bounded vs unbounded global event queues based on config.event_queue_size and the initialization of WatchManagerInner (DashMap + atomics + config) is straightforward and matches the documented behavior. No issues here from a correctness standpoint.


391-429: Dispatch and monitoring helpers are straightforward and correct

  • dispatch_event looks up watchers by key and does per-watcher try_send with cloning of the event, which is cheap enough given Bytes-backed payloads and consistent with non-blocking dispatch.
  • watcher_count and watched_key_count are simple introspection helpers that don’t affect the hot path and are useful for tests/metrics.

No functional issues spotted here.


317-344: The review comment is incorrect. Config-level validation already prevents this panic.

The WatchConfig struct in d-engine-core/src/config/raft.rs includes explicit validation that returns an error if watcher_buffer_size == 0 when watch is enabled. The validation occurs at configuration time (before WatchManager initialization), not at runtime in register(). This means any attempt to set watcher_buffer_size to 0 fails with a clear error message during config parsing, making it impossible for the register() method to be called with an invalid buffer size.

The review overlooks this existing safety mechanism and proposes adding a redundant guard in register(). No code changes are needed.

Likely an incorrect or invalid review comment.

Comment on lines +96 to +113
/// # Panics
///
/// Panics if the receiver has already been taken.
pub fn into_receiver(mut self) -> (u64, Bytes, mpsc::Receiver<WatchEvent>, WatcherHandleGuard) {
let id = self.id;
let key = self.key.clone();
let receiver = self.receiver.take().expect("receiver already taken");

let guard = WatcherHandleGuard {
id,
key: key.clone(),
manager: self.manager.clone(),
};

// Prevent Drop from being called on self, since guard will handle cleanup
std::mem::forget(self);

(id, key, receiver, guard)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

WatcherHandle::into_receiver leaks key and manager via mem::forget(self)

std::mem::forget(self) prevents Drop from running on WatcherHandle, so the embedded Bytes key and Arc<WatchManagerInner> never have their refcounts decremented. Because you clone both (key.clone(), self.manager.clone()), every into_receiver call permanently leaks at least one Bytes allocation and one Arc<WatchManagerInner> that will never be dropped, even after all handles, guards, and managers are gone.

Consider restructuring so you can move, not clone, the cleanup state into the guard without mem::forget, for example:

  • Wrap cleanup fields (id, key, manager) in an Option/ManuallyDrop inside WatcherHandle and take() them in into_receiver, so Drop becomes a no-op when the inner state has been moved; or
  • Introduce a small shared WatcherRegistration inner struct held in an Arc that owns the cleanup logic, and let both WatcherHandle and WatcherHandleGuard reference that instead of cloning and forgetting.

That way you avoid double-cleanup while still allowing all heap-owned resources to be reclaimed.

🤖 Prompt for AI Agents
d-engine-core/src/watch/manager.rs around lines 96 to 113: into_receiver
currently calls std::mem::forget(self) after cloning key and manager which leaks
the cloned Bytes and Arc; instead, make the cleanup state movable into the guard
so you don't need mem::forget — wrap id, key and manager in Option or
ManuallyDrop inside WatcherHandle (or introduce a shared WatcherRegistration
behind an Arc) and call take() to move them into the WatcherHandleGuard returned
from into_receiver, implement Drop to be a no-op when the inner Option is None
(or let the shared Arc handle cleanup), and remove mem::forget and redundant
clones so the Bytes and Arc get properly dropped when no longer used.

Comment on lines +134 to +172
impl Drop for WatcherHandleGuard {
fn drop(&mut self) {
// Guard cleanup: remove watcher from registry
if let Some(mut watchers) = self.manager.watchers.get_mut(&self.key) {
watchers.retain(|w| w.id != self.id);

if watchers.is_empty() {
drop(watchers);
self.manager.watchers.remove(&self.key);
}
}

trace!(
watcher_id = self.id,
key = ?self.key,
"Watcher unregistered via guard"
);
}
}

impl Drop for WatcherHandle {
fn drop(&mut self) {
// Automatic cleanup when client disconnects
if let Some(mut watchers) = self.manager.watchers.get_mut(&self.key) {
watchers.retain(|w| w.id != self.id);

if watchers.is_empty() {
drop(watchers);
self.manager.watchers.remove(&self.key);
}
}

trace!(
watcher_id = self.id,
key = ?self.key,
"Watcher unregistered"
);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Race condition in Drop impls can remove newly-registered watchers for the same key

Both WatcherHandleGuard::drop and WatcherHandle::drop do:

if let Some(mut watchers) = self.manager.watchers.get_mut(&self.key) {
    watchers.retain(|w| w.id != self.id);

    if watchers.is_empty() {
        drop(watchers);
        self.manager.watchers.remove(&self.key);
    }
}

Between drop(watchers) and self.manager.watchers.remove(&self.key), another thread can successfully register a new watcher for the same key (via entry(key).or_default().push(...)). In that case:

  1. The dropping thread empties the Vec and releases the bucket lock.
  2. A registering thread pushes a new watcher into that Vec.
  3. The dropping thread then calls remove(&self.key) and deletes the entire entry, including the newly-registered watcher.

Result: watchers for a key can disappear unexpectedly under concurrent register/drop.

A simple, safe fix is to keep the per-key entry even when the Vec becomes empty and avoid the separate remove call:

-        if let Some(mut watchers) = self.manager.watchers.get_mut(&self.key) {
-            watchers.retain(|w| w.id != self.id);
-
-            if watchers.is_empty() {
-                drop(watchers);
-                self.manager.watchers.remove(&self.key);
-            }
-        }
+        if let Some(mut watchers) = self.manager.watchers.get_mut(&self.key) {
+            watchers.retain(|w| w.id != self.id);
+            // Intentionally keep an empty Vec to avoid racing with concurrent registrations.
+        }

Apply the same pattern to both Drop impls. You trade a negligible number of empty Vecs for correct concurrent behavior.


I'll start by examining the actual code structure to verify the race condition claim.



🏁 Script executed:

#!/bin/bash
# First, find and examine the manager.rs file to understand data structures
find . -name "manager.rs" -path "*/watch/*" -type f | head -20

Length of output: 36


🏁 Script executed:

#!/bin/bash
# Look at the context around lines 134-172 to see the struct definitions and imports
fd "manager\.rs" --path "*/watch/*" -x cat {} | head -200

Length of output: 215


🏁 Script executed:

#!/bin/bash
# Search for the WatcherHandle and WatcherHandleGuard struct definitions
rg -n "struct WatcherHandle|struct WatcherHandleGuard" --type rs -A 5 -B 2

Length of output: 26


🏁 Script executed:

#!/bin/bash
# Check the manager struct definition to understand what 'watchers' is
rg -n "struct.*Manager|watchers\s*:" --type rs -C 3 | head -100

Length of output: 26


🏁 Script executed:

#!/bin/bash
# Look for DashMap imports and usage
rg -n "use.*DashMap|dashmap" --type rs -C 2

Length of output: 26


🏁 Script executed:

#!/bin/bash
# Examine the full Drop impl to understand the exact code
sed -n '130,180p' d-engine-core/src/watch/manager.rs

Length of output: 1290


🏁 Script executed:

#!/bin/bash
# Get broader context of the file structure
wc -l d-engine-core/src/watch/manager.rs
rg -n "impl Drop for WatcherHandleGuard|impl Drop for WatcherHandle" --type rs -A 20

Length of output: 65


🏁 Script executed:

#!/bin/bash
# Check if there are registration methods using entry API
rg -n "\.entry\(|or_default\(\)\.push" --type rs -B 2 -A 2

Length of output: 26


🏁 Script executed:

#!/bin/bash
# Look for synchronization primitives around watchers
rg -n "Mutex|RwLock|DashMap" d-engine-core/src/watch/manager.rs -C 2

Length of output: 1022


🏁 Script executed:

#!/bin/bash
# Get the exact lines 134-172 from the file
head -n 172 d-engine-core/src/watch/manager.rs | tail -n 39

Length of output: 1061


🏁 Script executed:

#!/bin/bash
# Check complete file to understand full context
cat d-engine-core/src/watch/manager.rs

Length of output: 12554


I need to verify DashMap's locking semantics to confirm the race condition claim.

Now let me also check if there are any existing tests or documentation in the codebase about concurrent behavior:



🏁 Script executed:

#!/bin/bash
# Search for tests related to Drop or concurrent operations
rg -n "test|#\[" d-engine-core/src/watch/manager.rs | grep -i "drop\|concurrent\|race\|thread"

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check if there are test files for manager
find d-engine-core -name "*test*" -o -name "*tests*" | xargs grep -l "watch\|manager" 2>/dev/null || echo "No test files found"

Length of output: 201


🏁 Script executed:

#!/bin/bash
# Check Cargo.toml for dashmap version
grep -A 5 -B 5 "dashmap" d-engine-core/Cargo.toml || grep -r "dashmap" . --include="Cargo.toml" | head -5

Length of output: 300


🏁 Script executed:

#!/bin/bash
# Look for any synchronization or thread safety comments in manager.rs
rg -n "thread|concurrent|race|safe|lock" d-engine-core/src/watch/manager.rs

Length of output: 1696


🏁 Script executed:

#!/bin/bash
# Check if there are other usages of DashMap in codebase to see patterns
rg -n "DashMap" --type rust -B 2 -A 2 | head -40

Length of output: 2999


🏁 Script executed:

#!/bin/bash
# Check for entry() usage patterns in the codebase to understand registration flow
rg "\.entry\(" d-engine-core/src/watch/manager.rs -B 3 -A 3

Length of output: 210


🌐 Web query:

DashMap get_mut RefMut lock release behavior when dropped

💡 Result:

Short answer: RefMut returned by DashMap::get_mut holds the shard's write lock for the duration of the RefMut value; when the RefMut is dropped (goes out of scope), its lock is released and other threads can access that shard. The docs explicitly show get_mut returns a mapref::one::RefMut and the RefMut type provides the mutable access (and is the RAII guard). [1][2]

Notes/warnings:

  • While the RefMut is alive the shard is locked; calling other DashMap methods that require the same shard may deadlock. Use scoped access (drop the RefMut as soon as possible) or try_get_mut/try_get to avoid blocking. [2]

Sources:

  • dashmap::mapref::one::RefMut (struct docs). [1]
  • dashmap::DashMap::get_mut (method docs + locking behaviour). [2]

🌐 Web query:

DashMap entry or_default concurrent behavior between get_mut and remove

💡 Result:

Short answer: Entry::or_default inserts (if needed) and returns a RefMut that holds the map's mutable reference (a shard write lock) for that entry. While that RefMut is held, operations that need a mutable lock on the same shard (e.g., remove, get_mut, insert, iter_mut, remove_if_mut, etc.) will block or, if attempted from the same thread, may deadlock. In practice: calling entry(key).or_default() -> RefMut keeps the lock until the RefMut is dropped; a concurrent thread that calls remove(key) will wait for that RefMut to be dropped, and calling remove/get_mut while still holding the RefMut in the same thread can deadlock. [1][2][3]

References

  • DashMap docs: remove() and get_mut() locking behaviour. [1]
  • DashMap Entry::or_default returning RefMut (holds mutable reference). [2]
  • Community explanation / deadlock examples (StackOverflow). [3]

Sources
[1] DashMap API docs — remove(), get_mut() locking behaviour.
[2] dashmap::mapref::entry::Entry::or_default docs (returns RefMut).
[3] StackOverflow discussion on DashMap deadlocks.


🌐 Web query:

DashMap bucket locking semantics lock-free concurrent HashMap

💡 Result:

Short answer: DashMap is sharded and uses per-shard RwLocks (reader/writer locks) — not a lock‑free hash map. Each shard contains a hashbrown raw table; reads take a shard read-lock, writes take a shard write-lock, and many APIs return guard objects that hold those locks. Because it uses locks (sharding + RwLocks) it is fast and concurrent, but not lock‑free/wait‑free and can deadlock if you call locking APIs while holding a map reference. [1][2][3]

Sources:

  • DashMap source (shards + RwLock + RawTable implementation). [1]
  • DashMap docs (locking behaviour on get/insert/iter and deadlock warnings). [2]
  • DashMap API reference / docs site (summary: sharded RwLock approach). [3]

🌐 Web query:

Rust DashMap empty Vec accumulation impact performance memory

💡 Result:

Short answer

  • An empty Vec (Vec::new()) does not allocate heap memory, but the Vec value itself still takes space as the Vec struct (typically 24 bytes on 64-bit). [1][2]
  • Storing many empty Vecs inside a DashMap therefore won't cause lots of heap allocations immediately, but each map entry still has per-entry overhead (map bucket/metadata, key storage, and the Vec struct). That overhead can be significant if you have millions of keys. [3]
  • Performance cost shows up when you mutate those Vecs (pushes) because each push may allocate, and because DashMap iteration/lookup has the usual hashmap overhead vs a compact Vec-based layout. Use appropriate data layout if you care about hot-path throughput or memory density. [3][4]

Practical guidance

  • If you only need to represent “no items” vs “some items” and want to avoid storing the Vec struct for many keys, consider Option<Vec> — but note Option<Vec> usually does not save space over Vec thanks to Rust’s null-pointer optimization for Vec, so it often doesn’t help memory. [2]
  • If per-entry size/density matters: consider alternatives — e.g. store indices into a shared arena/Vec, use smallvec (inlined small buffer) for very small lists, or allocate the Vec lazily (only insert a value when you first need to push). These reduce per-key overhead and heap churn when lists remain empty. [4]
  • Benchmark your real workload (insertion/mutation/iteration patterns). Hashmap-like structures (including DashMap) have higher per-entry and iteration overhead than a compact Vec for dense data; but they give concurrent access and O(1) keyed lookup which may be required. [3][4]

Sources

  • DashMap docs (DashMap behavior / creation). [1]
  • Vec allocation / empty-vec behavior discussion (Vec::new() does not allocate; Option notes). [2]
  • Benchmarks and discussions comparing HashMap/DashMap overhead vs Vec and why structure choice matters. [3][4]

References
[1] DashMap documentation.
[2] StackOverflow / Rust answers on empty Vec allocation and Option.
[3] Rust community posts / benchmarks showing HashMap vs Vec performance differences.
[4] Blog posts and discussions on allocation patterns (smallvec, arena, lazy allocation).


🌐 Web query:

DashMap compare swap pattern for atomic remove operations

💡 Result:

Use DashMap's conditional remove APIs (remove_if / remove_if_mut) so the check-and-remove happens atomically under the shard lock instead of doing a separate load + remove (which races). Example:

  • Read-check-remove in one operation:
    • dashmap.remove_if(&key, |k, v| /* return true to remove */ );

Example snippet:
let map = dashmap::DashMap::new();
map.insert("x", 1);
// remove only if value == 1
map.remove_if(&"x", |_, &v| v == 1);

This is the intended compare-and-swap style pattern for DashMap: evaluate the predicate and remove under the same lock/shard. See the DashMap docs for remove_if/remove_if_mut. [1][2]

Sources:
[1] DashMap docs — remove_if / remove_if_mut. [2] DashMap crate source (implementation).


Let me check the test file to see if there are any existing tests that catch this race:



🏁 Script executed:

#!/bin/bash
# Examine the manager test file to see what's tested
cat d-engine-core/src/watch/manager_test.rs

Length of output: 10147


🏁 Script executed:

#!/bin/bash
# Search for any code that checks if a key exists (depends on entry removal)
rg "contains_key|is_empty\(\)" d-engine-core/src/watch/ -n

Length of output: 151


🏁 Script executed:

#!/bin/bash
# Check if watcher_count or watched_key_count are used in tests (which would be affected)
rg "watcher_count|watched_key_count" d-engine-core -n

Length of output: 831


Fix race condition in Drop impls, but address test breakage from proposed solution

The race condition is confirmed. DashMap's get_mut() releases its shard lock when the RefMut is dropped, creating a window where another thread can successfully insert a new watcher via entry().or_default().push() before this thread calls remove(&self.key), causing newly-registered watchers to be deleted.

The proposed fix (keeping empty Vecs) correctly eliminates the race window. However, it breaks test_watcher_auto_cleanup() (manager_test.rs:126-138), which asserts watched_key_count() == 0 after all watchers drop. With empty Vecs remaining in the DashMap, watched_key_count() will not reach zero.

Two options:

  1. Accept empty Vecs + update test: Update test_watcher_auto_cleanup() to remove assertions that watched_key_count() == 0, since entries will remain (with negligible overhead: ~24 bytes per Vec struct). Apply the proposed fix to both WatcherHandleGuard::drop and WatcherHandle::drop.

  2. Atomic removal: Use DashMap's remove_if_mut to make the check-and-remove atomic, avoiding the race without keeping empty entries. This is more complex but preserves current semantics.

Choose which approach aligns with your design intent, then apply the fix and update the corresponding test.

🤖 Prompt for AI Agents
d-engine-core/src/watch/manager.rs around lines 134 to 172: the current Drop
impls use get_mut() then remove(&self.key) which drops the shard lock before
removal and allows a race where a concurrent insert can re-create a watcher
vector that this Drop will then remove; instead, replace the get_mut() + retain
+ conditional remove logic with an atomic remove_if_mut (or equivalent DashMap
API) that performs the "if vec is empty then remove key" check and removal under
the same shard lock so no race window exists; apply this change to both
WatcherHandleGuard::drop and WatcherHandle::drop and keep the existing test
expectations (do not change test_watcher_auto_cleanup()).

Comment on lines +268 to +304
/// Start the background event dispatcher thread
///
/// This spawns a dedicated thread that reads from the event queue and
/// dispatches events to registered watchers.
pub fn start(&self) {
if self.inner.running.swap(true, Ordering::SeqCst) {
// Already running
return;
}

let inner = self.inner.clone();
let receiver = Arc::clone(&self.event_receiver);

std::thread::spawn(move || {
debug!("Watch dispatcher thread started");

while inner.running.load(Ordering::Acquire) {
match receiver.recv() {
Ok(event) => {
Self::dispatch_event(&inner, event);
}
Err(_) => {
// Channel closed, exit thread
break;
}
}
}

debug!("Watch dispatcher thread stopped");
});
}

/// Stop the background dispatcher thread
pub fn stop(&self) {
self.inner.running.store(false, Ordering::Release);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

start/stop lifecycle can leak dispatcher threads and has surprising semantics

Current behavior:

  • start() spawns a dispatcher thread the first time running transitions from false to true.
  • The dispatcher loops on receiver.recv() while running is true.
  • stop() just sets running = false and returns.

Two issues:

  1. If stop() is called when there are no further events and the channel stays open, the dispatcher thread will remain blocked in recv() indefinitely (it only observes running again after recv() returns).
  2. After stop(), running is false, so calling start() again spawns an additional dispatcher thread. The original thread may still be blocked on recv(), so repeated start/stop cycles can accumulate sleeping threads competing for the same queue.

Depending on how tests and server lifecycles are structured, this can cause thread leaks and harder-to-reason-about dispatch behavior.

Consider either:

  • Making WatchManager “start once, no-op stop” (and adjusting docs accordingly), or
  • Introducing an explicit shutdown mechanism (e.g., send a sentinel event or use a control channel / recv_timeout) and storing a JoinHandle so stop() can reliably terminate and optionally join the dispatcher.

I'll verify the review comment by examining the WatchManager implementation and understanding the actual threading behavior.


---

🏁 Script executed:

```shell
#!/bin/bash
# Find and examine the WatchManager struct definition and inner structure
rg -n "struct WatchManager" -A 15 d-engine-core/src/watch/

Length of output: 2240


🏁 Script executed:

#!/bin/bash
# Look for the inner struct or running field definition
rg -n "running.*Atomic\|Channel.*receiver" -B2 -A2 d-engine-core/src/watch/manager.rs

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for test cases involving start/stop cycles
rg -n "\.start\(\)|\.stop\(\)" -B3 -A3 d-engine-core/src/watch/ --type rust

Length of output: 11273


🏁 Script executed:

#!/bin/bash
# Look for how event_receiver is initialized and if there's cleanup
rg -n "event_receiver\|event_sender" -B2 -A2 d-engine-core/src/watch/manager.rs

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check if there's a shutdown or drop implementation
rg -n "impl.*Drop\|fn drop\|fn shutdown" -A 10 d-engine-core/src/watch/manager.rs

Length of output: 0



---

🏁 Script executed:

```shell
#!/bin/bash
# Get the full new() method to see channel creation
rg -n "pub fn new" -A 30 d-engine-core/src/watch/manager.rs | head -50

Length of output: 1016


🏁 Script executed:

#!/bin/bash
# Look for any JoinHandle storage
rg -n "JoinHandle" d-engine-core/src/watch/

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for the channel type and any capacity/bounded info
rg -n "mpsc\|channel\|Sender\|Receiver" d-engine-core/src/watch/manager.rs | head -20

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check the test_multiple_start_stop test more carefully
sed -n '275,288p' d-engine-core/src/watch/manager_test.rs

Length of output: 380


🏁 Script executed:

#!/bin/bash
# Search for Drop implementation or cleanup in the file
cat d-engine-core/src/watch/manager.rs | tail -100

Length of output: 2698


Thread lifecycle has genuine restart and blocking issues

The review correctly identifies two real problems:

  1. Thread accumulation on restart cycles: Each start() after stop() spawns a new dispatcher thread because swap(true) returns false when called after stop() sets it. The previous dispatcher thread remains blocked in receiver.recv() indefinitely since the channel is never closed between cycles. Repeated start/stop/start patterns accumulate sleeping threads.

  2. Indefinite blocking on stop(): The dispatcher loop only checks the running flag after recv() returns. If stop() is called when no events are pending and the channel remains open, the thread blocks indefinitely waiting for the next event, never observing that running is now false.

The channel (bounded() or unbounded()) is created once and never recreated, persisting across cycles with no closure mechanism. There is no JoinHandle storage, so threads cannot be tracked or joined. The test suite includes test_start_stop_idempotent which calls start() multiple times then stop() multiple times, but it does not test the problematic start()→stop()→start() restart scenario that triggers the accumulation issue.

For use cases requiring manager restart (e.g., graceful server restart, test fixtures), this creates observable thread leaks and unpredictable dispatch behavior.

🤖 Prompt for AI Agents
In d-engine-core/src/watch/manager.rs around lines 268 to 304, the dispatcher
thread accumulates and can block forever because start() spawns a new thread
each restart while stop() merely flips the running flag and never unblocks or
joins the existing thread or closes/recreates the channel; fix this by storing
the spawned thread's JoinHandle (Option<JoinHandle<()>>) on the manager, update
start() to be a no-op if already running and to recreate the channel/receiver if
it was closed, and change the loop to use recv_timeout (or try_recv with a short
sleep) so it regularly checks the running flag; implement stop() to set
running=false, send a shutdown sentinel or close the sender to wake the
receiver, and join the stored handle (clearing it) to ensure the previous thread
exits before allowing a new start.

Comment on lines +55 to +65
/// Helper to register multiple watchers
async fn register_watchers(
manager: &WatchManager,
count: usize,
key_prefix: &str,
) {
for i in 0..count {
let key = format!("{key_prefix}{i}");
let _ = manager.register(key.into()).await;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Apply-with-watchers benchmarks currently run with zero active watchers

register_watchers drops each WatcherHandle immediately:

for i in 0..count {
    let key = format!("{key_prefix}{i}");
    let _ = manager.register(key.into()).await; // handle dropped here
}

Since WatcherHandle cleanup runs on drop, by the time you call notify_put / notify_delete in the benchmarks there are no watchers registered. That means:

  • bench_apply_with_1_watcher
  • bench_apply_with_10_watchers
  • bench_apply_with_100_watchers

are effectively measuring the same “no-watchers” overhead as bench_apply_without_watch.

To actually benchmark apply-path cost with active watchers, keep the handles alive for the duration of each iteration, e.g.:

-use d_engine_core::watch::WatchManager;
+use d_engine_core::watch::{WatchManager, WatcherHandle};

-/// Helper to register multiple watchers
-async fn register_watchers(
-    manager: &WatchManager,
-    count: usize,
-    key_prefix: &str,
-) {
-    for i in 0..count {
-        let key = format!("{key_prefix}{i}");
-        let _ = manager.register(key.into()).await;
-    }
-}
+/// Helper to register multiple watchers and keep them alive
+async fn register_watchers(
+    manager: &WatchManager,
+    count: usize,
+    key_prefix: &str,
+) -> Vec<WatcherHandle> {
+    let mut handles = Vec::with_capacity(count);
+    for i in 0..count {
+        let key = format!("{key_prefix}{i}");
+        handles.push(manager.register(key.into()).await);
+    }
+    handles
+}

And in the benches:

-            // Register 1 watcher
-            register_watchers(&watch_manager, 1, "key_").await;
+            // Register 1 watcher and keep it alive
+            let _watchers = register_watchers(&watch_manager, 1, "key_").await;

(similarly for 10 and 100 watchers). Keeping _watchers in scope preserves the intended watcher counts while still avoiding unused-variable warnings.

Also applies to: 294-332, 337-373, 379-415

🤖 Prompt for AI Agents
In d-engine-server/benches/state_machine.rs around lines 55-65 (and also apply
same change to ranges 294-332, 337-373, 379-415), the helper currently drops
each WatcherHandle immediately so benchmarks run with zero active watchers;
change register_watchers to collect and return (or keep) the WatcherHandle
instances in a Vec (e.g. Vec<WatcherHandle>) and keep that Vec in scope for the
duration of the benchmark iteration (name it with a leading underscore like
_watchers to avoid unused-variable warnings) so the handles are not dropped
before notify_put/notify_delete are called. Ensure callers use or bind the
returned Vec (or hold the local Vec) for the lifetime of each bench iteration.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Add explicit `option go_package` to all Protocol Buffer definitions to enable
proper Go code generation with correct import paths. This change establishes
d-engine-proto as a language-agnostic protobuf distribution that supports
Go clients and future language implementations.

Changes:
- Add `option go_package = "github.com/deventlab/d-engine/proto/{package}"`
  to all 7 proto files (common, error, client, server/*)
- Use organization-scoped package path (deventlab) instead of personal account
  to ensure stability and transferability of d-engine as open source project
- Add .gitignore to d-engine-proto to prevent accidental commit of generated code

Benefits:
- Enables Go gRPC client generation with correct import paths
- Establishes clear, organization-owned package naming for open source
- Supports future multi-language client support (Python, Rust, Java, etc.)
- Prevents generated code from polluting the source repository

Documentation:
- Added d-engine-docs/src/docs/client_guide/go-client.md with comprehensive
  Go client integration guide covering proto generation, best practices,
  error handling, and common patterns

Related: #170
Move ClientResponseExt tests from client_ext.rs to client_ext_test.rs
following project convention (howto.md #22) that requires unit tests
in separate files rather than inline with source code.
- Fix memory leak in WatcherHandle::into_receiver by using Option pattern
  instead of mem::forget, eliminating cloned resource leaks
- Fix race condition in Drop implementations using DashMap's remove_if_mut
  for atomic unregistration operations
- Fix thread lifecycle issues by adding JoinHandle tracking and shutdown
  channel, enabling clean start/stop/start cycles
- Fix benchmark watcher leaks by returning Vec<WatcherHandle> to keep
  watchers registered during measurements
- Add comprehensive unit tests covering all fixes with concurrent scenarios
@JoshuaChi JoshuaChi merged commit 56f1410 into develop Nov 25, 2025
3 of 4 checks passed
@JoshuaChi JoshuaChi deleted the feature/174_watch branch December 5, 2025 02:01
@coderabbitai coderabbitai bot mentioned this pull request Dec 7, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants