Rust warn on dropped LCM packets#2200
Conversation
The per-input mpsc queue between the LCM recv loop and the user's handler had capacity 16. `TypedRoute::try_dispatch` was using `try_send` and silently dropping new messages when full. At 10 Hz publish + any handler slower than ~6 Hz, the queue filled in 1.6 s and every subsequent message was lost without any signal. This was the actual cause of the ~50 % "LCM drops" observed during the pgo_rust KITTI-360 benchmark — the kernel reported zero drops the whole time (UDP RcvbufErrors=0, lo drop=0), and a separate burst characterization confirmed pure LCM can do 42 MB/s @ 89 Hz at 0% loss. The drops were here, in dimos-module. Two changes: 1. INPUT_CHANNEL_CAPACITY 16 → 1024 (and PUBLISH_CHANNEL_CAPACITY 64 → 1024 for symmetry). At 10 Hz publish, 1024 gives ~100 s of headroom for handler stalls. Memory cost is bounded by the message types (rust mpsc only allocates entries as needed). 2. Per-route AtomicU64 `drops` counter on TypedRoute. Increments on `TrySendError::Full`. Logged via rate-limited eprintln at power-of-2 milestones (1, 2, 4, 8, 16, …). First drop fires immediately so operators see the first warning; subsequent frequency decays exponentially so a runaway handler doesn't spam stderr while still surfacing the failure. 4 new unit tests cover the behavior: - typed_route_drops_count_when_queue_full: 100 sent, capacity 4, expect exactly 96 drops. - typed_route_drops_counter_starts_at_zero: first sends under capacity must not increment drops. - typed_route_does_not_drop_when_queue_has_headroom: 1000 sent, capacity 2048, drops must be 0 (guards against the rate-limited log firing on the happy path). - typed_route_rate_limited_log_fires_on_power_of_two: 130 sent, capacity 1, expect 129 drops crossing the milestones 1, 2, 4, 8, 16, 32, 64, 128 — operator sees 8 log lines. All 22 dimos-module tests pass. Clippy clean.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Replaces the power-of-2 milestone trigger with a wall-clock throttle. New constant MAX_ERROR_LOG_RATE caps drop-warning log lines per route at N/sec (currently 1). The drops counter still increments on every dropped message; only the stderr line is throttled. Implementation: last_log: Mutex<Option<Instant>> on TypedRoute. First drop after a quiet period fires immediately; subsequent drops within 1/MAX_ERROR_LOG_RATE seconds are silent. 22/22 dimos-module tests pass.
Reduced the back-pressure test suite to a single test (typed_route_logs_error_on_drop) that fills the queue, forces one drop, and verifies (a) the counter increments and (b) last_log is set (proxying for "log fired"). Dropped the three other tests as redundant with the single-message case. Renamed the field `drops` → `drop_count` to read more naturally. 19/19 dimos-module tests pass.
Greptile SummaryThis PR adds a rate-limited drop warning to
Confidence Score: 4/5Safe to merge; the logic change is isolated to warning output and does not affect message delivery or control flow. The functional change — emitting a warning on dropped messages — is correct and well-tested for the happy path. The two findings are about warning message clarity and test coverage for rate limiting; neither affects message delivery or system behaviour. The larger channel capacities are a straightforward tuning change with no downstream breakage risk. native/rust/dimos-module/src/module.rs — the drop warning message and its test are the only areas worth a second look. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Transport recv loop] -->|channel, data| B{Route lookup}
B -->|No route| C[Ignore]
B -->|Route found| D[TypedRoute::try_dispatch]
D --> E{decode bytes}
E -->|Err| F[eprintln decode error]
E -->|Ok msg| G{sender.try_send}
G -->|Ok| H[Message delivered to Input]
G -->|Closed| I[Silently ignore]
G -->|Full - backpressure| J[drop_count.fetch_add]
J --> K{rate-limit check\nlast_log within 1s?}
K -->|Yes, suppress| L[Silent drop]
K -->|No, emit| M[eprintln warning\ndrop_count + topic + cap]
M --> N[Update last_log]
Reviews (1): Last reviewed commit: "dimos-module: simplify tests + rename dr..." | Re-trigger Greptile |
Backpressure from heavy faster-than-real-time lidar benchmarks is kind of evitable. Right now its a silent failure.
This includes a warning message, and some higher default rust buffer sizes to handle more throughput to match.
Tests added to make sure rust native modules aren't bottlenecking LCM.
Run the tests: