Skip to content

fix(tests): stop LCM thread leak in cross-wall planning tests#2068

Merged
jeff-hykin merged 10 commits into
mainfrom
jeff/fix/cross-wall-thread-leak
May 13, 2026
Merged

fix(tests): stop LCM thread leak in cross-wall planning tests#2068
jeff-hykin merged 10 commits into
mainfrom
jeff/fix/cross-wall-thread-leak

Conversation

@jeff-hykin
Copy link
Copy Markdown
Member

@jeff-hykin jeff-hykin commented May 13, 2026

Problem

pytest-slow failing

Solution

async instead of threads

  • I have read and approved the CLA.

`StatsMonitor` (enabled by `global_config(dtop=True)`) created an
`LCMResourceLogger` whose `pLCMTransport` spun up an LCM handler
thread that was never stopped, so `monitor_threads` reported a
leaked `_lcm_loop` thread after each cross-wall test.

- Add `stop()` to the `ResourceLogger` protocol; `LCMResourceLogger`
  shuts its transport down, `StructlogResourceLogger` no-ops.
- `StatsMonitor.stop()` now also calls `self._logger.stop()`.
- Rewrite `run_cross_wall_test` to drive LCM via
  `loop.add_reader(lcm.fileno(), ...)` and `asyncio.sleep` instead
  of a background thread, removing the test-local thread plumbing.
`StatsMonitor.stop()` now calls `self._logger.stop()`. The test's
inline `CapturingLogger` didn't implement it, breaking
`test_collect_stats` with `AttributeError`. Add a no-op `stop()`.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 13, 2026

Greptile Summary

This PR fixes LCM thread leaks in cross-wall planning tests by replacing a daemon thread polling loop with asyncio.add_reader, and wires LCMResourceLogger.stop() into StatsMonitor.stop() to prevent the transport thread from surviving after the coordinator shuts down.

  • conftest.py: The _lcm_loop thread is replaced with loop.add_reader(lcm_fd, _on_lcm_readable), removing the need for a lock on shared odometry state; run_cross_wall_test drives the async function via asyncio.run(), and --numprocesses=auto is dropped from bin/pytest-slow so tests no longer interfere with each other over shared LCM multicast.
  • logger.py / monitor.py: stop() is added to the ResourceLogger Protocol, implemented on both StructlogResourceLogger (no-op) and LCMResourceLogger (delegates to pLCMTransport.stop()), and then called from StatsMonitor.stop() to close the transport thread.
  • bridge.py: visual_override type is corrected to dict[Glob | str, Callable[[Any], Archetype] | None], matching the existing None-means-suppress logic already in _visual_override_for_entity_path.

Confidence Score: 5/5

Safe to merge — the asyncio reader pattern is correct for single-process sequential tests, the transport thread leak is properly closed, and removing parallel process execution eliminates the LCM multicast collision root cause.

All shared mutable state in conftest.py is now accessed exclusively on the asyncio event loop thread, so dropping the lock is correct. The stop() wiring through the ResourceLogger protocol is complete across all implementations.

conftest.py — worth verifying that no future pytest-asyncio configuration introduces a running event loop that would conflict with asyncio.run() in run_cross_wall_test.

Important Files Changed

Filename Overview
dimos/navigation/nav_stack/tests/conftest.py Thread + lock replaced with asyncio event loop reader; locks removed since all odom state is now accessed on the single asyncio thread; cleanup moved to finally block with remove_reader. Logic is sound for sequential tests.
dimos/core/resource_monitor/monitor.py StatsMonitor.stop() now calls self._logger.stop() after joining the collection thread, closing the previously leaking pLCMTransport background thread.
dimos/core/resource_monitor/logger.py stop() added to ResourceLogger Protocol, StructlogResourceLogger (no-op), and LCMResourceLogger (delegates to pLCMTransport.stop()), fixing the transport thread leak.
bin/pytest-slow Removes --numprocesses=auto so slow tests run sequentially, preventing multiple test workers from colliding over shared LCM multicast channels.
dimos/visualization/rerun/bridge.py Type annotation for visual_override corrected to allow None values, matching the existing suppression logic already present in _visual_override_for_entity_path.

Sequence Diagram

sequenceDiagram
    participant Test as run_cross_wall_test
    participant AR as asyncio.run
    participant EL as event loop
    participant LCM as lcm fd reader
    participant Coord as ModuleCoordinator

    Test->>AR: call asyncio.run(_run_cross_wall_test)
    AR->>EL: create and run event loop
    EL->>Coord: ModuleCoordinator.build(blueprint)
    EL->>EL: add_reader(lcm_fd, _on_lcm_readable)
    EL->>EL: await asyncio.sleep (yields control)
    EL->>LCM: fd readable - calls _on_lcm_readable
    LCM->>LCM: lcm.handle - dispatches _odom_handler
    LCM-->>EL: odom_count / robot_x / robot_y updated
    EL->>EL: assert odom received and goals reached
    EL->>EL: finally: remove_reader, lcm.unsubscribe
    EL->>Coord: coordinator.stop
    AR-->>Test: return or propagate AssertionError
Loading

Reviews (6): Last reviewed commit: "-" | Re-trigger Greptile

Comment thread dimos/navigation/nav_stack/tests/conftest.py Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

❌ Patch coverage is 21.05263% with 30 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
dimos/navigation/nav_stack/tests/conftest.py 9.67% 28 Missing ⚠️
...on/nav_stack/tests/test_cross_wall_planning_far.py 0.00% 1 Missing ⚠️
...nav_stack/tests/test_cross_wall_planning_simple.py 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

When `cyclonedds` is installed (via the `dds` or `unitree-dds` extra)
on a box that has ROS 2 Jazzy, importing the C extension fails with
`libddsc.so.0: cannot open shared object file` because the wheel's
loader resolves the SONAME against `/opt/ros/jazzy/lib/libddsc.so`
but the actual `.so.0` lives in the multiarch dir. Document the
LD_LIBRARY_PATH / setup.bash workarounds.
Comment thread dimos/core/resource_monitor/logger.py Outdated
@jeff-hykin jeff-hykin enabled auto-merge (squash) May 13, 2026 05:16
Dreamsorcerer and others added 4 commits May 13, 2026 08:17
Re-applies the cross-wall thread-leak fix that was reverted: without
`StatsMonitor.stop()` calling `self._logger.stop()`, the PickleLCM
thread inside `LCMResourceLogger`'s `pLCMTransport` lives past test
teardown and `monitor_threads` flags it as `_lcm_loop` leaked.

- `ResourceLogger` protocol regains `stop()`.
- `StructlogResourceLogger.stop()` is a no-op; `LCMResourceLogger.stop()`
  stops the transport.
- `StatsMonitor.stop()` calls `self._logger.stop()` after joining its
  loop thread.
- Test's `CapturingLogger` gets a no-op `stop()` to satisfy the protocol.
@jeff-hykin jeff-hykin merged commit 89ada29 into main May 13, 2026
20 checks passed
@jeff-hykin jeff-hykin deleted the jeff/fix/cross-wall-thread-leak branch May 13, 2026 10:43
jeff-hykin added a commit that referenced this pull request May 13, 2026
Merging origin/main pulled in PR #2068's fix for the leaked _lcm_loop
thread on teardown. The simple/far variants of cross-wall planning also
gained pytest.mark.self_hosted in that commit. Add the same marker to
the rtab variant for consistency.

Verified: 1 passed in 68.45s, no teardown error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants