Jeff/fix/native threading by jeff-hykin · Pull Request #1663 · dimensionalOS/dimos

jeff-hykin · 2026-03-25T07:03:34Z

Problem

Flakey test, and a history of Flakey tests surrounding threads in modules.

Solution

Smart thread tooling with auto-cleanup to reduce the risk of bad cleanup and reduce bloat.

Breaking Changes

None, did more testing than usual because this touches core.

How to Test

# Thread utility tests (deadlocks, races, stress)
python -m pytest dimos/utils/test_thread_utils.py -v --noconftest

# NativeModule + MCP server tests
python -m pytest dimos/core/test_native_module.py dimos/agents/mcp/ -v --noconftest

# Full suite
python -m pytest --timeout=120 -q --ignore=dimos/perception/detection/type

Contributor License Agreement

I have read and approved the CLA.

- Add mod.stop() to test_process_crash_triggers_stop so watchdog, LCM, and event-loop threads are properly joined from the test thread - Filter third-party daemon threads with generic names (Thread-\d+) in conftest monitor_threads to ignore torch/HF background threads that have no cleanup API

Convert test_process_crash_triggers_stop to use a fixture that calls mod.stop() in teardown. The watchdog thread calls self.stop() but can't join itself, so an explicit stop() from the test thread is needed to properly clean up all threads. Drop the broad conftest regex filter for generic daemon thread names per review feedback.

mod.stop() is a no-op when the watchdog already called it, so capture thread IDs before the test and join new ones in teardown.

…join it

…dimos into jeff/fix/native_threading

…tive_threading

greptile-apps · 2026-03-25T07:07:25Z

Greptile Summary

This PR introduces a suite of smart, auto-cleaning thread primitives (ThreadSafeVal, ModuleThread, AsyncModuleThread, ModuleProcess, safe_thread_map) in dimos/utils/thread_utils.py and wires them into Module, NativeModule, and McpServer to eliminate the root cause of recurring flaky tests: threads and processes that were not deterministically cleaned up, and a specific self-join deadlock where a watchdog thread's on_exit callback triggered module teardown that then tried to join the watchdog from within itself.

Key changes and their rationale:

ThreadSafeVal: uses RLock instead of Lock, which is critical — the with block + inner set()/get() pattern (used throughout module.py for mod_state) would deadlock with a plain Lock.
ModuleThread.stop(): skips join() when called from the managed thread itself (self._thread is not threading.current_thread()), fixing the self-join deadlock in the watchdog → on_exit → module.dispose() chain.
AsyncModuleThread: consolidates the event-loop-in-a-thread pattern that was duplicated across modules; the base Module now creates one via _async_thread, and McpServer and others reference it directly.
ModuleProcess: replaces ad-hoc subprocess.Popen + manual cleanup with SIGTERM→SIGKILL escalation, structured log piping, and automatic disposable registration.
ModState literal type: replaces stringly-typed state strings with a proper Literal for static analysis benefit.
Test coverage is thorough, including stress tests and explicit deadlock regression tests.

Confidence Score: 4/5

Safe to merge; all identified issues are non-blocking style/maintenance concerns in the test file and one dead-code guard in McpServer.
The core thread primitives are well-designed and the key deadlock fix (self-join prevention via current_thread check) is correct. ThreadSafeVal correctly uses RLock to allow nested acquisitions. Test coverage is comprehensive with explicit stress and deadlock regression tests. The only concerns are: a missing Callable import and a bottom-of-file ExceptionGroup import in the test file (both harmless at runtime due to from future import annotations and late evaluation), a dead loop is not None guard in McpServer, and the accumulation of watchdog disposables on repeated ModuleProcess.start() calls. None of these affect production correctness.
dimos/utils/test_thread_utils.py (import ordering), dimos/agents/mcp/mcp_server.py (dead guard), dimos/utils/thread_utils.py (watchdog disposable accumulation)

Important Files Changed

Filename	Overview
dimos/utils/thread_utils.py	New core utility module introducing ThreadSafeVal (RLock-backed atomic wrapper), ModuleThread (managed thread with auto-disposable cleanup), AsyncModuleThread (managed async event loop in a daemon thread), ModuleProcess (managed subprocess with watchdog/log piping), and safe_thread_map (parallel map that waits for all tasks before raising). The self-join deadlock fix in ModuleThread.stop() is the key correctness change. Minor: each ModuleProcess.start() call appends a new watchdog disposable without removing the old one.
dimos/core/module.py	Replaces inline threading/asyncio boilerplate with AsyncModuleThread and ThreadSafeVal from thread_utils. Introduces ModState Literal type for type-safe state transitions. The _stop() method now relies on the disposable system to stop the async thread rather than explicit teardown, which is cleaner and consistent.
dimos/core/native_module.py	Replaces ad-hoc subprocess management with ModuleProcess, gaining automatic SIGTERM→SIGKILL escalation, watchdog crash detection, log piping, and disposable-based cleanup at no added complexity.
dimos/agents/mcp/mcp_server.py	Updated to use self._async_thread.loop (inherited from Module base) instead of the previously stored self._loop. The stop() method now has a dead loop is not None guard since AsyncModuleThread.loop always returns a non-None value.
dimos/utils/test_thread_utils.py	Comprehensive test suite covering deadlocks, races, idempotency, and stress scenarios for all new utilities. Two minor issues: ExceptionGroup is imported at line 888 (bottom) but used earlier in the file, and Callable is used as a return-type annotation on line 820 without being imported.
dimos/utils/typing_utils.py	New file providing Python-version compatibility shims for ExceptionGroup (polyfill for 3.10) and TypeVar (from typing_extensions on <3.13). Clean and minimal.
dimos/core/test_native_module.py	New test covering the watchdog crash→stop() path and blueprint autoconnect wiring with the refactored NativeModule.
dimos/core/test_core.py	Minor update to reflect new RPC method count after the ModState-related additions; otherwise unchanged.

Sequence Diagram

sequenceDiagram
    participant Main as Main Thread
    participant MT as ModuleThread
    participant AT as AsyncModuleThread
    participant MP as ModuleProcess
    participant WD as Watchdog Thread
    participant Disp as CompositeDisposable

    Main->>AT: __init__(module)
    AT->>AT: new_event_loop()
    AT->>AT: thread.start() [runs loop.run_forever]
    AT->>Disp: add(Disposable(self.stop))

    Main->>MP: __init__(module)
    MP->>Disp: add(Disposable(self.stop))
    MP->>MP: subprocess.Popen(...)
    MP->>MT: ModuleThread(module, target=_watch)
    MT->>Disp: add(Disposable(self.stop))
    MT->>WD: thread.start()

    Note over WD: process exits naturally
    WD->>WD: proc.wait() returns
    WD->>WD: _stopped? No → call on_exit()
    WD->>Main: on_exit() → module.stop()
    Main->>Disp: dispose()
    Disp->>MP: stop() → _stopped=True, process=None
    Disp->>MT: stop() → _stop_event.set()
    MT->>MT: current_thread == watchdog? → skip join()
    Note over MT: No deadlock ✓

    Note over Main: Normal teardown path
    Main->>Disp: dispose()
    Disp->>AT: stop() → loop.call_soon_threadsafe(loop.stop)
    Disp->>MP: stop() → SIGTERM → wait → SIGKILL if needed
    Disp->>MT: stop() → join(timeout)

_{Reviews (1): Last reviewed commit: "misc improve" | Re-trigger Greptile}

greptile-apps · 2026-03-25T07:07:28Z

dimos/utils/test_thread_utils.py

+                assert done.wait(timeout=10), "Deadlock with slow ModuleThread.stop()"
+
+
+from dimos.utils.typing_utils import ExceptionGroup


ExceptionGroup imported at bottom of file, used earlier

ExceptionGroup is imported on line 888 but first used on line 750 inside TestSafeThreadMap methods. This works at runtime because the full module is loaded before any test runs, but it's confusing to readers: the symbol appears to be undefined at its use sites, and any linter or static analysis tool will flag these as NameErrors. The import should be moved to the top-level imports block alongside the other third-party imports.

Suggested change

from dimos.utils.typing_utils import ExceptionGroup

from dimos.utils.typing_utils import ExceptionGroup

(Move this to the top of the file alongside the other dimos.utils imports, and remove line 888.)

greptile-apps · 2026-03-25T07:07:30Z

dimos/utils/test_thread_utils.py

+    """
+
+    @staticmethod
+    def _make_fake_stop(mod: FakeModule, done: threading.Event) -> Callable:


Missing Callable import used in return-type annotation

Callable is referenced as a return-type annotation in _make_fake_stop but is never imported in this file. With from __future__ import annotations in effect, the annotation is stored as a string at definition time and won't raise a NameError at runtime. However, any call to typing.get_type_hints(_make_fake_stop) — including some test introspection tools — will fail with NameError: name 'Callable' is not defined.

Add to the imports at the top of the file:

from collections.abc import Callable

dimos/agents/mcp/mcp_server.py

greptile-apps · 2026-03-25T07:07:31Z

dimos/utils/thread_utils.py

+        self._watchdog = ModuleThread(
+            module=self._module,
+            target=self._watch,
+            name=f"proc-{self._process.pid}-watchdog",
+        )


Each ModuleProcess.start() call adds a new ModuleThread disposable

Every time start() is called (line 388), a new ModuleThread is constructed for the watchdog. ModuleThread.__init__ immediately registers a Disposable(self.stop) in module._disposables (line 155). CompositeDisposable simply appends, so restarting the process accumulates stale disposables for watchdog threads that have already exited.

For the single-use lifecycle this is fine. But if start() is ever called more than once (e.g. after a failed first attempt, or the deferred-start path), the module's disposable list grows unboundedly, and on teardown each old watchdog's stop() is called even though it already finished, which — while idempotent — is surprising and hard to debug.

Consider either:

Explicitly removing the old watchdog disposable before creating a new one, or

Documenting clearly that start() is a one-shot operation and raising an error on re-entry.

super().start() will throw if its called more than once. We can/should assume start isn't being called multiple times AFAIK.

setstate getstate are different though, start could be called after setstate I believe

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

…dimos into jeff/fix/native_threading

jeff-hykin · 2026-03-25T07:16:19Z

dimos/core/module.py

+        with self.mod_state as state:
+            if state == "stopped":
+                raise RuntimeError(f"{type(self).__name__} cannot be restarted after stop")
+            self.mod_state.set("started")


I know lots of modules don't call super().start() but they also wouldn't be using mod_state cause its a new thing.

Different/off-topic discussion, but I think core2 should have ModuleBase as class decorator instead of an inherited class (we can basically wrap methods instead of saying "please remember to call super").

jeff-hykin · 2026-03-25T07:27:17Z

dimos/core/module.py

-        loop = getattr(self, "_loop", None)
+        # dispose of things BEFORE making aspects like rpc and _tf invalid
+        if hasattr(self, "_disposables"):
+            self._disposables.dispose()  # stops _async_thread via disposable


I think its important to move disposables up before the rpc stop and the tf stop

jeff-hykin · 2026-03-25T07:29:26Z

dimos/agents/mcp/mcp_server.py

        if self._uvicorn_server:
            self._uvicorn_server.should_exit = True
-            loop = self._loop
-            if loop is not None and self._serve_future is not None:


the loop is always there until super().stop() is called

jeff-hykin · 2026-03-25T07:32:56Z

dimos/agents/mcp/mcp_server.py

        server = uvicorn.Server(config)
        self._uvicorn_server = server
-        loop = self._loop
-        assert loop is not None


loop always there until stop is called

jeff-hykin · 2026-03-25T07:33:26Z

dimos/agents/mcp/test_mcp_server.py

+        return s.getsockname()[1]
+
+
+def test_mcp_server_lifecycle() -> None:


jeff-hykin · 2026-03-25T07:34:57Z

dimos/core/test_core.py

    assert hasattr(class_rpcs["start"], "__rpc__"), "start should have __rpc__ attribute"

-    nav._close_module()
+    nav._stop()


I'm trying to consolidate our naming to be "stop" instead of half "stop" half "close"

jeff-hykin · 2026-03-25T07:37:07Z

dimos/utils/thread_utils.py

+# ThreadSafeVal: a lock-protected value with context-manager support
+
+
+class ThreadSafeVal(Generic[T]):


this is my favorite util. I hate having _thing and _thing_lock and _thing2 and _thing2_lock, but I also hate seeing _thing being used in a method and thinking "hmm ... does _thing have a lock thats not being used?". This prevents ambiguity about what vals need locks and what vals don't

jeff-hykin · 2026-03-25T07:39:49Z

dimos/utils/thread_utils.py

+        self._thread.start()
+
+    def stop(self) -> None:
+        """Signal the thread to stop and join it.


this is probably the part that needs the most review

jeff-hykin · 2026-03-25T07:42:16Z

dimos/utils/thread_utils.py

+# safe_thread_map: parallel map that collects all results before raising
+
+
+def safe_thread_map(


Not used in this PR, but is used by the docker branch so getting it in here a bit early cause this is the util file it belongs in

jeff-hykin · 2026-03-25T07:43:04Z

dimos/utils/typing_utils.py

+
+if sys.version_info < (3, 11):
+
+    class ExceptionGroup(Exception):  # type: ignore[no-redef]  # noqa: N818


I didn't want to repeat all this cludge so I put it here. Let me know if there's a better spot

paul-nechifor · 2026-03-26T03:03:46Z

dimos/utils/thread_utils.py

+        if self._thread.is_alive() and self._thread is not threading.current_thread():
+            self._thread.join(timeout=self._close_timeout)
+
+    def join(self, timeout: float | None = None) -> None:


I don't think you need join since you're already join()-ing in stop.

paul-nechifor · 2026-03-26T03:04:26Z

dimos/utils/thread_utils.py

+        self._stopped = False
+        self._stop_lock = threading.Lock()


Why do you need _stopped and _stop_lock? You have _stop_event.

paul-nechifor · 2026-03-26T03:24:46Z

dimos/utils/thread_utils.py

+
+    def start(self) -> None:
+        """Start the underlying thread."""
+        self._stop_event.clear()


You don't need this. It's already off. If you want ModuleThread to be restartable, then you need to use another thread since threads aren't restartable.

paul-nechifor · 2026-03-26T03:27:24Z

dimos/utils/thread_utils.py

+        if start:
+            self.start()


Noooooo, don't autostart in the constructor. 😭

😈 no boilerplate

But fr, how do you feel about ModuleThread().start()

paul-nechifor · 2026-03-26T03:29:22Z

dimos/utils/thread_utils.py

+                self._worker = ModuleThread(
+                    module=self,
+                    target=self._run_loop,
+                    name="my-worker",


It would be nice if ModuleThread used self.module.__class__.__name__ as the prefix so we can just leave name blank most of the time and it still produces a useful name for debugging.

paul-nechifor · 2026-03-26T03:29:45Z

dimos/utils/thread_utils.py

+        return f"ThreadSafeVal({self._value!r})"
+
+
+# ModuleThread: a thread that auto-registers with a module's disposables


Why add this if there's a docstring below?

cause AI loves redundancy
(I'll remove it, thanks for bringing attention)

paul-nechifor · 2026-03-26T04:09:13Z

dimos/core/module.py

-    def _close_module(self) -> None:
-        with self._module_closed_lock:
-            if self._module_closed:
+    def _stop(self) -> None:


_close_module is a remnant from the the Module class hierarchy was more complicated. Some classes were skipping Module.__init__ and didn't initialize self._disposables for example. That's why I'm using hasattr(self, "_disposables") or hasattr(self, "_tf"). We didn't even have stop then.

I think it's not needed at all anymore. This could be deleted if you want and moved into def stop.

happily! I though it was a rpc vs non-rpc thing

jeff-hykin · 2026-03-26T04:48:46Z

dimos/utils/thread_utils.py

+                self._worker = ModuleThread(
+                    module=self,
+                    target=self._run_loop,
+                    name="my-worker",


Suggested change

name="my-worker",

name=self.module.__class__.__name__+"_my_worker",

- Merge _stop() into stop() in ModuleBase (removes unnecessary indirection) - Update all callers of _stop() to use stop() directly - Add thread_start() convenience function that creates + starts a ModuleThread

AsyncModuleThread no longer spawns the event loop thread in __init__. The loop is created on the first call to start(), which ModuleBase.start() now calls. This means module construction no longer has side effects — no threads are spawned until the module is explicitly started.

SUMMERxYANG and others added 16 commits March 12, 2026 16:05

CI code cleanup

6bd7ad6

CI code cleanup

e316626

chore: retrigger CI

f13b2b3

fix(test): join threads directly in crash_module fixture

3197ad3

mod.stop() is a no-op when the watchdog already called it, so capture thread IDs before the test and join new ones in teardown.

CI code cleanup

43d5434

fix(native_module): preserve watchdog reference so second stop() can …

1ff8769

…join it

minimal fix

c202c57

fully ideal approach, untested

fe787bb

ideal approach, not tested

8ae6282

improve tests

b8fcd08

Merge branch 'jeff/fix/native_threading' of github.com:dimensionalOS/…

d930cb3

…dimos into jeff/fix/native_threading

Merge branch 'dev' of github.com:dimensionalOS/dimos into jeff/fix/na…

fb8b40b

…tive_threading

formatting

1d06db9

misc improve

75708ff

jeff-hykin marked this pull request as draft March 25, 2026 07:03

greptile-apps bot reviewed Mar 25, 2026

View reviewed changes

jeff-hykin and others added 4 commits March 25, 2026 00:09

Apply suggestions from code review

aff62bc

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

cleanup

d5ae028

Merge branch 'jeff/fix/native_threading' of github.com:dimensionalOS/…

66eef1f

…dimos into jeff/fix/native_threading

-

4595390

jeff-hykin commented Mar 25, 2026

View reviewed changes

fix order of _disposables

3b5c4fd

jeff-hykin commented Mar 25, 2026

View reviewed changes

dimos/agents/mcp/test_mcp_server.py

return s.getsockname()[1]

def test_mcp_server_lifecycle() -> None:

Copy link

Member Author

jeff-hykin Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new test

jeff-hykin commented Mar 25, 2026

View reviewed changes

paul-nechifor reviewed Mar 26, 2026

View reviewed changes

jeff-hykin commented Mar 26, 2026

View reviewed changes

jeff-hykin added 3 commits March 25, 2026 23:19

pr feedback

f9b6d04

refactor: merge _stop into stop, add thread_start helper

e5d739f

- Merge _stop() into stop() in ModuleBase (removes unnecessary indirection) - Update all callers of _stop() to use stop() directly - Add thread_start() convenience function that creates + starts a ModuleThread

		assert done.wait(timeout=10), "Deadlock with slow ModuleThread.stop()"


		from dimos.utils.typing_utils import ExceptionGroup

		return s.getsockname()[1]


		def test_mcp_server_lifecycle() -> None:

		# ThreadSafeVal: a lock-protected value with context-manager support


		class ThreadSafeVal(Generic[T]):

		# safe_thread_map: parallel map that collects all results before raising


		def safe_thread_map(


		if sys.version_info < (3, 11):

		class ExceptionGroup(Exception): # type: ignore[no-redef] # noqa: N818

		return f"ThreadSafeVal({self._value!r})"


		# ModuleThread: a thread that auto-registers with a module's disposables

	name="my-worker",
	name=self.module.__class__.__name__+"_my_worker",

Conversation

jeff-hykin commented Mar 25, 2026

Problem

Solution

Breaking Changes

How to Test

Contributor License Agreement

Uh oh!

greptile-apps bot commented Mar 25, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeff-hykin Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeff-hykin Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeff-hykin Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeff-hykin Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeff-hykin Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

jeff-hykin Mar 25, 2026 •

edited

Loading

jeff-hykin Mar 25, 2026 •

edited

Loading

jeff-hykin Mar 25, 2026 •

edited

Loading

jeff-hykin Mar 25, 2026 •

edited

Loading

jeff-hykin Mar 26, 2026 •

edited

Loading