Skip to content

native module refinement#1875

Open
jeff-hykin wants to merge 1 commit intodevfrom
jeff/fix/native_rebuild3
Open

native module refinement#1875
jeff-hykin wants to merge 1 commit intodevfrom
jeff/fix/native_rebuild3

Conversation

@jeff-hykin
Copy link
Copy Markdown
Member

@jeff-hykin jeff-hykin commented Apr 17, 2026

Native module refinement (rebuild logic and logging) and keyboard interrupt improvement

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 17, 2026

Greptile Summary

This PR introduces NativeModule, a Module subclass that manages a native (C/C++) executable as a supervised subprocess with LCM topic wiring, watchdog thread, and graceful shutdown. It also adds signal.SIG_IGN for SIGINT in the Python worker entrypoint so workers don't race the coordinator during keyboard-interrupt shutdown.

  • P1 (native_module.py line 269): The second proc.wait(timeout=...) after proc.kill() has no exception handler — a subprocess.TimeoutExpired here propagates uncaught, leaving self._process/self._watchdog non-null and super().stop() never invoked, permanently breaking the module instance.

Confidence Score: 4/5

Safe to merge after addressing the unhandled TimeoutExpired after SIGKILL in stop().

One P1 defect: if proc.wait() after SIGKILL times out, stop() exits mid-cleanup leaving _process/_watchdog set and super().stop() never called, permanently corrupting module state. The remaining findings are P2 and don't block correctness on the happy path.

dimos/core/native_module.py — specifically the stop() method's SIGKILL fallback path (lines 263–269).

Security Review

  • shell=True in _maybe_build() (native_module.py line 381): build_command is a raw config string passed directly to the shell. If config is ever sourced from untrusted input, this is a command injection vector.

Important Files Changed

Filename Overview
dimos/core/native_module.py New NativeModule class wrapping native executables as managed subprocesses; P1 bug: unhandled TimeoutExpired after SIGKILL in stop() leaves module state inconsistent; also uses shell=True for build_command.
dimos/core/coordination/python_worker.py Adds SIGINT ignore in worker entrypoint for cleaner coordinator-orchestrated shutdown; minor: lock held for up to 5s during shutdown poll, and _suppress_console_output lacks context manager for devnull.
dimos/core/test_native_module.py New test file for NativeModule covering process crash/watchdog behavior and autoconnect blueprint wiring; tests look correct and cover the main code paths.

Sequence Diagram

sequenceDiagram
    participant C as Coordinator
    participant NM as NativeModule
    participant W as Watchdog Thread
    participant P as Native Process

    C->>NM: start()
    NM->>NM: _maybe_build()
    NM->>NM: _collect_topics()
    NM->>P: Popen(cmd, start_new_session=True)
    NM->>W: Thread(_watch_process).start()
    W->>P: proc.wait() [blocks]

    alt Process crashes unexpectedly
        P-->>W: exit(rc)
        W->>NM: stop() [watchdog-triggered]
    else Normal shutdown
        C->>NM: stop()
        NM->>NM: acquire _stop_lock, set _stopping=True
        NM->>P: send_signal(SIGTERM)
        NM->>P: proc.wait(timeout)
        alt SIGTERM timeout
            NM->>P: proc.kill()
            NM->>P: proc.wait(timeout) [unhandled TimeoutExpired bug]
        end
        NM->>W: watchdog.join(timeout)
        NM->>NM: _process=None, _watchdog=None
        NM->>C: super().stop()
    end

    note over NM,P: SIGINT ignored in python_worker entrypoint
    note over NM,P: so coordinator owns shutdown sequencing
Loading

Comments Outside Diff (2)

  1. dimos/core/coordination/python_worker.py, line 274-290 (link)

    P2 Lock held for up to 5 s during conn.poll() in shutdown()

    self._lock is acquired around conn.poll(timeout=5), so any thread still attempting an RPC (e.g., a stale daemon thread that hasn't noticed shutdown) will block for up to 5 seconds waiting for the lock. In practice shutdown should be quiescent, but consider releasing the lock before the poll or using a shorter timeout with a retry.

  2. dimos/core/coordination/python_worker.py, line 304-309 (link)

    P2 File descriptor leak on exception in _suppress_console_output()

    devnull is opened without a context manager; if either os.dup2() call raises, devnull.close() is never reached and the file descriptor leaks until GC. Use with open(...) to guarantee cleanup.

Reviews (1): Last reviewed commit: "native module refinement" | Re-trigger Greptile

Comment on lines 263 to +269
logger.warning(
"Native process did not exit, sending SIGKILL", pid=self._process.pid
"Native process did not exit, sending SIGKILL",
module=self._mod_label,
pid=proc.pid,
)
self._process.kill()
self._process.wait(timeout=5)
if self._watchdog is not None and self._watchdog is not threading.current_thread():
self._watchdog.join(timeout=DEFAULT_THREAD_JOIN_TIMEOUT)
self._watchdog = None
self._process = None
proc.kill()
proc.wait(timeout=self.config.shutdown_timeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unhandled TimeoutExpired after SIGKILL leaves module in inconsistent state

After proc.kill(), the second proc.wait(timeout=self.config.shutdown_timeout) call has no exception handler. If subprocess.TimeoutExpired is raised (e.g., on a heavily-loaded system or in containers where zombie reaping is slow), the exception propagates out of stop(), leaving self._process and self._watchdog non-null and super().stop() never called — breaking any future restart attempt and leaking the watchdog thread.

Comment on lines 379 to 381
proc = subprocess.Popen(
self.config.build_command,
shell=True,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 security shell=True with a config-supplied string is a command-injection risk

build_command comes directly from NativeModuleConfig and is passed to Popen with shell=True. If the config is ever hydrated from an external source (environment variable, remote config store, user input), arbitrary shell commands could be injected. Prefer passing the command as a pre-split list and using shell=False, or at minimum document that build_command must only come from trusted sources.

logger = setup_logger()


class LogFormat(enum.Enum):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this killed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments on the PR I thought were saying to simplify the stdout/err and AFAIK it wasn't used.

If we're going to have JSON vs normal logging shouldn't that be for all modules? It doesn't make sense to me that that would be native module specific

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not native module specific, our py logging system supports rich metadata (dictionaries) attached to each log line, for simplicity for native modules we support just STDOUT/STDERR parsing into strings, but rich json log is our dimos wide default

"""
exe = Path(self.config.executable)
if exe.exists():
is_prod = os.environ.get("PROD")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are introducing magical env var, please use config system and global config

is_prod = os.environ.get("PROD")

if is_prod:
if not exe.exists():
Copy link
Copy Markdown
Contributor

@leshy leshy Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do want autobuild to work in prod on the first run, in case of nix this will let us distribute binaries but intelligently locally build when needed (just like pip does for py libs)


logger.info(
"Executable not found, running build",
"Building native module",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't by default run build system change detection (even to validate file changes) for all native modules all of the time in dev, this is expensive, toggle this per module if you are developing it. if you have 5 native modules, we don't want to run 5 nix instances every time someone is running your blueprint in dev

Comment thread dimos/core/native_module.py
if sys.platform.startswith("linux"):
import ctypes

_LIBC = ctypes.CDLL("libc.so.6", use_errno=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_LIBC = ctypes.CDLL("libc.so.6", use_errno=True)
from ctypes.util import find_library
_LIBC = ctypes.CDLL(find_library("c"), use_errno=True)

Copy link
Copy Markdown
Contributor

@leshy leshy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some changes

from dimos.core.module import Module, ModuleConfig
from dimos.utils.logging_config import setup_logger

# ctypes is only needed for the Linux child-preexec helper below. Hoisting
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is LLM generate code because it adds a lot of unnecessary explanations which obscures that it's actually needed for. This could just as easily be:

from ctypes.util import find_library
import ctypes

def set_process_to_die_when_parent_dies() -> None:
    if not sys.platform.startswith("linux"):
        return
    PR_SET_PDEATHSIG = 1
    libc = ctypes.CDLL(find_library("c"), use_errno=True)
    if libc.prctl(PR_SET_PDEATHSIG, signal.SIGTERM) != 0:
        err = ctypes.get_errno()
        raise OSError(err, f"prctl(PR_SET_PDEATHSIG) failed: {os.strerror(err)}")


...

self._process = subprocess.Popen(
            ...
            preexec_fn=set_process_to_die_when_parent_dies,
        )

@jeff-hykin jeff-hykin mentioned this pull request Apr 17, 2026
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants