Skip to content

native rebuild changes#1780

Open
jeff-hykin wants to merge 13 commits intodevfrom
jeff/feat/native_rebuild2
Open

native rebuild changes#1780
jeff-hykin wants to merge 13 commits intodevfrom
jeff/feat/native_rebuild2

Conversation

@jeff-hykin
Copy link
Copy Markdown
Member

Problem

Editing source code doesnt cause native modules to rebuild. Very easy to forget and not friendly for ai edits.

Solution

Some generic utils:

  • dimos/utils/change_detect.py: Content-hash-based file change detection using xxhash.
  • NativeModuleConfig.rebuild_on_change: Optional list[str|Path|Glob]

Breaking Changes

None

How to Test

# Change detection tests
pytest dimos/utils/test_change_detect.py -v

# Native module rebuild tests
pytest dimos/core/test_native_rebuild.py -v

# Native module crash/thread leak test
pytest dimos/core/test_native_module.py::test_process_crash_triggers_stop -v

# LCM isolation tests
pytest dimos/protocol/pubsub/test_pattern_sub.py -v
pytest dimos/protocol/pubsub/impl/test_lcmpubsub.py -v

# Full fast suite
pytest -m 'not (tool or slow or mujoco)' dimos/

Contributor License Agreement

  • I have read and approved the CLA.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 12, 2026

Greptile Summary

This PR adds content-hash-based file change detection (dimos/utils/change_detect.py) and wires it into NativeModuleConfig.rebuild_on_change so that native modules automatically rebuild when their source files change.

  • _maybe_build cache-poisoning bug (native_module.py L400–407): did_change is called with the default update=True, writing the new hash to the cache before the build runs. If the build fails with RuntimeError, the cache already reflects the current source hash, so every subsequent _maybe_build() call sees no change and silently skips the rebuild — indefinitely. The fix is update=False on the pre-build check and update_cache(...) after confirmed success.
  • hash_paths None in f-string (mesh_utils.py L87): hash_paths returns str | None; using it directly in the f-string produces a literal "v3_None_…" cache directory for missing/unreadable URDF files.

Confidence Score: 4/5

Not safe to merge as-is: the cache-poisoning bug in _maybe_build means a single failed build silently suppresses all future rebuild attempts, which is the exact problem this PR is designed to solve.

One clear P1 defect in the primary feature path (_maybe_build updating the cache before build success) and one P1 in mesh_utils.py (None-in-f-string cache key). The change_detect utility itself is well-designed; the bug is in how it's consumed.

dimos/core/native_module.py (cache update ordering) and dimos/manipulation/planning/utils/mesh_utils.py (None guard for hash_paths).

Important Files Changed

Filename Overview
dimos/core/native_module.py Adds rebuild_on_change to NativeModuleConfig and _maybe_build change-detection logic; contains a P1 bug where did_change is called with update=True before the build runs, meaning a failed build permanently poisons the cache and suppresses all future rebuild attempts.
dimos/utils/change_detect.py New change-detection utility using xxhash with thread + flock cross-process locking; well-designed API with update=False / update_cache pattern; minor issue: clear_cache doesn't acquire locks before deleting the hash file.
dimos/manipulation/planning/utils/mesh_utils.py Migrates to hash_dict/hash_paths from change_detect; hash_paths return value is used directly in an f-string without a None guard, which silently produces a "v3_None_…" cache key for missing URDF files.
dimos/utils/test_change_detect.py Thorough tests for change_detect covering glob expansion, cache isolation, update=False semantics, relative paths, and the build workflow pattern.
dimos/core/test_native_rebuild.py Tests rebuild-on-change happy paths; does not cover the failed-build scenario that exposes the cache-poisoning bug in _maybe_build.
pyproject.toml Adds xxhash>=3.0.0 as a core dependency; appropriate since change_detect is used in core modules.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[NativeModule.start] --> B[_maybe_build]
    B --> C{rebuild_on_change\nset AND exe exists?}
    C -- No --> D{exe exists?}
    C -- Yes --> E["did_change(update=True ⚠️)\nWrites new hash to cache\nBEFORE build runs"]
    E -- False\nno change --> D
    E -- True\nchanged --> F[needs_rebuild = True]
    F --> G[Run build_command]
    D -- Yes, no rebuild --> H[return early / no build]
    D -- No, or needs rebuild --> G
    G -- returncode != 0 --> I["raise RuntimeError\n(cache already updated ⚠️)"]
    G -- success --> J[exe verified exists]
    J --> K["did_change again\n(redundant – cache already current)"]
    K --> L[return]
    I --> M["Next _maybe_build call:\ndid_change → False\n→ skips rebuild forever ⚠️"]

    style E fill:#ffcccc,stroke:#cc0000
    style I fill:#ffcccc,stroke:#cc0000
    style M fill:#ffcccc,stroke:#cc0000
Loading

Reviews (1): Last reviewed commit: "native rebuild changes" | Re-trigger Greptile

try:
# Include the path so additions/deletions/renames are detected
h.update(str(fpath).encode())
h.update(fpath.read_bytes())
Copy link
Copy Markdown
Contributor

@leshy leshy Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will load a full file into memory, ok for source files, but will for sure cause a weird issue for someone down the line. maybe easy hack for this is to check the file size and simply not hash if larger then 500k-1Mb or something

@leshy
Copy link
Copy Markdown
Contributor

leshy commented Apr 14, 2026

so not sure, haven't looked into details, but is there a reason we aren't relying nix build before each module start, since nix essentially does the above by itself?


# Override in subclasses to exclude fields from CLI arg generation
cli_exclude: frozenset[str] = frozenset()
cli_exclude: frozenset[str] = frozenset({"rebuild_on_change"})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this breaks subclass overriding right? since now all nativemodules need to explicitly include this value

_tail_size = 50

@property
def _mod_label(self) -> str:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cachedproperty better here

_stopping: bool = False
_last_stderr_lines: collections.deque[str]
_stderr_tail: collections.deque[str]
_stdout_tail: collections.deque[str]
Copy link
Copy Markdown
Contributor

@leshy leshy Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every line is already being passed to the logger.

So the tail buffers are keeping a duplicate copy of data that's already been logged just to re-log it again - why? I see also _last_stderr_lines was added previously.

IMO if we want more ergonomics with logging should fix the actual logger to allow you to filter, have per module log files you can tail in another term etc

# from terminal signals (SIGINT from the tty). preexec_fn is unsafe
# in the presence of threads (subprocess docs), so we only use it on
# Linux where prctl(PR_SET_PDEATHSIG) has no alternative.
def _child_preexec_linux() -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

extra_env: dict[str, str] = Field(default_factory=dict)
shutdown_timeout: float = 10.0
log_format: LogFormat = LogFormat.TEXT
rebuild_on_change: list[PathEntry] | None = None
Copy link
Copy Markdown
Contributor

@leshy leshy Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw this is the first actual setting that's different between prod and dev, we def don't want this on deployments, so probably if toggled True should be done not in code but the local config file level or via local env vars.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the difference between prod and dev? I don't follow

@jeff-hykin
Copy link
Copy Markdown
Member Author

jeff-hykin commented Apr 14, 2026

so not sure, haven't looked into details, but is there a reason we aren't relying nix build before each module start, since nix essentially does the above by itself?

Unfortunately yeah. I don't care much about speed, but rebuild_on_change is 1980x faster than nix's check on reboot, and merely 280x faster once nix has a "warm" cache. And that's for the mid360, which is absolutely tiny. For a native module like PathFollower being checked onboard the G1, nix takes 1037.89ms just to confirm nothing changed (rebuild_on_change runs in 5ms on the G1).

I'm not thrilled about rebuild_on_change, but it was important for development on the G1.

Now that I think about it though, we should probably have it always rebuild unless there is a rebuild_on_change, since its more or less an optimization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants