Skip to content

Jeff/fix/rosnav8#1791

Draft
jeff-hykin wants to merge 81 commits intodevfrom
jeff/fix/rosnav8
Draft

Jeff/fix/rosnav8#1791
jeff-hykin wants to merge 81 commits intodevfrom
jeff/fix/rosnav8

Conversation

@jeff-hykin
Copy link
Copy Markdown
Member

@jeff-hykin jeff-hykin commented Apr 16, 2026

Pre-requisite PRs

Will be perpetually re-testing hardware after merges/changes to ^

631af1e tested for 2 hours on the g1 (entire battery! instantly responsive after 2hrs)
c54578a044f892fe7f231b468f2f3512adf4282a (later) hardware tested and sim tested.

Problem

Robot blueprints each duplicated viewer backend selection logic (foxglove vs rerun vs rerun-web vs rerun-connect), making it error-prone to add new backends or change behavior. The RerunBridge also lacked a WebSocket server for dimos-viewer keyboard/click events in --connect mode.

Solution

  • port over the whole nav stack

Breaking Changes

  • None

How to Test

git checkout jeff/fix/rosnav8

# last hardware tested commit
git checkout c54578a044f892fe7f231b468f2f3512adf4282a

# simulation
uv run dimos run unitree-g1-nav-sim

# real
uv run dimos run unitree-g1-nav-onboard

Contributor License Agreement

  • I have read and approved the CLA.

Twelve fixes from Paul-style code review of #1780:

native_module.py:
- _maybe_build: use did_change(update=False) + post-success update_cache so
  failed builds never poison the rebuild cache
- delete stray "Source files changed" log line that fired on every call
- stop(): per-instance _stop_lock; capture proc/watchdog refs under the
  lock, do signal/wait/join outside the lock to avoid the watchdog-self-
  join deadlock; second caller no-ops via _stopping
- _read_log_stream: receive pid as a parameter instead of reaching into
  self._process.pid (TOCTOU with stop())
- _child_preexec_linux + ctypes hoisted to module scope under a
  sys.platform guard; no more re-import per start()
- _clear_nix_executable: gracefully handle cwd=None (skip the walk
  entirely), use .resolve() comparison so a symlinked cwd still
  terminates the walk, refuse to walk past cwd's tree

change_detect.py:
- fold max_file_size into _hash_files digest so different thresholds
  against the same cache_name can't corrupt each other
- new _locked_cache_file context manager — flock the .hash file directly
  in "a+" mode; no more orphan .lock sidecars accumulating in the cache
  dir; did_change/update_cache/clear_cache all share the helper

Tests:
- new test_should_rebuild_true_bypasses_change_check for the explicit
  "always rebuild" path
- new test_failed_build_does_not_mark_cache_clean for the update=False
  retry semantics

Build system:
- tiledb darwin overlay: filter patches by name instead of dropping all
  of them, so a future security patch from nixpkgs survives
- livox_sdk_config.hpp: honor $TMPDIR on Darwin instead of hardcoding /tmp

Perf script:
- compute cache_name inline (using the same inspect-based source_file
  pattern) instead of calling the inlined _build_cache_name method

All 26 tests across test_change_detect.py + test_native_module.py +
test_native_rebuild.py pass. ruff + mypy clean. mid360_native and
fastlio2_native still build end-to-end on aarch64-darwin. Linux drvPaths
for libpqxx/tiledb/pcl/vtk verified unchanged by the overlay.
# (roof is at ~3 m+).
MAX_ALLOWED_Z = 2.0

lcm_url = os.environ.get("LCM_DEFAULT_URL", "udpm://239.255.76.67:7667?ttl=0")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use _DEFAULT_LCM_URL from lcmservice.py. No need to duplicate it here.

if robot_z > max_z:
max_z = robot_z

lc.subscribe(ODOM_TOPIC, _odom_handler)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to unsubscribe.

MAX_ALLOWED_Z = 2.0

lcm_url = os.environ.get("LCM_DEFAULT_URL", "udpm://239.255.76.67:7667?ttl=0")
lc = lcmlib.LCM(lcm_url)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a 20000 line PR. You're not saving disk space by using lc instead of lcm. :)

)

finally:
print("\n[test-simple] Stopping blueprint…")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add in conventions to not add print() in tests.

print(f"[test-simple] Odom online. Robot at ({robot_x:.2f}, {robot_y:.2f})")

print(f"[test-simple] Warming up for {WARMUP_SEC}s…")
time.sleep(WARMUP_SEC)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes sleep is unavoidable, but is there no way to pool for a condition than to wait 15 seconds every time?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually yeah, if your module start() has returned your module is assumed ready

Comment on lines +21 to +27
Test sequence:
p0 (-0.3, 2.5) — open corridor speed test
p1 (11.2, -1.8) — navigate with furniture
p2 ( 3.3, -4.9) — intermediate waypoint near doorway (explore lower area)
p3 ( 7.0, -5.0) — through the doorway into the right room
p4 (11.3, -5.6) — explore right room
p4→p1 (11.2, -1.8) — CRITICAL: must route through doorway, NOT wall
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Test sequence:
p0 (-0.3, 2.5) — open corridor speed test
p1 (11.2, -1.8) — navigate with furniture
p2 ( 3.3, -4.9) — intermediate waypoint near doorway (explore lower area)
p3 ( 7.0, -5.0) — through the doorway into the right room
p4 (11.3, -5.6) — explore right room
p4p1 (11.2, -1.8) — CRITICAL: must route through doorway, NOT wall

import lcm as lcmlib
import pytest

os.environ.setdefault("DISPLAY", ":1")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not change the system after the test. Why is this even needed?

from dimos.visualization.vis_module import vis_module

# -- Clear stale nav paths from previous runs -------------------------
paths_dir = Path(__file__).resolve().parents[3] / "data" / "smart_nav_paths"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this get_data("smart_nav_paths")? Why are you removing it? Everything in data should be static.

for f in paths_dir.iterdir():
f.unlink(missing_ok=True)

# -- Build blueprint (same composition as unitree_g1_nav_sim) ----------
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does look pretty identical. So why not use it instead of copying it here?

pytestmark = [pytest.mark.slow]


class TestCrossWallPlanning:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is pretty much identical to dimos/navigation/smart_nav/tests/test_cross_wall_planning_simple.py . Please extract to common parts (like the waypoint following) into fixtures.

I already have something similar as follow_points in dimos/e2e_tests/conftest.py. Can that not be used here?

Also, it might be better to move these tests to dimos/e2e_tests.

("result-tare-planner", "tare_planner"),
]
_HAS_BINARIES = all((_NATIVE_DIR / d / "bin" / name).exists() for d, name in _REQUIRED_BINARIES)
_IS_LINUX_X86 = platform.system() == "Linux" and platform.machine() in ("x86_64", "AMD64")
Copy link
Copy Markdown
Contributor

@paul-nechifor paul-nechifor Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use _IS_LINUX_X86 in several places. Please move it to dimos/conftest.py as skipif_macos. But I don't understand why we can't build on Mac OS.

registered_scan: Out[PointCloud2]
odometry: Out[Odometry]

def __init__(self, **kwargs): # type: ignore[no-untyped-def]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, **kwargs): # type: ignore[no-untyped-def]
def __init__(self, **kwargs):

no typechecking in tests.

Comment on lines +161 to +172
def __getstate__(self) -> dict[str, Any]:
state = super().__getstate__()
state.pop("_cmd_lock", None)
state.pop("_sensor_thread", None)
state.pop("_sim_thread", None)
return state

def __setstate__(self, state: dict[str, Any]) -> None:
super().__setstate__(state)
self._cmd_lock = threading.Lock()
self._sensor_thread = None
self._sim_thread = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is needed?

@rpc
def start(self) -> None:
super().start()
self.cmd_vel._transport.subscribe(self._on_cmd_vel)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please unsub


now = time.time()
quat = Quaternion.from_euler(Vector3(0.0, 0.0, self._yaw))
self.odometry._transport.publish(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why _transport?

lock: threading.Lock = field(default_factory=threading.Lock)


# Test
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Test

Comment on lines +269 to +276
from dimos.core.coordination.blueprints import autoconnect
from dimos.core.coordination.module_coordinator import ModuleCoordinator
from dimos.msgs.geometry_msgs.PointStamped import PointStamped
from dimos.msgs.nav_msgs.Path import Path as NavPath
from dimos.navigation.smart_nav.modules.local_planner.local_planner import LocalPlanner
from dimos.navigation.smart_nav.modules.path_follower.path_follower import PathFollower
from dimos.navigation.smart_nav.modules.tare_planner.tare_planner import TarePlanner
from dimos.navigation.smart_nav.modules.terrain_analysis.terrain_analysis import TerrainAnalysis
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to top.

Comment on lines +291 to +316
tare = coordinator.get_instance(TarePlanner)
planner = coordinator.get_instance(LocalPlanner)
follower = coordinator.get_instance(PathFollower)
coordinator.get_instance(MockVehicle)
terrain = coordinator.get_instance(TerrainAnalysis)

def _on_wp(msg: PointStamped) -> None:
with collector.lock:
collector.waypoints.append((msg.x, msg.y, msg.z))

def _on_terrain(msg: PointCloud2) -> None:
with collector.lock:
collector.terrain_maps.append(True)

def _on_path(msg: NavPath) -> None:
with collector.lock:
collector.paths.append(msg)

def _on_cmd(msg: Twist) -> None:
with collector.lock:
collector.cmd_vels.append((msg.linear.x, msg.linear.y, msg.angular.z))

tare.way_point._transport.subscribe(_on_wp)
planner.path._transport.subscribe(_on_path)
follower.cmd_vel._transport.subscribe(_on_cmd)
terrain.terrain_map._transport.subscribe(_on_terrain)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same pattern you're using in dimos/navigation/smart_nav/tests/test_waypoint_nav.py except you're using lock.acquire() instead of with lock.

You can have a fixture which starts the blueprint and records data. The test should use the fixture and assert that the data is what you expect.



class PathFollower(NativeModule):
"""Pure pursuit path follower with PID yaw control.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't use pure pursuit for holonomic robots, just for an issue for later


def _query_pose(self) -> tuple[float, float, float]:
"""Return (x, y, z) from the TF tree, falling back to cached values."""
for parent, child in self._TF_POSE_QUERIES:
Copy link
Copy Markdown
Contributor

@leshy leshy Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't you simply using self.tf.get ? what is this for? I see cache mentioned but tf service does offer built in caching

return len(self._key_poses)


class PGO(Module):
Copy link
Copy Markdown
Contributor

@leshy leshy Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have DIY python only PGO? aren't we using some fastlio pgo stuff?

return [s, int((ts - s) * 1_000_000_000)]


class GraphNodes3D(Timestamped):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dimos.msgs structure 100% matches ros msgs, you shouldn't add new custom classes to this nav_msgs dir but create a dimos.msgs.something subdir for your custom messages.

they can later be upgraded to be a part of our actual dimos-lcm repo for multi lang support if we choose to, but shouldn't pollute ROS namespace

"rerun",
rerun_config={
"visual_override": {
"world/lidar": lambda grid: grid.to_rerun(voxel_size=voxel_size, mode="boxes"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot use lambdas in blueprints because they can't be pickled to be sent to modules. Are these values not sent to modules?

You have to use functions with names because they can be serialized. (This worked in dask because dask serializes functions by bytecode.)

pass
# Also grab addresses via DGRAM trick for interfaces without DNS
try:
import subprocess
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to top.

super().stop()
# Subscribe to our own odometry output so we can mirror each
# pose update into the TF tree as an odom→body transform.
self.odometry.transport.subscribe(self._on_odom_for_tf, self.odometry)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsubscribe.



class TestFarPlannerConfig:
"""Test FarPlanner configuration."""
Copy link
Copy Markdown
Contributor

@paul-nechifor paul-nechifor Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests for values. Please remove.

I don't see any code in dimos/navigation/smart_nav/modules/far_planner/far_planner.py . There's nothing to test.

@@ -0,0 +1,102 @@
# Copyright 2026 Dimensional Inc.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file too.

Comment on lines +536 to +540
logger.debug(
"Global map published",
points=len(cloud_np),
keyframes=pgo.num_key_poses,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.debug(
"Global map published",
points=len(cloud_np),
keyframes=pgo.num_key_poses,
)

Comment on lines +416 to +417
self.odometry.subscribe(self._on_odom)
self.registered_scan.subscribe(self._on_scan)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsub

Comment thread dimos/utils/generic.py
import uuid


def get_local_ips() -> list[tuple[str, str]]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another _get_local_ips in dimos/hardware/sensors/lidar/fastlio2/module.py. Should this replace that? Also move the imports to the top.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants