Skip to content

feat(api): add porcelain api#1772

Closed
paul-nechifor wants to merge 1 commit intodevfrom
paul/feat/porcelain-api
Closed

feat(api): add porcelain api#1772
paul-nechifor wants to merge 1 commit intodevfrom
paul/feat/porcelain-api

Conversation

@paul-nechifor
Copy link
Copy Markdown
Contributor

Problem

Closes DIM-XXX

Solution

Breaking Changes

How to Test

from dimos import Dimos

app = Dimos(n_workers=8)
app.run("unitree-go2-agentic")
app.skills.relative_move(forward=2.0)
print(app.ReplanningAStarPlanner._planner._safe_goal_clearance)
app.stop()

Contributor License Agreement

  • I have read and approved the CLA.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 11, 2026

Greptile Summary

This PR introduces a Dimos porcelain API class that wraps the ModuleCoordinator with a user-friendly interface for running blueprints, accessing skills, and getting RPyC proxies to running modules. It also refactors get_all_blueprints.py into a reusable module and adds a StartRpycRequest message type for on-demand RPyC server startup inside worker processes.

Confidence Score: 4/5

Safe to merge with awareness of the lock-contention hazard in getattr when the RPyC server has not yet been started.

All findings are P2, but the lock-holding during rpyc.connect() can block stop() for up to 30 seconds on the first app.SomeModule access — a real reliability concern for concurrent usage. The global config mutation and stale skill-cache issues are edge cases. Score is 4 rather than 5 to prompt at least the lock fix before shipping.

dimos/porcelain.py — specifically _get_rpyc_proxy (lock contention), run() (global_config side-effect), and _SkillsProxy._build_cache() (silent error suppression + stale cache after restart).

Important Files Changed

Filename Overview
dimos/porcelain.py Core porcelain API — has lock-contention hazard (RPyC connect inside RLock), global config mutation, and non-thread-safe _SkillsProxy cache.
dimos/init.py Lazy-import shim for Dimos — minimal and correct.
dimos/core/coordination/worker_messages.py Adds StartRpycRequest frozen dataclass to the worker message union — straightforward addition.
dimos/core/coordination/python_worker.py Adds start_rpyc() on Actor and StartRpycRequest handler in the worker loop — logic is correct; socket is bound before start() so port is valid immediately.
dimos/robot/get_all_blueprints.py Extracts blueprint/module lookup helpers from the CLI into a reusable module — clean separation.
dimos/test_porcelain.py Good test coverage of construction, lifecycle, skills, RPyC access, thread safety, and restart; slow tests are properly marked.
dimos/robot/test_get_all_blueprints.py Tests for name resolution helpers — covers happy paths and unknown-name error case.
dimos/test_no_init_files.py Enforces no stray init.py files except the new porcelain shim — correctly exempts dimos/init.py.
dimos/robot/cli/dimos.py CLI updated to import get_by_name_or_exit from the new shared module — no functional changes.

Sequence Diagram

sequenceDiagram
    participant U as User
    participant D as Dimos
    participant C as ModuleCoordinator
    participant R as RPCClient
    participant A as Actor
    participant W as WorkerProcess
    participant P as RPyCServer

    U->>D: run(unitree-go2-agentic)
    D->>C: build(blueprint)
    C->>W: DeployModuleRequest
    W-->>C: WorkerResponse(module_id)
    C-->>D: coordinator ready

    U->>D: app.skills.relative_move(forward=2.0)
    D->>R: get_skills() via LCM RPC
    R-->>D: list[SkillInfo]
    D->>R: relative_move via RpcCall
    R-->>U: result

    U->>D: app.SomeModule
    D->>A: start_rpyc()
    A->>W: StartRpycRequest via pipe
    W->>P: _start_rpyc_server()
    P-->>W: port bound
    W-->>A: WorkerResponse(port)
    D->>P: rpyc.connect(localhost, port)
    P-->>D: module_proxy
    D-->>U: RPyC module proxy

    U->>D: stop()
    D->>P: conn.close()
    D->>C: stop()
    C->>W: ShutdownRequest
Loading

Reviews (1): Last reviewed commit: "feat(api): add porcelain api" | Re-trigger Greptile

Comment thread dimos/porcelain.py
Comment on lines +136 to +148
def _get_rpyc_proxy(self, module_class: type[ModuleBase], proxy: RPCClient) -> Any:
"""Get or create an RPyC proxy to a remote module instance."""
if module_class in self._rpyc_cache:
conn, module_proxy = self._rpyc_cache[module_class]
if not conn.closed:
return module_proxy

actor = proxy.actor_instance
port = actor.start_rpyc()
conn = rpyc.connect("localhost", port, config={"sync_request_timeout": 30})
module_proxy = conn.root.get_module(actor._module_id)
self._rpyc_cache[module_class] = (conn, module_proxy)
return module_proxy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 RPyC connect blocks self._lock for up to 30 s

_get_rpyc_proxy is always called while self._lock is held (inside the with self._lock: block in __getattr__). The rpyc.connect() call on line 145 has sync_request_timeout=30, meaning any thread that calls stop(), run(), restart(), or skills will block for up to 30 seconds while a new RPyC connection is being established. Consider releasing the lock before the network call:

def _get_rpyc_proxy(self, module_class: type[ModuleBase], proxy: RPCClient) -> Any:
    with self._lock:
        if module_class in self._rpyc_cache:
            conn, module_proxy = self._rpyc_cache[module_class]
            if not conn.closed:
                return module_proxy

    # Establish connection outside the global lock
    actor = proxy.actor_instance
    port = actor.start_rpyc()
    conn = rpyc.connect("localhost", port, config={"sync_request_timeout": 30})
    module_proxy = conn.root.get_module(actor._module_id)

    with self._lock:
        self._rpyc_cache[module_class] = (conn, module_proxy)
    return module_proxy

Comment thread dimos/porcelain.py
Comment on lines +64 to +70
if self._coordinator is None:
from dimos.core.coordination.module_coordinator import ModuleCoordinator
from dimos.core.global_config import global_config

if self._config_overrides:
global_config.update(**self._config_overrides)
self._coordinator = ModuleCoordinator.build(blueprint)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 global_config.update() is last-writer-wins across instances

global_config is a module-level singleton. If two Dimos instances are created with different overrides and both call run(), the second call overwrites the first's config. This affects any module that reads global_config after both coordinators are built (e.g., newly spawned workers). In test suites that create multiple Dimos instances this can cause subtle ordering-dependent failures.

Comment thread dimos/porcelain.py
Comment on lines +243 to +247
for cls, proxy in self._coordinator._deployed_modules.items():
try:
skills: list[SkillInfo] = proxy.get_skills() # type: ignore[attr-defined]
except Exception:
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent except Exception: continue hides module connectivity errors

If proxy.get_skills() fails (e.g. the module process crashed, the RPC channel is down, or the method simply doesn't exist), the module is silently skipped. The caller then sees AttributeError: No skill named 'foo' with no indication that some modules failed to report their skills. At minimum, log the error at WARNING level:

try:
    skills: list[SkillInfo] = proxy.get_skills()  # type: ignore[attr-defined]
except Exception as exc:
    logger.warning("Failed to get skills from module", module=cls.__name__, error=str(exc))
    continue

Comment thread dimos/porcelain.py
Comment on lines +237 to +252
def _build_cache(self) -> None:
modules_key = frozenset(self._coordinator._deployed_modules.keys())
if self._cache_key == modules_key and self._cache is not None:
return

skill_map: dict[str, list[tuple[type[ModuleBase], RPCClient, SkillInfo]]] = {}
for cls, proxy in self._coordinator._deployed_modules.items():
try:
skills: list[SkillInfo] = proxy.get_skills() # type: ignore[attr-defined]
except Exception:
continue
for info in skills:
skill_map.setdefault(info.func_name, []).append((cls, proxy, info)) # type: ignore[arg-type]

self._cache = skill_map
self._cache_key = modules_key
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _SkillsProxy cache not invalidated after restart()

_build_cache() keys its short-circuit on frozenset(_coordinator._deployed_modules.keys()). A restart() call replaces a module in-place (same class key, potentially new RPCClient proxy), so the frozenset is unchanged and the cache is NOT rebuilt. A stored _SkillsProxy will keep calling the old proxy, which may be connected to a dead worker. Because each app.skills access creates a fresh _SkillsProxy this only bites users who store the proxy (skills = app.skills; app.restart(...); skills.foo()). Consider keying the cache on a monotonic generation counter bumped on each restart.

@paul-nechifor paul-nechifor marked this pull request as draft April 12, 2026 00:20
@paul-nechifor
Copy link
Copy Markdown
Contributor Author

Closed in favor of #1779

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant