Skip to content

test: add scalable fleet integration coverage#39

Merged
soupat merged 5 commits into
mainfrom
scalable-fleet-integration-tests
Jun 1, 2026
Merged

test: add scalable fleet integration coverage#39
soupat merged 5 commits into
mainfrom
scalable-fleet-integration-tests

Conversation

@soupat
Copy link
Copy Markdown
Collaborator

@soupat soupat commented May 21, 2026

Summary

  • Add 200-device NATS-backed large-fleet integration coverage for discovery, label histograms, truncation, drill-down schema expansion, heterogeneous fleets, invoke_many, ESTOP aliasing, broadcast, where, fire_at, and correlation subscriptions.
  • Add subscription semantics coverage and implementation for live top-level event(...) selectors versus snapshot device(...).event(...) selectors.
  • Add edge predicate hardening for startup warnings, bounded predicate evaluation, and fail-closed behavior.
  • Update public discovery docs, server CLI docs, and add a release-facing test summary for discovery and operations coverage.

Testing

  • git diff --check
  • python -m py_compile packages/device-connect-agent-tools/device_connect_agent_tools/tools.py packages/device-connect-edge/device_connect_edge/device.py packages/device-connect-edge/tests/test_device_where.py tests/tests/test_tools_broadcast.py tests/tests/test_tools_large_fleet_invoke.py tests/tests/test_tools_large_fleet_heterogeneous.py
  • pytest packages/device-connect-agent-tools/tests/test_subscribe.py packages/device-connect-agent-tools/tests/test_broadcast.py -q -> 32 passed in 0.35s
  • pytest packages/device-connect-edge/tests/test_device_where.py packages/device-connect-edge/tests/test_predicate.py -q -> 19 passed in 0.23s
  • pytest tests/tests/test_tools_broadcast.py --backend=nats -q --tb=short --log-cli-level=CRITICAL -> 12 passed, 12 skipped in 49.53s
  • pytest tests/tests/test_tools_invoke.py::test_scalable_fleet_discovery_and_invoke_many tests/tests/test_tools_large_fleet_discovery.py tests/tests/test_tools_large_fleet_invoke.py tests/tests/test_tools_large_fleet_broadcast.py tests/tests/test_tools_large_fleet_heterogeneous.py --backend=nats -q --tb=short --log-cli-level=CRITICAL -> 18 passed in 37.05s

Adds NATS-backed large-fleet integration coverage for discovery, schema compaction and drill-down, heterogeneous fleets, invoke_many, ESTOP aliasing, broadcast, where, fire_at, and correlation subscriptions.

Also documents release test coverage and addresses subscription and predicate behavior needed by the comprehensive scenarios.
@soupat soupat requested a review from atsyplikhin May 21, 2026 17:43
Large-fleet tests intentionally seed hundreds of registry rows. Clearing only before those tests let stale rows remain until TTL expiry and polluted later broad selector discovery tests in CI.
@soupat soupat requested a review from kavya-chennoju May 27, 2026 16:30
@kavya-chennoju
Copy link
Copy Markdown
Collaborator

Review addendum — additional medium-severity risks

Four more items to add to the risk table from the prior review. The first three sharpen points already raised; the fourth is a new correctness gap in the live event-subscription path.

Medium — Daemon thread per where evaluation leaks + blocks the event loop

_evaluate_where_with_timeout in device.py:1331-1346 spins up a threading.Thread(target=_run, daemon=True) and calls thread.join(timeout_s). Two compounding problems:

  • Thread leakage on timeout. When the join times out, the function returns False, but the underlying thread is abandoned and keeps running to completion. A barrage of slow or hostile predicates produces unbounded thread accumulation, with leak rate set by broadcast rate × P(eval > 50 ms) rather than by anything the edge controls. Edges run for weeks, so steady-state inflation is real even without an attacker.
  • Synchronous .join() blocks the asyncio loop. _evaluate_where is invoked from the broadcast dispatch path, which runs on the edge's main event loop. thread.join(0.05) parks that loop thread for up to 50 ms per evaluation. Under a broad where broadcast against a busy edge, this serializes event handling and queues up event/RPC traffic. Mitigation: hand evaluation off via loop.run_in_executor with a reusable single-worker executor, and treat cancellation properly (e.g. abandon-and-reuse with a generation counter) rather than spawn-per-call.

Medium — Malformed NATS subjects from event names

_event_subjects_for_selector at tools.py:1080-1083 interpolates the resolved event name directly:

f"device-connect.{conn.zone}.*.event.{name}"

NATS treats . as a token separator, * as a single-token wildcard, and > as a tail wildcard. Concrete failures:

  • Hierarchical names like motion.detected or sensor.temperature.high become 6-token subjects that match unintended siblings under event.motion.*.
  • A registered event name of * resolves to device-connect.<zone>.*.event.*, silently subscribing to every event in the zone.
  • A registered name of > resolves to device-connect.<zone>.*.event.>, subscribing to that subject and every descendant — particularly dangerous if the codebase later starts publishing on event.<name>.<subtopic> subjects.
  • Whitespace or empty names produce malformed subjects that fail at subscribe time or silently match nothing.

Required fix before production: validate name against ^[A-Za-z0-9_\-]+$ in _event_names_for_filter and return invalid_event_name for offenders, plus the same check at event-registration time on the edge so bad names never reach the registry.

Medium — Live event subscriptions miss events when registry is initially empty

This is the more interesting correctness gap in the new live-subscription design. _event_names_for_filter at tools.py:1064-1093 resolves event names by calling discover(selector) once, at subscription time. If no devices that emit the target event name are registered yet, discover returns [], the resulting subject list is empty, and _event_subjects_for_selector hands back [].

The subscription handle is then constructed with no subjects — every late-joining device that does emit object_detected afterwards is silently missed, even though the documented contract is that top-level event(...) includes late joiners. Concretely:

  • An agent boots before its target fleet, calls subscribe(\"event(object_detected)\"), and waits. Cameras come online a minute later. The agent receives nothing.
  • A test or operator restarts the registry, then re-attaches an event monitor before devices re-register. Same silent miss.
  • An agent subscribes to a brand-new event name that no currently-registered device exposes yet but a forthcoming firmware rollout will. The subscription is permanently inert until restarted.

The fix has to handle the "name unknown at sub-time" case. Two reasonable options:

  1. If the resolved name set is empty, subscribe to the full device-connect.{zone}.*.event.> wildcard and filter incoming messages client-side by name. Costlier, but matches the documented late-joiner semantics.
  2. Re-resolve the name set on a timer (or on registry-change notifications) and add new subjects to the live Subscription handle as devices appear. Cheaper steady-state, but requires Subscription to support incremental subject add — likely a follow-up.

The existing integration test test_subscribe_top_level_event_selector_includes_late_joiners passes only because itest-evlive-cam-1 registers before subscribe(\"event(object_detected)\") is called — it covers "new device, known event name" but not "subscribe before any matching device exists." Worth adding a regression that subscribes against an empty registry and then spawns a camera.

Low — Test duplication + unbounded asyncio.gather in fleet spawn

Reiterating from the prior review since it interacts with scale-test reliability:

@atsyplikhin
Copy link
Copy Markdown
Collaborator

Review — two findings in the current head not yet covered

The latest commit (fix: address PR 39 event and scale review feedback) resolves the earlier addendum's points: predicate eval is now executor-based and bounded, event names are validated before becoming subjects, the empty-registry late-joiner case uses the *.event.> wildcard + client-side filter, and fleet spawn is semaphore-bounded. Two issues remain that I didn't see raised elsewhere.

High — the new EventDef name validator breaks the MCP capability path

types.py adds a field_validator("name") rejecting anything not matching ^[A-Za-z0-9_-]+$. But DeviceConnectMCP.get_capabilities() still builds events with the event/ prefix:

# packages/device-connect-agent-tools/device_connect_agent_tools/mcp/device_connect_mcp.py:401
events.append(EventDef(name=f"event/{meta['name']}", ...))

Confirmed it raises at runtime against this branch:

ValidationError: Invalid event name 'event/objectDetected'. Must match '^[A-Za-z0-9_-]+$'.

So any MCP-adapter device exposing >=1 event now fails when capabilities are built (reached via _DefaultDriver.capabilities). No test covers get_capabilities() with events, so CI stays green while the adapter is broken.

Root cause is an inconsistency: enqueue_event and _RemoteInvoker.publish_event strip the event/ prefix before validating, but the EventDef validator rejects it. Fix options: pass meta['name'] (not f"event/{...}") at the MCP site, or have validate_event_name strip a leading event/ before checking. Either way, add a regression test that builds MCP capabilities with an event. (The surviving docstring example EventDef(name="event/objectDetected", ...) at types.py:130 also now documents a value the class rejects.)

Medium — live event(<name>, <label>:x) selectors silently drop the label constraint

The new EVENT_ONLY literal-name fast path in _event_subjects_for_selector:

if event_name_match and not _is_glob_name_match(event_name_match):
    return [f"device-connect.{conn.zone}.*.event.>"], None, {event_name_match}

The client-side filter (_message_event_name in {event_name_match}) keys on the name only. A Filter carries both name_match and key_filters, so a selector like event(motion_detected, safety:critical) takes this branch and delivers motion_detected from every device — the safety:critical constraint is dropped. On main this selector resolved through discover(), which is label-aware, so this is a behavior regression for name+label event selectors (the glob/no-name branch is fine — it resolves names via label-aware discovery). Either honor co-present label filters client-side, or explicitly reject name+label combos on the live path.

@soupat soupat merged commit 5ea5d30 into main Jun 1, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants