test: add scalable fleet integration coverage by soupat · Pull Request #39 · arm/device-connect

soupat · 2026-05-21T17:40:20Z

Summary

Add 200-device NATS-backed large-fleet integration coverage for discovery, label histograms, truncation, drill-down schema expansion, heterogeneous fleets, invoke_many, ESTOP aliasing, broadcast, where, fire_at, and correlation subscriptions.
Add subscription semantics coverage and implementation for live top-level event(...) selectors versus snapshot device(...).event(...) selectors.
Add edge predicate hardening for startup warnings, bounded predicate evaluation, and fail-closed behavior.
Update public discovery docs, server CLI docs, and add a release-facing test summary for discovery and operations coverage.

Testing

git diff --check
python -m py_compile packages/device-connect-agent-tools/device_connect_agent_tools/tools.py packages/device-connect-edge/device_connect_edge/device.py packages/device-connect-edge/tests/test_device_where.py tests/tests/test_tools_broadcast.py tests/tests/test_tools_large_fleet_invoke.py tests/tests/test_tools_large_fleet_heterogeneous.py
pytest packages/device-connect-agent-tools/tests/test_subscribe.py packages/device-connect-agent-tools/tests/test_broadcast.py -q -> 32 passed in 0.35s
pytest packages/device-connect-edge/tests/test_device_where.py packages/device-connect-edge/tests/test_predicate.py -q -> 19 passed in 0.23s
pytest tests/tests/test_tools_broadcast.py --backend=nats -q --tb=short --log-cli-level=CRITICAL -> 12 passed, 12 skipped in 49.53s
pytest tests/tests/test_tools_invoke.py::test_scalable_fleet_discovery_and_invoke_many tests/tests/test_tools_large_fleet_discovery.py tests/tests/test_tools_large_fleet_invoke.py tests/tests/test_tools_large_fleet_broadcast.py tests/tests/test_tools_large_fleet_heterogeneous.py --backend=nats -q --tb=short --log-cli-level=CRITICAL -> 18 passed in 37.05s

Adds NATS-backed large-fleet integration coverage for discovery, schema compaction and drill-down, heterogeneous fleets, invoke_many, ESTOP aliasing, broadcast, where, fire_at, and correlation subscriptions. Also documents release test coverage and addresses subscription and predicate behavior needed by the comprehensive scenarios.

Large-fleet tests intentionally seed hundreds of registry rows. Clearing only before those tests let stale rows remain until TTL expiry and polluted later broad selector discovery tests in CI.

kavya-chennoju · 2026-05-28T23:26:44Z

Review addendum — additional medium-severity risks

Four more items to add to the risk table from the prior review. The first three sharpen points already raised; the fourth is a new correctness gap in the live event-subscription path.

Medium — Daemon thread per `where` evaluation leaks + blocks the event loop

_evaluate_where_with_timeout in device.py:1331-1346 spins up a threading.Thread(target=_run, daemon=True) and calls thread.join(timeout_s). Two compounding problems:

Thread leakage on timeout. When the join times out, the function returns False, but the underlying thread is abandoned and keeps running to completion. A barrage of slow or hostile predicates produces unbounded thread accumulation, with leak rate set by broadcast rate × P(eval > 50 ms) rather than by anything the edge controls. Edges run for weeks, so steady-state inflation is real even without an attacker.
Synchronous .join() blocks the asyncio loop. _evaluate_where is invoked from the broadcast dispatch path, which runs on the edge's main event loop. thread.join(0.05) parks that loop thread for up to 50 ms per evaluation. Under a broad where broadcast against a busy edge, this serializes event handling and queues up event/RPC traffic. Mitigation: hand evaluation off via loop.run_in_executor with a reusable single-worker executor, and treat cancellation properly (e.g. abandon-and-reuse with a generation counter) rather than spawn-per-call.

Medium — Malformed NATS subjects from event names

_event_subjects_for_selector at tools.py:1080-1083 interpolates the resolved event name directly:

f"device-connect.{conn.zone}.*.event.{name}"

NATS treats . as a token separator, * as a single-token wildcard, and > as a tail wildcard. Concrete failures:

Hierarchical names like motion.detected or sensor.temperature.high become 6-token subjects that match unintended siblings under event.motion.*.
A registered event name of * resolves to device-connect.<zone>.*.event.*, silently subscribing to every event in the zone.
A registered name of > resolves to device-connect.<zone>.*.event.>, subscribing to that subject and every descendant — particularly dangerous if the codebase later starts publishing on event.<name>.<subtopic> subjects.
Whitespace or empty names produce malformed subjects that fail at subscribe time or silently match nothing.

Required fix before production: validate name against ^[A-Za-z0-9_\-]+$ in _event_names_for_filter and return invalid_event_name for offenders, plus the same check at event-registration time on the edge so bad names never reach the registry.

Medium — Live event subscriptions miss events when registry is initially empty

This is the more interesting correctness gap in the new live-subscription design. _event_names_for_filter at tools.py:1064-1093 resolves event names by calling discover(selector) once, at subscription time. If no devices that emit the target event name are registered yet, discover returns [], the resulting subject list is empty, and _event_subjects_for_selector hands back [].

The subscription handle is then constructed with no subjects — every late-joining device that does emit object_detected afterwards is silently missed, even though the documented contract is that top-level event(...) includes late joiners. Concretely:

An agent boots before its target fleet, calls subscribe(\"event(object_detected)\"), and waits. Cameras come online a minute later. The agent receives nothing.
A test or operator restarts the registry, then re-attaches an event monitor before devices re-register. Same silent miss.
An agent subscribes to a brand-new event name that no currently-registered device exposes yet but a forthcoming firmware rollout will. The subscription is permanently inert until restarted.

The fix has to handle the "name unknown at sub-time" case. Two reasonable options:

If the resolved name set is empty, subscribe to the full device-connect.{zone}.*.event.> wildcard and filter incoming messages client-side by name. Costlier, but matches the documented late-joiner semantics.
Re-resolve the name set on a timer (or on registry-change notifications) and add new subjects to the live Subscription handle as devices appear. Cheaper steady-state, but requires Subscription to support incremental subject add — likely a follow-up.

The existing integration test test_subscribe_top_level_event_selector_includes_late_joiners passes only because itest-evlive-cam-1 registers before subscribe(\"event(object_detected)\") is called — it covers "new device, known event name" but not "subscribe before any matching device exists." Worth adding a regression that subscribes against an empty registry and then spawns a camera.

Low — Test duplication + unbounded `asyncio.gather` in fleet spawn

Reiterating from the prior review since it interacts with scale-test reliability:

_wait_for_devices, _assert_compact_function_rows, _assert_expanded_function_rows, and _scale_fleet_size are copy-pasted across test_tools_large_fleet_discovery.py, test_tools_large_fleet_invoke.py, test_tools_large_fleet_broadcast.py, and test_tools_large_fleet_heterogeneous.py, with slightly different minimum= rules in _scale_fleet_size that will drift. Move to tests/fixtures/scale.py.
spawn_sensor_fleet in tests/fixtures/devices.py:121-130 fans out via a single unbounded asyncio.gather over count spawns. At the current default of 200 it works; anyone who bumps DC_SCALE_FLEET_SIZE will hit etcd/NATS connection storms and likely fd limits before registration completes. A semaphore-bounded gather (max ~32 in flight) keeps the test honest at higher sizes.

atsyplikhin · 2026-05-29T03:18:38Z

Review — two findings in the current head not yet covered

The latest commit (fix: address PR 39 event and scale review feedback) resolves the earlier addendum's points: predicate eval is now executor-based and bounded, event names are validated before becoming subjects, the empty-registry late-joiner case uses the *.event.> wildcard + client-side filter, and fleet spawn is semaphore-bounded. Two issues remain that I didn't see raised elsewhere.

High — the new `EventDef` name validator breaks the MCP capability path

types.py adds a field_validator("name") rejecting anything not matching ^[A-Za-z0-9_-]+$. But DeviceConnectMCP.get_capabilities() still builds events with the event/ prefix:

# packages/device-connect-agent-tools/device_connect_agent_tools/mcp/device_connect_mcp.py:401
events.append(EventDef(name=f"event/{meta['name']}", ...))

Confirmed it raises at runtime against this branch:

ValidationError: Invalid event name 'event/objectDetected'. Must match '^[A-Za-z0-9_-]+$'.

So any MCP-adapter device exposing >=1 event now fails when capabilities are built (reached via _DefaultDriver.capabilities). No test covers get_capabilities() with events, so CI stays green while the adapter is broken.

Root cause is an inconsistency: enqueue_event and _RemoteInvoker.publish_event strip the event/ prefix before validating, but the EventDef validator rejects it. Fix options: pass meta['name'] (not f"event/{...}") at the MCP site, or have validate_event_name strip a leading event/ before checking. Either way, add a regression test that builds MCP capabilities with an event. (The surviving docstring example EventDef(name="event/objectDetected", ...) at types.py:130 also now documents a value the class rejects.)

Medium — live `event(<name>, <label>:x)` selectors silently drop the label constraint

The new EVENT_ONLY literal-name fast path in _event_subjects_for_selector:

if event_name_match and not _is_glob_name_match(event_name_match):
    return [f"device-connect.{conn.zone}.*.event.>"], None, {event_name_match}

The client-side filter (_message_event_name in {event_name_match}) keys on the name only. A Filter carries both name_match and key_filters, so a selector like event(motion_detected, safety:critical) takes this branch and delivers motion_detected from every device — the safety:critical constraint is dropped. On main this selector resolved through discover(), which is label-aware, so this is a behavior regression for name+label event selectors (the glob/no-name branch is fine — it resolves names via label-aware discovery). Either honor co-present label filters client-side, or explicitly reject name+label combos on the live path.

soupat requested a review from atsyplikhin May 21, 2026 17:43

test: clear registry after isolated integration tests

c90fd94

Large-fleet tests intentionally seed hundreds of registry rows. Clearing only before those tests let stale rows remain until TTL expiry and polluted later broad selector discovery tests in CI.

soupat requested a review from kavya-chennoju May 27, 2026 16:30

fix: address PR 39 event and scale review feedback

83e3d58

soupat added 2 commits May 29, 2026 08:33

fix: address PR 39 event review feedback

dc094a5

docs: update discovery test summary

317fbb2

soupat merged commit 5ea5d30 into main Jun 1, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add scalable fleet integration coverage#39

test: add scalable fleet integration coverage#39
soupat merged 5 commits into
mainfrom
scalable-fleet-integration-tests

soupat commented May 21, 2026

Uh oh!

kavya-chennoju commented May 28, 2026

Uh oh!

atsyplikhin commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

soupat commented May 21, 2026

Summary

Testing

Uh oh!

kavya-chennoju commented May 28, 2026

Review addendum — additional medium-severity risks

Medium — Daemon thread per where evaluation leaks + blocks the event loop

Medium — Malformed NATS subjects from event names

Medium — Live event subscriptions miss events when registry is initially empty

Low — Test duplication + unbounded asyncio.gather in fleet spawn

Uh oh!

atsyplikhin commented May 29, 2026

Review — two findings in the current head not yet covered

High — the new EventDef name validator breaks the MCP capability path

Medium — live event(<name>, <label>:x) selectors silently drop the label constraint

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Medium — Daemon thread per `where` evaluation leaks + blocks the event loop

Low — Test duplication + unbounded `asyncio.gather` in fleet spawn

High — the new `EventDef` name validator breaks the MCP capability path

Medium — live `event(<name>, <label>:x)` selectors silently drop the label constraint