Skip to content

Source-route firmware runner on JobSource.REMOTE for compile (7a-2b)#560

Merged
bdraco merged 9 commits into
mainfrom
7a-2b-source-routed-runner
May 11, 2026
Merged

Source-route firmware runner on JobSource.REMOTE for compile (7a-2b)#560
bdraco merged 9 commits into
mainfrom
7a-2b-source-routed-runner

Conversation

@bdraco
Copy link
Copy Markdown
Member

@bdraco bdraco commented May 11, 2026

What does this implement/fix?

Phase 7a-2b of issue #106 (transparent install). Branches FirmwareController._execute_job on the FirmwareJob.source field added in 7a-2a (#556) so a JobSource.REMOTE compile job actually dispatches to a paired receiver instead of sitting RUNNING with no work happening.

Pieces

  • controllers/firmware/controller.py — adds the early branch in _execute_job after JOB_STARTED fires and persist completes, before chip-verify: if job.source is JobSource.REMOTE: await self._execute_remote_job(job); return. Outer finally still runs the shared _current_job = None / _persist_jobs() cleanup so the local + remote paths share lifecycle bookkeeping.
  • controllers/firmware/remote_runner.py (new) — run_remote_compile_job:
    • Builds the YAML bundle via the existing helpers.config_bundle.build_yaml_bundle (the same esphome bundle subprocess wrapper remote_build/submit_job uses).
    • Looks up the live PeerLinkClient via RemoteBuildController._lookup_open_peer_link_client(job.source_pin_sha256). A missing pairing / closed session / unspawned client fails the job locally with the lookup's error text.
    • Subscribes to OFFLOADER_JOB_OUTPUT / OFFLOADER_JOB_STATE_CHANGED before dispatching submit_job so an immediate running / output frame can't outrace the subscription. Listener attach + detach lives in one with bus.listening(...) block so every early-return failure path releases the subscriptions.
    • Filters inbound frames by both pin_sha256 and job_id so stray frames from another in-flight remote job land in a different runner.
    • Dispatches client.submit_job(job_id=job.job_id, target="compile", bundle_bytes=...). The receiver echoes the offloader's job_id back on every fan-out frame per 5c contract, so the runner's filter matches without any side-channel correlation cache.
    • Translates inbound output / state-changed events into local JOB_OUTPUT / JOB_PROGRESS / JOB_<terminal> fires. follow_job and the firmware-tasks UI consume one event stream regardless of which CPU compiled the bytes.
    • Translates a local firmware/cancel flip (_cancel_requested) into a wire cancel_job send; the receiver's resulting job_state_changed{status: cancelled} finalises the local job through the existing _finalize_cancelled helper. Matches the local subprocess path's contract: if the user clicked Stop, the job is CANCELLED regardless of which terminal status the receiver eventually fired.

Receiver-supplied error_message rides into job.error on failed; the existing frontend error rendering shows it without any special-case code.

Scope

COMPILE-only this slice. UPLOAD / INSTALL land in 7a-3, which layers download_artifacts + the local flash step on top of the same dispatch path. A REMOTE job with a non-COMPILE job_type lands as FAILED with an explanatory error.

Tests

8 new tests in tests/controllers/firmware/test_remote_runner.py driving the runner against a real EventBus with AsyncMock shims for the bundle build + peer-link client surfaces:

  • Happy path: bundle build → submit_job accepts → output frames re-fire as JOB_OUTPUT → terminal completed fires JOB_COMPLETED.
  • Progress translation: a wire output line matching _PROGRESS_PATTERNS fires a local JOB_PROGRESS.
  • Stray-event filtering: frames for a different job_id don't leak in or terminate this runner.
  • Receiver failed translates with the supplied error_message.
  • submit_job rejection (accepted=False) finalises locally with the rejection reason.
  • Receiver unreachable (lookup error) finalises locally with the lookup's error text.
  • Non-COMPILE job_type for a REMOTE job is rejected at the runner's top.
  • Local cancel flips _cancel_requested → runner sends client.cancel_job(...) → receiver's cancelled frame finalises as CANCELLED.

340 tests in the firmware suite pass; full repo run shows only the three pre-existing ESP32-H2 no-wifi YAML-generation failures on main (unrelated).

Related issue or feature (if applicable):

Types of changes

  • Bugfix (non-breaking change which fixes an issue) — bugfix
  • New feature (non-breaking change which adds functionality) — new-feature
  • Enhancement to an existing feature — enhancement
  • Breaking change (fix or feature that would cause existing functionality to not work as expected) — breaking-change
  • Refactor (no behaviour change) — refactor
  • Documentation only — docs
  • Maintenance / chore — maintenance
  • CI / workflow change — ci
  • Dependencies bump — dependencies

Frontend coordination

This slice is internal-only: no new WS commands, no wire-shape changes, no new frontend-visible fields beyond source / source_label already shipped in #556. The install-dialog "Building on {receiver_label}" sub-line lands with 7a-3 when install routes through pick_build_path.

  • No frontend change needed
  • Companion frontend PR: esphome/device-builder-dashboard-frontend#

Checklist

  • The code change is tested and works locally.
  • Pre-commit hooks pass (ruff, codespell, yaml/json/python checks).
  • Tests have been added or updated under tests/ where applicable.
  • components.json has not been hand-edited (regenerate via script/sync_components.py if a sync is needed).
  • Architecture-level changes are reflected in docs/ARCHITECTURE.md and/or docs/API.md.

The runner picked up the ``FirmwareJob.source`` field added in
7a-2a (#556) but still walked the local subprocess pipeline for
every job — so a REMOTE-source job sat marked RUNNING forever
with no work happening. This wires the source-routed branch:
``_execute_job`` now early-returns to ``_execute_remote_job``
when ``source is JobSource.REMOTE``, which lives in a new
``remote_runner.py`` and:

- builds the YAML bundle via the existing
  ``esphome bundle`` subprocess wrapper
- looks up the live ``PeerLinkClient`` through the remote-build
  controller (matching the pin_sha256 stored on the job)
- dispatches ``submit_job{target=compile}`` and waits for the
  receiver's terminal ``OFFLOADER_JOB_STATE_CHANGED``
- translates inbound ``OFFLOADER_JOB_OUTPUT`` /
  ``OFFLOADER_JOB_STATE_CHANGED`` events into local
  ``JOB_OUTPUT`` / ``JOB_PROGRESS`` / ``JOB_<terminal>``
  fires, so ``follow_job`` and the firmware-tasks UI consume
  one event stream regardless of which CPU compiled the bytes
- translates a local ``firmware/cancel`` flip into a wire
  ``cancel_job`` send; the receiver's resulting cancelled
  state-changed frame finalises the local job

Scope is COMPILE-only — UPLOAD / INSTALL go through 7a-3 which
layers ``download_artifacts`` + the local flash step on top.
Anything else lands as FAILED with an explanatory error.
Copilot AI review requested due to automatic review settings May 11, 2026 00:04
@github-actions github-actions Bot added the new-feature New feature label May 11, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 11, 2026

Merging this PR will not alter performance

✅ 12 untouched benchmarks


Comparing 7a-2b-source-routed-runner (daa26df) with main (ed9e994)

Open in CodSpeed

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.13%. Comparing base (ed9e994) to head (daa26df).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #560      +/-   ##
==========================================
+ Coverage   99.12%   99.13%   +0.01%     
==========================================
  Files          77       78       +1     
  Lines       10062    10219     +157     
==========================================
+ Hits         9974    10131     +157     
  Misses         88       88              
Flag Coverage Δ
py3.12 99.10% <100.00%> (+0.01%) ⬆️
py3.14 99.13% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
..._device_builder/controllers/firmware/controller.py 100.00% <100.00%> (ø)
...ome_device_builder/controllers/firmware/helpers.py 100.00% <100.00%> (ø)
...vice_builder/controllers/firmware/remote_runner.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the missing firmware-queue runner branch for FirmwareJob.source == JobSource.REMOTE, so REMOTE compile jobs are dispatched to a paired receiver via peer-link and translated back into the existing local JOB_* event stream.

Changes:

  • Branch FirmwareController._execute_job() to route REMOTE-source jobs into a dedicated remote runner.
  • Introduce controllers/firmware/remote_runner.py to bundle YAML, submit remote compile jobs, translate output/progress/terminal state, and forward cancel requests.
  • Add a comprehensive test suite covering success, progress parsing, stray-event filtering, and key failure/cancel paths.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
esphome_device_builder/controllers/firmware/controller.py Adds REMOTE-source branching and a small _execute_remote_job() wrapper that delegates to the remote runner.
esphome_device_builder/controllers/firmware/remote_runner.py Implements remote compile dispatch over peer-link plus event translation/cancel bridging.
tests/controllers/firmware/test_remote_runner.py New end-to-end-ish tests driving the remote runner against a real EventBus with mocked bundle/client surfaces.

Comment thread esphome_device_builder/controllers/firmware/remote_runner.py Outdated
Comment thread esphome_device_builder/controllers/firmware/remote_runner.py
Comment thread tests/controllers/firmware/test_remote_runner.py Outdated
bdraco added 4 commits May 10, 2026 19:15
PR review surfaced three improvements to the source-routed runner
branch:

1. **DRY** — the per-line bookkeeping (append → trim → fire
   JOB_OUTPUT → fire JOB_PROGRESS on advance) was duplicated
   between the local subprocess streaming loop and the remote
   runner's ``_on_output`` listener. Extracted to
   ``helpers._ingest_output_line(job, bus, line)`` and consumed
   from both paths. ``_check_error`` stays inline in the local
   path because it mutates nonlocal flags the post-exit
   handler reads, and the remote path doesn't need it (the
   receiver surfaces a structured ``failed`` status instead of
   stderr-scrape territory).

2. **Tighten exception narrowing** — ``_lookup_open_peer_link_client``
   raises only ``CommandError``, so swap the catch-all
   ``except Exception`` for ``except CommandError`` in both
   the dispatch and the cancel translation paths. A bare
   Exception catch was swallowing AttributeError / runtime
   regressions as "receiver not reachable: 'X' has no
   attribute 'Y'"; the narrower catch lets unexpected bugs
   propagate visibly.

3. **CancelledError → CANCELLED + re-raise** — mirror the
   local subprocess path's contract: a runner-task cancellation
   (controller stop) finalises the job as CANCELLED + fires
   ``JOB_CANCELLED`` before re-raising. Without this the
   shutdown path left REMOTE jobs in RUNNING with no terminal
   event for subscribers.

Also added a regression test for the
cancel-beats-receiver-completed race: if user Stop is registered
before the receiver's ``completed`` frame lands, the job
finalises as CANCELLED (user intent wins), matching the local
subprocess contract.

Test brittleness from the DRY refactor: one trim-spy test in
``test_execute_job_e2e.py`` was monkeypatching ``_trim_job_output``
on the ``controller`` module only; the mid-run call now resolves
through ``helpers`` (via ``_ingest_output_line``), so the spy
needs both surfaces patched to keep the mid-run-trim regression
check honest. Captured in a comment at the patch site.
Three Copilot points + the codecov gap:

1. **Session-loss detection.** Subscribe to
   ``OFFLOADER_PEER_LINK_CLOSED`` matching this receiver's
   ``pin_sha256`` and feed a sibling future the wait loop
   consults. Without it, a peer-link drop after ``submit_job``
   accepted but before a terminal frame landed wedged the
   runner forever — the firmware queue would refuse the next
   job until restart.

2. **Cancel intent wins on the failure path.** ``_fail_locally``
   now checks ``_cancel_requested`` and routes through
   ``_finalize_cancelled`` when the user already clicked Stop
   before the runner reached a wire failure (bundle build,
   lookup, dispatch, session loss). Mirrors the local
   subprocess path's contract — a Stop that happens to race a
   failure should not show up as a red FAILED badge.

3. **Stub ``submit_job`` ack echoes caller's job_id.** Fixture
   ``_make_client`` was hard-coding the ack's ``job_id`` to
   ``"remote-1"`` while some tests pass different ids; a
   future runner change validating ``ack["job_id"] == job.job_id``
   would have failed for stub-divergence reasons rather than
   real regressions.

4. **Coverage to 99%** with focused tests for the previously
   uncovered paths: missing ``source_pin_sha256``,
   ``FileNotFoundError`` from bundle build,
   ``BundleBuildError``, ``remote_build is None``, submit's
   ``PeerLinkNoSessionError``, peer-link-closed mid-build,
   cancel-on-missing-session, ``cancel_job`` raising
   ``PeerLinkNoSessionError``, runner-task shutdown via
   ``CancelledError``, receiver-initiated ``cancelled``
   without local cancel, cancel-during-bundle-build, and the
   ``_execute_job`` branch-wiring test. ``submit_job`` ack
   echo is pinned by ``test_submit_job_ack_echoes_caller_job_id``.

Plus small cleanups from review:

- Deduped ``_send_cancel_or_finalise``'s two near-identical
  ``except`` blocks into one ``(CommandError,
  PeerLinkNoSessionError)`` clause.

- Added a ``FirmwareController.bus`` public property so the
  remote runner stops reaching across ``controller._db.bus``
  for the canonical bus. ``@property`` (not ``@cached_property``)
  because the test fixtures swap ``_db.bus`` to install spies
  and a cached value would shadow the swap.
Two duplication cleanups surfaced on second-pass review:

1. **Listener filter helper.** Both ``_on_output`` and
   ``_on_state`` opened with the same
   ``data["pin_sha256"] != pin or data["job_id"] != target_job_id``
   filter. Extracted to ``_is_ours(data)`` closure so the
   two callbacks read as filter + delegate. The session-
   close callback stays inline because the wire payload
   doesn't carry a ``job_id`` (the close brings down every
   in-flight job on the link), so it's a different shape.

2. **Dead YAML seed lines.** ``patch_bundle`` mocks
   ``build_yaml_bundle`` to return fake bytes; the runner
   never reads the YAML off disk in any test, so the
   ``(tmp_path / "kitchen.yaml").write_text(...)`` line
   repeated 17 times across the file was vestigial copy-
   paste from an earlier draft that called the real
   subprocess. Removed all 17 (and the now-unused
   ``tmp_path`` / ``Path`` imports). Tests still cover the
   FileNotFoundError + BundleBuildError paths via
   ``monkeypatch.setattr(remote_runner, "build_yaml_bundle",
   AsyncMock(side_effect=...))``, so the on-disk shape
   isn't load-bearing on either side.

Net: 11 insertions, 44 deletions; same 354 firmware tests
pass, same 99% coverage on the firmware package.
Codecov flagged 3 missing lines:

- ``_on_session_closed``'s pin-mismatch early return — a peer-
  link close for an unrelated receiver must not finalise this
  job. Covered by
  ``test_remote_compile_ignores_session_closed_for_other_pin``.

- ``_send_cancel_or_finalise``'s ``remote_build is None``
  branch — a receiver-controller teardown race where the user
  clicks Stop while ``DeviceBuilder`` has already cleared
  ``self.remote_build``. Covered by
  ``test_remote_compile_cancel_after_remote_build_torn_down_finalises_locally``,
  which simulates the race by clearing ``_db.remote_build``
  while the runner is parked on the terminal wait, then
  registers a cancel.

The latter test pins runner readiness on
``client._lookup_open_peer_link_client.call_count`` rather than
a fixed ``await asyncio.sleep(0)`` count — the count check is a
deterministic synchronisation point (the lookup is the last
sync event before ``_await_terminal`` parks), where the sleep
loop was timing-fragile in the presence of preceding tests
that left coroutines scheduled on the loop.

Coverage on ``remote_runner.py`` now 100% (134/134).
Two related cleanups:

1. **Replace 0.5s cancel poll with an ``asyncio.Event``.**
   ``FirmwareController.cancel`` now signals a per-job
   ``asyncio.Event`` registered on
   ``controller._cancel_events`` by the runner, and the
   runner parks on
   ``asyncio.wait({terminal, session_lost, cancel_wait},
   return_when=FIRST_COMPLETED)`` instead of polling
   ``_cancel_requested`` every half second.

   Effect: user Stop click → runner wakes instantly (was up
   to 500 ms latency). The local subprocess path doesn't
   register an event — SIGTERM on the spawned subprocess is
   already the wake signal. ``asyncio.get_running_loop()
   .create_task`` is the modern wrapper for the event's
   ``wait()`` coroutine (``asyncio.ensure_future`` is the
   legacy form).

2. **Proper sync in tests, not sleep loops.** The pattern
   ``for _ in range(4): await asyncio.sleep(0)`` was used
   as a hand-tuned "wait until runner parks" everywhere —
   timing-fragile (the park depth depended on residual
   coroutines from prior tests), and a ~0.6 s
   ``await asyncio.sleep(0.6)`` was used to wait for the
   wire cancel.

   Replaced with two helpers in
   ``tests/controllers/firmware/test_remote_runner.py``:

   - ``_wait_until_dispatched(client)`` polls on
     ``AsyncMock.await_count`` for ``submit_job`` — returns
     the instant the runner moves past the submit await.
   - ``_wait_for_wire_cancel(client)`` polls on
     ``AsyncMock.await_count`` for ``cancel_job`` with a 50 ms
     granularity and a 1 s timeout — returns the instant
     the runner's wire cancel lands.

   Both raise ``AssertionError`` on timeout so a runner
   regression that fails to reach the expected state shows
   up as a clean failure rather than a hang.

   ``_request_remote_cancel(controller, job)`` mirrors the
   production cancel handler's two-side signal
   (``_cancel_requested.add`` + ``_cancel_events[job_id].set``)
   so tests don't have to know which signal the runner
   actually watches.

   Test suite runtime: 4 s → 0.2 s for
   ``test_remote_runner.py`` (no more 0.5 s cancel polls).
   ``_wire_remote_build`` now returns ``(remote_build,
   client)`` so callers can capture the client for sync.

Fixture-attribute updates required for the new
``_cancel_events`` field: the conftest factory's
``with_terminate=True`` kit + the ``_wire_real_queue``
helpers in ``test_execute_job_e2e.py`` and
``test_verify_chip_e2e.py`` (the cancel handler reads the
dict unconditionally so any test that exercises cancel
needs the field present, even when ``_cancel_events`` is
empty).

100% coverage on ``remote_runner.py``.
Codecov flagged ``cancel_event.set()`` in
``FirmwareController.cancel`` as uncovered — the existing
remote-runner tests bypass the real WS handler via the
``_request_remote_cancel`` helper, so the handler's
``_cancel_events[job_id].set()`` branch was never exercised.

Added ``test_firmware_cancel_handler_wakes_remote_runner_via_event``
which drives through the real handler:

- wires the job into ``_jobs`` + ``_current_job`` so the
  handler's RUNNING branch validates cleanly,
- awaits ``controller.cancel(job_id=...)`` for the real
  signal path, then
- asserts the wire ``cancel_job`` lands and the receiver's
  cancelled frame finalises the job locally.

Pins the wiring so a future refactor that drops the
``cancel_events`` lookup on the controller side surfaces as
a test failure, not as a silent half-second cancel-latency
regression. ``controller.py`` coverage now 100%.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread esphome_device_builder/controllers/firmware/remote_runner.py
Comment thread tests/controllers/firmware/test_remote_runner.py Outdated
bdraco added 2 commits May 10, 2026 19:54
The event-driven cancel landed an ordering bug: between
``_execute_job`` setting ``_current_job = job`` (sync) and
the runner reaching ``run_remote_compile_job`` to register
its ``cancel_event``, the WS cancel handler could run.
The handler ``_cancel_requested.add(job_id)``s, then looks
up ``_cancel_events[job_id]`` — which is still empty
because the runner hasn't arrived — and the ``set()`` is
skipped. The runner then parks on
``asyncio.wait({terminal, session_lost, cancel_wait})``
and hangs until the receiver eventually completes or the
peer-link heartbeat surfaces ``session_lost``.

The old polling shape was immune: each iteration read
``_cancel_requested`` directly, so the late-bind didn't
matter. The event-driven shape needs the registration
site to replay the missed signal:

    cancel_event = asyncio.Event()
    controller._cancel_events[job.job_id] = cancel_event
    if job.job_id in controller._cancel_requested:
        cancel_event.set()

Added ``test_remote_compile_cancel_before_runner_registers_event_still_fires``
as the regression: flips ``_cancel_requested`` before the
runner starts, asserts the runner still translates the
cancel onto the wire (rather than hanging on its newly-
created event).
Copilot review caught a leftover reference to the runner's
old polling shape: the helper's docstring described
``asyncio.wait(waiters, timeout=0.5)`` as the wake-up
mechanism, but ``_await_terminal`` is now fully event-driven
(parks on ``asyncio.wait({terminal, session_lost,
cancel_wait}, return_when=FIRST_COMPLETED)``). Rewritten
to describe what the runner actually does so a future
refactor can't read this and assume there's still a poll
loop to preserve.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants