Source-route firmware runner on JobSource.REMOTE for compile (7a-2b)#560
Merged
Conversation
The runner picked up the ``FirmwareJob.source`` field added in 7a-2a (#556) but still walked the local subprocess pipeline for every job — so a REMOTE-source job sat marked RUNNING forever with no work happening. This wires the source-routed branch: ``_execute_job`` now early-returns to ``_execute_remote_job`` when ``source is JobSource.REMOTE``, which lives in a new ``remote_runner.py`` and: - builds the YAML bundle via the existing ``esphome bundle`` subprocess wrapper - looks up the live ``PeerLinkClient`` through the remote-build controller (matching the pin_sha256 stored on the job) - dispatches ``submit_job{target=compile}`` and waits for the receiver's terminal ``OFFLOADER_JOB_STATE_CHANGED`` - translates inbound ``OFFLOADER_JOB_OUTPUT`` / ``OFFLOADER_JOB_STATE_CHANGED`` events into local ``JOB_OUTPUT`` / ``JOB_PROGRESS`` / ``JOB_<terminal>`` fires, so ``follow_job`` and the firmware-tasks UI consume one event stream regardless of which CPU compiled the bytes - translates a local ``firmware/cancel`` flip into a wire ``cancel_job`` send; the receiver's resulting cancelled state-changed frame finalises the local job Scope is COMPILE-only — UPLOAD / INSTALL go through 7a-3 which layers ``download_artifacts`` + the local flash step on top. Anything else lands as FAILED with an explanatory error.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #560 +/- ##
==========================================
+ Coverage 99.12% 99.13% +0.01%
==========================================
Files 77 78 +1
Lines 10062 10219 +157
==========================================
+ Hits 9974 10131 +157
Misses 88 88
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
Pull request overview
Adds the missing firmware-queue runner branch for FirmwareJob.source == JobSource.REMOTE, so REMOTE compile jobs are dispatched to a paired receiver via peer-link and translated back into the existing local JOB_* event stream.
Changes:
- Branch
FirmwareController._execute_job()to route REMOTE-source jobs into a dedicated remote runner. - Introduce
controllers/firmware/remote_runner.pyto bundle YAML, submit remote compile jobs, translate output/progress/terminal state, and forward cancel requests. - Add a comprehensive test suite covering success, progress parsing, stray-event filtering, and key failure/cancel paths.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| esphome_device_builder/controllers/firmware/controller.py | Adds REMOTE-source branching and a small _execute_remote_job() wrapper that delegates to the remote runner. |
| esphome_device_builder/controllers/firmware/remote_runner.py | Implements remote compile dispatch over peer-link plus event translation/cancel bridging. |
| tests/controllers/firmware/test_remote_runner.py | New end-to-end-ish tests driving the remote runner against a real EventBus with mocked bundle/client surfaces. |
PR review surfaced three improvements to the source-routed runner branch: 1. **DRY** — the per-line bookkeeping (append → trim → fire JOB_OUTPUT → fire JOB_PROGRESS on advance) was duplicated between the local subprocess streaming loop and the remote runner's ``_on_output`` listener. Extracted to ``helpers._ingest_output_line(job, bus, line)`` and consumed from both paths. ``_check_error`` stays inline in the local path because it mutates nonlocal flags the post-exit handler reads, and the remote path doesn't need it (the receiver surfaces a structured ``failed`` status instead of stderr-scrape territory). 2. **Tighten exception narrowing** — ``_lookup_open_peer_link_client`` raises only ``CommandError``, so swap the catch-all ``except Exception`` for ``except CommandError`` in both the dispatch and the cancel translation paths. A bare Exception catch was swallowing AttributeError / runtime regressions as "receiver not reachable: 'X' has no attribute 'Y'"; the narrower catch lets unexpected bugs propagate visibly. 3. **CancelledError → CANCELLED + re-raise** — mirror the local subprocess path's contract: a runner-task cancellation (controller stop) finalises the job as CANCELLED + fires ``JOB_CANCELLED`` before re-raising. Without this the shutdown path left REMOTE jobs in RUNNING with no terminal event for subscribers. Also added a regression test for the cancel-beats-receiver-completed race: if user Stop is registered before the receiver's ``completed`` frame lands, the job finalises as CANCELLED (user intent wins), matching the local subprocess contract. Test brittleness from the DRY refactor: one trim-spy test in ``test_execute_job_e2e.py`` was monkeypatching ``_trim_job_output`` on the ``controller`` module only; the mid-run call now resolves through ``helpers`` (via ``_ingest_output_line``), so the spy needs both surfaces patched to keep the mid-run-trim regression check honest. Captured in a comment at the patch site.
Three Copilot points + the codecov gap: 1. **Session-loss detection.** Subscribe to ``OFFLOADER_PEER_LINK_CLOSED`` matching this receiver's ``pin_sha256`` and feed a sibling future the wait loop consults. Without it, a peer-link drop after ``submit_job`` accepted but before a terminal frame landed wedged the runner forever — the firmware queue would refuse the next job until restart. 2. **Cancel intent wins on the failure path.** ``_fail_locally`` now checks ``_cancel_requested`` and routes through ``_finalize_cancelled`` when the user already clicked Stop before the runner reached a wire failure (bundle build, lookup, dispatch, session loss). Mirrors the local subprocess path's contract — a Stop that happens to race a failure should not show up as a red FAILED badge. 3. **Stub ``submit_job`` ack echoes caller's job_id.** Fixture ``_make_client`` was hard-coding the ack's ``job_id`` to ``"remote-1"`` while some tests pass different ids; a future runner change validating ``ack["job_id"] == job.job_id`` would have failed for stub-divergence reasons rather than real regressions. 4. **Coverage to 99%** with focused tests for the previously uncovered paths: missing ``source_pin_sha256``, ``FileNotFoundError`` from bundle build, ``BundleBuildError``, ``remote_build is None``, submit's ``PeerLinkNoSessionError``, peer-link-closed mid-build, cancel-on-missing-session, ``cancel_job`` raising ``PeerLinkNoSessionError``, runner-task shutdown via ``CancelledError``, receiver-initiated ``cancelled`` without local cancel, cancel-during-bundle-build, and the ``_execute_job`` branch-wiring test. ``submit_job`` ack echo is pinned by ``test_submit_job_ack_echoes_caller_job_id``. Plus small cleanups from review: - Deduped ``_send_cancel_or_finalise``'s two near-identical ``except`` blocks into one ``(CommandError, PeerLinkNoSessionError)`` clause. - Added a ``FirmwareController.bus`` public property so the remote runner stops reaching across ``controller._db.bus`` for the canonical bus. ``@property`` (not ``@cached_property``) because the test fixtures swap ``_db.bus`` to install spies and a cached value would shadow the swap.
Two duplication cleanups surfaced on second-pass review: 1. **Listener filter helper.** Both ``_on_output`` and ``_on_state`` opened with the same ``data["pin_sha256"] != pin or data["job_id"] != target_job_id`` filter. Extracted to ``_is_ours(data)`` closure so the two callbacks read as filter + delegate. The session- close callback stays inline because the wire payload doesn't carry a ``job_id`` (the close brings down every in-flight job on the link), so it's a different shape. 2. **Dead YAML seed lines.** ``patch_bundle`` mocks ``build_yaml_bundle`` to return fake bytes; the runner never reads the YAML off disk in any test, so the ``(tmp_path / "kitchen.yaml").write_text(...)`` line repeated 17 times across the file was vestigial copy- paste from an earlier draft that called the real subprocess. Removed all 17 (and the now-unused ``tmp_path`` / ``Path`` imports). Tests still cover the FileNotFoundError + BundleBuildError paths via ``monkeypatch.setattr(remote_runner, "build_yaml_bundle", AsyncMock(side_effect=...))``, so the on-disk shape isn't load-bearing on either side. Net: 11 insertions, 44 deletions; same 354 firmware tests pass, same 99% coverage on the firmware package.
Codecov flagged 3 missing lines: - ``_on_session_closed``'s pin-mismatch early return — a peer- link close for an unrelated receiver must not finalise this job. Covered by ``test_remote_compile_ignores_session_closed_for_other_pin``. - ``_send_cancel_or_finalise``'s ``remote_build is None`` branch — a receiver-controller teardown race where the user clicks Stop while ``DeviceBuilder`` has already cleared ``self.remote_build``. Covered by ``test_remote_compile_cancel_after_remote_build_torn_down_finalises_locally``, which simulates the race by clearing ``_db.remote_build`` while the runner is parked on the terminal wait, then registers a cancel. The latter test pins runner readiness on ``client._lookup_open_peer_link_client.call_count`` rather than a fixed ``await asyncio.sleep(0)`` count — the count check is a deterministic synchronisation point (the lookup is the last sync event before ``_await_terminal`` parks), where the sleep loop was timing-fragile in the presence of preceding tests that left coroutines scheduled on the loop. Coverage on ``remote_runner.py`` now 100% (134/134).
Two related cleanups:
1. **Replace 0.5s cancel poll with an ``asyncio.Event``.**
``FirmwareController.cancel`` now signals a per-job
``asyncio.Event`` registered on
``controller._cancel_events`` by the runner, and the
runner parks on
``asyncio.wait({terminal, session_lost, cancel_wait},
return_when=FIRST_COMPLETED)`` instead of polling
``_cancel_requested`` every half second.
Effect: user Stop click → runner wakes instantly (was up
to 500 ms latency). The local subprocess path doesn't
register an event — SIGTERM on the spawned subprocess is
already the wake signal. ``asyncio.get_running_loop()
.create_task`` is the modern wrapper for the event's
``wait()`` coroutine (``asyncio.ensure_future`` is the
legacy form).
2. **Proper sync in tests, not sleep loops.** The pattern
``for _ in range(4): await asyncio.sleep(0)`` was used
as a hand-tuned "wait until runner parks" everywhere —
timing-fragile (the park depth depended on residual
coroutines from prior tests), and a ~0.6 s
``await asyncio.sleep(0.6)`` was used to wait for the
wire cancel.
Replaced with two helpers in
``tests/controllers/firmware/test_remote_runner.py``:
- ``_wait_until_dispatched(client)`` polls on
``AsyncMock.await_count`` for ``submit_job`` — returns
the instant the runner moves past the submit await.
- ``_wait_for_wire_cancel(client)`` polls on
``AsyncMock.await_count`` for ``cancel_job`` with a 50 ms
granularity and a 1 s timeout — returns the instant
the runner's wire cancel lands.
Both raise ``AssertionError`` on timeout so a runner
regression that fails to reach the expected state shows
up as a clean failure rather than a hang.
``_request_remote_cancel(controller, job)`` mirrors the
production cancel handler's two-side signal
(``_cancel_requested.add`` + ``_cancel_events[job_id].set``)
so tests don't have to know which signal the runner
actually watches.
Test suite runtime: 4 s → 0.2 s for
``test_remote_runner.py`` (no more 0.5 s cancel polls).
``_wire_remote_build`` now returns ``(remote_build,
client)`` so callers can capture the client for sync.
Fixture-attribute updates required for the new
``_cancel_events`` field: the conftest factory's
``with_terminate=True`` kit + the ``_wire_real_queue``
helpers in ``test_execute_job_e2e.py`` and
``test_verify_chip_e2e.py`` (the cancel handler reads the
dict unconditionally so any test that exercises cancel
needs the field present, even when ``_cancel_events`` is
empty).
100% coverage on ``remote_runner.py``.
Codecov flagged ``cancel_event.set()`` in ``FirmwareController.cancel`` as uncovered — the existing remote-runner tests bypass the real WS handler via the ``_request_remote_cancel`` helper, so the handler's ``_cancel_events[job_id].set()`` branch was never exercised. Added ``test_firmware_cancel_handler_wakes_remote_runner_via_event`` which drives through the real handler: - wires the job into ``_jobs`` + ``_current_job`` so the handler's RUNNING branch validates cleanly, - awaits ``controller.cancel(job_id=...)`` for the real signal path, then - asserts the wire ``cancel_job`` lands and the receiver's cancelled frame finalises the job locally. Pins the wiring so a future refactor that drops the ``cancel_events`` lookup on the controller side surfaces as a test failure, not as a silent half-second cancel-latency regression. ``controller.py`` coverage now 100%.
The event-driven cancel landed an ordering bug: between
``_execute_job`` setting ``_current_job = job`` (sync) and
the runner reaching ``run_remote_compile_job`` to register
its ``cancel_event``, the WS cancel handler could run.
The handler ``_cancel_requested.add(job_id)``s, then looks
up ``_cancel_events[job_id]`` — which is still empty
because the runner hasn't arrived — and the ``set()`` is
skipped. The runner then parks on
``asyncio.wait({terminal, session_lost, cancel_wait})``
and hangs until the receiver eventually completes or the
peer-link heartbeat surfaces ``session_lost``.
The old polling shape was immune: each iteration read
``_cancel_requested`` directly, so the late-bind didn't
matter. The event-driven shape needs the registration
site to replay the missed signal:
cancel_event = asyncio.Event()
controller._cancel_events[job.job_id] = cancel_event
if job.job_id in controller._cancel_requested:
cancel_event.set()
Added ``test_remote_compile_cancel_before_runner_registers_event_still_fires``
as the regression: flips ``_cancel_requested`` before the
runner starts, asserts the runner still translates the
cancel onto the wire (rather than hanging on its newly-
created event).
Copilot review caught a leftover reference to the runner's
old polling shape: the helper's docstring described
``asyncio.wait(waiters, timeout=0.5)`` as the wake-up
mechanism, but ``_await_terminal`` is now fully event-driven
(parks on ``asyncio.wait({terminal, session_lost,
cancel_wait}, return_when=FIRST_COMPLETED)``). Rewritten
to describe what the runner actually does so a future
refactor can't read this and assume there's still a poll
loop to preserve.
This was referenced May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this implement/fix?
Phase 7a-2b of issue #106 (transparent install). Branches
FirmwareController._execute_jobon theFirmwareJob.sourcefield added in 7a-2a (#556) so aJobSource.REMOTEcompile job actually dispatches to a paired receiver instead of sitting RUNNING with no work happening.Pieces
controllers/firmware/controller.py— adds the early branch in_execute_jobafterJOB_STARTEDfires and persist completes, before chip-verify:if job.source is JobSource.REMOTE: await self._execute_remote_job(job); return. Outerfinallystill runs the shared_current_job = None/_persist_jobs()cleanup so the local + remote paths share lifecycle bookkeeping.controllers/firmware/remote_runner.py(new) —run_remote_compile_job:helpers.config_bundle.build_yaml_bundle(the sameesphome bundlesubprocess wrapperremote_build/submit_jobuses).PeerLinkClientviaRemoteBuildController._lookup_open_peer_link_client(job.source_pin_sha256). A missing pairing / closed session / unspawned client fails the job locally with the lookup's error text.OFFLOADER_JOB_OUTPUT/OFFLOADER_JOB_STATE_CHANGEDbefore dispatchingsubmit_jobso an immediaterunning/outputframe can't outrace the subscription. Listener attach + detach lives in onewith bus.listening(...)block so every early-return failure path releases the subscriptions.pin_sha256andjob_idso stray frames from another in-flight remote job land in a different runner.client.submit_job(job_id=job.job_id, target="compile", bundle_bytes=...). The receiver echoes the offloader'sjob_idback on every fan-out frame per 5c contract, so the runner's filter matches without any side-channel correlation cache.JOB_OUTPUT/JOB_PROGRESS/JOB_<terminal>fires.follow_joband the firmware-tasks UI consume one event stream regardless of which CPU compiled the bytes.firmware/cancelflip (_cancel_requested) into a wirecancel_jobsend; the receiver's resultingjob_state_changed{status: cancelled}finalises the local job through the existing_finalize_cancelledhelper. Matches the local subprocess path's contract: if the user clicked Stop, the job is CANCELLED regardless of which terminal status the receiver eventually fired.Receiver-supplied
error_messagerides intojob.erroronfailed; the existing frontenderrorrendering shows it without any special-case code.Scope
COMPILE-only this slice. UPLOAD / INSTALL land in 7a-3, which layers
download_artifacts+ the local flash step on top of the same dispatch path. A REMOTE job with a non-COMPILEjob_typelands as FAILED with an explanatory error.Tests
8 new tests in
tests/controllers/firmware/test_remote_runner.pydriving the runner against a realEventBuswithAsyncMockshims for the bundle build + peer-link client surfaces:JOB_OUTPUT→ terminalcompletedfiresJOB_COMPLETED._PROGRESS_PATTERNSfires a localJOB_PROGRESS.job_iddon't leak in or terminate this runner.failedtranslates with the suppliederror_message.submit_jobrejection (accepted=False) finalises locally with the rejection reason.job_typefor a REMOTE job is rejected at the runner's top._cancel_requested→ runner sendsclient.cancel_job(...)→ receiver's cancelled frame finalises as CANCELLED.340 tests in the firmware suite pass; full repo run shows only the three pre-existing ESP32-H2 no-wifi YAML-generation failures on main (unrelated).
Related issue or feature (if applicable):
Types of changes
bugfixnew-featureenhancementbreaking-changerefactordocsmaintenancecidependenciesFrontend coordination
This slice is internal-only: no new WS commands, no wire-shape changes, no new frontend-visible fields beyond
source/source_labelalready shipped in #556. The install-dialog "Building on {receiver_label}" sub-line lands with 7a-3 when install routes throughpick_build_path.Checklist
ruff,codespell, yaml/json/python checks).tests/where applicable.components.jsonhas not been hand-edited (regenerate viascript/sync_components.pyif a sync is needed).docs/ARCHITECTURE.mdand/ordocs/API.md.