Add per-device reachability subscription for the drawer#329
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #329 +/- ##
==========================================
+ Coverage 98.57% 98.60% +0.02%
==========================================
Files 45 46 +1
Lines 5127 5295 +168
==========================================
+ Hits 5054 5221 +167
- Misses 73 74 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR adds a per-device WebSocket stream that provides detailed “reachability” freshness data (mDNS / ping / MQTT last-seen + ping RTT) for the device drawer, backed by a new in-memory ReachabilityTracker that DeviceStateMonitor updates as observations occur.
Changes:
- Add
devices/subscribe_reachabilityWS command to stream per-device reachability snapshots while the drawer is open. - Introduce
ReachabilityTrackerand wire it throughDevicesController↔DeviceStateMonitor, including capturing ping RTT (icmplib.Host.min_rtt) and clearing reachability on mDNSRemoved. - Add
ReachabilitySource(StrEnum) and comprehensive unit/integration tests for tracker + state monitor integration + WS handler behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
esphome_device_builder/controllers/devices/controller.py |
Adds ReachabilityTracker, new WS subscribe handler, refresh loop, and bus-forwarding for reachability snapshots. |
esphome_device_builder/controllers/_device_state_monitor.py |
Wires optional reachability tracking into apply/ping/mDNS-Removed paths; adds typed priority_for and mDNS refresh helper. |
esphome_device_builder/controllers/_reachability_tracker.py |
New tracker for per-signal last-seen and ping RTT plus snapshot serialization. |
esphome_device_builder/models/devices.py |
Adds ReachabilitySource enum. |
esphome_device_builder/models/common.py |
Adds EventType.DEVICE_REACHABILITY. |
tests/controllers/devices/test_subscribe_reachability.py |
New end-to-end tests for the WS reachability subscription, filtering, cancellation, and refresh gating. |
tests/test_reachability_tracker.py |
New unit tests for tracker observe/clear/snapshot semantics and callback behavior. |
tests/test_state_monitor_reachability.py |
New integration tests for state-monitor ↔ tracker hand-offs, ping RTT capture, and mDNS Removed clearing. |
tests/test_state_monitor_lifecycle.py |
Updates monitor fixture to include _reachability slot. |
tests/test_probe_device.py |
Updates monitor fixtures to include _reachability slot. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Adds a WS subscribe stream that pushes per-signal freshness for one device — mDNS / ICMP ping / MQTT last-seen plus the most recent ping RTT — so the dashboard's right-side drawer can show a Reachability section without bloating the broadcast event channel for every connected client. Reuses the existing per-message_id event tagging and ``devices/stop_stream`` cancellation primitive; no new dispatcher work needed. * New ``ReachabilityTracker`` class (``controllers/_reachability_tracker.py``) — owns four maps (mDNS / ping / MQTT last-seen, ping RTT) and the wire-shape snapshot serializer. * ``DeviceStateMonitor`` delegates: ``apply()`` records an observation under the source on ``ONLINE``; ``_ping_device`` captures ``Host.min_rtt`` (previously discarded); the mDNS ``Removed`` branch clears the tracker. * New ``ReachabilitySource`` ``StrEnum`` so ``priority_for()`` returns a typed value — typos in caller comparisons fail loudly instead of falling silently to ``UNKNOWN``. * New ``EventType.DEVICE_REACHABILITY`` and ``DevicesController.subscribe_reachability`` WS command. Filters bus events by device name in a closure; schedules a 60s mDNS force-refresh task while subscribed AND the active source is mDNS. * ``_build_reachability_snapshot`` consolidates the device-lookup + tracker-snapshot pairing shared between the initial seed and the per-event push. Tests: 23 new unit + integration tests covering the tracker, state-monitor integration, and end-to-end WS subscribe. Full suite: 1438 passed, 3 skipped.
Self-review pass on the new reachability subscription. Three issues caught + fixed: * ``_on_scan_change(REMOVED)`` did not clear the tracker. The four per-signal maps would accumulate one entry per device ever in the catalog — the mDNS browser's ``Removed`` only catches the broadcast-gone case, not YAML deletion. Clear the entry alongside the other REMOVED-side cleanup. * ``record_ping_rtt`` fired ``on_observation`` even though the state monitor's ping path always pairs it with a ``apply(name, ONLINE, "ping")`` call that fans out through ``observe`` and fires the same callback once. Net effect: two bus events for every successful ping. Drop the redundant fire. * ``DevicesController._reachability`` was unset on bypass-init test controllers, so any test driving ``_on_scan_change`` hit ``AttributeError``. Wire ``ReachabilityTracker()`` into the shared ``make_controller`` fixture. Coverage: 87.30% on the new lines → 99% (six remaining lines are pre-existing in helper methods unrelated to this PR). New tests: * ``test_refresh_mdns_*`` — no-zeroconf no-op, successful resolve flips ONLINE + bumps mDNS last-seen, swallows resolve exceptions, empty-resolve no-op. * ``test_dispatch_removed_event_clears_reachability_tracker`` — drives the actual mDNS browser dispatch closure (rather than replaying the branch by hand). * ``test_subscribe_without_client_returns_silently`` — early- return guard on missing client. * ``test_observation_for_deleted_device_is_dropped`` — race between observation and deletion. * ``test_observation_fires_bus_event_for_known_device`` — full production path through ``tracker.observe`` → ``_on_reachability_observation`` → bus.fire. * ``test_scan_change_removed_clears_tracker`` — the new cleanup branch. * ``test_record_ping_rtt_sets_field_without_firing_callback`` — pins the no-fire contract.
Copilot caught the bug it was meant to: the new ``DEVICE_REACHABILITY`` event was being broadcast to every connected ``subscribe_events`` client because the handler still listened to ``list(EventType)``. Per-device freshness pings fire on every mDNS announce, every successful ICMP, every MQTT discover response — broadcasting them defeats the point of having a per-device subscription and would saturate the bounded queue under fleet load (the terminator triggers and tears the connection down). Filter ``DEVICE_REACHABILITY`` out of the broadcast event-types list with an explicit comment about why. The ``devices/subscribe_reachability`` handler stays the only consumer. Also fixes a handful of stale terminology references Copilot flagged: the actual command is ``devices/subscribe_reachability``, not ``subscribe_device_reachability``. Updated comments in ``_reachability_tracker.py``, ``devices/controller.py``, and ``models/common.py``. The ``ReachabilitySource`` docstring no longer claims priority is encoded by member declaration order — the actual precedence lives in ``_SOURCE_PRIORITY``. New test pins the broadcast exclusion: fires a ``DEVICE_REACHABILITY`` into the bus, asserts the broadcast subscriber's ``send_event`` calls do not include it, plus a sanity-check on ``DEVICE_UPDATED`` so a regression dropping everything still surfaces.
5499886 to
b8e7bee
Compare
Surfaces the per-signal freshness the new device-builder backend pushes over ``devices/subscribe_reachability``: one row per channel the device has been observed on (mDNS / Ping / MQTT) with the localized "N seconds ago" and, for the active source, an "active" badge. The Ping row appends the round-trip-ms. The subscription is gated on ``(drawer open && device set && api ready)``: opens when those line up, closes (sends ``devices/stop_stream``) when any of them flips. While subscribed, a 500ms ``setInterval`` ticks a state counter so the displayed relative-time advances between server pushes; the backend force-refreshes mDNS every 60s, so the rendered age stays honest. * New ``ReachabilityStateEvent`` / ``ReachabilitySource`` / ``ReachabilitySubscription`` types in ``src/api/types.ts``. * New ``EsphomeApi.subscribeDeviceReachability(name, callback)`` returns a handle whose ``unsubscribe()`` sends ``devices/stop_stream``. Reuses the existing ``_eventSubscriptions`` dispatch path; no new transport plumbing. * New ``src/util/relative-time.ts`` wraps ``Intl.RelativeTimeFormat`` (mirrors the Home Assistant frontend's ``relativeTime`` helper) plus an ``ageOf()`` that bridges the backend's snapshot time to client now-time so the 500ms tick advances the display without needing a fresh push. * Drawer-content (``device-drawer-content.ts``) gets the Reachability section: declarative row table for the three signals so adding a future channel is one entry instead of another duplicated row block. New ``drawerOpen`` property the parent threads from its ``open`` so the subscription closes when the drawer slides off. * New localization keys for the Reachability section header, source labels, and "active" badge. The relative-time strings themselves come from ``Intl.RelativeTimeFormat`` so we don't ship per-language phrasings for "N seconds ago". Tests: 17 new (relative-time formatter incl. clock-skew clamp, seconds → minutes → hours → days bucket walk; subscribe/forward /cancel/error contract). Full suite: 658 passed. Pairs with backend PR esphome/device-builder#329 (the WS command and event the drawer subscribes to).
Two callers (``refresh_mdns`` for the per-subscription 60s force-refresh, ``_resolve_non_api_mdns_targets`` for the batch non-API resolve sweep) ran the same "non-empty address list → ``apply(ONLINE, mdns, claim=True)`` + ``apply_ip_addresses``" branch. Moved it to a single private helper so the deliberate no-OFFLINE-on-empty-resolve rule can't drift between the two paths. Drive-bys: * Drop the defensive ``getattr(result, "min_rtt", 0.0)`` in the ping path — ``icmplib.Host`` declares ``min_rtt`` as a public attribute, so direct access is correct and reads cleaner. * Expand the ``ReachabilityTracker`` field docstring to spell out what prunes the four maps (mDNS browser ``Removed`` and ``DevicesController._on_scan_change(REMOVED)``) and why an OFFLINE state from ping/MQTT timeout deliberately does *not* clear (the drawer still wants to surface "we last heard on MQTT 8 minutes ago").
Reported live: opening the drawer for an online device shows "3 minutes ago" for mDNS even though the broadcast TTL is 120s. Two compounding issues: * zeroconf only fires ``ServiceStateChange.Updated`` when a record's *content* changes. ESPHome devices re-announce at TTL/2 (~60s) but with the same SRV/A/PTR records, so the TTL refresh typically does NOT fire a callback — ``_mdns_last_seen`` stays stuck at the original ``Added`` timestamp. * The per-subscription refresh loop slept 60s before its first tick. Combined with the above, the initial snapshot the drawer received showed whatever stale value was carried over from the dashboard's startup, and the first force-refresh that would bump it didn't run for 60s. Flip the loop to refresh-first-then-sleep. ``async_resolve_host`` hits the zeroconf cache (warm in steady state) and returns within milliseconds; ``apply(ONLINE, "mdns", claim=True)`` bumps ``_mdns_last_seen`` so the next ``DEVICE_REACHABILITY`` event the subscription pushes carries a fresh "Last seen". Test updated to match the new ordering: first iteration's ``priority_for`` decides whether refresh runs *before* the first sleep, and the side-effect list grew an entry to cover the loop's third priority probe before its cancellation-raising sleep.
…seen) Reported: drawer showed "3 minutes ago" for an online device on a TTL=120s announce schedule. Diagnosis: zeroconf only fires ``ServiceStateChange.Updated`` when a record's *content* changes, not on the same-content TTL refreshes ESPHome devices send every ~60s. We were stamping ``_mdns_last_seen`` at the (rare) callback firings, so the value lied — the cache itself had been silently updating with fresh ``DNSAddress.created`` timestamps all along. Read ``DNSAddress.created`` directly from ``zeroconf.cache.get_all_by_details`` for the device's ``<name>.local.`` A and AAAA records. Use the most-recently- created across both as the truthful "last heard" — and surface ``DNSAddress.get_remaining_ttl`` as a new ``mdns_ttl_remaining_seconds`` wire field so the drawer can show "TTL: 38s" beside the row. Why A/AAAA (not SRV): the ``_esphomelib._tcp.local.`` SRV record only exists on devices running the ESPHome native API. A web_server / MQTT / OTA-only device still announces an A/AAAA under its hostname, which is what we care about for "is this host reachable on the LAN". Tracker accepts an injected ``mdns_cache_reader`` callable so the data source is testable; ``observe(name, "mdns")`` is now a pure callback-fire path (no stamp). Tests pass ``dict.get`` as the reader directly — bound-method already matches the right shape, no factory needed. Wiring: build the state monitor first, then the tracker with ``self._state_monitor.get_mdns_cache_info`` as a bound method, then ``state_monitor.set_reachability(tracker)`` to complete the loop. Avoids a wrapper lambda the previous draft needed. Use zeroconf's ``millis_to_seconds`` helper instead of dividing by 1000 by hand; ``operator.attrgetter`` instead of a lambda for ``max(records, key=...)``.
Pairs with backend esphome/device-builder#329's ``mdns_ttl_remaining_seconds`` wire field — the cached SRV/A record's remaining TTL according to zeroconf, useful as a diagnostic next to the freshness ("Last seen 90s ago · TTL: 30s" tells "due to re-announce" apart from "missed several windows already"). * Add ``mdns_ttl_remaining_seconds`` to ``ReachabilityStateEvent``. * The drawer's ``ReachabilityRowSpec`` gets an optional ``ttlRemaining`` field; the mDNS row populates it from the event payload, others leave it null. * Render path: count down ``ttlRemaining`` between server pushes the same way ``ageOf`` counts up, clamping at zero so a stale snapshot doesn't surface negative seconds. * New ``drawer_mdns_ttl_remaining`` localization key formatted through the existing ``Intl.NumberFormat`` so the integer decimal stays locale-correct.
Reported live: drawer rendered "TTL: 0s" for every active mDNS row. Cause: ``DNSAddress.get_remaining_ttl`` already returns seconds (the zeroconf docstring + impl divide by 1000 before returning), but ``get_mdns_cache_info`` ran the result through ``millis_to_seconds`` a second time — a 108-second TTL became 0.108 seconds and rendered as "TTL: 0s". Drop the extra ``millis_to_seconds`` for the TTL path. The ``age_seconds`` calc still needs the conversion because ``current_time_millis() - record.created`` is plain millisecond arithmetic. Comment both call sites so a future refactor doesn't re-introduce the symmetry that looks right but is wrong. Test fixture also corrected — passing 90_000 / 115_000 to the ``get_remaining_ttl`` mock was masking the bug because the double-divide cancelled out the wrong-unit input. Stub seconds now and assert the seconds round-trip.
Two more reported live: * "Every reopen shows TTL: 120s · now" — caused by the refresh-then-sleep ordering firing an active mDNS resolve on every drawer-open. With cache-based snapshot reads the initial value already reflects truth, so the eager resolve just clobbered the displayed age with a fresh announce on every subscribe. Flip back to sleep-then-refresh: the open reads the cache as-is, the 60s tick afterwards keeps the A/AAAA records alive (``ServiceBrowser`` only auto-refreshes the PTR; A and AAAA decay on their own TTL otherwise). * "TTL: 0s · 2 minutes ago" persisting on a powered-off device — zeroconf's cache reaper sweeps lazily, so an expired record sits in the cache for up to ~10s after its TTL. Filter ``r.is_expired(now)`` out before picking the latest, so the row hides as soon as TTL math says we're past expiry rather than waiting for the reaper. New real-zeroconf integration test (``test_get_mdns_cache_info_against_real_zeroconf_record``) — constructs a ``DNSAddress`` via the documented constructor and drops it into a ``Zeroconf.cache``, verifying the seconds/milliseconds contract end-to-end without depending on a stub matching the production assumption (which is how the earlier "TTL: 0s" double-divide bug went green for a full review cycle). Imports moved to top-level on the test file; mock-based ``picks_latest_record`` test got an ``is_expired = MagicMock(return_value=False)`` stub on each record so the new filter doesn't drop them.
Reported: "after 120s [the rows] all expire and switch to Reachability Waiting for first broadcast…". Confirmed: ESPHome devices are mDNS-silent except at startup — they only respond to probes, never volunteer announces. With no proactive querying the A records expire on their TTL (120s) and the drawer's mDNS row vanishes for every device that hasn't been queried recently. Fix: the per-subscription refresh tick now actively probes the wire when the cached A record is within 10s of TTL expiry (or already absent). Bypasses ``AsyncEsphomeZeroconf.async_resolve_host`` because that short-circuits on cache hit — for keeping the cache *alive* we need to skip the cache check entirely and call ``AddressResolver.async_request`` directly so the device's response refreshes the cache entry's ``created`` timestamp. Walkthrough (60s tick + 10s threshold + 120s A TTL): * t=0: subscribe; cache TTL=120 * t=60s: tick fires, TTL_remaining=60s > 10s → skip wire query * t=120s: tick fires, TTL_remaining ≤ 10s (or absent) → wire query, device responds, cache TTL extended to 120s The skip-when-fresh gate matters at fleet scale — each probe is a multicast query, so an unconditional every-tick probe would saturate the local segment for no gain. Tests rebuilt for the new ``AddressResolver`` path: stubs ``async_request`` + ``parsed_scoped_addresses``, asserts the threshold gate (skip when fresh, probe when near expiry / absent). The cache-info ``is_expired`` filter test fixture got an ``is_expired = MagicMock(return_value=False)`` on each record so the new filter doesn't drop them. Plus an integration-style test that constructs a real ``DNSAddress`` via the documented constructor and drops it into a real ``Zeroconf.cache`` — defends against the class of bug where a stub matches the production assumption and the wrong unit cancels out (the earlier "TTL: 0s" bug).
Reverts the just-pushed ``AddressResolver.async_request`` bypass — the user's question pinned down the actual zeroconf contract: ``async_request`` (and inherited by ``AddressResolver``) calls ``_load_from_cache`` first and short-circuits if a complete record set is in cache. The bypass would query unconditionally, but that defeats the cache and adds multicast traffic for already-fresh records. Better: schedule the loop so the wire query happens when the cache is *actually* expired and ``async_resolve_host``'s existing fall-through behaviour does the right thing. New refresh loop shape: * Read cached TTL_remaining for the device's A/AAAA records. * If still alive (``> 0``): sleep ``ttl_remaining + padding`` and ``continue``. Wakes up just past expiry. * If expired or absent: sleep just the padding (lets the subscription's initial snapshot land first), then call ``refresh_device_mdns`` — by now the cache is gone and the ``async_resolve_host`` cache-check fails, so the wire query actually fires. The recheck-after-sleep matters: an unrelated mDNS announce (another browser query, a startup probe from somewhere on the LAN) can re-arm the cache during our sleep. The recheck spares us a redundant wire query in that case — we just sleep again for the new lifetime. Also drops the ``is_expired`` filter in ``get_mdns_cache_info``. The drawer wants the truthful "last seen" age even when A has aged past its TTL: surfacing "Last seen 122s ago, TTL: 0s" is more accurate than hiding the row entirely just because the 120s TTL ran out a second before our refresh tick fires. The PTR record (4500s TTL) keeps the device's existence alive in mDNS-land regardless of A/AAAA expiry. The row hides correctly once the reaper actually evicts the record. Tests: new ``test_refresh_loop_sleeps_until_cache_expiry`` pins the ``ttl_remaining + padding`` arithmetic and ``test_refresh_loop_skips_wire_query_when_recheck_finds_fresh_cache`` pins the recheck-don't-query gate. Threshold / bypass tests are gone.
|
Marking as draft — getting late, can't finish testing tonight. Will resume tomorrow. |
User feedback: the drawer's "Last seen" should reflect the most recent ``DNSRecord.created`` across ANY mDNS record we have for the device, not just the address records. Previously the row showed "Waiting for first broadcast…" the moment the A record expired (120s) even though the device's PTR was kept fresh by the ``ServiceBrowser`` and the device was plainly still around in mDNS-land — misleading "we still show as online" but the row hides. ``get_mdns_cache_info`` now reads: * A and AAAA at ``<name>.local.`` * SRV and TXT at ``<name>._esphomelib._tcp.local.`` * PTR via the dedicated ``cache.current_entry_with_name_and_alias`` helper (built for exactly the "PTR-by-target" lookup we need). Picks ``max(records, key=created)`` across the union. The PTR's ~4500s TTL means it stays cached far longer than A's 120s, so it carries the row through brief A-expiry windows when the device went quiet between announces. The row only hides when the cache has nothing under any of the types (brand-new device, never observed at all). New integration test ``picks_latest_across_record_types``: real ``DNSAddress`` (110s old, A) + real ``DNSPointer`` (5s old, PTR) in a real ``Zeroconf.cache``, asserts the helper returns the PTR's 5s age — pinning the multi-type lookup so a refactor that drops any of the type queries surfaces here. The earlier mock-based tests got the ``current_entry_with_name_and_alias`` stub they now need.
The previous commit's union-of-types ``get_mdns_cache_info`` broke the refresh loop: PTR has a 4500s TTL, so picking ``ttl_remaining`` off the latest record (which is now usually PTR) made the loop sleep ~75 minutes between probes — the A record would expire 73 minutes earlier and the drawer would silently lose its mDNS row in the gap. Split the lookup: * ``get_mdns_cache_info`` keeps the union-of-types semantics for the drawer's "Last seen" display — surfaces the freshest record across A / AAAA / SRV / TXT / PTR. * New ``get_mdns_a_record_ttl_remaining`` returns the *minimum* remaining TTL across cached A/AAAA records (the ones that decay at 120s and we need to keep alive). Used by the refresh loop to schedule its next probe — sleeps until the address records actually need refreshing. DRY-up: both helpers share a private ``_get_address_records`` that owns the ``cache.get_all_by_details(local_name, A/AAAA, IN)`` lookup. ``get_mdns_a_record_ttl_remaining`` returns the ``min(remaining)`` (smaller TTL covers the case where one family expires before the other); ``get_mdns_cache_info`` adds SRV/TXT plus the dedicated ``cache.current_entry_with_name_and_alias`` PTR lookup. Tests: new ``test_get_mdns_a_record_ttl_remaining_*`` (real ``DNSAddress`` for both families, A=90s and AAAA=40s remaining; pin AAAA wins, plus a no-records null-return). Loop tests in ``test_subscribe_reachability.py`` switched their stubs from ``get_mdns_cache_info.side_effect`` to the new method's plain-float side_effect.
What does this implement/fix?
Adds a per-device reachability subscription so the device drawer's right-side panel can show how we're hearing from a device — mDNS / ping / MQTT last-seen plus the most recent ping RTT — without bloating the broadcast
subscribe_eventschannel for every other connected client.Companion frontend PR: esphome/device-builder-frontend#165
Highlights:
devices/subscribe_reachabilitystreams areachability_stateevent for one device on subscribe and on every observation. Cancel via the existingdevices/stop_stream(no new dispatcher work).ReachabilityTrackerowns the per-signal maps (mDNS / ping / MQTT last-seen + ping RTT). Pure data, lives in its own module so the state monitor stays focused on priority rules and browser callbacks.DeviceStateMonitordelegates:apply()records an observation under the source onONLINE;_ping_devicecapturesHost.min_rtt(previously discarded); the mDNSRemovedbranch clears the tracker.ReachabilitySourceStrEnumtypifiespriority_for()— caller comparisons fail loudly on a typo instead of falling silently toUNKNOWN.active_source == MDNS. Ping/MQTT-source devices already get freshness from existing loops._on_scan_change(REMOVED)now clears the tracker so deleted devices don't leave per-signal entries forever.Related issue or feature (if applicable):
Types of changes
bugfixnew-featureenhancementbreaking-changerefactordocsmaintenancecidependenciesFrontend coordination
Checklist
ruff,codespell, yaml/json/python checks).tests/where applicable. (~30 new unit + integration tests covering the tracker, state-monitor integration, end-to-end WS subscribe + cancel + filter, force-refresh-only-when-mdns, deleted-device cleanup, andrefresh_mdnspaths.)components.jsonhas not been hand-edited.docs/ARCHITECTURE.mdand/ordocs/API.md.Out of scope
*_last_seenis what matters for "is this device responsive"