Skip to content

Add per-device reachability subscription for the drawer#329

Merged
bdraco merged 12 commits intomainfrom
reachability-subscribe
May 5, 2026
Merged

Add per-device reachability subscription for the drawer#329
bdraco merged 12 commits intomainfrom
reachability-subscribe

Conversation

@bdraco
Copy link
Copy Markdown
Member

@bdraco bdraco commented May 5, 2026

What does this implement/fix?

Adds a per-device reachability subscription so the device drawer's right-side panel can show how we're hearing from a device — mDNS / ping / MQTT last-seen plus the most recent ping RTT — without bloating the broadcast subscribe_events channel for every other connected client.

Companion frontend PR: esphome/device-builder-frontend#165

Highlights:

  • New WS command devices/subscribe_reachability streams a reachability_state event for one device on subscribe and on every observation. Cancel via the existing devices/stop_stream (no new dispatcher work).
  • New ReachabilityTracker owns the per-signal maps (mDNS / ping / MQTT last-seen + ping RTT). Pure data, lives in its own module so the state monitor stays focused on priority rules and browser callbacks.
  • DeviceStateMonitor delegates: apply() records an observation under the source on ONLINE; _ping_device captures Host.min_rtt (previously discarded); the mDNS Removed branch clears the tracker.
  • New ReachabilitySource StrEnum typifies priority_for() — caller comparisons fail loudly on a typo instead of falling silently to UNKNOWN.
  • 60s mDNS force-refresh task scoped to the subscription and gated on active_source == MDNS. Ping/MQTT-source devices already get freshness from existing loops.
  • _on_scan_change(REMOVED) now clears the tracker so deleted devices don't leave per-signal entries forever.

Related issue or feature (if applicable):

  • Adds the backend support for the device-drawer Reachability section discussed in the corresponding frontend PR.

Types of changes

  • Bugfix (non-breaking change which fixes an issue) — bugfix
  • New feature (non-breaking change which adds functionality) — new-feature
  • Enhancement to an existing feature — enhancement
  • Breaking change (fix or feature that would cause existing functionality to not work as expected) — breaking-change
  • Refactor (no behaviour change) — refactor
  • Documentation only — docs
  • Maintenance / chore — maintenance
  • CI / workflow change — ci
  • Dependencies bump — dependencies

Frontend coordination

Checklist

  • The code change is tested and works locally.
  • Pre-commit hooks pass (ruff, codespell, yaml/json/python checks).
  • Tests have been added or updated under tests/ where applicable. (~30 new unit + integration tests covering the tracker, state-monitor integration, end-to-end WS subscribe + cancel + filter, force-refresh-only-when-mdns, deleted-device cleanup, and refresh_mdns paths.)
  • components.json has not been hand-edited.
  • Architecture-level changes are reflected in docs/ARCHITECTURE.md and/or docs/API.md.

Out of scope

  • Persisting last-seen across restarts (in-memory; the next observation refills within seconds)
  • TTL info from the mDNS record — python-zeroconf doesn't expose cache-entry age through its public API; the receive-time *_last_seen is what matters for "is this device responsive"
  • Surfacing reachability in the table view (drawer-only for now)

Copilot AI review requested due to automatic review settings May 5, 2026 03:14
@bdraco bdraco added the new-feature New feature label May 5, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 5, 2026

Merging this PR will not alter performance

✅ 9 untouched benchmarks


Comparing reachability-subscribe (5f3f315) with main (62f9882)

Open in CodSpeed

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 5, 2026

Codecov Report

❌ Patch coverage is 99.44134% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 98.60%. Comparing base (55e3a83) to head (5f3f315).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...evice_builder/controllers/_device_state_monitor.py 98.48% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #329      +/-   ##
==========================================
+ Coverage   98.57%   98.60%   +0.02%     
==========================================
  Files          45       46       +1     
  Lines        5127     5295     +168     
==========================================
+ Hits         5054     5221     +167     
- Misses         73       74       +1     
Flag Coverage Δ
py3.12 98.54% <99.44%> (+0.02%) ⬆️
py3.14 98.60% <99.44%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...evice_builder/controllers/_reachability_tracker.py 100.00% <100.00%> (ø)
...e_device_builder/controllers/devices/controller.py 99.86% <100.00%> (+0.01%) ⬆️
esphome_device_builder/device_builder.py 94.66% <100.00%> (+0.02%) ⬆️
esphome_device_builder/models/common.py 100.00% <100.00%> (ø)
esphome_device_builder/models/devices.py 100.00% <100.00%> (ø)
...evice_builder/controllers/_device_state_monitor.py 98.62% <98.48%> (-0.07%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a per-device WebSocket stream that provides detailed “reachability” freshness data (mDNS / ping / MQTT last-seen + ping RTT) for the device drawer, backed by a new in-memory ReachabilityTracker that DeviceStateMonitor updates as observations occur.

Changes:

  • Add devices/subscribe_reachability WS command to stream per-device reachability snapshots while the drawer is open.
  • Introduce ReachabilityTracker and wire it through DevicesControllerDeviceStateMonitor, including capturing ping RTT (icmplib.Host.min_rtt) and clearing reachability on mDNS Removed.
  • Add ReachabilitySource (StrEnum) and comprehensive unit/integration tests for tracker + state monitor integration + WS handler behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
esphome_device_builder/controllers/devices/controller.py Adds ReachabilityTracker, new WS subscribe handler, refresh loop, and bus-forwarding for reachability snapshots.
esphome_device_builder/controllers/_device_state_monitor.py Wires optional reachability tracking into apply/ping/mDNS-Removed paths; adds typed priority_for and mDNS refresh helper.
esphome_device_builder/controllers/_reachability_tracker.py New tracker for per-signal last-seen and ping RTT plus snapshot serialization.
esphome_device_builder/models/devices.py Adds ReachabilitySource enum.
esphome_device_builder/models/common.py Adds EventType.DEVICE_REACHABILITY.
tests/controllers/devices/test_subscribe_reachability.py New end-to-end tests for the WS reachability subscription, filtering, cancellation, and refresh gating.
tests/test_reachability_tracker.py New unit tests for tracker observe/clear/snapshot semantics and callback behavior.
tests/test_state_monitor_reachability.py New integration tests for state-monitor ↔ tracker hand-offs, ping RTT capture, and mDNS Removed clearing.
tests/test_state_monitor_lifecycle.py Updates monitor fixture to include _reachability slot.
tests/test_probe_device.py Updates monitor fixtures to include _reachability slot.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread esphome_device_builder/controllers/devices/controller.py
Comment thread esphome_device_builder/models/common.py
Comment thread esphome_device_builder/controllers/devices/controller.py Outdated
Comment thread esphome_device_builder/models/devices.py Outdated
Comment thread esphome_device_builder/controllers/_reachability_tracker.py Outdated
bdraco added 3 commits May 4, 2026 22:43
Adds a WS subscribe stream that pushes per-signal freshness for one
device — mDNS / ICMP ping / MQTT last-seen plus the most recent
ping RTT — so the dashboard's right-side drawer can show a
Reachability section without bloating the broadcast event channel
for every connected client. Reuses the existing per-message_id
event tagging and ``devices/stop_stream`` cancellation primitive;
no new dispatcher work needed.

* New ``ReachabilityTracker`` class
  (``controllers/_reachability_tracker.py``) — owns four maps
  (mDNS / ping / MQTT last-seen, ping RTT) and the wire-shape
  snapshot serializer.

* ``DeviceStateMonitor`` delegates: ``apply()`` records an
  observation under the source on ``ONLINE``; ``_ping_device``
  captures ``Host.min_rtt`` (previously discarded); the mDNS
  ``Removed`` branch clears the tracker.

* New ``ReachabilitySource`` ``StrEnum`` so ``priority_for()``
  returns a typed value — typos in caller comparisons fail loudly
  instead of falling silently to ``UNKNOWN``.

* New ``EventType.DEVICE_REACHABILITY`` and
  ``DevicesController.subscribe_reachability`` WS command.
  Filters bus events by device name in a closure; schedules a
  60s mDNS force-refresh task while subscribed AND the active
  source is mDNS.

* ``_build_reachability_snapshot`` consolidates the device-lookup
  + tracker-snapshot pairing shared between the initial seed and
  the per-event push.

Tests: 23 new unit + integration tests covering the tracker,
state-monitor integration, and end-to-end WS subscribe. Full
suite: 1438 passed, 3 skipped.
Self-review pass on the new reachability subscription. Three
issues caught + fixed:

* ``_on_scan_change(REMOVED)`` did not clear the tracker. The
  four per-signal maps would accumulate one entry per device
  ever in the catalog — the mDNS browser's ``Removed`` only
  catches the broadcast-gone case, not YAML deletion. Clear
  the entry alongside the other REMOVED-side cleanup.

* ``record_ping_rtt`` fired ``on_observation`` even though the
  state monitor's ping path always pairs it with a ``apply(name,
  ONLINE, "ping")`` call that fans out through ``observe`` and
  fires the same callback once. Net effect: two bus events for
  every successful ping. Drop the redundant fire.

* ``DevicesController._reachability`` was unset on bypass-init
  test controllers, so any test driving ``_on_scan_change`` hit
  ``AttributeError``. Wire ``ReachabilityTracker()`` into the
  shared ``make_controller`` fixture.

Coverage: 87.30% on the new lines → 99% (six remaining lines
are pre-existing in helper methods unrelated to this PR).

New tests:

* ``test_refresh_mdns_*`` — no-zeroconf no-op, successful resolve
  flips ONLINE + bumps mDNS last-seen, swallows resolve
  exceptions, empty-resolve no-op.
* ``test_dispatch_removed_event_clears_reachability_tracker`` —
  drives the actual mDNS browser dispatch closure (rather than
  replaying the branch by hand).
* ``test_subscribe_without_client_returns_silently`` — early-
  return guard on missing client.
* ``test_observation_for_deleted_device_is_dropped`` — race
  between observation and deletion.
* ``test_observation_fires_bus_event_for_known_device`` — full
  production path through ``tracker.observe`` →
  ``_on_reachability_observation`` → bus.fire.
* ``test_scan_change_removed_clears_tracker`` — the new cleanup
  branch.
* ``test_record_ping_rtt_sets_field_without_firing_callback`` —
  pins the no-fire contract.
Copilot caught the bug it was meant to: the new
``DEVICE_REACHABILITY`` event was being broadcast to every
connected ``subscribe_events`` client because the handler still
listened to ``list(EventType)``. Per-device freshness pings
fire on every mDNS announce, every successful ICMP, every MQTT
discover response — broadcasting them defeats the point of
having a per-device subscription and would saturate the
bounded queue under fleet load (the terminator triggers and
tears the connection down).

Filter ``DEVICE_REACHABILITY`` out of the broadcast event-types
list with an explicit comment about why. The
``devices/subscribe_reachability`` handler stays the only
consumer.

Also fixes a handful of stale terminology references Copilot
flagged: the actual command is ``devices/subscribe_reachability``,
not ``subscribe_device_reachability``. Updated comments in
``_reachability_tracker.py``, ``devices/controller.py``, and
``models/common.py``. The ``ReachabilitySource`` docstring no
longer claims priority is encoded by member declaration order
— the actual precedence lives in ``_SOURCE_PRIORITY``.

New test pins the broadcast exclusion: fires a
``DEVICE_REACHABILITY`` into the bus, asserts the broadcast
subscriber's ``send_event`` calls do not include it, plus a
sanity-check on ``DEVICE_UPDATED`` so a regression dropping
everything still surfaces.
@bdraco bdraco force-pushed the reachability-subscribe branch from 5499886 to b8e7bee Compare May 5, 2026 03:44
bdraco added a commit to esphome/device-builder-frontend that referenced this pull request May 5, 2026
Surfaces the per-signal freshness the new device-builder backend
pushes over ``devices/subscribe_reachability``: one row per
channel the device has been observed on (mDNS / Ping / MQTT) with
the localized "N seconds ago" and, for the active source, an
"active" badge. The Ping row appends the round-trip-ms.

The subscription is gated on ``(drawer open && device set &&
api ready)``: opens when those line up, closes (sends
``devices/stop_stream``) when any of them flips. While
subscribed, a 500ms ``setInterval`` ticks a state counter so the
displayed relative-time advances between server pushes; the
backend force-refreshes mDNS every 60s, so the rendered age stays
honest.

* New ``ReachabilityStateEvent`` / ``ReachabilitySource`` /
  ``ReachabilitySubscription`` types in ``src/api/types.ts``.
* New ``EsphomeApi.subscribeDeviceReachability(name, callback)``
  returns a handle whose ``unsubscribe()`` sends
  ``devices/stop_stream``. Reuses the existing ``_eventSubscriptions``
  dispatch path; no new transport plumbing.
* New ``src/util/relative-time.ts`` wraps
  ``Intl.RelativeTimeFormat`` (mirrors the Home Assistant
  frontend's ``relativeTime`` helper) plus an ``ageOf()`` that
  bridges the backend's snapshot time to client now-time so the
  500ms tick advances the display without needing a fresh push.
* Drawer-content (``device-drawer-content.ts``) gets the
  Reachability section: declarative row table for the three
  signals so adding a future channel is one entry instead of
  another duplicated row block. New ``drawerOpen`` property the
  parent threads from its ``open`` so the subscription closes
  when the drawer slides off.
* New localization keys for the Reachability section header,
  source labels, and "active" badge. The relative-time strings
  themselves come from ``Intl.RelativeTimeFormat`` so we don't
  ship per-language phrasings for "N seconds ago".

Tests: 17 new (relative-time formatter incl. clock-skew clamp,
seconds → minutes → hours → days bucket walk; subscribe/forward
/cancel/error contract). Full suite: 658 passed.

Pairs with backend PR esphome/device-builder#329 (the WS command
and event the drawer subscribes to).
bdraco added 3 commits May 4, 2026 22:48
Two callers (``refresh_mdns`` for the per-subscription 60s
force-refresh, ``_resolve_non_api_mdns_targets`` for the batch
non-API resolve sweep) ran the same "non-empty address list →
``apply(ONLINE, mdns, claim=True)`` + ``apply_ip_addresses``"
branch. Moved it to a single private helper so the deliberate
no-OFFLINE-on-empty-resolve rule can't drift between the two
paths.

Drive-bys:

* Drop the defensive ``getattr(result, "min_rtt", 0.0)`` in the
  ping path — ``icmplib.Host`` declares ``min_rtt`` as a public
  attribute, so direct access is correct and reads cleaner.

* Expand the ``ReachabilityTracker`` field docstring to spell
  out what prunes the four maps (mDNS browser ``Removed`` and
  ``DevicesController._on_scan_change(REMOVED)``) and why an
  OFFLINE state from ping/MQTT timeout deliberately does *not*
  clear (the drawer still wants to surface "we last heard on
  MQTT 8 minutes ago").
Reported live: opening the drawer for an online device shows
"3 minutes ago" for mDNS even though the broadcast TTL is 120s.
Two compounding issues:

* zeroconf only fires ``ServiceStateChange.Updated`` when a
  record's *content* changes. ESPHome devices re-announce at
  TTL/2 (~60s) but with the same SRV/A/PTR records, so the
  TTL refresh typically does NOT fire a callback —
  ``_mdns_last_seen`` stays stuck at the original ``Added``
  timestamp.

* The per-subscription refresh loop slept 60s before its first
  tick. Combined with the above, the initial snapshot the
  drawer received showed whatever stale value was carried over
  from the dashboard's startup, and the first force-refresh
  that would bump it didn't run for 60s.

Flip the loop to refresh-first-then-sleep. ``async_resolve_host``
hits the zeroconf cache (warm in steady state) and returns
within milliseconds; ``apply(ONLINE, "mdns", claim=True)``
bumps ``_mdns_last_seen`` so the next ``DEVICE_REACHABILITY``
event the subscription pushes carries a fresh "Last seen".

Test updated to match the new ordering: first iteration's
``priority_for`` decides whether refresh runs *before* the
first sleep, and the side-effect list grew an entry to cover
the loop's third priority probe before its
cancellation-raising sleep.
…seen)

Reported: drawer showed "3 minutes ago" for an online device on a
TTL=120s announce schedule. Diagnosis: zeroconf only fires
``ServiceStateChange.Updated`` when a record's *content* changes,
not on the same-content TTL refreshes ESPHome devices send every
~60s. We were stamping ``_mdns_last_seen`` at the (rare) callback
firings, so the value lied — the cache itself had been silently
updating with fresh ``DNSAddress.created`` timestamps all along.

Read ``DNSAddress.created`` directly from
``zeroconf.cache.get_all_by_details`` for the device's
``<name>.local.`` A and AAAA records. Use the most-recently-
created across both as the truthful "last heard" — and surface
``DNSAddress.get_remaining_ttl`` as a new
``mdns_ttl_remaining_seconds`` wire field so the drawer can show
"TTL: 38s" beside the row.

Why A/AAAA (not SRV): the ``_esphomelib._tcp.local.`` SRV record
only exists on devices running the ESPHome native API. A
web_server / MQTT / OTA-only device still announces an A/AAAA
under its hostname, which is what we care about for "is this
host reachable on the LAN".

Tracker accepts an injected ``mdns_cache_reader`` callable so
the data source is testable; ``observe(name, "mdns")`` is now a
pure callback-fire path (no stamp). Tests pass ``dict.get`` as
the reader directly — bound-method already matches the right
shape, no factory needed.

Wiring: build the state monitor first, then the tracker with
``self._state_monitor.get_mdns_cache_info`` as a bound method,
then ``state_monitor.set_reachability(tracker)`` to complete
the loop. Avoids a wrapper lambda the previous draft needed.
Use zeroconf's ``millis_to_seconds`` helper instead of dividing
by 1000 by hand; ``operator.attrgetter`` instead of a lambda
for ``max(records, key=...)``.
bdraco added a commit to esphome/device-builder-frontend that referenced this pull request May 5, 2026
Pairs with backend esphome/device-builder#329's
``mdns_ttl_remaining_seconds`` wire field — the cached SRV/A
record's remaining TTL according to zeroconf, useful as a
diagnostic next to the freshness ("Last seen 90s ago · TTL: 30s"
tells "due to re-announce" apart from "missed several windows
already").

* Add ``mdns_ttl_remaining_seconds`` to ``ReachabilityStateEvent``.
* The drawer's ``ReachabilityRowSpec`` gets an optional
  ``ttlRemaining`` field; the mDNS row populates it from the
  event payload, others leave it null.
* Render path: count down ``ttlRemaining`` between server pushes
  the same way ``ageOf`` counts up, clamping at zero so a stale
  snapshot doesn't surface negative seconds.
* New ``drawer_mdns_ttl_remaining`` localization key formatted
  through the existing ``Intl.NumberFormat`` so the integer
  decimal stays locale-correct.
bdraco added 4 commits May 4, 2026 23:27
Reported live: drawer rendered "TTL: 0s" for every active mDNS
row. Cause: ``DNSAddress.get_remaining_ttl`` already returns
seconds (the zeroconf docstring + impl divide by 1000 before
returning), but ``get_mdns_cache_info`` ran the result through
``millis_to_seconds`` a second time — a 108-second TTL became
0.108 seconds and rendered as "TTL: 0s".

Drop the extra ``millis_to_seconds`` for the TTL path. The
``age_seconds`` calc still needs the conversion because
``current_time_millis() - record.created`` is plain
millisecond arithmetic. Comment both call sites so a future
refactor doesn't re-introduce the symmetry that looks right
but is wrong.

Test fixture also corrected — passing 90_000 / 115_000 to the
``get_remaining_ttl`` mock was masking the bug because the
double-divide cancelled out the wrong-unit input. Stub
seconds now and assert the seconds round-trip.
Two more reported live:

* "Every reopen shows TTL: 120s · now" — caused by the
  refresh-then-sleep ordering firing an active mDNS resolve on
  every drawer-open. With cache-based snapshot reads the
  initial value already reflects truth, so the eager resolve
  just clobbered the displayed age with a fresh announce on
  every subscribe. Flip back to sleep-then-refresh: the open
  reads the cache as-is, the 60s tick afterwards keeps the
  A/AAAA records alive (``ServiceBrowser`` only auto-refreshes
  the PTR; A and AAAA decay on their own TTL otherwise).

* "TTL: 0s · 2 minutes ago" persisting on a powered-off
  device — zeroconf's cache reaper sweeps lazily, so an
  expired record sits in the cache for up to ~10s after its
  TTL. Filter ``r.is_expired(now)`` out before picking the
  latest, so the row hides as soon as TTL math says we're
  past expiry rather than waiting for the reaper.

New real-zeroconf integration test
(``test_get_mdns_cache_info_against_real_zeroconf_record``) —
constructs a ``DNSAddress`` via the documented constructor and
drops it into a ``Zeroconf.cache``, verifying the
seconds/milliseconds contract end-to-end without depending on
a stub matching the production assumption (which is how the
earlier "TTL: 0s" double-divide bug went green for a full
review cycle). Imports moved to top-level on the test file;
mock-based ``picks_latest_record`` test got an
``is_expired = MagicMock(return_value=False)`` stub on each
record so the new filter doesn't drop them.
Reported: "after 120s [the rows] all expire and switch to
Reachability Waiting for first broadcast…". Confirmed: ESPHome
devices are mDNS-silent except at startup — they only respond
to probes, never volunteer announces. With no proactive
querying the A records expire on their TTL (120s) and the
drawer's mDNS row vanishes for every device that hasn't been
queried recently.

Fix: the per-subscription refresh tick now actively probes
the wire when the cached A record is within 10s of TTL
expiry (or already absent). Bypasses
``AsyncEsphomeZeroconf.async_resolve_host`` because that
short-circuits on cache hit — for keeping the cache *alive*
we need to skip the cache check entirely and call
``AddressResolver.async_request`` directly so the device's
response refreshes the cache entry's ``created`` timestamp.

Walkthrough (60s tick + 10s threshold + 120s A TTL):

* t=0: subscribe; cache TTL=120
* t=60s: tick fires, TTL_remaining=60s > 10s → skip wire query
* t=120s: tick fires, TTL_remaining ≤ 10s (or absent) → wire
  query, device responds, cache TTL extended to 120s

The skip-when-fresh gate matters at fleet scale — each probe
is a multicast query, so an unconditional every-tick probe
would saturate the local segment for no gain.

Tests rebuilt for the new ``AddressResolver`` path: stubs
``async_request`` + ``parsed_scoped_addresses``, asserts
the threshold gate (skip when fresh, probe when near
expiry / absent). The cache-info ``is_expired`` filter test
fixture got an ``is_expired = MagicMock(return_value=False)``
on each record so the new filter doesn't drop them.

Plus an integration-style test that constructs a real
``DNSAddress`` via the documented constructor and drops it
into a real ``Zeroconf.cache`` — defends against the class
of bug where a stub matches the production assumption and
the wrong unit cancels out (the earlier "TTL: 0s" bug).
Reverts the just-pushed ``AddressResolver.async_request`` bypass —
the user's question pinned down the actual zeroconf contract:
``async_request`` (and inherited by ``AddressResolver``) calls
``_load_from_cache`` first and short-circuits if a complete
record set is in cache. The bypass would query unconditionally,
but that defeats the cache and adds multicast traffic for
already-fresh records. Better: schedule the loop so the wire
query happens when the cache is *actually* expired and
``async_resolve_host``'s existing fall-through behaviour does
the right thing.

New refresh loop shape:

* Read cached TTL_remaining for the device's A/AAAA records.
* If still alive (``> 0``): sleep ``ttl_remaining + padding`` and
  ``continue``. Wakes up just past expiry.
* If expired or absent: sleep just the padding (lets the
  subscription's initial snapshot land first), then call
  ``refresh_device_mdns`` — by now the cache is gone and the
  ``async_resolve_host`` cache-check fails, so the wire query
  actually fires.

The recheck-after-sleep matters: an unrelated mDNS announce
(another browser query, a startup probe from somewhere on the
LAN) can re-arm the cache during our sleep. The recheck spares
us a redundant wire query in that case — we just sleep again
for the new lifetime.

Also drops the ``is_expired`` filter in ``get_mdns_cache_info``.
The drawer wants the truthful "last seen" age even when A has
aged past its TTL: surfacing "Last seen 122s ago, TTL: 0s" is
more accurate than hiding the row entirely just because the
120s TTL ran out a second before our refresh tick fires. The
PTR record (4500s TTL) keeps the device's existence alive in
mDNS-land regardless of A/AAAA expiry. The row hides correctly
once the reaper actually evicts the record.

Tests: new ``test_refresh_loop_sleeps_until_cache_expiry`` pins
the ``ttl_remaining + padding`` arithmetic and
``test_refresh_loop_skips_wire_query_when_recheck_finds_fresh_cache``
pins the recheck-don't-query gate. Threshold / bypass tests
are gone.
@bdraco bdraco marked this pull request as draft May 5, 2026 04:57
@bdraco
Copy link
Copy Markdown
Member Author

bdraco commented May 5, 2026

Marking as draft — getting late, can't finish testing tonight. Will resume tomorrow.

bdraco added 2 commits May 5, 2026 00:08
User feedback: the drawer's "Last seen" should reflect the
most recent ``DNSRecord.created`` across ANY mDNS record we
have for the device, not just the address records. Previously
the row showed "Waiting for first broadcast…" the moment the
A record expired (120s) even though the device's PTR was
kept fresh by the ``ServiceBrowser`` and the device was
plainly still around in mDNS-land — misleading "we still show
as online" but the row hides.

``get_mdns_cache_info`` now reads:

* A and AAAA at ``<name>.local.``
* SRV and TXT at ``<name>._esphomelib._tcp.local.``
* PTR via the dedicated
  ``cache.current_entry_with_name_and_alias`` helper (built
  for exactly the "PTR-by-target" lookup we need).

Picks ``max(records, key=created)`` across the union. The
PTR's ~4500s TTL means it stays cached far longer than A's
120s, so it carries the row through brief A-expiry windows
when the device went quiet between announces. The row only
hides when the cache has nothing under any of the types
(brand-new device, never observed at all).

New integration test ``picks_latest_across_record_types``:
real ``DNSAddress`` (110s old, A) + real ``DNSPointer`` (5s
old, PTR) in a real ``Zeroconf.cache``, asserts the helper
returns the PTR's 5s age — pinning the multi-type lookup so
a refactor that drops any of the type queries surfaces here.
The earlier mock-based tests got the
``current_entry_with_name_and_alias`` stub they now need.
The previous commit's union-of-types ``get_mdns_cache_info``
broke the refresh loop: PTR has a 4500s TTL, so picking
``ttl_remaining`` off the latest record (which is now usually
PTR) made the loop sleep ~75 minutes between probes — the
A record would expire 73 minutes earlier and the drawer would
silently lose its mDNS row in the gap.

Split the lookup:

* ``get_mdns_cache_info`` keeps the union-of-types semantics
  for the drawer's "Last seen" display — surfaces the freshest
  record across A / AAAA / SRV / TXT / PTR.
* New ``get_mdns_a_record_ttl_remaining`` returns the
  *minimum* remaining TTL across cached A/AAAA records (the
  ones that decay at 120s and we need to keep alive). Used by
  the refresh loop to schedule its next probe — sleeps until
  the address records actually need refreshing.

DRY-up: both helpers share a private ``_get_address_records``
that owns the ``cache.get_all_by_details(local_name, A/AAAA,
IN)`` lookup. ``get_mdns_a_record_ttl_remaining`` returns the
``min(remaining)`` (smaller TTL covers the case where one
family expires before the other); ``get_mdns_cache_info``
adds SRV/TXT plus the dedicated
``cache.current_entry_with_name_and_alias`` PTR lookup.

Tests: new ``test_get_mdns_a_record_ttl_remaining_*`` (real
``DNSAddress`` for both families, A=90s and AAAA=40s
remaining; pin AAAA wins, plus a no-records null-return).
Loop tests in ``test_subscribe_reachability.py`` switched
their stubs from ``get_mdns_cache_info.side_effect`` to the
new method's plain-float side_effect.
@bdraco bdraco marked this pull request as ready for review May 5, 2026 14:21
@bdraco bdraco merged commit 76db5e9 into main May 5, 2026
13 checks passed
@bdraco bdraco deleted the reachability-subscribe branch May 5, 2026 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants