Stop dropping mDNS state on Device rebuilds; speed up apply_* lookups#75
Conversation
Atomic-write editors (vscode-on-macOS, etc.) can briefly remove a YAML mid-save; the scanner sees the file disappear, fires ``REMOVED``, then re-``ADDED`` on the next sweep with ``previous=None``. The rebuilt Device starts fresh (``deployed_config_hash=""``, ``api_encryption_active=None``, …) even though zeroconf still has the same TXT cached, so the expected re-population path is the eager probe + the next mDNS broadcast — same code that already runs on every announcement. The bug was that ``apply_ip`` / ``apply_version`` / ``apply_config_hash`` / ``apply_api_encryption`` deduped against their own monitor-side dicts. Those dicts persisted across the rebuild while the Device's fields had been wiped, so the next announcement matched the cache and short-circuited — leaving the drawer's "Deployed" hash stuck on an em-dash and the version / encryption indicators stale until the user re-flashed or otherwise forced a value change. Drop the dedupe dicts and instead compare against the configured devices' actual fields (the same shape ``apply()`` already uses for state). When any matching device's field differs from the broadcast value, fire — the per-device callback already filters no-op devices on its way through. Same semantics in the steady state, but a rebuilt Device is observable as drift and the next mDNS event repopulates it. Test side-effects updated to mirror production: the controller's real callbacks write the value back onto the device, and the new device-based dedupe relies on that write to settle. Without the side-effect a bare ``MagicMock`` callback would leave the field empty and every repeat call would (correctly) re-fire.
Previous shape walked the device list twice (list-comp to gather matches, then ``all(...)`` to dedupe). At fleet scale (~1000 devices, several mDNS broadcasts per second) that's measurable churn. Single-pass loop with an early break short-circuits the moment a stale match is found, and otherwise scans the list once per apply call.
There was a problem hiding this comment.
Pull request overview
Fixes an mDNS deduplication bug where rebuilt Device objects (e.g., after atomic-save scanner churn) could fail to re-apply cached mDNS TXT data because dedupe was keyed off monitor-side caches instead of the device model’s fields.
Changes:
- Remove monitor-local dedupe dictionaries for IP/version/config-hash/API-encryption and dedupe against the current
Devicefields instead. - Update mDNS-related tests to mirror production behavior where controller callbacks write observed values back onto the
Device. - Add a regression test covering the “device rebuild + cached TXT” scenario for
config_hash.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
esphome_device_builder/controllers/_device_state_monitor.py |
Switch dedupe logic to compare against live Device fields; remove monitor-side caches. |
tests/test_probe_device.py |
Adjust probe test setup to match removed monitor dedupe fields. |
tests/test_mdns_version.py |
Make test callback write back to Device.deployed_version to reflect production dedupe behavior. |
tests/test_mdns_config_hash.py |
Make test callback write back to Device.deployed_config_hash and add rebuild regression test. |
tests/test_mdns_api_encryption.py |
Make test callback write back to Device.api_encryption_active to reflect production dedupe behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The "no binary on disk → pending" check was firing first and overriding the hash comparison, so a device whose deployed and expected hashes both match still showed "Modified" in the dashboard whenever the local ``firmware.bin`` was absent. That combination is normal: - ``--only-generate`` writes ``build_info.json`` (giving an ``expected_config_hash``) but doesn't produce ``firmware.bin``. - A ``clean`` job wipes the build directory after a flash; the device keeps broadcasting its hash but the local artefact is gone. - A fresh checkout of the user's config_dir from version control has no compile output yet. When both hashes are known, that's authoritative — the running firmware was built from a config logically equivalent to what the YAML resolves to today, no matter whether we still have the local artefact. The bin / mtime fallback is only relevant when one or both hashes are missing.
mDNS / ping / MQTT observations key on the device's ``esphome.name`` and call the apply-* methods several times per broadcast. The previous shape walked ``self._scanner.devices`` on every call (O(N) per apply, O(N) on every dedupe check inside ``_any_matching_device_differs``); at fleet scale (~1000 devices, several broadcasts per second across the LAN) that's measurable churn for what is fundamentally a name-keyed lookup. Maintain ``_devices_by_name: dict[str, list[Device]]`` in the scanner alongside ``_devices``. The list values are by design — two YAMLs can share the same ``esphome.name`` (a config plus a ``foo (1).yaml`` copy, dashboard_import siblings, etc.) and a single broadcast must fan out to every match. Expose via ``get_by_name(name)``; wire ``DevicesController._devices_by_name`` and ``DeviceStateMonitor._find_device_by_name`` / ``_any_matching_device_differs`` to use it. Index maintenance is centralised in ``_set_device`` / ``_pop_device`` / ``_unindex_name`` so renames-in-YAML don't leave stale buckets. Tests that build a ``MagicMock`` scanner now also wire ``get_by_name`` so the lookup mirrors production. The monitor keeps a linear-scan fallback for callers that haven't passed ``get_devices_by_name`` (kept narrow so existing direct-construct tests don't need a parallel rewrite).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- ``DeviceScanner.get_by_name`` now returns a fresh list snapshot. Previously the method handed out a reference to the live bucket; a misbehaving caller could mutate it (``.clear()``, ``.sort()``) and quietly poison the name index for future lookups. Matches the semantics of the ``devices`` property. - ``_set_device`` keeps each bucket sorted by ``configuration`` filename. Insertion order used to depend on ``paths_to_load`` (a set), so ``bucket[0]`` consumers — including ``_find_device_by_name`` and the apply / dedupe path — saw a non-deterministic "first match" for duplicate-named YAMLs and could flip-flop across scans. - ``apply()`` now dedupes against every matching device's state, not just the first. With two YAMLs sharing an ``esphome.name``, one sibling could be in-sync while the other was rebuilt with state=UNKNOWN; the old "first device matches → bail" path left the stale sibling stuck. - Reworded the priority-2 docstring in ``compute_has_pending_changes`` to reflect the actual condition (no comparable hash pair AND no local binary), since with the reorder #2 can fire even when one hash side has been broadcast. - New regression tests: - ``test_get_by_name_returns_a_snapshot`` - ``test_index_bucket_order_is_deterministic`` - ``test_apply_state_repairs_stale_sibling_when_first_match_is_in_sync``
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Document the architecture conventions established by PR #75: the Device is the source of truth (apply-* methods dedupe against device fields, not a monitor-side cache), the scanner's name- keyed index, and apply-* dedupe over every matching device rather than ``bucket[0]``. - Add a "things that have bitten us" entry for atomic-save editor churn (REMOVED+re-ADDED with previous=None) and the matching test-helper requirement that mock callbacks mirror production's state mutation. - Note the workflow conventions surfaced this session: ``uv run pytest`` instead of bare ``python -m pytest``, and PowerShell- on-Windows-runners doesn't accept bash-style ``\`` line continuations. - Add a "Comparison with the legacy esphome dashboard" pointer plus a lessons-about-the-comparison section. ``compare_legacy.md`` is the working-document analysis of where legacy behaviour hasn't yet been ported; checked in so future sessions can read it instead of re-deriving the gaps.
The handler-level tests in ``tests/controllers/devices/`` all bypass ``__init__`` via ``__new__`` and stub the inner controllers individually. That left the wiring code itself uncovered: ``__init__`` (lines 81-129) constructing the scanner / state monitor / MQTT coordinator and threading their nine callbacks back to the controller, plus ``start`` / ``stop`` / ``poll`` (lines 142-164). 6 tests instantiate a real ``DevicesController`` against a ``tmp_path`` config dir and a thin stub ``DeviceBuilder``: - ``__init__`` constructs all three inner controllers, sets the documented default attrs, wires the scanner against ``tmp_path``. - State-monitor callbacks point back at ``self._on_*_change`` — pin the seven distinct wires (regression class from PR #75). - ``start`` resolves esphome cmd, loads ignored, scans, starts the state monitor, reconciles MQTT, registers the JOB_COMPLETED bus listener. - ``stop`` unsubscribes the bus listener and stops both monitors. - ``stop`` is idempotent without a started listener (cold process restart). - ``poll`` rescans + reconciles MQTT. Inner monitor lifecycle methods are patched as ``AsyncMock`` so ``start`` / ``stop`` don't try to open a zeroconf browser or connect to MQTT — those have their own dedicated test files.
The handler-level tests in ``tests/controllers/devices/`` all bypass ``__init__`` via ``__new__`` and stub the inner controllers individually. That left the wiring code itself uncovered: ``__init__`` (lines 81-129) constructing the scanner / state monitor / MQTT coordinator and threading their nine callbacks back to the controller, plus ``start`` / ``stop`` / ``poll`` (lines 142-164). 6 tests instantiate a real ``DevicesController`` against a ``tmp_path`` config dir and a thin stub ``DeviceBuilder``: - ``__init__`` constructs all three inner controllers, sets the documented default attrs, wires the scanner against ``tmp_path``. - State-monitor callbacks point back at ``self._on_*_change`` — pin the seven distinct wires (regression class from PR #75). - ``start`` resolves esphome cmd, loads ignored, scans, starts the state monitor, reconciles MQTT, registers the JOB_COMPLETED bus listener. - ``stop`` unsubscribes the bus listener and stops both monitors. - ``stop`` is idempotent without a started listener (cold process restart). - ``poll`` rescans + reconciles MQTT. Inner monitor lifecycle methods are patched as ``AsyncMock`` so ``start`` / ``stop`` don't try to open a zeroconf browser or connect to MQTT — those have their own dedicated test files.
The handler-level tests in ``tests/controllers/devices/`` all bypass ``__init__`` via ``__new__`` and stub the inner controllers individually. That left the wiring code itself uncovered: ``__init__`` (lines 81-129) constructing the scanner / state monitor / MQTT coordinator and threading their nine callbacks back to the controller, plus ``start`` / ``stop`` / ``poll`` (lines 142-164). 6 tests instantiate a real ``DevicesController`` against a ``tmp_path`` config dir and a thin stub ``DeviceBuilder``: - ``__init__`` constructs all three inner controllers, sets the documented default attrs, wires the scanner against ``tmp_path``. - State-monitor callbacks point back at ``self._on_*_change`` — pin the seven distinct wires (regression class from PR #75). - ``start`` resolves esphome cmd, loads ignored, scans, starts the state monitor, reconciles MQTT, registers the JOB_COMPLETED bus listener. - ``stop`` unsubscribes the bus listener and stops both monitors. - ``stop`` is idempotent without a started listener (cold process restart). - ``poll`` rescans + reconciles MQTT. Inner monitor lifecycle methods are patched as ``AsyncMock`` so ``start`` / ``stop`` don't try to open a zeroconf browser or connect to MQTT — those have their own dedicated test files.
The handler-level tests in ``tests/controllers/devices/`` all bypass ``__init__`` via ``__new__`` and stub the inner controllers individually. That left the wiring code itself uncovered: ``__init__`` (lines 81-129) constructing the scanner / state monitor / MQTT coordinator and threading their nine callbacks back to the controller, plus ``start`` / ``stop`` / ``poll`` (lines 142-164). 6 tests instantiate a real ``DevicesController`` against a ``tmp_path`` config dir and a thin stub ``DeviceBuilder``: - ``__init__`` constructs all three inner controllers, sets the documented default attrs, wires the scanner against ``tmp_path``. - State-monitor callbacks point back at ``self._on_*_change`` — pin the seven distinct wires (regression class from PR #75). - ``start`` resolves esphome cmd, loads ignored, scans, starts the state monitor, reconciles MQTT, registers the JOB_COMPLETED bus listener. - ``stop`` unsubscribes the bus listener and stops both monitors. - ``stop`` is idempotent without a started listener (cold process restart). - ``poll`` rescans + reconciles MQTT. Inner monitor lifecycle methods are patched as ``AsyncMock`` so ``start`` / ``stop`` don't try to open a zeroconf browser or connect to MQTT — those have their own dedicated test files.
The handler-level tests in ``tests/controllers/devices/`` all bypass ``__init__`` via ``__new__`` and stub the inner controllers individually. That left the wiring code itself uncovered: ``__init__`` (lines 81-129) constructing the scanner / state monitor / MQTT coordinator and threading their nine callbacks back to the controller, plus ``start`` / ``stop`` / ``poll`` (lines 142-164). 6 tests instantiate a real ``DevicesController`` against a ``tmp_path`` config dir and a thin stub ``DeviceBuilder``: - ``__init__`` constructs all three inner controllers, sets the documented default attrs, wires the scanner against ``tmp_path``. - State-monitor callbacks point back at ``self._on_*_change`` — pin the seven distinct wires (regression class from PR #75). - ``start`` resolves esphome cmd, loads ignored, scans, starts the state monitor, reconciles MQTT, registers the JOB_COMPLETED bus listener. - ``stop`` unsubscribes the bus listener and stops both monitors. - ``stop`` is idempotent without a started listener (cold process restart). - ``poll`` rescans + reconciles MQTT. Inner monitor lifecycle methods are patched as ``AsyncMock`` so ``start`` / ``stop`` don't try to open a zeroconf browser or connect to MQTT — those have their own dedicated test files.
Summary
Three connected fixes for the mDNS state-tracking layer, all surfaced by the same dashboard symptom: "I saved the YAML, hashes match, but the device still shows Modified / Deployed=—."
1. Dedupe
apply_*against device state, not a separate cacheAtomic-write editors (vscode-on-macOS, etc.) can briefly remove a YAML mid-save; the scanner sees the file disappear, fires
REMOVED, then re-ADDEDwithprevious=None. The rebuiltDevicestarts fresh (deployed_config_hash="",api_encryption_active=None, …) even though zeroconf still has the same TXT cached. The bug was thatapply_ip/apply_version/apply_config_hash/apply_api_encryptiondeduped against monitor-side dicts. Those dicts persisted across the rebuild while theDevice's fields had been wiped, so the next mDNS announcement matched the cache and short-circuited — leaving the drawer's "Deployed" hash stuck on an em-dash.Drop the dicts; compare against the configured devices' actual fields (the same shape
apply()already uses for state). When any matching device's field differs from the broadcast value, fire — the per-device callback already filters no-op devices on its way through. Single-pass scan with an early break in_any_matching_device_differs.2. Hash match beats missing
firmware.bininhas_pending_changesEven after #1, the dashboard still showed "Modified" when both hashes matched.
compute_has_pending_changeswas checking "no binary on disk → pending" before the hash comparison, so a device whose deployed and expected hashes both matched still came back pending whenever the localfirmware.binwas absent. That combination is normal:--only-generatewritesbuild_info.json(giving anexpected_config_hash) but doesn't producefirmware.bin.cleanjob wipes the build directory after a flash; the device keeps broadcasting its hash but the local artefact is gone.When both hashes are known, the comparison is authoritative — the running firmware was built from a config logically equivalent to what the YAML resolves to today, regardless of whether we still have the local artefact. The bin / mtime fallback only matters when one or both hashes are missing.
3. Name-keyed scanner index for O(1)
apply_*lookupsEvery mDNS broadcast triggers up to four
apply_*calls; each was walkingself._scanner.deviceslinearly. At fleet scale (~1000 devices, several broadcasts per second across the LAN) that's measurable churn for what is fundamentally a name-keyed lookup. The scanner now maintains_devices_by_name: dict[str, list[Device]]alongside_devices, exposed viaget_by_name(name). The list values are by design — two YAMLs can share the sameesphome.name(a config plus afoo (1).yamlcopy,dashboard_importsiblings, etc.) and a single broadcast must fan out to every match. Index maintenance is centralised in_set_device/_pop_device/_unindex_nameso renames-in-YAML don't leave stale buckets.Reproduction (pre-fix)
config_hash([mdns] Broadcast config_hash TXT record on _esphomelib._tcp esphome#16145).—and the card stays "Modified" forever.After this PR: the next mDNS announcement re-populates the rebuilt device, and the "Modified" pill clears as soon as the hashes match.
Test plan
uv run pytest tests/— 360 passed, 3 skippedtest_apply_config_hash_refires_after_device_rebuildexercises the rebuild scenariotest_in_sync_when_hashes_match_even_without_local_binarylocks down the priority change