Stop dropping mDNS state on Device rebuilds; speed up apply_* lookups by bdraco · Pull Request #75 · esphome/device-builder

bdraco · 2026-05-02T07:35:35Z

Summary

Three connected fixes for the mDNS state-tracking layer, all surfaced by the same dashboard symptom: "I saved the YAML, hashes match, but the device still shows Modified / Deployed=—."

1. Dedupe `apply_*` against device state, not a separate cache

Atomic-write editors (vscode-on-macOS, etc.) can briefly remove a YAML mid-save; the scanner sees the file disappear, fires REMOVED, then re-ADDED with previous=None. The rebuilt Device starts fresh (deployed_config_hash="", api_encryption_active=None, …) even though zeroconf still has the same TXT cached. The bug was that apply_ip / apply_version / apply_config_hash / apply_api_encryption deduped against monitor-side dicts. Those dicts persisted across the rebuild while the Device's fields had been wiped, so the next mDNS announcement matched the cache and short-circuited — leaving the drawer's "Deployed" hash stuck on an em-dash.

Drop the dicts; compare against the configured devices' actual fields (the same shape apply() already uses for state). When any matching device's field differs from the broadcast value, fire — the per-device callback already filters no-op devices on its way through. Single-pass scan with an early break in _any_matching_device_differs.

2. Hash match beats missing `firmware.bin` in `has_pending_changes`

Even after #1, the dashboard still showed "Modified" when both hashes matched. compute_has_pending_changes was checking "no binary on disk → pending" before the hash comparison, so a device whose deployed and expected hashes both matched still came back pending whenever the local firmware.bin was absent. That combination is normal:

--only-generate writes build_info.json (giving an expected_config_hash) but doesn't produce firmware.bin.
A clean job wipes the build directory after a flash; the device keeps broadcasting its hash but the local artefact is gone.
A fresh checkout of the user's config_dir from version control has no compile output yet.

When both hashes are known, the comparison is authoritative — the running firmware was built from a config logically equivalent to what the YAML resolves to today, regardless of whether we still have the local artefact. The bin / mtime fallback only matters when one or both hashes are missing.

3. Name-keyed scanner index for O(1) `apply_*` lookups

Every mDNS broadcast triggers up to four apply_* calls; each was walking self._scanner.devices linearly. At fleet scale (~1000 devices, several broadcasts per second across the LAN) that's measurable churn for what is fundamentally a name-keyed lookup. The scanner now maintains _devices_by_name: dict[str, list[Device]] alongside _devices, exposed via get_by_name(name). The list values are by design — two YAMLs can share the same esphome.name (a config plus a foo (1).yaml copy, dashboard_import siblings, etc.) and a single broadcast must fan out to every match. Index maintenance is centralised in _set_device / _pop_device / _unindex_name so renames-in-YAML don't leave stale buckets.

Reproduction (pre-fix)

Configure a device whose firmware broadcasts config_hash ([mdns] Broadcast config_hash TXT record on _esphomelib._tcp esphome#16145).
Wait for the dashboard to apply the broadcast — drawer's "Deployed" populates.
Edit the YAML in vscode (or any editor doing atomic-rename saves) and save.
Drawer's "Deployed" config hash flips back to — and the card stays "Modified" forever.

After this PR: the next mDNS announcement re-populates the rebuilt device, and the "Modified" pill clears as soon as the hashes match.

Test plan

uv run pytest tests/ — 360 passed, 3 skipped
New regression test test_apply_config_hash_refires_after_device_rebuild exercises the rebuild scenario
New regression test test_in_sync_when_hashes_match_even_without_local_binary locks down the priority change
Test side-effects updated to mirror production (controller callbacks now write the value back onto the device, since the new dedupe relies on that write to settle)

Atomic-write editors (vscode-on-macOS, etc.) can briefly remove a YAML mid-save; the scanner sees the file disappear, fires ``REMOVED``, then re-``ADDED`` on the next sweep with ``previous=None``. The rebuilt Device starts fresh (``deployed_config_hash=""``, ``api_encryption_active=None``, …) even though zeroconf still has the same TXT cached, so the expected re-population path is the eager probe + the next mDNS broadcast — same code that already runs on every announcement. The bug was that ``apply_ip`` / ``apply_version`` / ``apply_config_hash`` / ``apply_api_encryption`` deduped against their own monitor-side dicts. Those dicts persisted across the rebuild while the Device's fields had been wiped, so the next announcement matched the cache and short-circuited — leaving the drawer's "Deployed" hash stuck on an em-dash and the version / encryption indicators stale until the user re-flashed or otherwise forced a value change. Drop the dedupe dicts and instead compare against the configured devices' actual fields (the same shape ``apply()`` already uses for state). When any matching device's field differs from the broadcast value, fire — the per-device callback already filters no-op devices on its way through. Same semantics in the steady state, but a rebuilt Device is observable as drift and the next mDNS event repopulates it. Test side-effects updated to mirror production: the controller's real callbacks write the value back onto the device, and the new device-based dedupe relies on that write to settle. Without the side-effect a bare ``MagicMock`` callback would leave the field empty and every repeat call would (correctly) re-fire.

Previous shape walked the device list twice (list-comp to gather matches, then ``all(...)`` to dedupe). At fleet scale (~1000 devices, several mDNS broadcasts per second) that's measurable churn. Single-pass loop with an early break short-circuits the moment a stale match is found, and otherwise scans the list once per apply call.

Copilot

Pull request overview

Fixes an mDNS deduplication bug where rebuilt Device objects (e.g., after atomic-save scanner churn) could fail to re-apply cached mDNS TXT data because dedupe was keyed off monitor-side caches instead of the device model’s fields.

Changes:

Remove monitor-local dedupe dictionaries for IP/version/config-hash/API-encryption and dedupe against the current Device fields instead.
Update mDNS-related tests to mirror production behavior where controller callbacks write observed values back onto the Device.
Add a regression test covering the “device rebuild + cached TXT” scenario for config_hash.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`esphome_device_builder/controllers/_device_state_monitor.py`	Switch dedupe logic to compare against live `Device` fields; remove monitor-side caches.
`tests/test_probe_device.py`	Adjust probe test setup to match removed monitor dedupe fields.
`tests/test_mdns_version.py`	Make test callback write back to `Device.deployed_version` to reflect production dedupe behavior.
`tests/test_mdns_config_hash.py`	Make test callback write back to `Device.deployed_config_hash` and add rebuild regression test.
`tests/test_mdns_api_encryption.py`	Make test callback write back to `Device.api_encryption_active` to reflect production dedupe behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The "no binary on disk → pending" check was firing first and overriding the hash comparison, so a device whose deployed and expected hashes both match still showed "Modified" in the dashboard whenever the local ``firmware.bin`` was absent. That combination is normal: - ``--only-generate`` writes ``build_info.json`` (giving an ``expected_config_hash``) but doesn't produce ``firmware.bin``. - A ``clean`` job wipes the build directory after a flash; the device keeps broadcasting its hash but the local artefact is gone. - A fresh checkout of the user's config_dir from version control has no compile output yet. When both hashes are known, that's authoritative — the running firmware was built from a config logically equivalent to what the YAML resolves to today, no matter whether we still have the local artefact. The bin / mtime fallback is only relevant when one or both hashes are missing.

mDNS / ping / MQTT observations key on the device's ``esphome.name`` and call the apply-* methods several times per broadcast. The previous shape walked ``self._scanner.devices`` on every call (O(N) per apply, O(N) on every dedupe check inside ``_any_matching_device_differs``); at fleet scale (~1000 devices, several broadcasts per second across the LAN) that's measurable churn for what is fundamentally a name-keyed lookup. Maintain ``_devices_by_name: dict[str, list[Device]]`` in the scanner alongside ``_devices``. The list values are by design — two YAMLs can share the same ``esphome.name`` (a config plus a ``foo (1).yaml`` copy, dashboard_import siblings, etc.) and a single broadcast must fan out to every match. Expose via ``get_by_name(name)``; wire ``DevicesController._devices_by_name`` and ``DeviceStateMonitor._find_device_by_name`` / ``_any_matching_device_differs`` to use it. Index maintenance is centralised in ``_set_device`` / ``_pop_device`` / ``_unindex_name`` so renames-in-YAML don't leave stale buckets. Tests that build a ``MagicMock`` scanner now also wire ``get_by_name`` so the lookup mirrors production. The monitor keeps a linear-scan fallback for callers that haven't passed ``get_devices_by_name`` (kept narrow so existing direct-construct tests don't need a parallel rewrite).

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- ``DeviceScanner.get_by_name`` now returns a fresh list snapshot. Previously the method handed out a reference to the live bucket; a misbehaving caller could mutate it (``.clear()``, ``.sort()``) and quietly poison the name index for future lookups. Matches the semantics of the ``devices`` property. - ``_set_device`` keeps each bucket sorted by ``configuration`` filename. Insertion order used to depend on ``paths_to_load`` (a set), so ``bucket[0]`` consumers — including ``_find_device_by_name`` and the apply / dedupe path — saw a non-deterministic "first match" for duplicate-named YAMLs and could flip-flop across scans. - ``apply()`` now dedupes against every matching device's state, not just the first. With two YAMLs sharing an ``esphome.name``, one sibling could be in-sync while the other was rebuilt with state=UNKNOWN; the old "first device matches → bail" path left the stale sibling stuck. - Reworded the priority-2 docstring in ``compute_has_pending_changes`` to reflect the actual condition (no comparable hash pair AND no local binary), since with the reorder #2 can fire even when one hash side has been broadcast. - New regression tests: - ``test_get_by_name_returns_a_snapshot`` - ``test_index_bucket_order_is_deterministic`` - ``test_apply_state_repairs_stale_sibling_when_first_match_is_in_sync``

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Document the architecture conventions established by PR #75: the Device is the source of truth (apply-* methods dedupe against device fields, not a monitor-side cache), the scanner's name- keyed index, and apply-* dedupe over every matching device rather than ``bucket[0]``. - Add a "things that have bitten us" entry for atomic-save editor churn (REMOVED+re-ADDED with previous=None) and the matching test-helper requirement that mock callbacks mirror production's state mutation. - Note the workflow conventions surfaced this session: ``uv run pytest`` instead of bare ``python -m pytest``, and PowerShell- on-Windows-runners doesn't accept bash-style ``\`` line continuations. - Add a "Comparison with the legacy esphome dashboard" pointer plus a lessons-about-the-comparison section. ``compare_legacy.md`` is the working-document analysis of where legacy behaviour hasn't yet been ported; checked in so future sessions can read it instead of re-deriving the gaps.

The handler-level tests in ``tests/controllers/devices/`` all bypass ``__init__`` via ``__new__`` and stub the inner controllers individually. That left the wiring code itself uncovered: ``__init__`` (lines 81-129) constructing the scanner / state monitor / MQTT coordinator and threading their nine callbacks back to the controller, plus ``start`` / ``stop`` / ``poll`` (lines 142-164). 6 tests instantiate a real ``DevicesController`` against a ``tmp_path`` config dir and a thin stub ``DeviceBuilder``: - ``__init__`` constructs all three inner controllers, sets the documented default attrs, wires the scanner against ``tmp_path``. - State-monitor callbacks point back at ``self._on_*_change`` — pin the seven distinct wires (regression class from PR #75). - ``start`` resolves esphome cmd, loads ignored, scans, starts the state monitor, reconciles MQTT, registers the JOB_COMPLETED bus listener. - ``stop`` unsubscribes the bus listener and stops both monitors. - ``stop`` is idempotent without a started listener (cold process restart). - ``poll`` rescans + reconciles MQTT. Inner monitor lifecycle methods are patched as ``AsyncMock`` so ``start`` / ``stop`` don't try to open a zeroconf browser or connect to MQTT — those have their own dedicated test files.

Copilot AI review requested due to automatic review settings May 2, 2026 07:35

bdraco added the bugfix Bug fix label May 2, 2026

Copilot started reviewing on behalf of bdraco May 2, 2026 07:36 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

bdraco added 2 commits May 2, 2026 02:44

bdraco changed the title ~~Dedupe mDNS apply_* against device state, not a separate cache~~ Stop dropping mDNS state on Device rebuilds; speed up apply_* lookups May 2, 2026

bdraco requested a review from Copilot May 2, 2026 07:46

Copilot started reviewing on behalf of bdraco May 2, 2026 07:46 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

bdraco requested a review from Copilot May 2, 2026 07:55

Copilot started reviewing on behalf of bdraco May 2, 2026 07:56 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

bdraco merged commit ab2ceb8 into main May 2, 2026
13 checks passed

bdraco deleted the dedupe-against-device-state branch May 2, 2026 07:59

bdraco mentioned this pull request May 2, 2026

Capture session lessons in CLAUDE.md and add legacy-comparison notes #86

Merged

2 tasks

bdraco mentioned this pull request May 3, 2026

Cover DevicesController lifecycle end-to-end #195

Merged

2 tasks

bdraco mentioned this pull request May 6, 2026

Carry api_encryption_active across scanner reloads #365

Merged

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stop dropping mDNS state on Device rebuilds; speed up apply_* lookups#75

Stop dropping mDNS state on Device rebuilds; speed up apply_* lookups#75
bdraco merged 5 commits into
mainfrom
dedupe-against-device-state

bdraco commented May 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

bdraco commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Dedupe apply_* against device state, not a separate cache

2. Hash match beats missing firmware.bin in has_pending_changes

3. Name-keyed scanner index for O(1) apply_* lookups

Reproduction (pre-fix)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bdraco commented May 2, 2026 •

edited

Loading

1. Dedupe `apply_*` against device state, not a separate cache

2. Hash match beats missing `firmware.bin` in `has_pending_changes`

3. Name-keyed scanner index for O(1) `apply_*` lookups