Skip to content

Stop dropping mDNS state on Device rebuilds; speed up apply_* lookups#75

Merged
bdraco merged 5 commits into
mainfrom
dedupe-against-device-state
May 2, 2026
Merged

Stop dropping mDNS state on Device rebuilds; speed up apply_* lookups#75
bdraco merged 5 commits into
mainfrom
dedupe-against-device-state

Conversation

@bdraco
Copy link
Copy Markdown
Member

@bdraco bdraco commented May 2, 2026

Summary

Three connected fixes for the mDNS state-tracking layer, all surfaced by the same dashboard symptom: "I saved the YAML, hashes match, but the device still shows Modified / Deployed=—."

1. Dedupe apply_* against device state, not a separate cache

Atomic-write editors (vscode-on-macOS, etc.) can briefly remove a YAML mid-save; the scanner sees the file disappear, fires REMOVED, then re-ADDED with previous=None. The rebuilt Device starts fresh (deployed_config_hash="", api_encryption_active=None, …) even though zeroconf still has the same TXT cached. The bug was that apply_ip / apply_version / apply_config_hash / apply_api_encryption deduped against monitor-side dicts. Those dicts persisted across the rebuild while the Device's fields had been wiped, so the next mDNS announcement matched the cache and short-circuited — leaving the drawer's "Deployed" hash stuck on an em-dash.

Drop the dicts; compare against the configured devices' actual fields (the same shape apply() already uses for state). When any matching device's field differs from the broadcast value, fire — the per-device callback already filters no-op devices on its way through. Single-pass scan with an early break in _any_matching_device_differs.

2. Hash match beats missing firmware.bin in has_pending_changes

Even after #1, the dashboard still showed "Modified" when both hashes matched. compute_has_pending_changes was checking "no binary on disk → pending" before the hash comparison, so a device whose deployed and expected hashes both matched still came back pending whenever the local firmware.bin was absent. That combination is normal:

  • --only-generate writes build_info.json (giving an expected_config_hash) but doesn't produce firmware.bin.
  • A clean job wipes the build directory after a flash; the device keeps broadcasting its hash but the local artefact is gone.
  • A fresh checkout of the user's config_dir from version control has no compile output yet.

When both hashes are known, the comparison is authoritative — the running firmware was built from a config logically equivalent to what the YAML resolves to today, regardless of whether we still have the local artefact. The bin / mtime fallback only matters when one or both hashes are missing.

3. Name-keyed scanner index for O(1) apply_* lookups

Every mDNS broadcast triggers up to four apply_* calls; each was walking self._scanner.devices linearly. At fleet scale (~1000 devices, several broadcasts per second across the LAN) that's measurable churn for what is fundamentally a name-keyed lookup. The scanner now maintains _devices_by_name: dict[str, list[Device]] alongside _devices, exposed via get_by_name(name). The list values are by design — two YAMLs can share the same esphome.name (a config plus a foo (1).yaml copy, dashboard_import siblings, etc.) and a single broadcast must fan out to every match. Index maintenance is centralised in _set_device / _pop_device / _unindex_name so renames-in-YAML don't leave stale buckets.

Reproduction (pre-fix)

  1. Configure a device whose firmware broadcasts config_hash ([mdns] Broadcast config_hash TXT record on _esphomelib._tcp esphome#16145).
  2. Wait for the dashboard to apply the broadcast — drawer's "Deployed" populates.
  3. Edit the YAML in vscode (or any editor doing atomic-rename saves) and save.
  4. Drawer's "Deployed" config hash flips back to and the card stays "Modified" forever.

After this PR: the next mDNS announcement re-populates the rebuilt device, and the "Modified" pill clears as soon as the hashes match.

Test plan

  • uv run pytest tests/ — 360 passed, 3 skipped
  • New regression test test_apply_config_hash_refires_after_device_rebuild exercises the rebuild scenario
  • New regression test test_in_sync_when_hashes_match_even_without_local_binary locks down the priority change
  • Test side-effects updated to mirror production (controller callbacks now write the value back onto the device, since the new dedupe relies on that write to settle)

Atomic-write editors (vscode-on-macOS, etc.) can briefly remove a
YAML mid-save; the scanner sees the file disappear, fires
``REMOVED``, then re-``ADDED`` on the next sweep with
``previous=None``. The rebuilt Device starts fresh
(``deployed_config_hash=""``, ``api_encryption_active=None``, …)
even though zeroconf still has the same TXT cached, so the
expected re-population path is the eager probe + the next mDNS
broadcast — same code that already runs on every announcement.

The bug was that ``apply_ip`` / ``apply_version`` /
``apply_config_hash`` / ``apply_api_encryption`` deduped against
their own monitor-side dicts. Those dicts persisted across the
rebuild while the Device's fields had been wiped, so the next
announcement matched the cache and short-circuited — leaving the
drawer's "Deployed" hash stuck on an em-dash and the version /
encryption indicators stale until the user re-flashed or otherwise
forced a value change.

Drop the dedupe dicts and instead compare against the configured
devices' actual fields (the same shape ``apply()`` already uses
for state). When any matching device's field differs from the
broadcast value, fire — the per-device callback already filters
no-op devices on its way through. Same semantics in the steady
state, but a rebuilt Device is observable as drift and the next
mDNS event repopulates it.

Test side-effects updated to mirror production: the controller's
real callbacks write the value back onto the device, and the new
device-based dedupe relies on that write to settle. Without the
side-effect a bare ``MagicMock`` callback would leave the field
empty and every repeat call would (correctly) re-fire.
Copilot AI review requested due to automatic review settings May 2, 2026 07:35
@bdraco bdraco added the bugfix Bug fix label May 2, 2026
Previous shape walked the device list twice (list-comp to gather
matches, then ``all(...)`` to dedupe). At fleet scale (~1000
devices, several mDNS broadcasts per second) that's measurable
churn. Single-pass loop with an early break short-circuits the
moment a stale match is found, and otherwise scans the list once
per apply call.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an mDNS deduplication bug where rebuilt Device objects (e.g., after atomic-save scanner churn) could fail to re-apply cached mDNS TXT data because dedupe was keyed off monitor-side caches instead of the device model’s fields.

Changes:

  • Remove monitor-local dedupe dictionaries for IP/version/config-hash/API-encryption and dedupe against the current Device fields instead.
  • Update mDNS-related tests to mirror production behavior where controller callbacks write observed values back onto the Device.
  • Add a regression test covering the “device rebuild + cached TXT” scenario for config_hash.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
esphome_device_builder/controllers/_device_state_monitor.py Switch dedupe logic to compare against live Device fields; remove monitor-side caches.
tests/test_probe_device.py Adjust probe test setup to match removed monitor dedupe fields.
tests/test_mdns_version.py Make test callback write back to Device.deployed_version to reflect production dedupe behavior.
tests/test_mdns_config_hash.py Make test callback write back to Device.deployed_config_hash and add rebuild regression test.
tests/test_mdns_api_encryption.py Make test callback write back to Device.api_encryption_active to reflect production dedupe behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bdraco added 2 commits May 2, 2026 02:44
The "no binary on disk → pending" check was firing first and
overriding the hash comparison, so a device whose deployed and
expected hashes both match still showed "Modified" in the
dashboard whenever the local ``firmware.bin`` was absent. That
combination is normal:
- ``--only-generate`` writes ``build_info.json`` (giving an
  ``expected_config_hash``) but doesn't produce ``firmware.bin``.
- A ``clean`` job wipes the build directory after a flash; the
  device keeps broadcasting its hash but the local artefact is
  gone.
- A fresh checkout of the user's config_dir from version control
  has no compile output yet.

When both hashes are known, that's authoritative — the running
firmware was built from a config logically equivalent to what the
YAML resolves to today, no matter whether we still have the local
artefact. The bin / mtime fallback is only relevant when one or
both hashes are missing.
mDNS / ping / MQTT observations key on the device's
``esphome.name`` and call the apply-* methods several times per
broadcast. The previous shape walked ``self._scanner.devices`` on
every call (O(N) per apply, O(N) on every dedupe check inside
``_any_matching_device_differs``); at fleet scale (~1000 devices,
several broadcasts per second across the LAN) that's measurable
churn for what is fundamentally a name-keyed lookup.

Maintain ``_devices_by_name: dict[str, list[Device]]`` in the
scanner alongside ``_devices``. The list values are by design —
two YAMLs can share the same ``esphome.name`` (a config plus a
``foo (1).yaml`` copy, dashboard_import siblings, etc.) and a
single broadcast must fan out to every match. Expose via
``get_by_name(name)``; wire ``DevicesController._devices_by_name``
and ``DeviceStateMonitor._find_device_by_name`` /
``_any_matching_device_differs`` to use it. Index maintenance is
centralised in ``_set_device`` / ``_pop_device`` /
``_unindex_name`` so renames-in-YAML don't leave stale buckets.

Tests that build a ``MagicMock`` scanner now also wire
``get_by_name`` so the lookup mirrors production. The monitor
keeps a linear-scan fallback for callers that haven't passed
``get_devices_by_name`` (kept narrow so existing direct-construct
tests don't need a parallel rewrite).
@bdraco bdraco changed the title Dedupe mDNS apply_* against device state, not a separate cache Stop dropping mDNS state on Device rebuilds; speed up apply_* lookups May 2, 2026
@bdraco bdraco requested a review from Copilot May 2, 2026 07:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread esphome_device_builder/helpers/device_yaml.py Outdated
Comment thread esphome_device_builder/controllers/_device_scanner.py Outdated
Comment thread esphome_device_builder/controllers/_device_scanner.py Outdated
Comment thread esphome_device_builder/controllers/_device_state_monitor.py
- ``DeviceScanner.get_by_name`` now returns a fresh list snapshot.
  Previously the method handed out a reference to the live bucket;
  a misbehaving caller could mutate it (``.clear()``, ``.sort()``)
  and quietly poison the name index for future lookups. Matches the
  semantics of the ``devices`` property.
- ``_set_device`` keeps each bucket sorted by ``configuration``
  filename. Insertion order used to depend on ``paths_to_load``
  (a set), so ``bucket[0]`` consumers — including
  ``_find_device_by_name`` and the apply / dedupe path — saw a
  non-deterministic "first match" for duplicate-named YAMLs and
  could flip-flop across scans.
- ``apply()`` now dedupes against every matching device's state,
  not just the first. With two YAMLs sharing an ``esphome.name``,
  one sibling could be in-sync while the other was rebuilt with
  state=UNKNOWN; the old "first device matches → bail" path left
  the stale sibling stuck.
- Reworded the priority-2 docstring in ``compute_has_pending_changes``
  to reflect the actual condition (no comparable hash pair AND no
  local binary), since with the reorder #2 can fire even when one
  hash side has been broadcast.
- New regression tests:
  - ``test_get_by_name_returns_a_snapshot``
  - ``test_index_bucket_order_is_deterministic``
  - ``test_apply_state_repairs_stale_sibling_when_first_match_is_in_sync``
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bdraco bdraco merged commit ab2ceb8 into main May 2, 2026
13 checks passed
@bdraco bdraco deleted the dedupe-against-device-state branch May 2, 2026 07:59
bdraco added a commit that referenced this pull request May 2, 2026
- Document the architecture conventions established by PR #75: the
  Device is the source of truth (apply-* methods dedupe against
  device fields, not a monitor-side cache), the scanner's name-
  keyed index, and apply-* dedupe over every matching device
  rather than ``bucket[0]``.
- Add a "things that have bitten us" entry for atomic-save editor
  churn (REMOVED+re-ADDED with previous=None) and the matching
  test-helper requirement that mock callbacks mirror production's
  state mutation.
- Note the workflow conventions surfaced this session: ``uv run
  pytest`` instead of bare ``python -m pytest``, and PowerShell-
  on-Windows-runners doesn't accept bash-style ``\`` line
  continuations.
- Add a "Comparison with the legacy esphome dashboard" pointer
  plus a lessons-about-the-comparison section. ``compare_legacy.md``
  is the working-document analysis of where legacy behaviour
  hasn't yet been ported; checked in so future sessions can read
  it instead of re-deriving the gaps.
bdraco added a commit that referenced this pull request May 3, 2026
The handler-level tests in ``tests/controllers/devices/`` all
bypass ``__init__`` via ``__new__`` and stub the inner
controllers individually. That left the wiring code itself
uncovered: ``__init__`` (lines 81-129) constructing the
scanner / state monitor / MQTT coordinator and threading their
nine callbacks back to the controller, plus ``start`` / ``stop``
/ ``poll`` (lines 142-164).

6 tests instantiate a real ``DevicesController`` against a
``tmp_path`` config dir and a thin stub ``DeviceBuilder``:

- ``__init__`` constructs all three inner controllers, sets the
  documented default attrs, wires the scanner against
  ``tmp_path``.
- State-monitor callbacks point back at ``self._on_*_change`` —
  pin the seven distinct wires (regression class from PR #75).
- ``start`` resolves esphome cmd, loads ignored, scans, starts
  the state monitor, reconciles MQTT, registers the
  JOB_COMPLETED bus listener.
- ``stop`` unsubscribes the bus listener and stops both
  monitors.
- ``stop`` is idempotent without a started listener (cold
  process restart).
- ``poll`` rescans + reconciles MQTT.

Inner monitor lifecycle methods are patched as ``AsyncMock`` so
``start`` / ``stop`` don't try to open a zeroconf browser or
connect to MQTT — those have their own dedicated test files.
bdraco added a commit that referenced this pull request May 3, 2026
The handler-level tests in ``tests/controllers/devices/`` all
bypass ``__init__`` via ``__new__`` and stub the inner
controllers individually. That left the wiring code itself
uncovered: ``__init__`` (lines 81-129) constructing the
scanner / state monitor / MQTT coordinator and threading their
nine callbacks back to the controller, plus ``start`` / ``stop``
/ ``poll`` (lines 142-164).

6 tests instantiate a real ``DevicesController`` against a
``tmp_path`` config dir and a thin stub ``DeviceBuilder``:

- ``__init__`` constructs all three inner controllers, sets the
  documented default attrs, wires the scanner against
  ``tmp_path``.
- State-monitor callbacks point back at ``self._on_*_change`` —
  pin the seven distinct wires (regression class from PR #75).
- ``start`` resolves esphome cmd, loads ignored, scans, starts
  the state monitor, reconciles MQTT, registers the
  JOB_COMPLETED bus listener.
- ``stop`` unsubscribes the bus listener and stops both
  monitors.
- ``stop`` is idempotent without a started listener (cold
  process restart).
- ``poll`` rescans + reconciles MQTT.

Inner monitor lifecycle methods are patched as ``AsyncMock`` so
``start`` / ``stop`` don't try to open a zeroconf browser or
connect to MQTT — those have their own dedicated test files.
bdraco added a commit that referenced this pull request May 3, 2026
The handler-level tests in ``tests/controllers/devices/`` all
bypass ``__init__`` via ``__new__`` and stub the inner
controllers individually. That left the wiring code itself
uncovered: ``__init__`` (lines 81-129) constructing the
scanner / state monitor / MQTT coordinator and threading their
nine callbacks back to the controller, plus ``start`` / ``stop``
/ ``poll`` (lines 142-164).

6 tests instantiate a real ``DevicesController`` against a
``tmp_path`` config dir and a thin stub ``DeviceBuilder``:

- ``__init__`` constructs all three inner controllers, sets the
  documented default attrs, wires the scanner against
  ``tmp_path``.
- State-monitor callbacks point back at ``self._on_*_change`` —
  pin the seven distinct wires (regression class from PR #75).
- ``start`` resolves esphome cmd, loads ignored, scans, starts
  the state monitor, reconciles MQTT, registers the
  JOB_COMPLETED bus listener.
- ``stop`` unsubscribes the bus listener and stops both
  monitors.
- ``stop`` is idempotent without a started listener (cold
  process restart).
- ``poll`` rescans + reconciles MQTT.

Inner monitor lifecycle methods are patched as ``AsyncMock`` so
``start`` / ``stop`` don't try to open a zeroconf browser or
connect to MQTT — those have their own dedicated test files.
bdraco added a commit that referenced this pull request May 3, 2026
The handler-level tests in ``tests/controllers/devices/`` all
bypass ``__init__`` via ``__new__`` and stub the inner
controllers individually. That left the wiring code itself
uncovered: ``__init__`` (lines 81-129) constructing the
scanner / state monitor / MQTT coordinator and threading their
nine callbacks back to the controller, plus ``start`` / ``stop``
/ ``poll`` (lines 142-164).

6 tests instantiate a real ``DevicesController`` against a
``tmp_path`` config dir and a thin stub ``DeviceBuilder``:

- ``__init__`` constructs all three inner controllers, sets the
  documented default attrs, wires the scanner against
  ``tmp_path``.
- State-monitor callbacks point back at ``self._on_*_change`` —
  pin the seven distinct wires (regression class from PR #75).
- ``start`` resolves esphome cmd, loads ignored, scans, starts
  the state monitor, reconciles MQTT, registers the
  JOB_COMPLETED bus listener.
- ``stop`` unsubscribes the bus listener and stops both
  monitors.
- ``stop`` is idempotent without a started listener (cold
  process restart).
- ``poll`` rescans + reconciles MQTT.

Inner monitor lifecycle methods are patched as ``AsyncMock`` so
``start`` / ``stop`` don't try to open a zeroconf browser or
connect to MQTT — those have their own dedicated test files.
bdraco added a commit that referenced this pull request May 3, 2026
The handler-level tests in ``tests/controllers/devices/`` all
bypass ``__init__`` via ``__new__`` and stub the inner
controllers individually. That left the wiring code itself
uncovered: ``__init__`` (lines 81-129) constructing the
scanner / state monitor / MQTT coordinator and threading their
nine callbacks back to the controller, plus ``start`` / ``stop``
/ ``poll`` (lines 142-164).

6 tests instantiate a real ``DevicesController`` against a
``tmp_path`` config dir and a thin stub ``DeviceBuilder``:

- ``__init__`` constructs all three inner controllers, sets the
  documented default attrs, wires the scanner against
  ``tmp_path``.
- State-monitor callbacks point back at ``self._on_*_change`` —
  pin the seven distinct wires (regression class from PR #75).
- ``start`` resolves esphome cmd, loads ignored, scans, starts
  the state monitor, reconciles MQTT, registers the
  JOB_COMPLETED bus listener.
- ``stop`` unsubscribes the bus listener and stops both
  monitors.
- ``stop`` is idempotent without a started listener (cold
  process restart).
- ``poll`` rescans + reconciles MQTT.

Inner monitor lifecycle methods are patched as ``AsyncMock`` so
``start`` / ``stop`` don't try to open a zeroconf browser or
connect to MQTT — those have their own dedicated test files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants