Motivation
Two recent failure modes in operational use of talm apply motivate this:
-
Declared but absent: values reference a host resource that doesn't exist on the target — a typo'd link name, a disk selector that matches zero disks. Apply succeeds, the failure manifests on the next boot (link eth0_typo missing) or at install time (no disk matches model=Samsumg). Worst case: silent install on the wrong disk.
-
Declared but not realized: values declare just eth0 after migration from a previous topology that had eth1. After apply, the on-node state contains eth0 (correct) AND leftover eth1 (from before). Or: Talos parser silently drops a doc with an unknown field. Or: a controller reverts state post-apply. In every case the operator sees "apply: success" and never finds out the on-node state diverged from intent.
A pair of apply-time gates closes both classes:
- Phase 1 — before apply, verify every declared host resource exists on the node.
- Phase 2 — before apply, preview the diff against on-node state; after apply, verify on-node state matches what was sent.
Today neither check exists. The version-mismatch warning in pkg/commands/preflight.go (preflightCheckTalosVersion) is the only pre-apply check at all.
Phase 1 — declared-resource existence
Walk the rendered MachineConfig per target node, collect every reference to a host-side resource, verify each against the node's COSI snapshot. Block apply on mismatch by default.
In scope
Network link references:
- v1.11:
machine.network.interfaces[].interface
- v1.12 multi-doc:
LinkConfig.name, BondConfig.bondLinks[], VLANConfig.link, BridgeConfig.bridgedLinks[], Layer2VIPConfig.link, plus addresses[].linkName / routes[].linkName in multi-doc network configs
- Validation: exact-match against
metadata.id values from lookup "links".
Install disk:
machine.install.disk (literal path) — must exist in lookup "disks" candidates.
machine.install.diskSelector (size / model / serial / wwid / type / busPath filter) — must match >= 1 disk. Zero matches = blocker. Multiple matches = warning (install picks the first, fragile).
User volume / extra disks:
UserVolumeConfig.provisioning.diskSelector (v1.12)
machine.disks[].device (v1.11 extra-disk partitioning)
- Same rules as install disk.
Out of scope (initial phase)
- Talos extensions / image-bound config (validated at image build, not host state).
- Memory/CPU sizing — talm doesn't currently parameterize these against host capacity.
- PCI device addressing for passthrough — not a documented talm feature today.
Default behavior
Block by default. Collect every mismatch in one pass per node (no early abort within a single node's checks); surface them together; exit non-zero. --skip-resource-validation opts out for recovery scenarios (booting into maintenance with mismatched hardware, pre-staging values for future hardware).
Best-effort fallback
If COSI is unreachable, surface "could not validate; pass --skip-resource-validation to proceed at your own risk" — do NOT silently no-op.
Phase 2 — drift preview + post-apply state verification
Phase 2A: pre-apply preview
- Render the new config locally.
- Read current on-node
MachineConfig via COSI: type MachineConfigs.config.talos.dev, namespace config, ID v1alpha1, package github.com/siderolabs/talos/pkg/machinery/resources/config. Use safe.StateGet with the same cosiPreflightContext shape as preflightCheckTalosVersion.
- Diff per
(kind, name) pair:
+ LinkConfig{name: eth0} — addition
- LinkConfig{name: eth1} — removal (most important class — where stale leftover hides)
~ LinkConfig{name: eth0} — field-level change
= ResolverConfig — unchanged
- Display unified-diff-style before apply. Under a TTY: pause for
y/N if the diff is non-empty. Non-interactive: print and proceed. --skip-drift-preview opts out.
Phase 2B: post-apply state verification
- After
ApplyConfiguration returns success, re-read on-node MachineConfig via COSI (after a short reconcile window — Talos converges fast but not instant; bound by a preflightCOSIReadTimeout-style cap).
- Structurally compare against the bytes we sent (parse both sides; whitespace and key ordering must not produce false positives).
- Block (exit non-zero) on any divergence with a list of expected-vs-actual per resource.
--skip-post-apply-verify opts out.
Auth path vs insecure (maintenance) path
The Talos MachineConfig resource is declared Sensitivity: meta.Sensitive. It is readable on the auth path (full credentials), and unreachable through the Reader role used by the insecure / maintenance connection in --insecure mode.
Consequence: Phase 2 is available on the auth path only. On the insecure path the gate prints one line — talm: drift verification unavailable on maintenance connection — and proceeds. Apply itself still works; Phase 1 still runs (its inputs, links and disks, are non-sensitive and reachable on both paths).
Resource kinds in scope
Identity is (kind, name) for v1.12 multi-doc, position-based for v1.11 nested form:
LinkConfig / machine.network.interfaces[].interface — links by name
BondConfig, VLANConfig, BridgeConfig — virtual links by name
Layer2VIPConfig — VIPs by name (= address)
HostnameConfig, ResolverConfig, TimeServerConfig — singletons (add/remove/update only)
RouteConfig — static routes by destination
UserVolumeConfig — v1.12 user volumes by name
- Top-level
machine.install — singleton
Out of scope (initial phase)
- Kubernetes-level state (kubelet manifests, CNI behavior) — outside the MachineConfig boundary.
- Live kernel state (every interface, every route on the box) — that's a node-health audit, not a config-coherence check.
- Extension-emitted runtime state — would need a per-extension allowlist.
Shared infrastructure (both phases reuse)
cosiPreflightContext in pkg/commands/preflight.go (singular node ctx key — apid's COSI router rejects plural nodes).
- The per-node iteration model from
applyOneFileDirectPatchMode in pkg/commands/apply.go (lines 285-339).
- Function-type injection pattern from
versionReader at pkg/commands/preflight.go:75 — new sibling types linksDisksReader and machineConfigReader so unit tests don't need a live Talos client.
- Output: stable, line-oriented, grep-friendly; identical
--skip-* flag-namespace style.
- Multi-node independence: per node, surface all results, exit non-zero on any node's blocker.
Phased rollout
PRs may land Phase 1 and Phase 2 together or separately. Phase 1 is smaller in surface and works on both auth and insecure paths, so a Phase-1-first split is acceptable. Each phase ships with its own opt-out flag from day one.
Diff representation (Phase 2A output)
talm: drift preview (node: 10.0.0.5)
+ LinkConfig{name: eth0}
- LinkConfig{name: eth1}
~ HostnameConfig
hostname: cp-old -> cp-new
= ResolverConfig (no change)
talm: 1 addition, 1 removal, 1 update, 1 unchanged.
For ~, show only changed leaves; for +/-, show kind+name only (full bodies are noisy — operators can run talm template for the full content). Optional --diff-output=json for CI consumers; JSON shape stable enough to script against.
Tests
TDD-first. Each new validation rule or diff classifier opens with a red test in its own commit; the implementation commit turns it green.
Engine-side walker tests in pkg/engine/contract_*_test.go (Go-side; helm-unittest can't reach the chart discovery branches because of talm's custom LookupFunc):
- Phase 1: declared link/disk reference vs fake
lookup "links" / lookup "disks" snapshots — every reference field shape (v1.11 + v1.12) covered.
- Phase 2: differ unit tests for additions, removals, field-level updates, no-op; v1.11 nested and v1.12 multi-doc reduce to the same
(kind, name) set.
- Reuse existing fixture builders in
pkg/engine/render_test.go (e.g., multiNicWithVLANLookup, hetznerPublicNICWithPrivateVLANLookup).
Preflight-hook tests in pkg/commands/preflight_test.go:
- Mirror the existing
versionReader injection pattern. Tests pass fake readers; production code builds them from *client.Client.
- Best-effort fallback test: simulate COSI unreachable; assert the "could not validate" message and the right exit code with / without the skip flag.
- Insecure-path test for Phase 2: fake reader returns a Sensitivity-style PermissionDenied; assert one informational line and no blocker on that path.
End-to-end: fixture node config with LinkConfig{eth1}, render config with LinkConfig{eth0} only — assert Phase 2A output contains + eth0 AND - eth1.
Critical files
pkg/commands/preflight.go — home for the three new functions (preflightValidateResources, previewDrift, verifyAppliedState), mirroring preflightCheckTalosVersion.
pkg/commands/apply.go — wiring sites: buildApplyClosure (auth path, lines 240-263) and applyOneFileDirectPatchMode (direct-patch path, lines 285-339). Per-node loop already exists at lines 322-325.
pkg/engine/engine.go — SerializeConfiguration at line 235 produces the bytes the walker consumes.
pkg/engine/contract_*_test.go and pkg/commands/preflight_test.go — test homes.
pkg/engine/render_test.go — reusable COSI-fake fixture builders.
Open questions
talm template --validate parity: useful for CI, lower stakes since template is read-only. Probably yes, separate flag.
- Reconcile window for Phase 2B: too short and Talos may not have applied yet; too long and apply turnaround feels sluggish. Start with the same
preflightCOSIReadTimeout style cap; tune from feedback.
- Allowlist for "expected divergence": certain fields are Talos-mutated post-apply (generated cert hashes, timestamps). Bootstrap with a small allowlist of known-mutable paths; expand as encountered.
Supersedes
This umbrella supersedes #169 (declared-resource existence) and #171 (drift preview + post-apply verify), which will be closed with cross-references to this issue.
Motivation
Two recent failure modes in operational use of
talm applymotivate this:Declared but absent: values reference a host resource that doesn't exist on the target — a typo'd link name, a disk selector that matches zero disks. Apply succeeds, the failure manifests on the next boot (link
eth0_typomissing) or at install time (no disk matchesmodel=Samsumg). Worst case: silent install on the wrong disk.Declared but not realized: values declare just
eth0after migration from a previous topology that hadeth1. After apply, the on-node state containseth0(correct) AND leftovereth1(from before). Or: Talos parser silently drops a doc with an unknown field. Or: a controller reverts state post-apply. In every case the operator sees "apply: success" and never finds out the on-node state diverged from intent.A pair of apply-time gates closes both classes:
Today neither check exists. The version-mismatch warning in
pkg/commands/preflight.go(preflightCheckTalosVersion) is the only pre-apply check at all.Phase 1 — declared-resource existence
Walk the rendered MachineConfig per target node, collect every reference to a host-side resource, verify each against the node's COSI snapshot. Block apply on mismatch by default.
In scope
Network link references:
machine.network.interfaces[].interfaceLinkConfig.name,BondConfig.bondLinks[],VLANConfig.link,BridgeConfig.bridgedLinks[],Layer2VIPConfig.link, plusaddresses[].linkName/routes[].linkNamein multi-doc network configsmetadata.idvalues fromlookup "links".Install disk:
machine.install.disk(literal path) — must exist inlookup "disks"candidates.machine.install.diskSelector(size / model / serial / wwid / type / busPath filter) — must match>= 1disk. Zero matches = blocker. Multiple matches = warning (install picks the first, fragile).User volume / extra disks:
UserVolumeConfig.provisioning.diskSelector(v1.12)machine.disks[].device(v1.11 extra-disk partitioning)Out of scope (initial phase)
Default behavior
Block by default. Collect every mismatch in one pass per node (no early abort within a single node's checks); surface them together; exit non-zero.
--skip-resource-validationopts out for recovery scenarios (booting into maintenance with mismatched hardware, pre-staging values for future hardware).Best-effort fallback
If COSI is unreachable, surface "could not validate; pass
--skip-resource-validationto proceed at your own risk" — do NOT silently no-op.Phase 2 — drift preview + post-apply state verification
Phase 2A: pre-apply preview
MachineConfigvia COSI: typeMachineConfigs.config.talos.dev, namespaceconfig, IDv1alpha1, packagegithub.com/siderolabs/talos/pkg/machinery/resources/config. Usesafe.StateGetwith the samecosiPreflightContextshape aspreflightCheckTalosVersion.(kind, name)pair:+ LinkConfig{name: eth0}— addition- LinkConfig{name: eth1}— removal (most important class — where stale leftover hides)~ LinkConfig{name: eth0}— field-level change= ResolverConfig— unchangedy/Nif the diff is non-empty. Non-interactive: print and proceed.--skip-drift-previewopts out.Phase 2B: post-apply state verification
ApplyConfigurationreturns success, re-read on-nodeMachineConfigvia COSI (after a short reconcile window — Talos converges fast but not instant; bound by apreflightCOSIReadTimeout-style cap).--skip-post-apply-verifyopts out.Auth path vs insecure (maintenance) path
The Talos
MachineConfigresource is declaredSensitivity: meta.Sensitive. It is readable on the auth path (full credentials), and unreachable through the Reader role used by the insecure / maintenance connection in--insecuremode.Consequence: Phase 2 is available on the auth path only. On the insecure path the gate prints one line —
talm: drift verification unavailable on maintenance connection— and proceeds. Apply itself still works; Phase 1 still runs (its inputs,linksanddisks, are non-sensitive and reachable on both paths).Resource kinds in scope
Identity is
(kind, name)for v1.12 multi-doc, position-based for v1.11 nested form:LinkConfig/machine.network.interfaces[].interface— links by nameBondConfig,VLANConfig,BridgeConfig— virtual links by nameLayer2VIPConfig— VIPs by name (= address)HostnameConfig,ResolverConfig,TimeServerConfig— singletons (add/remove/update only)RouteConfig— static routes by destinationUserVolumeConfig— v1.12 user volumes by namemachine.install— singletonOut of scope (initial phase)
Shared infrastructure (both phases reuse)
cosiPreflightContextinpkg/commands/preflight.go(singularnodectx key — apid's COSI router rejects pluralnodes).applyOneFileDirectPatchModeinpkg/commands/apply.go(lines 285-339).versionReaderatpkg/commands/preflight.go:75— new sibling typeslinksDisksReaderandmachineConfigReaderso unit tests don't need a live Talos client.--skip-*flag-namespace style.Phased rollout
PRs may land Phase 1 and Phase 2 together or separately. Phase 1 is smaller in surface and works on both auth and insecure paths, so a Phase-1-first split is acceptable. Each phase ships with its own opt-out flag from day one.
Diff representation (Phase 2A output)
For
~, show only changed leaves; for+/-, show kind+name only (full bodies are noisy — operators can runtalm templatefor the full content). Optional--diff-output=jsonfor CI consumers; JSON shape stable enough to script against.Tests
TDD-first. Each new validation rule or diff classifier opens with a red test in its own commit; the implementation commit turns it green.
Engine-side walker tests in
pkg/engine/contract_*_test.go(Go-side; helm-unittest can't reach the chart discovery branches because of talm's customLookupFunc):lookup "links"/lookup "disks"snapshots — every reference field shape (v1.11 + v1.12) covered.(kind, name)set.pkg/engine/render_test.go(e.g.,multiNicWithVLANLookup,hetznerPublicNICWithPrivateVLANLookup).Preflight-hook tests in
pkg/commands/preflight_test.go:versionReaderinjection pattern. Tests pass fake readers; production code builds them from*client.Client.End-to-end: fixture node config with
LinkConfig{eth1}, render config withLinkConfig{eth0}only — assert Phase 2A output contains+ eth0AND- eth1.Critical files
pkg/commands/preflight.go— home for the three new functions (preflightValidateResources,previewDrift,verifyAppliedState), mirroringpreflightCheckTalosVersion.pkg/commands/apply.go— wiring sites:buildApplyClosure(auth path, lines 240-263) andapplyOneFileDirectPatchMode(direct-patch path, lines 285-339). Per-node loop already exists at lines 322-325.pkg/engine/engine.go—SerializeConfigurationat line 235 produces the bytes the walker consumes.pkg/engine/contract_*_test.goandpkg/commands/preflight_test.go— test homes.pkg/engine/render_test.go— reusable COSI-fake fixture builders.Open questions
talm template --validateparity: useful for CI, lower stakes since template is read-only. Probably yes, separate flag.preflightCOSIReadTimeoutstyle cap; tune from feedback.Supersedes
This umbrella supersedes #169 (declared-resource existence) and #171 (drift preview + post-apply verify), which will be closed with cross-references to this issue.