Skip to content

Apply-time safety gates: validate declared resources, preview & verify drift #172

@lexfrei

Description

@lexfrei

Motivation

Two recent failure modes in operational use of talm apply motivate this:

  1. Declared but absent: values reference a host resource that doesn't exist on the target — a typo'd link name, a disk selector that matches zero disks. Apply succeeds, the failure manifests on the next boot (link eth0_typo missing) or at install time (no disk matches model=Samsumg). Worst case: silent install on the wrong disk.

  2. Declared but not realized: values declare just eth0 after migration from a previous topology that had eth1. After apply, the on-node state contains eth0 (correct) AND leftover eth1 (from before). Or: Talos parser silently drops a doc with an unknown field. Or: a controller reverts state post-apply. In every case the operator sees "apply: success" and never finds out the on-node state diverged from intent.

A pair of apply-time gates closes both classes:

  • Phase 1 — before apply, verify every declared host resource exists on the node.
  • Phase 2 — before apply, preview the diff against on-node state; after apply, verify on-node state matches what was sent.

Today neither check exists. The version-mismatch warning in pkg/commands/preflight.go (preflightCheckTalosVersion) is the only pre-apply check at all.

Phase 1 — declared-resource existence

Walk the rendered MachineConfig per target node, collect every reference to a host-side resource, verify each against the node's COSI snapshot. Block apply on mismatch by default.

In scope

Network link references:

  • v1.11: machine.network.interfaces[].interface
  • v1.12 multi-doc: LinkConfig.name, BondConfig.bondLinks[], VLANConfig.link, BridgeConfig.bridgedLinks[], Layer2VIPConfig.link, plus addresses[].linkName / routes[].linkName in multi-doc network configs
  • Validation: exact-match against metadata.id values from lookup "links".

Install disk:

  • machine.install.disk (literal path) — must exist in lookup "disks" candidates.
  • machine.install.diskSelector (size / model / serial / wwid / type / busPath filter) — must match >= 1 disk. Zero matches = blocker. Multiple matches = warning (install picks the first, fragile).

User volume / extra disks:

  • UserVolumeConfig.provisioning.diskSelector (v1.12)
  • machine.disks[].device (v1.11 extra-disk partitioning)
  • Same rules as install disk.

Out of scope (initial phase)

  • Talos extensions / image-bound config (validated at image build, not host state).
  • Memory/CPU sizing — talm doesn't currently parameterize these against host capacity.
  • PCI device addressing for passthrough — not a documented talm feature today.

Default behavior

Block by default. Collect every mismatch in one pass per node (no early abort within a single node's checks); surface them together; exit non-zero. --skip-resource-validation opts out for recovery scenarios (booting into maintenance with mismatched hardware, pre-staging values for future hardware).

Best-effort fallback

If COSI is unreachable, surface "could not validate; pass --skip-resource-validation to proceed at your own risk" — do NOT silently no-op.

Phase 2 — drift preview + post-apply state verification

Phase 2A: pre-apply preview

  • Render the new config locally.
  • Read current on-node MachineConfig via COSI: type MachineConfigs.config.talos.dev, namespace config, ID v1alpha1, package github.com/siderolabs/talos/pkg/machinery/resources/config. Use safe.StateGet with the same cosiPreflightContext shape as preflightCheckTalosVersion.
  • Diff per (kind, name) pair:
    • + LinkConfig{name: eth0} — addition
    • - LinkConfig{name: eth1} — removal (most important class — where stale leftover hides)
    • ~ LinkConfig{name: eth0} — field-level change
    • = ResolverConfig — unchanged
  • Display unified-diff-style before apply. Under a TTY: pause for y/N if the diff is non-empty. Non-interactive: print and proceed. --skip-drift-preview opts out.

Phase 2B: post-apply state verification

  • After ApplyConfiguration returns success, re-read on-node MachineConfig via COSI (after a short reconcile window — Talos converges fast but not instant; bound by a preflightCOSIReadTimeout-style cap).
  • Structurally compare against the bytes we sent (parse both sides; whitespace and key ordering must not produce false positives).
  • Block (exit non-zero) on any divergence with a list of expected-vs-actual per resource.
  • --skip-post-apply-verify opts out.

Auth path vs insecure (maintenance) path

The Talos MachineConfig resource is declared Sensitivity: meta.Sensitive. It is readable on the auth path (full credentials), and unreachable through the Reader role used by the insecure / maintenance connection in --insecure mode.

Consequence: Phase 2 is available on the auth path only. On the insecure path the gate prints one line — talm: drift verification unavailable on maintenance connection — and proceeds. Apply itself still works; Phase 1 still runs (its inputs, links and disks, are non-sensitive and reachable on both paths).

Resource kinds in scope

Identity is (kind, name) for v1.12 multi-doc, position-based for v1.11 nested form:

  • LinkConfig / machine.network.interfaces[].interface — links by name
  • BondConfig, VLANConfig, BridgeConfig — virtual links by name
  • Layer2VIPConfig — VIPs by name (= address)
  • HostnameConfig, ResolverConfig, TimeServerConfig — singletons (add/remove/update only)
  • RouteConfig — static routes by destination
  • UserVolumeConfig — v1.12 user volumes by name
  • Top-level machine.install — singleton

Out of scope (initial phase)

  • Kubernetes-level state (kubelet manifests, CNI behavior) — outside the MachineConfig boundary.
  • Live kernel state (every interface, every route on the box) — that's a node-health audit, not a config-coherence check.
  • Extension-emitted runtime state — would need a per-extension allowlist.

Shared infrastructure (both phases reuse)

  • cosiPreflightContext in pkg/commands/preflight.go (singular node ctx key — apid's COSI router rejects plural nodes).
  • The per-node iteration model from applyOneFileDirectPatchMode in pkg/commands/apply.go (lines 285-339).
  • Function-type injection pattern from versionReader at pkg/commands/preflight.go:75 — new sibling types linksDisksReader and machineConfigReader so unit tests don't need a live Talos client.
  • Output: stable, line-oriented, grep-friendly; identical --skip-* flag-namespace style.
  • Multi-node independence: per node, surface all results, exit non-zero on any node's blocker.

Phased rollout

PRs may land Phase 1 and Phase 2 together or separately. Phase 1 is smaller in surface and works on both auth and insecure paths, so a Phase-1-first split is acceptable. Each phase ships with its own opt-out flag from day one.

Diff representation (Phase 2A output)

talm: drift preview (node: 10.0.0.5)
  + LinkConfig{name: eth0}
  - LinkConfig{name: eth1}
  ~ HostnameConfig
      hostname: cp-old -> cp-new
  = ResolverConfig (no change)
talm: 1 addition, 1 removal, 1 update, 1 unchanged.

For ~, show only changed leaves; for +/-, show kind+name only (full bodies are noisy — operators can run talm template for the full content). Optional --diff-output=json for CI consumers; JSON shape stable enough to script against.

Tests

TDD-first. Each new validation rule or diff classifier opens with a red test in its own commit; the implementation commit turns it green.

Engine-side walker tests in pkg/engine/contract_*_test.go (Go-side; helm-unittest can't reach the chart discovery branches because of talm's custom LookupFunc):

  • Phase 1: declared link/disk reference vs fake lookup "links" / lookup "disks" snapshots — every reference field shape (v1.11 + v1.12) covered.
  • Phase 2: differ unit tests for additions, removals, field-level updates, no-op; v1.11 nested and v1.12 multi-doc reduce to the same (kind, name) set.
  • Reuse existing fixture builders in pkg/engine/render_test.go (e.g., multiNicWithVLANLookup, hetznerPublicNICWithPrivateVLANLookup).

Preflight-hook tests in pkg/commands/preflight_test.go:

  • Mirror the existing versionReader injection pattern. Tests pass fake readers; production code builds them from *client.Client.
  • Best-effort fallback test: simulate COSI unreachable; assert the "could not validate" message and the right exit code with / without the skip flag.
  • Insecure-path test for Phase 2: fake reader returns a Sensitivity-style PermissionDenied; assert one informational line and no blocker on that path.

End-to-end: fixture node config with LinkConfig{eth1}, render config with LinkConfig{eth0} only — assert Phase 2A output contains + eth0 AND - eth1.

Critical files

  • pkg/commands/preflight.go — home for the three new functions (preflightValidateResources, previewDrift, verifyAppliedState), mirroring preflightCheckTalosVersion.
  • pkg/commands/apply.go — wiring sites: buildApplyClosure (auth path, lines 240-263) and applyOneFileDirectPatchMode (direct-patch path, lines 285-339). Per-node loop already exists at lines 322-325.
  • pkg/engine/engine.goSerializeConfiguration at line 235 produces the bytes the walker consumes.
  • pkg/engine/contract_*_test.go and pkg/commands/preflight_test.go — test homes.
  • pkg/engine/render_test.go — reusable COSI-fake fixture builders.

Open questions

  • talm template --validate parity: useful for CI, lower stakes since template is read-only. Probably yes, separate flag.
  • Reconcile window for Phase 2B: too short and Talos may not have applied yet; too long and apply turnaround feels sluggish. Start with the same preflightCOSIReadTimeout style cap; tune from feedback.
  • Allowlist for "expected divergence": certain fields are Talos-mutated post-apply (generated cert hashes, timestamps). Bootstrap with a small allowlist of known-mutable paths; expand as encountered.

Supersedes

This umbrella supersedes #169 (declared-resource existence) and #171 (drift preview + post-apply verify), which will be closed with cross-references to this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/applyIssues or PRs related to talm apply (preflight, COSI validation, ApplyConfiguration flow)kind/featureCategorizes issue or PR as related to a new feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions