feat(apply): pre-flight Talos version check and decode-error hints#133
feat(apply): pre-flight Talos version check and decode-error hints#133
Conversation
Add github.com/cockroachdb/errors as a direct dependency and surface its WithHint annotations after the main error message in the entrypoint error printer. Existing fmt.Errorf-wrapped errors continue to render unchanged. New sites can now attach actionable hints that print on their own lines as 'hint: ...' without callers having to format them. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Pre-flight: read the running Talos version from the COSI Versions resource on the target node and compare it to the contract derived from templateOptions.talosVersion / --talos-version. When the configured contract is strictly newer than the running version, print a warning with a hint pointing at the maintenance image / contract mismatch. Best-effort: any read or parse failure returns silently and never blocks apply. Backstop: when c.ApplyConfiguration returns the maintenance strict decoder error 'unknown keys found during decoding: ...', attach the same hint via errors.WithHint so users see actionable guidance instead of a cryptic gRPC message. Both paths cover the template-rendering and direct-patch apply flows. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR adds Talos version preflight checking to the ChangesTalos Version Preflight Validation
Sequence DiagramsequenceDiagram
participant User as User/CLI
participant App as talm apply
participant Client as Talos Client
participant Node as Target Node
participant Config as Configuration Apply
User->>App: talm apply
loop For each node
App->>Client: Connect (node-scoped context)
Client->>Node: Read runtime.Version from COSI
Node-->>Client: Return running Talos version
App->>App: Evaluate version mismatch<br/>(configured vs running)
alt Version mismatch detected
App->>App: Print warning + hints to stderr
end
App->>Config: Apply configuration
Config-->>Config: Apply result
alt Apply error with unknown fields
App->>App: Annotate error with decode hint
App->>App: Extract and print all hints
end
end
Config-->>User: Success or error
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a pre-flight check to compare the configured Talos version against the version running on the node, providing warnings and hints if there's a mismatch. It also enhances error reporting by using the github.com/cockroachdb/errors package to display actionable hints for common configuration issues, such as unknown keys during decoding. Feedback suggests adding a timeout to the pre-flight network call to ensure it remains non-blocking and truly best-effort.
| "context" | ||
| "fmt" | ||
| "io" | ||
| "strings" |
There was a problem hiding this comment.
Added in b6e0618 — time is now imported and used by the new preflightCOSIReadTimeout constant in cosiVersionReader.
| res, err := safe.StateGet[*runtime.Version]( | ||
| ctx, | ||
| c.COSI, | ||
| resource.NewMetadata(runtime.NamespaceName, runtime.VersionType, "version", resource.VersionUndefined), | ||
| ) |
There was a problem hiding this comment.
The pre-flight check performs a network call to the Talos node via safe.StateGet. To ensure this check remains truly "best-effort" and does not block the apply process if a node is unresponsive or the network is slow, it should have a dedicated timeout. Adding a short timeout (e.g., 2 seconds) ensures that this informational check never significantly delays the actual configuration application.
| res, err := safe.StateGet[*runtime.Version]( | |
| ctx, | |
| c.COSI, | |
| resource.NewMetadata(runtime.NamespaceName, runtime.VersionType, "version", resource.VersionUndefined), | |
| ) | |
| ctx, cancel := context.WithTimeout(ctx, 2*time.Second) | |
| defer cancel() | |
| res, err := safe.StateGet[*runtime.Version]( | |
| ctx, | |
| c.COSI, | |
| resource.NewMetadata(runtime.NamespaceName, runtime.VersionType, "version", resource.VersionUndefined), | |
| ) |
There was a problem hiding this comment.
Done in b6e0618 — wrapped the safe.StateGet call inside cosiVersionReader with context.WithTimeout(ctx, preflightCOSIReadTimeout) where the constant is 2 * time.Second. Short enough to stay invisible on a healthy node, long enough to clear any expected roundtrip; on a hung node the reader returns ok=false and the existing silent-on-error contract takes over.
…flight per-node Pre-flight in pkg/commands/preflight.go had two gaps that exactly matched the reproduction cases from #132. 1. Empty configuredVersion was short-circuited to silent return. Machinery treats an unset contract as TalosVersionCurrent (a nil *VersionContract that compares as strictly greater than every concrete version), so it still injects machine.install.grubUseUKICmdline even when templateOptions.talosVersion is unset. Drop the early return: evaluateVersionMismatch now leaves configuredContract nil when configuredVersion is empty and lets contract.Greater carry the semantics. The warning prints 'configured talosVersion=current' (machinery's *VersionContract.String() for the nil contract) so the user can see what was inferred. 2. The direct-patch apply path called preflightCheckTalosVersion with the multi-node context withApplyClient produces. COSI does not support multi-node proxying — see the existing precedent in pkg/commands/rotate_ca_handler.go:317 — so the read fails silently and no warning ever surfaces for any direct-patch apply against more than one node. Iterate GlobalArgs.Nodes and invoke preflight per node with client.WithNode, matching the per-node fan-out the template-rendering path already inherits from applyTemplatesPerNode. The function now takes a versionReader so tests can drive the wired behavior without a live COSI server. cosiVersionReader wraps the production safe.StateGet[*runtime.Version] read at the apply call sites. preflight_test.go gains TestPreflightCheckTalosVersion with five cases (newer/empty-as-current/match/reader-error/unparseable) and the empty-configured row in TestEvaluateVersionMismatch flips from 'no warning' to 'warning'. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
The talosVersion / --talos-version setting was only described as an output-format selector. It is also (and more critically) the contract that decides which fields machinery injects into the generated config, which must match what the maintenance Talos parser on the node knows. A user who sets it from install.image (a different artifact) lands on 'unknown keys found during decoding: machine.install.grubUseUKICmdline' and has no doc-side guidance — see #132 and the reproduction in cozystack/cozystack#2442. Add a short paragraph that names the failure mode, points at the pre-flight warning the apply path now prints, and tells the user the two ways out (reboot into a maintenance image matching the contract, or lower the contract). Keep it next to the existing output-format note where readers are already looking at this setting. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Address review feedback from gemini-code-assist on pkg/commands/preflight.go:21,85: the pre-flight COSI read had no timeout, so a slow or unresponsive node could turn an informational best-effort check into a blocker for apply. Wrap the safe.StateGet call with context.WithTimeout (2s) inside cosiVersionReader. The timeout is short enough to stay invisible on a healthy node and long enough to clear any expected roundtrip; on a hung node, the reader returns ok=false and the existing silent-on-error contract takes over. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/commands/apply.go`:
- Around line 171-180: The code only iterates GlobalArgs.Nodes so when nodes are
resolved from the talosconfig context by wrapWithNodeContext those targets are
skipped; change the loop to iterate the actual resolved node list (e.g. obtain
nodes from wrapWithNodeContext or from the client/context after wrapping)
instead of GlobalArgs.Nodes so preflightCheckTalosVersion runs for every
resolved node and the earlier fmt.Printf shows the correct nodes; update the
call sites around cosiVersionReader, preflightCheckTalosVersion,
wrapWithNodeContext, GlobalArgs.Nodes and applyCmdFlags.talosVersion to use the
resolved nodes collection.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 41a55578-01a5-4de8-922c-15b8915303c9
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (6)
README.mdgo.modmain.gopkg/commands/apply.gopkg/commands/preflight.gopkg/commands/preflight_test.go
…itted Address review feedback from coderabbitai on pkg/commands/apply.go:178: the direct-patch closure iterated GlobalArgs.Nodes only, but wrapWithNodeContext fills ctx via client.WithNodes from the talosconfig context when --nodes is omitted without mutating GlobalArgs.Nodes itself. The preflight loop was a no-op in that path and the log line printed an empty node list. Mirror wrapWithNodeContext's resolution into a local targetNodes slice and use it for both the log line and the per-node preflight loop. With --nodes set, behavior is unchanged. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
The auth template-rendering apply path puts the target node under the plural "nodes" metadata key so helpers.ForEachResource and apid's machine-API backend resolver can read it. Reusing that ctx for the COSI version preflight breaks the preflight: Talos's apid director rejects every COSI method whose outgoing context carries the plural key, regardless of slice length, and cosiVersionReader swallows errors and returns ok=false on rejection. End user sees no version-mismatch warning even when the running Talos predates the configured talosVersion -- defeating the whole point of the preflight added in PR #133. Add cosiPreflightContext: clone the outgoing metadata, drop "nodes", attach "node" with the same target, hand the rebuilt context to the COSI caller. ApplyConfiguration keeps the original ctx unchanged. The helper is a noop on the insecure (maintenance) path that carries no node metadata at all and on multi-node ctx where the single-target shortcut would be unsafe. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…riptions Source comments and tests had bug-number references (#77, PR #133) that violate the project's commit-message and code-comment standards: comments must be self-explanatory to a reader without access to the issue tracker. Replace each citation with a description of the bug class itself -- 'duplicate primitive-array entries per round-trip' instead of '#77', 'the version-mismatch warning that preflightCheckTalosVersion exists to surface' instead of 'PR #133'. The surrounding prose already says what the issue is, so the numbers came out without losing context. Pre-existing references in code this branch does not touch (preflight_test.go's #132 reproduction comments, engine_test.go's #66 fixture name) are left as-is -- out of scope here. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Why
talm apply -iagainst a node in maintenance mode can fail with a cryptic strict-decoder error like:The field is not in the user's
nodes/*.yamlor in any chart template — it is auto-injected by machinery during config generation when the contract derived fromtemplateOptions.talosVersionoutpaces what's compiled into the running maintenance binary. Real-world example:cozystack/cozystack#2442. Root cause for that user wasboot-to-talos -yesdefaulting to Talos v1.11.6 while the install image targeted v1.12.6, but the cryptic error gave them no clue.Scope is laid out in
cozystack/talm#132. This PR addresses the talm-side improvements from item 3 of that issue, plus a pre-flight detection complement.What
Two complementary improvements:
Pre-flight version check — before
applyactually sends the config, read the running Talos version from the COSIVersions.runtime.talos.dev/runtime/versionresource (NonSensitive, reachable via--insecureReader role) and compare it against the configured contract. When the configured contract is strictly newer than the running version, print a warning with a hint pointing at the maintenance image vs contract mismatch. Best-effort: any read or parse failure returns silently and never blocks apply.Backstop hint on the decode error — when
c.ApplyConfigurationreturns aunknown keys found during decoding:error, attach acockroachdb/errors.WithHintannotation pointing at the same mismatch. The custom error printer in the entrypoint renders hints ashint: ...lines under the main error text. Existingfmt.Errorferrors continue to render unchanged.Both apply paths are covered (template-rendering and direct-patch). Default scope is draft only in the sense that the warning is informational — pre-flight never aborts apply; the user may have a valid reason to pin a lower contract.
Implementation notes
github.com/cockroachdb/errorsforWithHint/GetAllHints. Minimal usage today; future sites can attach hints without callers having to format them.pkg/commands/preflight.gowithpreflightCheckTalosVersion,evaluateVersionMismatch,annotateApplyConfigError, and the two hint constants.pkg/commands/apply.gocalls into both helpers from each apply closure.main.go(entrypoint) renders hints to stderr after the main error message viaerrors.GetAllHints.SilenceErrors/SilenceUsagewere already set on the root cobra command.safe.StateGet[*runtime.Version]— the same pattern astalosctl/cmd/talos/support.goupstream. Usesc.COSI.Getsemantics (nostate.WithSkipProtobufUnmarshal) since theruntimepackage's protobuf types are imported directly.Tests
pkg/commands/preflight_test.go:TestEvaluateVersionMismatch— table-driven, 8 cases covering all comparison branches and unparseable inputs. Asserts that the warning carries a hint and mentions the running version.TestAnnotateApplyConfigError— 3 cases (nil, unrelated error, strict-decoder error). Asserts hint attachment viaerrors.GetAllHints.go test ./...— all pass.golangci-lint run ./...— 0 issues.Verification (manual)
End-to-end:
./talm apply -f nodes/node1.yaml -iagainst a node whereVersions.runtime.talos.devreports a version older thantemplateOptions.talosVersion. Expect apre-flight:warning on stderr with a hint, then apply continues.hint: ...line(s).Closes (or addresses item 3 of) #132.
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests