feat(charts,modeline,upgrade): operator extension points + modeline/upgrade fixes#211
Conversation
ParseModeline previously split the modeline body on the literal `, `
(comma+space) string, which is the separator GenerateModeline emits
between key-pairs. JSON arrays serialised by encoding/json carry no
space after each inner comma ("["a","b"]"), so the canonical
talm-generated form parsed back cleanly. But a hand-authored modeline
with whitespace inside the array ("nodes=["a", "b"]" — the
natural shape) was cut at the first inner comma. The leading half
("nodes=["a"") then failed JSON parsing with
"unexpected end of JSON input", with a hint that misled the operator
into thinking their JSON was malformed.
Replace the literal split with a depth-aware tokenizer that tracks
`[`/`]` nesting and JSON string state (including backslash
escapes) and only emits a key-pair boundary when it sees a `,` at
depth 0 outside a string literal. The canonical no-space form still
parses; the human-written form now parses; commas inside string
literals never split the value.
Update the contract test that pinned the old strict separator to
document the relaxation, and add cases for multi-element arrays with
various whitespace shapes plus comma-in-string-literal coverage.
Extend the manual-test-plan with section C6f covering the parser
behaviour on both anchor and side-patch slots.
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…eneric presets
Operators wanting to extend the Talos machine config rendered by the
cozystack or generic presets had to fork the chart or hand-edit the
generated per-node YAML against the autogen banner asking them to
keep template edits in values. Add four "extra*" values keys that
plug into the load-bearing sections of machine.* — kernel modules,
sysctls, kubelet.extraConfig, machine.files — without overriding the
preset's hardcoded defaults.
Cozystack preset (append / merge to existing built-ins):
extraKernelModules [] append to openvswitch / drbd / zfs / spl
/ vfio_pci / vfio_iommu_type1
extraKubeletExtraArgs {} merge into cpuManagerPolicy + maxPods;
last-key-wins on conflict (escape hatch)
extraSysctls {} merge into gc_thresh* + nr_hugepages
extraMachineFiles [] append to CRI customization + lvm.conf
Generic preset (emit only when set):
extraKernelModules [] emits machine.kernel.modules block
extraKubeletExtraArgs {} emits kubelet.extraConfig block
extraSysctls {} emits machine.sysctls block
extraMachineFiles [] emits machine.files block
Contract tests under pkg/engine pin append vs merge semantics for
every key on cozystack, on / off state on generic, and the default
generic render still emits no machine.kernel / sysctls / files /
extraConfig blocks (the existing NoCozystackOpinionsOnGeneric guard
stays green). Manual-test-plan B7 / B8 cover both presets with
forward-looking yq queries against the rendered output.
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…grade talm upgrade resolves the target installer image from values.yaml (canonical "raise the cluster's Talos version" workflow), but the node body's machine.install.image was never resynced afterwards. A follow-up talm apply re-rendered the chart against values.yaml and got the new image, then merged the stale body via MergeFileAsPatch and silently pinned the cluster back to the pre-upgrade image — the next A/B boot would roll back without warning. Point-patch machine.install.image in every -f node body after the upgrade RPC + post-upgrade verify return success. Uses a yaml.v3 node round-trip so the modeline, autogen banner, sibling keys, and per-key comments survive untouched; files without an install.image key (side-patches, orphans, partials) are silently skipped so the handler can blindly fan out over the full -f list. Verify failure intentionally leaves the body alone: the node is on the pre-upgrade image after auto-rollback, and the body must reflect what the node actually runs — not what talosctl was asked to install. Operator opt-out via --skip-post-upgrade-verify (or the --insecure / --stage paths that drop Phase 2C entirely) patches unconditionally on RPC success, trading the verify gate's safety for body-sync. Contract tests pin scalar swap, idempotency (no rewrite when already on target), silent-skip across three no-key shapes, structural-error surface for unexpected machine.install.image kinds, and file-list fan-out semantics. Manual-test-plan E4 covers the happy path, failure-path skip, and operator-opt-out path against a real cluster. Signed-off-by: Aleksei Sviridkin <f@lex.la>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (12)
📝 WalkthroughWalkthroughThis PR adds three orthogonal features: operator-driven Helm extension points for machine configuration ( ChangesOperator Extension Points for Machine Configuration
Modeline Parsing Robustness for JSON Arrays
Post-Upgrade Image Sync for Node Body Files
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces operator-supplied extension points (extraKernelModules, extraSysctls, extraKubeletExtraArgs, and extraMachineFiles) to the cozystack and generic Helm charts, enabling easier customization. It also implements a post-upgrade sync mechanism in talm upgrade that patches the machine.install.image in node configuration files after successful verification, maintaining consistency between local files and the running cluster. Furthermore, the modeline parser has been updated to be depth-aware, allowing for more flexible formatting of multi-element JSON arrays. Review feedback highlights the need for atomic file writes when updating node configurations and suggests collecting errors across all files during the sync process to provide a comprehensive overview of failures.
| } | ||
|
|
||
| //nolint:gosec // filePath is an operator-supplied -f argument; we must write to exactly that path | ||
| if err := os.WriteFile(filePath, out, mode); err != nil { |
There was a problem hiding this comment.
The use of os.WriteFile is not atomic because it truncates the file before writing the new content. If the process is interrupted (e.g., due to a crash or power failure) between truncation and completion of the write, the node configuration file could be left empty or corrupted. For critical configuration files, it is safer to write the content to a temporary file in the same directory and then use os.Rename to replace the original file atomically.
| for _, filePath := range files { | ||
| patched, err := writeBackInstallImageToNodeBody(filePath, newImage) | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
There was a problem hiding this comment.
The function returns immediately on the first error encountered while patching node bodies. In a multi-node upgrade scenario, this could leave the local configuration files in an inconsistent state where some are updated and others are not, without the operator seeing errors for the remaining files. Consider collecting errors from all files in a slice and using errors.Join to return them at the end, providing a complete overview of any failures.
References
- Ensure that invalid inputs or states are safely handled in all cases.
Summary
Three independent improvements that surfaced while bringing up a fresh 3-node Talos cluster with Cozystack. Each commit stands alone and can be reverted independently.
fix(modeline): split modeline parts on top-level commas onlyThe
# talm: nodes=[...], endpoints=[...]parser usedstrings.SplitSeq(content, ", ")which cut the value at the first comma INSIDE a multi-element JSON array. Hand-authored shared side-patches withnodes=["a", "b"](space after comma) hiterror parsing JSON array for key nodes, value ["a". The talm-generated form happened to dodge it becausejson.Marshalof a string slice emits no space after the comma.Replace the literal split with a depth-aware tokenizer that splits on
,only at JSON-array depth 0, tracking string state and backslash escapes. Both the canonical no-space form and the human form parse cleanly. Comma inside string literals never splits. Empty arrays, trailing commas, and unbalanced brackets keep their existing semantics (test-pinned).feat(charts): expose Talos operator extension points in cozystack + generic presetsOperators wanting to extend the rendered Talos config (
machine.kernel.modules,machine.sysctls,machine.kubelet.extraConfig,machine.files) had to fork the chart or hand-edit the generatednodes/<n>.yamlagainst the autogen banner. Adds fourextra*values keys to both presets.On cozystack (defaults present): list-shaped keys append; map-shaped keys are collision-checked at template time and
failwith a precise hint if an operator key names a built-in. Built-ins are never overridable — yaml.v3 rejects duplicate map keys on decode, so a silent emit-both would produce a config that cannot round-trip.On generic (no defaults): each block emits only when the matching
extra*key is non-empty.fix(upgrade): point-patch node body install.image after successful upgradetalm upgraderesolves the target installer image fromvalues.yaml, but the per-nodenodes/<n>.yaml::machine.install.imagewas never resynced afterwards. A follow-uptalm applyrendered the chart againstvalues.yaml(got the new image), then merged the stale body viaMergeFileAsPatchand silently pinned the cluster back to the pre-upgrade image — the next A/B boot would roll back without warning.After successful RPC + post-upgrade verify, point-patch
machine.install.imagein every-fnode body to the applied image. Uses a yaml.v3 streaming decoder so the v1.12+ multi-doc shape (machine + cluster + RegistryMirrorConfig + LinkConfig + Layer2VIPConfig + VLANConfig) is preserved intact —yaml.Unmarshalreads only the first document and would silently erase the rest. Encoder pinned at 2-space indent so the diff stays a single line. Comments and modeline survive untouched.Verify-failure (auto-rollback) intentionally leaves the body alone: the node ran the old image, so the body must reflect that. Operator opt-out via
--skip-post-upgrade-verify(and--insecure/--stagepaths that drop Phase 2C entirely) patches unconditionally on RPC success.Verification
go test ./...: all packages green.golangci-lint run ./...: 0 issues.Docs sync
The new
extra*keys and the post-upgrade body-sync behaviour are operator-visible. A parallel update atcozystack/websitewill follow under a separate PR.Summary by CodeRabbit
New Features
Improvements