Skip to content

feat(charts,modeline,upgrade): operator extension points + modeline/upgrade fixes#211

Merged
lexfrei merged 3 commits into
mainfrom
feat/talm-extension-points-and-upgrade-sync
May 15, 2026
Merged

feat(charts,modeline,upgrade): operator extension points + modeline/upgrade fixes#211
lexfrei merged 3 commits into
mainfrom
feat/talm-extension-points-and-upgrade-sync

Conversation

@lexfrei
Copy link
Copy Markdown
Contributor

@lexfrei lexfrei commented May 15, 2026

Summary

Three independent improvements that surfaced while bringing up a fresh 3-node Talos cluster with Cozystack. Each commit stands alone and can be reverted independently.

fix(modeline): split modeline parts on top-level commas only

The # talm: nodes=[...], endpoints=[...] parser used strings.SplitSeq(content, ", ") which cut the value at the first comma INSIDE a multi-element JSON array. Hand-authored shared side-patches with nodes=["a", "b"] (space after comma) hit error parsing JSON array for key nodes, value ["a". The talm-generated form happened to dodge it because json.Marshal of a string slice emits no space after the comma.

Replace the literal split with a depth-aware tokenizer that splits on , only at JSON-array depth 0, tracking string state and backslash escapes. Both the canonical no-space form and the human form parse cleanly. Comma inside string literals never splits. Empty arrays, trailing commas, and unbalanced brackets keep their existing semantics (test-pinned).

feat(charts): expose Talos operator extension points in cozystack + generic presets

Operators wanting to extend the rendered Talos config (machine.kernel.modules, machine.sysctls, machine.kubelet.extraConfig, machine.files) had to fork the chart or hand-edit the generated nodes/<n>.yaml against the autogen banner. Adds four extra* values keys to both presets.

On cozystack (defaults present): list-shaped keys append; map-shaped keys are collision-checked at template time and fail with a precise hint if an operator key names a built-in. Built-ins are never overridable — yaml.v3 rejects duplicate map keys on decode, so a silent emit-both would produce a config that cannot round-trip.

On generic (no defaults): each block emits only when the matching extra* key is non-empty.

fix(upgrade): point-patch node body install.image after successful upgrade

talm upgrade resolves the target installer image from values.yaml, but the per-node nodes/<n>.yaml::machine.install.image was never resynced afterwards. A follow-up talm apply rendered the chart against values.yaml (got the new image), then merged the stale body via MergeFileAsPatch and silently pinned the cluster back to the pre-upgrade image — the next A/B boot would roll back without warning.

After successful RPC + post-upgrade verify, point-patch machine.install.image in every -f node body to the applied image. Uses a yaml.v3 streaming decoder so the v1.12+ multi-doc shape (machine + cluster + RegistryMirrorConfig + LinkConfig + Layer2VIPConfig + VLANConfig) is preserved intact — yaml.Unmarshal reads only the first document and would silently erase the rest. Encoder pinned at 2-space indent so the diff stays a single line. Comments and modeline survive untouched.

Verify-failure (auto-rollback) intentionally leaves the body alone: the node ran the old image, so the body must reflect that. Operator opt-out via --skip-post-upgrade-verify (and --insecure / --stage paths that drop Phase 2C entirely) patches unconditionally on RPC success.

Verification

  • go test ./...: all packages green.
  • golangci-lint run ./...: 0 issues.
  • Manual-test-plan extended in each commit (B7/B8 for extension points, C6f for multi-IP modeline, E4 for post-upgrade body sync).

Docs sync

The new extra* keys and the post-upgrade body-sync behaviour are operator-visible. A parallel update at cozystack/website will follow under a separate PR.

Summary by CodeRabbit

  • New Features

    • Added extension points for custom kernel modules, kubelet arguments, sysctls, and machine files.
    • Automatic post-upgrade synchronization of machine install image to preserve state.
    • Enhanced modeline parsing with improved JSON array and whitespace handling.
  • Improvements

    • Render-time validation prevents custom settings from overwriting built-in defaults.
    • Better YAML structure preservation during upgrades.

Review Change Stack

lexfrei added 3 commits May 15, 2026 22:36
ParseModeline previously split the modeline body on the literal `, `
(comma+space) string, which is the separator GenerateModeline emits
between key-pairs. JSON arrays serialised by encoding/json carry no
space after each inner comma ("["a","b"]"), so the canonical
talm-generated form parsed back cleanly. But a hand-authored modeline
with whitespace inside the array ("nodes=["a", "b"]" — the
natural shape) was cut at the first inner comma. The leading half
("nodes=["a"") then failed JSON parsing with
"unexpected end of JSON input", with a hint that misled the operator
into thinking their JSON was malformed.

Replace the literal split with a depth-aware tokenizer that tracks
`[`/`]` nesting and JSON string state (including backslash
escapes) and only emits a key-pair boundary when it sees a `,` at
depth 0 outside a string literal. The canonical no-space form still
parses; the human-written form now parses; commas inside string
literals never split the value.

Update the contract test that pinned the old strict separator to
document the relaxation, and add cases for multi-element arrays with
various whitespace shapes plus comma-in-string-literal coverage.
Extend the manual-test-plan with section C6f covering the parser
behaviour on both anchor and side-patch slots.

Signed-off-by: Aleksei Sviridkin <f@lex.la>
…eneric presets

Operators wanting to extend the Talos machine config rendered by the
cozystack or generic presets had to fork the chart or hand-edit the
generated per-node YAML against the autogen banner asking them to
keep template edits in values. Add four "extra*" values keys that
plug into the load-bearing sections of machine.* — kernel modules,
sysctls, kubelet.extraConfig, machine.files — without overriding the
preset's hardcoded defaults.

Cozystack preset (append / merge to existing built-ins):

  extraKernelModules     []  append to openvswitch / drbd / zfs / spl
                             / vfio_pci / vfio_iommu_type1
  extraKubeletExtraArgs  {}  merge into cpuManagerPolicy + maxPods;
                             last-key-wins on conflict (escape hatch)
  extraSysctls           {}  merge into gc_thresh* + nr_hugepages
  extraMachineFiles      []  append to CRI customization + lvm.conf

Generic preset (emit only when set):

  extraKernelModules     []  emits machine.kernel.modules block
  extraKubeletExtraArgs  {}  emits kubelet.extraConfig block
  extraSysctls           {}  emits machine.sysctls block
  extraMachineFiles      []  emits machine.files block

Contract tests under pkg/engine pin append vs merge semantics for
every key on cozystack, on / off state on generic, and the default
generic render still emits no machine.kernel / sysctls / files /
extraConfig blocks (the existing NoCozystackOpinionsOnGeneric guard
stays green). Manual-test-plan B7 / B8 cover both presets with
forward-looking yq queries against the rendered output.

Signed-off-by: Aleksei Sviridkin <f@lex.la>
…grade

talm upgrade resolves the target installer image from values.yaml
(canonical "raise the cluster's Talos version" workflow), but the
node body's machine.install.image was never resynced afterwards. A
follow-up talm apply re-rendered the chart against values.yaml and
got the new image, then merged the stale body via MergeFileAsPatch
and silently pinned the cluster back to the pre-upgrade image — the
next A/B boot would roll back without warning.

Point-patch machine.install.image in every -f node body after the
upgrade RPC + post-upgrade verify return success. Uses a yaml.v3
node round-trip so the modeline, autogen banner, sibling keys, and
per-key comments survive untouched; files without an install.image
key (side-patches, orphans, partials) are silently skipped so the
handler can blindly fan out over the full -f list.

Verify failure intentionally leaves the body alone: the node is on
the pre-upgrade image after auto-rollback, and the body must reflect
what the node actually runs — not what talosctl was asked to install.
Operator opt-out via --skip-post-upgrade-verify (or the --insecure /
--stage paths that drop Phase 2C entirely) patches unconditionally
on RPC success, trading the verify gate's safety for body-sync.

Contract tests pin scalar swap, idempotency (no rewrite when already
on target), silent-skip across three no-key shapes, structural-error
surface for unexpected machine.install.image kinds, and file-list
fan-out semantics. Manual-test-plan E4 covers the happy path,
failure-path skip, and operator-opt-out path against a real cluster.

Signed-off-by: Aleksei Sviridkin <f@lex.la>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 42622f57-e3a3-41b5-8ca0-ace525c2ecbf

📥 Commits

Reviewing files that changed from the base of the PR and between a0bd589 and c349a05.

📒 Files selected for processing (12)
  • charts/cozystack/templates/_helpers.tpl
  • charts/cozystack/values.yaml
  • charts/generic/templates/_helpers.tpl
  • charts/generic/values.yaml
  • docs/manual-test-plan.md
  • pkg/commands/contract_upgrade_image_writeback_test.go
  • pkg/commands/upgrade_handler.go
  • pkg/commands/upgrade_image_writeback.go
  • pkg/engine/contract_machine_test.go
  • pkg/modeline/contract_test.go
  • pkg/modeline/modeline.go
  • pkg/modeline/modeline_test.go

📝 Walkthrough

Walkthrough

This PR adds three orthogonal features: operator-driven Helm extension points for machine configuration (extra* values), robust modeline parsing for JSON arrays with flexible spacing, and post-upgrade synchronization of machine.install.image in node body files. All changes preserve backward compatibility through empty defaults and conditional rendering.

Changes

Operator Extension Points for Machine Configuration

Layer / File(s) Summary
Cozystack Extension Points and Collision Guards
charts/cozystack/templates/_helpers.tpl, charts/cozystack/values.yaml
Cozystack template now validates and conditionally merges extraKubeletExtraArgs and extraSysctls into preset defaults while rejecting collisions, and adds optional extraKernelModules and extraMachineFiles rendering. Values file documents the four new extension points with empty-array/empty-map defaults.
Generic Chart Extension Points and Emit-Only Rendering
charts/generic/templates/_helpers.tpl, charts/generic/values.yaml
Generic template conditionally emits machine.kubelet.extraConfig, machine.sysctls, machine.kernel.modules, and machine.files blocks only when corresponding values are non-empty. Values file documents each extension point with empty defaults.
Contract Tests for Extension Points and Collision Handling
pkg/engine/contract_machine_test.go
Engine contract tests verify cozystack append semantics, collision rejection with render-time errors for overlapping keys, and generic preset conditional emission. YAML decoding helpers extract and validate configuration shapes from rendered multi-document output.
Manual QA for Extension Points
docs/manual-test-plan.md
Manual test scenarios document cozystack collision-rejection expectations and generic emit-only behavior validation.

Modeline Parsing Robustness for JSON Arrays

Layer / File(s) Summary
Depth-Aware Modeline Tokenization
pkg/modeline/modeline.go
New splitModelineParts tokenizer scans character-by-character tracking JSON array nesting and string escape sequences to split modeline parts on commas only at depth 0. Replaces fixed ", " delimiter, enabling flexible comma spacing inside arrays.
Contract Tests for Modeline Parsing
pkg/modeline/contract_test.go
Contract tests validate key-value separation with multiple comma/spacing variants, assert commas inside JSON arrays/strings do not split tokens, reject trailing commas at depth 0, and ensure unbalanced brackets produce errors without panic.
Table-Driven Tests for Modeline Parsing
pkg/modeline/modeline_test.go
Table-driven tests exercise edge cases: multi-element arrays with spaced commas, mixed spacing across key-pairs, commas inside JSON strings, and omitted trailing space in key separators.
Manual QA for Apply-Chain Modeline Parsing
docs/manual-test-plan.md
Manual test verifies multi-IP modeline JSON array parsing tolerates whitespace variations and that side-patch files carrying modelines are rejected by the modeline gate.

Post-Upgrade Image Sync for Node Body Files

Layer / File(s) Summary
Core Install-Image Writeback Implementation
pkg/commands/upgrade_image_writeback.go
New module implements multi-document YAML decode/encode with pinned indentation, locates and in-place-swaps the machine.install.image scalar in the first matching document, preserves file permissions from existing mode bits, and skips cleanly when key is absent or already on target.
Contract Tests for Install-Image Writeback
pkg/commands/contract_upgrade_image_writeback_test.go
Contract tests validate scalar swap with comment/modeline preservation, idempotent no-op behavior and stable mtime, silent skip when key missing, error handling for structural YAML issues, byte-for-byte indentation preservation, Unix file mode preservation, multi-document handling with --- separators and documents in any order, and multi-file fan-out.
Upgrade Handler Post-Upgrade Sync Integration
pkg/commands/upgrade_handler.go
Upgrade handler now sequences post-upgrade verify and calls writeBackInstallImageToFiles to patch node bodies, regardless of verify opt-out or skip-verify. Help text documents the new "Post-upgrade sync" step. Both verify-enabled and verify-skipped paths now perform the sync.
Manual QA for Post-Upgrade Image Sync
docs/manual-test-plan.md
Manual test verifies machine.install.image in project files is synced only after successful post-upgrade verification while preserving multi-document YAML count and structure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • cozystack/talm#124: Modifies ParseModeline implementation for tokenizing modeline key/value parts; directly related to this PR's modeline parsing changes.
  • cozystack/talm#154: Adds contract tests pinning the same Helm machine extension-point rendering semantics and ParseModeline parsing behavior updated in this PR.
  • cozystack/talm#116: Refactors the talos.config.machine.common Helm helper template during the Talos v1.12 multi-document migration; overlaps at the same template code paths extended in this PR for extension-point rendering.

Suggested reviewers

  • IvanHunters
  • myasnikovdaniil

Poem

🐰 Three strands of change, all woven tight,
Extra fields that guard and gate just right,
Commas dancing deep in JSON arrays bright,
Images that sync when upgrades take their flight,
A rabbit's gift to keep configs just right! ✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/talm-extension-points-and-upgrade-sync

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lexfrei lexfrei marked this pull request as ready for review May 15, 2026 19:55
@lexfrei lexfrei merged commit 62f344c into main May 15, 2026
14 checks passed
@lexfrei lexfrei deleted the feat/talm-extension-points-and-upgrade-sync branch May 15, 2026 19:55
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces operator-supplied extension points (extraKernelModules, extraSysctls, extraKubeletExtraArgs, and extraMachineFiles) to the cozystack and generic Helm charts, enabling easier customization. It also implements a post-upgrade sync mechanism in talm upgrade that patches the machine.install.image in node configuration files after successful verification, maintaining consistency between local files and the running cluster. Furthermore, the modeline parser has been updated to be depth-aware, allowing for more flexible formatting of multi-element JSON arrays. Review feedback highlights the need for atomic file writes when updating node configurations and suggests collecting errors across all files during the sync process to provide a comprehensive overview of failures.

}

//nolint:gosec // filePath is an operator-supplied -f argument; we must write to exactly that path
if err := os.WriteFile(filePath, out, mode); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of os.WriteFile is not atomic because it truncates the file before writing the new content. If the process is interrupted (e.g., due to a crash or power failure) between truncation and completion of the write, the node configuration file could be left empty or corrupted. For critical configuration files, it is safer to write the content to a temporary file in the same directory and then use os.Rename to replace the original file atomically.

Comment on lines +295 to +300
for _, filePath := range files {
patched, err := writeBackInstallImageToNodeBody(filePath, newImage)
if err != nil {
return err
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function returns immediately on the first error encountered while patching node bodies. In a multi-node upgrade scenario, this could leave the local configuration files in an inconsistent state where some are updated and others are not, without the operator seeing errors for the remaining files. Consider collecting errors from all files in a slice and using errors.Join to return them at the end, providing a complete overview of any failures.

References
  1. Ensure that invalid inputs or states are safely handled in all cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant