fix: RotateKeys atomic rotation, debugPhase empty-patch guard, normalizeEndpoint IPv6 brackets#160
Conversation
Closes #151. The previous implementation called GenerateKey() expecting a fresh identity, but GenerateKey is a load-or-create helper: when talm.key already exists (the normal pre-rotation state) it loads the existing identity instead of generating a new one. Result: rotation was a silent no-op — the same key still encrypted every secret in secrets.encrypted.yaml after the alleged rotation. Fix: remove talm.key before calling GenerateKey so a fresh identity is guaranteed to land on disk. The encrypted file is then re-encrypted with the new key. Test: TestContract_Age_RotateKeys_ReplacesKeyAndPreservesPlaintext (was previously named ...PreservesPlaintextRoundTrip and contained a comment documenting the bug while NOT pinning the broken behaviour). Now asserts both invariants — public key MUST differ across rotation, AND plaintext must round-trip with the new key. The previous-name comment is removed since the bug is fixed. Signed-off-by: Aleksei Sviridkin <f@lex.la>
Closes #152. debugPhase indexed patch[0] without checking that patch was non-empty. A template that conditionally emits nothing legitimately produces "" in the patches slice; the runtime panic surfaced ONLY under --debug, exactly when the operator was using the flag to investigate a problem. Skip empty entries before the patch[0] dispatch. Surviving non-empty patches print as before. Tests in pkg/engine/contract_debugphase_test.go run the helper as a subprocess (it calls os.Exit(0) at the end, can't be exercised in-process) using the os.Args[0]+ENV pattern from Go's stdlib tests. Two cases: - TolerantOfEmptyPatch: empty entry sandwiched between two non-empty patches; surrounding entries still print, no panic. - HandlesAllEmpty: every entry empty; loop skips all of them, exit 0, no panic. Signed-off-by: Aleksei Sviridkin <f@lex.la>
Closes #155. normalizeEndpoint stripped IPv6 brackets via net.SplitHostPort and never re-added them, producing an unparseable URL like "https://2001:db8::1:6443" instead of the canonical bracketed form "https://[2001:db8::1]:6443" required by RFC 3986 §3.2.2 and Go's net/url.Parse. Use net.JoinHostPort for the assembly step — it bracket-wraps any host containing a colon, so the same code path handles IPv4 and IPv6 inputs symmetrically. The no-port branch strips an outer bracket pair (if present) so JoinHostPort can add exactly one pair back. Tests in pkg/commands/contract_helpers_test.go: - The previously-pinned-as-broken IPv6 case `[2001:db8::1]:6443` -> `https://2001:db8::1:6443` (FIXME(#155)) now expects the canonical bracketed form. - New no-explicit-port IPv6 case `[2001:db8::1]` -> `https://[2001:db8::1]:6443` exercises the strip-then-rebracket path. Signed-off-by: Aleksei Sviridkin <f@lex.la>
Address branch-review observation: the two IPv6 subtests previously keyed off the input string, so the output rendered '[2001:db8::1]' twice (with and without port) which read as duplicate at a glance. Switch to named subtests so each row is unambiguous in 'go test -v' output. Signed-off-by: Aleksei Sviridkin <f@lex.la>
Address branch-review blocker: the previous fix for #151 introduced a data-loss window between os.Remove(talm.key) and the final WriteFile of the re-encrypted secrets. Any crash in that window left the operator with an encrypted secrets file paired against either a missing key (unrecoverable) or a fresh-but-empty key (which cannot decrypt the still-old ciphertext). New shape: Phase 1 — read encrypted file, decrypt with old key, all in mem. Phase 2 — generate new identity, encrypt new ciphertext, all in mem. Phase 3 — atomic-rename originals aside as *.rotation-backup. Phase 4 — write new key + encrypted file via secureperm.WriteFile (atomic temp-then-rename, mode 0o600). Any failure here triggers a restore() helper that puts the backups back and returns an error citing the failure stage. Phase 5 — remove backups on success. Refuse to start if leftover backups from a previous interrupted run are present — the operator must inspect them before another rotation runs, otherwise the recovery state would be silently overwritten. Adjacent fix in the same function: secrets.encrypted.yaml is now written via secureperm.WriteFile so it lands at mode 0o600 (was 0o644 — pre-existing world-readable, defense-in-depth). Extracted formatKeyFile() helper so RotateKeys produces the same canonical age keygen layout as GenerateKey. New tests in pkg/age/contract_test.go: - BothFilesMode0600 — both files end up 0o600 after rotation. - NoLeftoverBackups — successful rotation removes both backup files, no clutter. - RefusesOnLeftoverBackup — second rotation refuses with a precise error pointing at the leftover path; the leftover is preserved for manual inspection. Existing ReplacesKeyAndPreservesPlaintext continues to pin the core invariant (oldPub != newPub AND plaintext round-trips). Closes #159 in addition to #151. Also picks up the docstring follow-up on normalizeEndpoint — adds IPv6 examples to the function comment so the contract is visible in godoc, not just in the implementation comment. Signed-off-by: Aleksei Sviridkin <f@lex.la>
Address branch-review blockers on the previous commit:
- Phase 5 cleanup of *.rotation-backup files now returns a non-nil
error if either os.Remove fails, instead of returning nil after
printing a stderr warning. Previously a successful rotation with
a failed cleanup left leftover backups on disk AND the function
returned nil — the next run would then refuse with a 'previous
interrupted run' message that contradicted the success of the
prior call. The new error explicitly states 'rotation committed
but cleanup of backup files failed' so operators see what
actually happened.
- Docstring rewritten:
* Recovery procedure now explains both origins of leftover
*.rotation-backup files (interrupted run vs successful run
with failed cleanup) and gives a way to distinguish via the
timestamp of talm.key vs the backup. The previous wording
advised 'rename them back into place' which was unsafe in the
cleanup-failed case (would overwrite the new key with the old
key while the encrypted file remained encrypted under the new,
now-discarded, key — unrecoverable).
* The contradiction between 'returns nil only after backups are
removed' and 'cleanup failure still returns nil' is resolved
in favour of the former.
- Phase 0 refusal message no longer claims 'interrupted run' (which
is wrong for the cleanup-failed origin); it now lists both
possible origins.
Tests:
- New CleanupFailureSurfacesError test pins that any rename-aside
failure during Phase 3 returns an error AND leaves the originals
untouched on disk (Phase 3 rollback). Failure injection: stage a
directory at the backup destination so os.Rename fails. Skipped
under euid 0 because the directory permissions used in the
injection are ignored by the kernel for root.
- RefusesOnLeftoverBackup tightened to assert that the error names
both possible origins ('interrupted' AND 'cleanup'), so the
operator does not need to consult two recovery procedures.
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Address branch-review observations on the previous commit: 1. The test previously named CleanupFailureSurfacesError did not actually exercise the Phase 5 cleanup path. Staging a directory at the backup destination triggers Phase 0 refusal (os.Stat returns nil error for directories), not a Phase 3 rename failure or a Phase 5 os.Remove failure. Renamed to RefusesWhenDirectoryBlocksBackupPath to match what it actually pins. Comment now explicitly states the Phase 5 path is not reachable through public-API fault injection — the contract there is documented in the function docstring. 2. The Phase 3 rollback rename (between failing-encrypted-rename and the error return) silently swallowed its own error. If the rollback failed the operator was left with keyBackup but no keyFile, and the caller-facing error said only 'failed to back up encrypted file' without flagging the partial state. Apply the same pattern as the restore() closure in Phase 4 — capture the rollback rename's error and include it in the wrapped message with manual-recovery instructions if non-nil. Signed-off-by: Aleksei Sviridkin <f@lex.la>
📝 WalkthroughWalkthroughThis PR implements three independent bug fixes: a complete rewrite of key rotation to actually generate and commit a fresh identity atomically with backup recovery, IPv6 endpoint normalization using RFC-3986-compliant bracket handling, and hardening debug output against empty patch strings. All changes include comprehensive contract test coverage. ChangesAtomic key rotation with backup recovery
IPv6 endpoint canonicalization
Debug output empty patch safety
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request improves the security and atomicity of the key rotation process, ensures correct IPv6 bracketing during endpoint normalization, and fixes a potential panic in the debug phase when encountering empty patches. Comprehensive contract tests were added to verify these changes. Feedback suggests extending the use of secure file permissions to other functions within the age package to ensure a consistent security posture across the codebase.
| if err := secureperm.WriteFile(keyFile, []byte(formatKeyFile(newIdentity, time.Now()))); err != nil { | ||
| return restore("write new key", err) | ||
| } | ||
| if err := secureperm.WriteFile(encryptedFile, encryptedDataNew); err != nil { |
There was a problem hiding this comment.
The use of secureperm.WriteFile here correctly ensures 0o600 permissions for the encrypted secrets file during rotation. However, other functions in this package that write the same file (specifically EncryptSecretsFile at line 231 and EncryptYAMLFile at line 717) still use os.WriteFile with 0o644. For a consistent security posture, consider applying this improvement to those locations as well.
There was a problem hiding this comment.
Right — the same file (secrets.encrypted.yaml) is written by three code paths (EncryptSecretsFile, RotateKeys, EncryptYAMLFile). RotateKeys was switched to secureperm.WriteFile earlier in this PR; the other two stayed on os.WriteFile with 0o644. Both now go through secureperm.WriteFile so every code path that writes encrypted secrets material agrees on mode 0o600. New tests pin the mode invariant for all three paths (TestContract_Age_EncryptSecretsFile_Mode0600, TestContract_Age_EncryptYAMLFile_Mode0600, plus the existing RotateKeys_BothFilesMode0600). See ef886b6.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/engine/contract_debugphase_test.go (1)
59-59: ⚡ Quick winUse
CommandContextfor subprocess invocation.The linter flags
exec.Commandcalls without context. Usingexec.CommandContextenables proper cancellation if the test times out and aligns with Go best practices.🔧 Proposed fix
For line 59:
- cmd := exec.Command(os.Args[0], "-test.run=^TestContract_DebugPhase_TolerantOfEmptyPatch$") + cmd := exec.CommandContext(context.Background(), os.Args[0], "-test.run=^TestContract_DebugPhase_TolerantOfEmptyPatch$")For line 96:
- cmd := exec.Command(os.Args[0], "-test.run=^TestContract_DebugPhase_HandlesAllEmpty$") + cmd := exec.CommandContext(context.Background(), os.Args[0], "-test.run=^TestContract_DebugPhase_HandlesAllEmpty$")Add the
contextimport at the top:import ( + "context" "os" "os/exec"Also applies to: 96-96
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/engine/contract_debugphase_test.go` at line 59, Replace exec.Command calls in TestContract_DebugPhase_TolerantOfEmptyPatch (and the other test at the similar site) with exec.CommandContext: add the context import, create a context (preferably from t.Context() or a context.WithTimeout/WithCancel and defer cancel()) and pass that ctx into exec.CommandContext instead of exec.Command; ensure you call cancel if you used WithTimeout/WithCancel and update both occurrences referenced (the one constructing cmd at TestContract_DebugPhase_TolerantOfEmptyPatch and the other similar test).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/age/age.go`:
- Around line 515-518: EncryptSecretsFile and EncryptYAMLFile currently write
encrypted outputs with os.WriteFile(…, 0o644) while RotateKeys uses
secureperm.WriteFile to enforce 0o600; update EncryptSecretsFile and
EncryptYAMLFile to use secureperm.WriteFile (or otherwise set file mode 0o600
atomically) when writing the encrypted outputs so new files receive the same
0600 hardening as RotateKeys, preserving the permission contract for initial
encrypt/init paths; reference EncryptSecretsFile, EncryptYAMLFile, RotateKeys,
and secureperm.WriteFile when making the change.
---
Nitpick comments:
In `@pkg/engine/contract_debugphase_test.go`:
- Line 59: Replace exec.Command calls in
TestContract_DebugPhase_TolerantOfEmptyPatch (and the other test at the similar
site) with exec.CommandContext: add the context import, create a context
(preferably from t.Context() or a context.WithTimeout/WithCancel and defer
cancel()) and pass that ctx into exec.CommandContext instead of exec.Command;
ensure you call cancel if you used WithTimeout/WithCancel and update both
occurrences referenced (the one constructing cmd at
TestContract_DebugPhase_TolerantOfEmptyPatch and the other similar test).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 34a11823-8284-4a12-88af-ca712b34d462
📒 Files selected for processing (6)
pkg/age/age.gopkg/age/contract_test.gopkg/commands/contract_helpers_test.gopkg/commands/talosctl_wrapper.gopkg/engine/contract_debugphase_test.gopkg/engine/engine.go
Windows CI failed the BothFilesMode0600 assertion: NTFS does not honour Unix permission bits, so os.FileInfo.Mode().Perm() always reports 0o666 regardless of the actual DACL. The secureperm package already has a Windows-side test that exercises the equivalent contract via security-descriptor APIs (pkg/secureperm/secureperm_windows_test.go) — skipping here avoids the duplicate platform-specific assertion in pkg/age. Signed-off-by: Aleksei Sviridkin <f@lex.la>
Address review feedback from gemini-code-assist on pkg/age/age.go:613: the same file (secrets.encrypted.yaml) is written by three code paths (EncryptSecretsFile, RotateKeys, EncryptYAMLFile). RotateKeys was switched to secureperm.WriteFile (mode 0o600) earlier in this PR; the other two still used os.WriteFile with 0o644, which made the on-disk mode depend on which code path last touched the file. Switch both to secureperm.WriteFile so every code path that writes encrypted secrets material agrees on the same defense-in-depth permission. Same rationale as the existing 0o600 on talm.key itself. Tests pin the mode invariant for all three paths: - EncryptSecretsFile_Mode0600 (new) - EncryptYAMLFile_Mode0600 (new) - RotateKeys_BothFilesMode0600 (existing) All three skip on Windows because NTFS does not honour Unix permission bits; secureperm has a Windows-side DACL test that covers the equivalent contract. Signed-off-by: Aleksei Sviridkin <f@lex.la>
Squash of the 10 commits produced by the pkg/age cleanup agent
(branch worktree-agent-a686a1f4e92a27bbd) into feat/strict-lint.
The agent's branch was based on a pre-feat/strict-lint commit so
cherry-picking individually conflicted; replacing pkg/age/* with
the agent's final state keeps the same end result with a single
parent.
Categories addressed inside pkg/age (verified by agent: scope lint
count == 0, build clean, race tests pass):
- godot, godoclint -> doc comments terminate with period
- err113, errorlint -> introduce ErrLeftoverRotationBackup sentinel,
switch to cockroachdb/errors throughout
- dupl, nestif, forcetypeassert -> extract shared helpers
encryptYAMLPair/decryptYAMLPair/
encryptYAMLMap/mergeAndEncryptYAMLMap
- wrapcheck -> wrap external errors via cockroachdb/errors,
errInternalInvariant for invariant violations
- noinlineerr -> hoist 'err :=' assignments out of if-init
(~70 sites in tests + age.go)
- nlreturn, wsl_v5 -> blank lines around terminating statements
- varnamelen -> typed (was v) in type-switches
- goconst -> goosWindows constant for the 'windows' literal
- gocognit -> mergeAndEncryptYAMLValues split into
mergeAndEncryptMap/Slice/String/Scalar +
tryReuseCiphertext
- gocyclo -> RotateKeys split into refuseIfLeftoverBackups/
prepareRotationCiphertext/swapOriginalsAside/
restoreFromBackups/removeRotationBackups
No semantic changes; the rotation atomicity contract from PR #160
is preserved (the helpers are pure refactors of the existing flow).
Three nolint:wrapcheck sites remain at errors.WithHint/WithHintf
boundaries — these are the project's operator-facing hint helpers
and cannot be wrapped further without changing semantics; comments
explain why on each.
Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Summary
Three Go-side bug fixes covering
RotateKeysatomicity (#151 + #159),debugPhasepanic on empty patch (#152), andnormalizeEndpointIPv6 bracketing (#155). Each fix replaces a previously-pinned-as-broken contract test with a real-behaviour pin, so a regression that reverts any fix surfaces in CI.Fixes
pkg/age.RotateKeys— closes #151 and #159The original implementation called
GenerateKey()expecting a fresh identity, butGenerateKeyis load-or-create: withtalm.keyalready on disk it loaded the existing identity instead of generating a new one. Rotation was a silent no-op — the same key encrypted every secret after the alleged rotation.The fix is atomic transaction with backup/restore:
*.rotation-backup.secureperm.WriteFile(atomic temp+rename, mode0o600). Failures here trigger arestoreclosure that puts the backups back.Phase 0 refuses to start when leftover
*.rotation-backupfiles are present, with an error that names both possible origins (interrupted run or successful run with failed cleanup) so the operator picks the right recovery path.Adjacent fix in the same function:
secrets.encrypted.yamlis now written viasecureperm.WriteFileso it lands at mode0o600(was0o644— pre-existing, defense-in-depth). The same switch is applied toEncryptSecretsFileandEncryptYAMLFileso all three code paths that write encrypted secrets material agree on the same on-disk mode.Tests pin: public key inequality, plaintext round-trip, both files at
0o600(with Windows skip — NTFS does not honour Unix permission bits), no leftover backups on success, refusal on leftover backups (with both-origins error wording), refusal when something occupies the backup destination.pkg/engine.debugPhase— closes #152debugPhaseindexedpatch[0]without a length guard. A template that conditionally emits nothing legitimately produces""in the slice; the runtime panic surfaced ONLY under--debug— exactly when the operator was using the flag to investigate something. Skip empty entries before the dispatch.Test runs the helper as a subprocess (it calls
os.Exit(0)) using theos.Args[0]+ env-sentinel pattern fromos/exec_test.go. Two cases: empty sandwiched between non-empty (asserts surrounding output preserved + no panic), all-empty (asserts header-only path).pkg/commands.normalizeEndpoint— closes #155net.SplitHostPortstrips IPv6 brackets and the oldfmt.Sprintfdid not re-add them, so[2001:db8::1]:6443became the unparseablehttps://2001:db8::1:6443. Switch tonet.JoinHostPortwhich auto-bracket-wraps any host containing a colon. RFC 3986 §3.2.2 compliant.Docstring updated to include IPv6 examples (now visible in godoc, not just in the implementation comment).
Test pinned-as-broken
FIXME(#155)case rewritten to assert canonicalhttps://[2001:db8::1]:6443; new no-port IPv6 case[2001:db8::1]exercises the strip-then-rebracket path. Subtests renamed to descriptive labels for unambiguous test output.