Skip to content

feat(encryption): sidecar crash-durable read/write (Stage 1)#722

Open
bootjp wants to merge 3 commits intofeat/encryption-foundationfrom
feat/encryption-sidecar
Open

feat(encryption): sidecar crash-durable read/write (Stage 1)#722
bootjp wants to merge 3 commits intofeat/encryption-foundationfrom
feat/encryption-sidecar

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 30, 2026

Summary

Stage 1 of the data-at-rest encryption rollout. Stacked on #719 (Stage 0) — base branch is feat/encryption-foundation. Once Stage 0 merges to main, this PR base will switch to main and the diff will collapse to just the Stage 1 files.

Implements the §5.1 keys.json sidecar layer of docs/design/2026_04_29_proposed_data_at_rest_encryption.md. Library-only; integration with FSM apply (§5.5 reconciliation) and admin commands (§6) lands in Stage 5+.

What is added

  • internal/encryption/sidecar.goSidecar struct mirroring §5.1 JSON exactly:

    • Version, RaftAppliedIndex, StorageEnvelopeActive, RaftEnvelopeCutoverIndex
    • ActiveKeys{Storage, Raft uint32}0 is the §5.1 not-bootstrapped sentinel
    • Keys map[string]SidecarKey — keyed by decimal-string form of key_id per §5.1
    • SidecarKey{Purpose, Wrapped, Created, LocalEpoch}Created is a string (not time.Time) so future timezone-format additions do not break old readers
  • WriteSidecar(path, *Sidecar) error implements the §5.1 crash-durable protocol:

    1. Build contents (Version is set automatically) and validate.
    2. Write to <path>.tmp at mode 0o600, then file.Sync().
    3. os.Rename(<path>.tmp, <path>).
    4. dir.Sync() on the parent so the rename is durable.

    Skipping any step is a data-loss-class bug per §10 lens 1; a power loss between the rename and the directory inode flush can roll back keys.json while the rotation Raft entry is already committed, stranding ciphertext under a wrap that has effectively disappeared. Dir-fsync errors propagate — §5.1 explicitly refuses to start on filesystems (NFS, some FUSE) that cannot guarantee durability of the rename.

  • ReadSidecar(path) (*Sidecar, error) — validates wire version, per-key purpose (storage / raft), and the decimal-uint32 form of every keys-map entry. Returns typed errors (ErrSidecarVersion, ErrSidecarPurpose, ErrSidecarKeyIDFormat, ErrSidecarReservedKeyID).

  • IsNotExist(err) convenience for the first-boot case.

Tests (sidecar_test.go, 22 cases)

  • Round-trip: Write → Read returns the same struct via reflect.DeepEqual.
  • No .tmp file left behind on success; tmp cleaned up on rename failure (target as directory).
  • Overwrite existing.
  • File mode 0o600 (no group/other read/write).
  • WriteSidecar sets Version automatically.
  • Validation: bad purpose / reserved key_id / non-numeric key_id at both Read and Write.
  • ReadSidecar: missing (IsNotExist), unknown version (4 values), corrupt JSON (4 cases), key_id overflowing uint32 (4294967296).
  • Constants pinned (storage / raft purpose strings).

Self-review (CLAUDE.md 5 lenses)

  1. Data loss — Crash-durable protocol matches §5.1; partial-write cleanup test confirms no orphaned .tmp. Validation symmetric at Read and Write so a malformed sidecar cannot land on disk via WriteSidecar.
  2. Concurrency — Sidecar I/O is not concurrent (one writer per data dir). No locking; future Stage 5+ admin layer will serialize at the FSM apply boundary.
  3. Performance — Single read at boot, single write per rotation/cutover entry. Marshal cost is negligible vs. the fsync syscalls.
  4. Data consistencyVersion field is the wire-format gate. Reader fails closed on unknown version. Future v2 will be additive JSON fields plus a bump.
  5. Test coverage — 22 sidecar cases on top of Stage 0 31 tests + 2 property tests + 3 benchmarks; package total now 47 tests, -race clean, 0 lint issues.

Test plan

  • go test ./internal/encryption/... (~7s)
  • go test -race ./internal/encryption/... (~33s)
  • golangci-lint run ./internal/encryption/... (0 issues)
  • Reviewer: confirm Sidecar JSON matches the §5.1 illustrative shape verbatim (field names, Wrapped as base64).
  • Reviewer: confirm crash-durable protocol order (write → file.Sync → rename → dir.Sync) is followed in WriteSidecar.
  • Reviewer: confirm validation is applied at both Read and Write.

Implements the §5.1 keys.json sidecar layer of the data-at-rest encryption design (docs/design/2026_04_29_proposed_data_at_rest_encryption.md). Stage 1 is library-only; integration with FSM apply (§5.5 reconciliation, §6 admin commands) lands in Stage 5+.

Sidecar struct mirrors §5.1 illustrative JSON exactly:

- Version (uint), bumped on layout change; reader rejects mismatch with ErrSidecarVersion so an older binary cannot silently drop fields it does not know about.

- RaftAppliedIndex / StorageEnvelopeActive / RaftEnvelopeCutoverIndex fields for §5.5 sidecar / Raft-log reconciliation.

- ActiveKeys (Storage / Raft uint32) -- 0 means "not bootstrapped" sentinel per §5.1.

- Keys map[string]SidecarKey keyed by decimal-string form of key_id (per §5.1: "JSON object keys must be strings, but the on-disk envelope and the in-memory keystore always work in the binary uint32 form"). Each entry carries Purpose ("storage"/"raft"), Wrapped ([]byte, JSON base64), Created (ISO-8601 string -- not time.Time so future timezone-format additions do not break older readers), and LocalEpoch (uint16, consumed by §4.1 nonce construction).

WriteSidecar implements the §5.1 crash-durable write protocol:

  1. Build new contents + ensure Version is set.

  2. Write to <path>.tmp at mode 0o600, then file.Sync().

  3. os.Rename(<path>.tmp, <path>).

  4. dir.Sync() on the parent so the rename is durable.

Skipping step 2 or 4 is treated as a data-loss-class bug per §10 lens 1: a power loss between the rename and the directory inode flush can roll back keys.json while the rotation Raft entry is already committed, stranding ciphertext under a wrap that has effectively disappeared. Dir-fsync errors propagate -- the design refuses to start on filesystems (NFS, some FUSE) that cannot guarantee durability of the rename.

Validation (applied symmetrically at Read and Write):

- ErrSidecarVersion: wire version != SidecarVersion.

- ErrSidecarPurpose: keys-map entry purpose outside {storage, raft}.

- ErrSidecarKeyIDFormat: keys-map key not a decimal uint32.

- ErrSidecarReservedKeyID: keys-map carries key_id 0 (reserved sentinel).

Tests (sidecar_test.go, 22 cases):

- Round-trip: Write then Read returns the same struct.

- No tmp file left behind on success.

- Overwrite existing.

- File mode 0o600 (no group/other read/write).

- WriteSidecar sets Version automatically.

- Rejects nil sidecar.

- Validation: bad purpose / reserved key_id / non-numeric key_id at Write time.

- ReadSidecar: missing file (IsNotExist), unknown version (multiple), corrupt JSON (4 cases), bad purpose, reserved key_id, non-numeric key_id, key_id overflowing uint32 (4294967296).

- Partial-write cleanup: tmp file is removed if rename fails (target path is a directory).

- Constants pinned (storage / raft purpose strings).

47 tests + 2 property tests + 3 benchmarks total under internal/encryption/, -race clean, 0 lint issues.

Stacked on Stage 0 (PR #719).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f549e54-0f94-43cf-82bf-3d316b04d229

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/encryption-sidecar

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 3m 14s —— View job


Code Review — PR #722 feat(encryption): sidecar crash-durable read/write (Stage 1)

  • Gather context and read changed files
  • Review sidecar.go — struct layout, crash-durable write protocol, validation
  • Review sidecar_test.go — coverage, edge cases, correctness
  • Cross-check against §5.1 of design doc
  • Post final review

Reviewer checklist items (from PR description)

  • Sidecar JSON matches §5.1 shape verbatim. Field names and ordering are exact: version, raft_applied_index, storage_envelope_active, raft_envelope_cutover_index, active.{storage,raft}, keys.<decimal_id>.{purpose,wrapped,created,local_epoch}. Wrapped []byte marshals to base64 automatically via encoding/json. ✓
  • Crash-durable protocol order confirmed. writeTmpAndFsyncos.RenamesyncDir. Matches §5.1 steps 2→3→4. ✓
  • Validation applied at both Read and Write. validateSidecar is called in WriteSidecar before marshalling and in ReadSidecar after unmarshalling. ✓

Issues found

1. Stale .tmp file on write-phase failure [Minor]

WriteSidecar's cleanup defer is registered only after writeTmpAndFsync returns successfully (sidecar.go:199). If the tmp file is created but f.Write or f.Sync inside writeTmpAndFsync fails, the .tmp file is left on disk and the defer never fires.

// sidecar.go:193–203 — defer is registered AFTER writeTmpAndFsync returns
if err := writeTmpAndFsync(tmpPath, data); err != nil {
    return err   // ← defer not yet registered; stale .tmp stays on disk
}
defer func() {
    if retErr != nil {
        _ = os.Remove(tmpPath)
    }
}()

Not a correctness bug in practice (O_TRUNC fixes it on the next call), but TestWriteSidecar_PartialWriteCleanup tests only the rename-failure path, so the PR description's statement "partial-write cleanup test confirms no orphaned .tmp" is only partially accurate. The write-failure path (disk full, fsync error on tmp) is untested.

The fix is small — add cleanup inside writeTmpAndFsync itself, or move the defer before calling it. Fix this →

2. ErrUnsupportedFilesystem not defined [Spec gap, Stage 1 acceptable]

docs/design/2026_04_29_proposed_data_at_rest_encryption.md §5.1 states:

On filesystems that lack dir.Sync() semantics (NFS, some FUSE mounts) the encryption package refuses to start with ErrUnsupportedFilesystem.

This error is not yet declared in errors.go. Currently syncDir propagates the raw os.Sync error, which makes it impossible for the caller to distinguish "NFS without dir-fsync" from "directory not found". Fine for this library-only Stage 1 (the process-start integration is Stage 5+), but ErrUnsupportedFilesystem should be defined in errors.go now so Stage 5+ can pattern-match on it without a separate PR. Fix this →

3. No cross-validation of Active IDs against the Keys map [Minor, defensive gap]

validateSidecar checks that each entry in Keys is well-formed, but does not verify that Active.Storage and Active.Raft (when non-zero) actually appear in Keys, nor that the referenced entry's Purpose matches the direction (storage/raft). A sidecar with Active.Storage = 999 but no "999" key passes validation and is written to disk, only failing at the higher-layer unwrap call.

This is an intentional Stage 1 boundary (the validator is format-only; semantic constraints belong to the FSM layer). Noting it here so Stage 5+ explicitly adds the cross-check before writing an active-key promotion.


Strengths

  • Crash-durable protocol is faithfully implemented. The four-step sequence in WriteSidecar (writeTmpAndFsyncos.RenamesyncDir) matches §5.1 exactly. syncDir correctly opens the directory and calls f.Sync() so the rename is durable even under a kernel that batches directory writes.
  • Error wrapping is consistent throughout. All boundary errors use pkgerrors.Wrapf, satisfying the wrapcheck linter. Sentinel errors (ErrSidecarVersion, ErrSidecarPurpose, ErrSidecarKeyIDFormat, ErrSidecarReservedKeyID) are chained correctly so callers can errors.Is-match on the cause while still seeing the full path context.
  • IsNotExist is correct. Unwrapping through the pkgerrors chain to fs.ErrNotExist works because os.ReadFileos.PathErrorfs.ErrNotExist and errors.Is traverses the chain. The first-boot case is handled cleanly.
  • File-mode security is correct. O_WRONLY|O_CREATE|O_TRUNC with 0o600 on both create and overwrite paths prevents a TOCTOU window where an attacker reads wrapped key material from a world-readable temp file.
  • WriteSidecar mutates the caller's Version field — this is documented and tested by TestWriteSidecar_SetsVersion. The behaviour is intentional and unambiguous.
  • Test suite is thorough for the paths it covers. 22 cases across round-trip, mode bits, overwrite, cleanup, constants pinning, unknown version (4 values), corrupt JSON (4 cases), and all three key-id error kinds at both Read and Write. -race clean per the PR.

Summary

The implementation is solid and faithful to §5.1. The crash-durable write order, error types, and JSON layout are all correct. The two actionable issues are:

  1. Stale .tmp on write-phase failure (minor but worth closing before Stage 5+ when write failures become load-bearing)
  2. ErrUnsupportedFilesystem should be defined now (small addition to errors.go, avoids a dependency on a later PR)

Issue 3 (cross-validation of Active IDs) is a known Stage 5+ concern, not a blocker.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Sidecar management system for encryption keys, providing functionality to read and write the keys.json file using a crash-durable protocol. The implementation includes validation for key purposes and IDs, along with comprehensive unit tests. The review feedback suggests enhancing the validation logic to ensure active keys are present in the key map, avoiding side effects by not mutating input structs during write operations, and ensuring that errors from file closure are not silently ignored.

Comment on lines +134 to +141
func validateSidecar(sc *Sidecar) error {
for idStr, k := range sc.Keys {
if err := validateSidecarKey(idStr, k); err != nil {
return err
}
}
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

validateSidecar should verify that the Active keys (Storage and Raft) actually exist in the Keys map if they are non-zero. A sidecar with an active key ID that has no corresponding entry in the Keys map is in an inconsistent state.

func validateSidecar(sc *Sidecar) error {
	for idStr, k := range sc.Keys {
		if err := validateSidecarKey(idStr, k); err != nil {
			return err
		}
	}
	if sc.Active.Storage != ReservedKeyID {
		if _, ok := sc.Keys[strconv.FormatUint(uint64(sc.Active.Storage), 10)]; !ok {
			return pkgerrors.Errorf("encryption: active storage key %d not found in keys map", sc.Active.Storage)
		}
	}
	if sc.Active.Raft != ReservedKeyID {
		if _, ok := sc.Keys[strconv.FormatUint(uint64(sc.Active.Raft), 10)]; !ok {
			return pkgerrors.Errorf("encryption: active raft key %d not found in keys map", sc.Active.Raft)
		}
	}
	return nil
}

Comment thread internal/encryption/sidecar.go Outdated
Comment on lines +180 to +188
sc.Version = SidecarVersion
if err := validateSidecar(sc); err != nil {
return pkgerrors.Wrap(err, "encryption: validate before write")
}

data, err := json.MarshalIndent(sc, "", " ")
if err != nil {
return pkgerrors.Wrap(err, "encryption: marshal sidecar")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

WriteSidecar mutates the Version field of the input Sidecar struct. This side effect might be unexpected for callers who reuse the struct. It is safer to perform this assignment on a local copy before marshaling.

Suggested change
sc.Version = SidecarVersion
if err := validateSidecar(sc); err != nil {
return pkgerrors.Wrap(err, "encryption: validate before write")
}
data, err := json.MarshalIndent(sc, "", " ")
if err != nil {
return pkgerrors.Wrap(err, "encryption: marshal sidecar")
}
scCopy := *sc
scCopy.Version = SidecarVersion
if err := validateSidecar(&scCopy); err != nil {
return pkgerrors.Wrap(err, "encryption: validate before write")
}
data, err := json.MarshalIndent(&scCopy, "", " ")
if err != nil {
return pkgerrors.Wrap(err, "encryption: marshal sidecar")
}

Comment thread internal/encryption/sidecar.go Outdated
Comment on lines +217 to +231
func writeTmpAndFsync(tmpPath string, data []byte) error {
f, err := os.OpenFile(tmpPath, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, sidecarFileMode)
if err != nil {
return pkgerrors.Wrapf(err, "encryption: open %q", tmpPath)
}
defer func() { _ = f.Close() }()

if _, err := f.Write(data); err != nil {
return pkgerrors.Wrapf(err, "encryption: write %q", tmpPath)
}
if err := f.Sync(); err != nil {
return pkgerrors.Wrapf(err, "encryption: fsync %q", tmpPath)
}
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error from f.Close() is silently ignored. According to the general rules, errors from Close() should be handled or logged to ensure resource leaks or cleanup issues are visible. Since this is a write operation, it is particularly important to ensure the file is closed successfully.

Suggested change
func writeTmpAndFsync(tmpPath string, data []byte) error {
f, err := os.OpenFile(tmpPath, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, sidecarFileMode)
if err != nil {
return pkgerrors.Wrapf(err, "encryption: open %q", tmpPath)
}
defer func() { _ = f.Close() }()
if _, err := f.Write(data); err != nil {
return pkgerrors.Wrapf(err, "encryption: write %q", tmpPath)
}
if err := f.Sync(); err != nil {
return pkgerrors.Wrapf(err, "encryption: fsync %q", tmpPath)
}
return nil
}
func writeTmpAndFsync(tmpPath string, data []byte) (err error) {
f, err := os.OpenFile(tmpPath, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, sidecarFileMode)
if err != nil {
return pkgerrors.Wrapf(err, "encryption: open %q", tmpPath)
}
defer func() {
if closeErr := f.Close(); closeErr != nil && err == nil {
err = pkgerrors.Wrapf(closeErr, "encryption: close %q", tmpPath)
}
}()
if _, err := f.Write(data); err != nil {
return pkgerrors.Wrapf(err, "encryption: write %q", tmpPath)
}
if err := f.Sync(); err != nil {
return pkgerrors.Wrapf(err, "encryption: fsync %q", tmpPath)
}
return nil
}
References
  1. Do not silently ignore errors from Close() methods on resources like network connections. Log them to ensure resource leaks or other cleanup problems are visible.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a1def5d849

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/encryption/sidecar.go Outdated
// Helper exists so WriteSidecar does not nest its own defer-on-error
// inside the rename block.
func writeTmpAndFsync(tmpPath string, data []byte) error {
f, err := os.OpenFile(tmpPath, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, sidecarFileMode)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Force restrictive mode on existing sidecar temp file

WriteSidecar opens <path>.tmp with os.O_CREATE|os.O_TRUNC, but the 0o600 mode is only applied when the file is newly created; if keys.json.tmp already exists (e.g., leftover from older tooling or manually created with broader perms), its permissive mode is retained and then moved into place by os.Rename, making keys.json group/other readable despite the security contract. This is reproducible with a preexisting keys.json.tmp at 0666, which yields a final sidecar mode of 0644 under a typical umask.

Useful? React with 👍 / 👎.

P2 (codex line 218 -- security): writeTmpAndFsync used os.OpenFile with O_CREATE|O_TRUNC; if keys.json.tmp pre-existed at e.g. 0o666 (older tooling, manual poking, partial recovery), the existing mode was retained because O_CREATE only applies the mode arg on a fresh inode. os.Rename then moved the permissive mode into the production keys.json, defeating the wrapped-DEK file-mode guarantee. Fix: os.Remove any pre-existing tmp before O_EXCL-creating fresh at sidecarFileMode (0o600). New TestWriteSidecar_StaleTmpDoesNotLeakPermissiveMode pins the regression.

Medium (gemini line 141 + claude[bot] #3): validateSidecar did not check that Active.Storage / Active.Raft (when non-zero) actually appear in the Keys map. A sidecar with an active key_id but no corresponding wrapped-DEK entry is malformed input; rotation and bootstrap paths always write the two halves together. Added requireActiveKey check in validateSidecar (applied symmetrically at Read and Write); new ErrSidecarActiveKeyMissing typed error. Tests cover both Storage and Raft directions on Write and the Read path.

Medium (gemini line 188): WriteSidecar mutated the callers Sidecar.Version field. Operate on a shallow copy (scCopy := *sc) instead so callers reusing the struct see Version preserved. Updated TestWriteSidecar_OnDiskVersion to assert only the on-disk side; new TestWriteSidecar_DoesNotMutateCaller pins the new contract.

Medium (gemini line 231): writeTmpAndFsync silently ignored f.Close errors. Convert to a named-return retErr pattern with a defer that captures Close failures when the main path otherwise succeeded -- write-back failures are now surfaced.

Minor (claude[bot] #1): WriteSidecar registered the cleanup defer AFTER calling writeTmpAndFsync, so a write/fsync failure inside that helper would leave keys.json.tmp on disk. Move the defer above the call so any failure path -- including pre-rename ones -- triggers cleanup. New TestWriteSidecar_StaleTmpFileIsCleanedOnWriteFailure exercises the read-only-dir path.

Minor (claude[bot] #2): added ErrUnsupportedFilesystem typed error per §5.1. syncDir now wraps the underlying fsync error with this sentinel using pkgerrors.WithSecondaryError so Stage 5+ startup integration can errors.Is-match without parsing strings.

Caller audit per the loop prompt: WriteSidecar / ReadSidecar / syncDir all only have callers in sidecar_test.go for Stage 1 (library only); the syncDir name is shared with internal/raftengine/etcd/persistence.go syncDir but they live in different packages and do not collide. Semantic changes (non-mutating WriteSidecar, Close error propagation, ErrUnsupportedFilesystem wrapping) are contained.

Tests now cover 26 sidecar cases (was 22). Package total: 51 tests + 2 property tests + 3 benchmarks; -race clean, 0 lint issues.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 2m 28s —— View job


Code Review — PR #722 feat(encryption): sidecar crash-durable read/write (Stage 1) (Round 2)

  • Gather context and read changed files
  • Check what was fixed from round-1 review
  • Review sidecar.go — crash-durable write, validation, error handling
  • Review sidecar_test.go — coverage gaps after fixes
  • Review errors.go — new error types
  • Cross-check against gemini/codex review comments
  • Post final review

Round-1 / Gemini / Codex resolution check

All 6 issues from round 1 are fully resolved in commit 95c04e4.

Issue Reporter Status
Defer registered after writeTmpAndFsync → stale tmp on write-phase failure claude round-1 #1 ✅ Fixed — defer moved above the call; TestWriteSidecar_StaleTmpFileIsCleanedOnWriteFailure added
ErrUnsupportedFilesystem undefined claude round-1 #2 ✅ Fixed — declared in errors.go:50; syncDir wraps raw fsync error with pkgerrors.WithSecondaryError(ErrUnsupportedFilesystem, err)
No cross-validation of Active IDs against Keys map claude round-1 #3 / gemini line 141 ✅ Fixed — requireActiveKey added to validateSidecar; ErrSidecarActiveKeyMissing declared; 3 tests added (Write × 2, Read × 1)
WriteSidecar mutated caller's Version field gemini line 188 ✅ Fixed — scCopy := *sc shallow copy; TestWriteSidecar_DoesNotMutateCaller pins the new contract
f.Close() error silently ignored in writeTmpAndFsync gemini line 231 ✅ Fixed — named-return retErr + defer; close errors propagate when write otherwise succeeded
Pre-existing .tmp mode carried through via os.Rename codex P2 ✅ Fixed — os.Remove then O_EXCL create; TestWriteSidecar_StaleTmpDoesNotLeakPermissiveMode pins the regression

Remaining observations (non-blocking)

1. requireActiveKey does not validate purpose direction [Intentional Stage 5+ deferral]

A sidecar with Active.Storage = 1 where key "1" has Purpose = "raft" passes validateSidecar. The cross-validation added in this round checks existence, not semantic direction. This was noted in round-1 as an intentional boundary: purpose-direction matching belongs to the FSM layer in Stage 5+. No action needed now, but Stage 5+ must add the check before writing an active-key promotion.

2. TestReadSidecar_RejectsActiveKeyMissing covers only the Storage direction [Minor asymmetry]

TestWriteSidecar_RejectsActiveKeyMissing tests both Active.Storage and Active.Raft missing-entry cases on the write path (sidecar_test.go:382–411). The read-path counterpart at sidecar_test.go:413–430 only exercises Active.Storage = 12345 with an empty keys map; the Active.Raft direction is untested on read. The code path (requireActiveKey is called for both) is correct and symmetric — this is just a test coverage gap. A two-line addition to the existing read test would close it. Fix this →

3. syncDir close error is silently ignored [Acceptable]

syncDir defers _ = f.Close() on a read-only directory fd. Since no write-back is in flight, a close failure on a directory fd carries no durability implication. This is idiomatic and correct.


Verification of the crash-durable protocol

WriteSidecar (sidecar.go:200–243) now follows the §5.1 four-step sequence correctly:

1. scCopy = shallow copy; scCopy.Version = SidecarVersion; validateSidecar(&scCopy)
2. defer cleanup(tmpPath) registered   ← BEFORE the following calls
3. writeTmpAndFsync(tmpPath, data)     → O_EXCL create → Write → Sync → Close (error propagated)
4. os.Rename(tmpPath, path)
5. syncDir(dir)                        → open dir → Sync → Close

Step 2 is now registered before step 3 (the round-1 fix), so any failure at the write or fsync inside writeTmpAndFsync triggers cleanup. The os.Remove-then-O_EXCL pattern in writeTmpAndFsync (the codex P2 fix) ensures the inode is always fresh at 0o600. The syncDir error wraps ErrUnsupportedFilesystem with WithSecondaryError so errors.Is-matching works without string parsing at Stage 5+.


Reviewer checklist (PR description)

  • Sidecar JSON matches §5.1 shape verbatim. Field names, JSON keys, and Wrapped []byte → base64 all confirmed. ✓
  • Crash-durable protocol order confirmed. Write → file.Syncos.Renamedir.Sync. ✓
  • Validation applied at both Read and Write. validateSidecar (including requireActiveKey) is called in both paths. ✓

Summary

Round 2 is clean. All round-1, gemini, and codex issues are resolved with appropriate tests. The two remaining observations are a test coverage asymmetry (Read-path Active.Raft direction) and an intentional Stage 5+ deferral. Neither is a blocker. This is ready to merge once the base feat/encryption-foundation branch lands.

…722 r2)

claude[bot] r2: TestReadSidecar_RejectsActiveKeyMissing only tested the Active.Storage direction; the Active.Raft direction was implicitly covered by the same requireActiveKey code path but had no explicit regression. Convert to a table-driven test with three cases (storage-only, raft-only, both) so a future change to the per-purpose validation is caught.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 30, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @bootjp's task in 2m 13s —— View job


Code Review — PR #722 feat(encryption): sidecar crash-durable read/write (Stage 1) (Round 3)

  • Gather context
  • Read current state of changed files (sidecar.go, sidecar_test.go, errors.go)
  • Check what changed since round-2 review (commit 0320384)
  • Review sidecar.go, sidecar_test.go, errors.go
  • Post final review

Round-2 follow-up resolution

The one actionable item from round 2 is closed in commit 0320384.

Issue Status
TestReadSidecar_RejectsActiveKeyMissing missing Raft direction on read path ✅ Fixed — 3 sub-cases now: storage active without entry, raft active without entry, both active without entries (sidecar_test.go:413–443)

Fresh pass — no new issues found

The full three-file diff was reviewed. Everything below is a verification of known-good properties; nothing actionable is new.

Crash-durable protocol (WriteSidecar, lines 200–243)

The §5.1 four-step sequence is faithfully preserved:

  1. Shallow copy (scCopy := *sc) so the caller's struct is not mutated — Version is stamped on the copy only.
  2. validateSidecar(&scCopy) runs before marshalling, so a malformed sidecar cannot reach disk.
  3. Defer for cleanup is registered before writeTmpAndFsync (round-1 fix) — any write or fsync failure cleans up the tmp.
  4. writeTmpAndFsyncos.RenamesyncDir in the correct order.

writeTmpAndFsync (lines 262–286)

  • os.Remove(tmpPath) then O_EXCL create (codex P2 fix) ensures the tmp inode is always fresh at 0o600, regardless of any pre-existing permissive file.
  • Named-return retErr propagates f.Close errors that occur after a successful Sync — not silently dropped.
  • Write-phase failures propagate to the caller's defer, which removes the (not-yet-existent) tmp cleanly via _ = os.Remove.

syncDir (lines 296–308)

Wraps the raw fsync error with pkgerrors.WithSecondaryError(ErrUnsupportedFilesystem, err) so Stage 5+ startup can errors.Is-match on ErrUnsupportedFilesystem without string parsing.

validateSidecar / requireActiveKey (lines 136–165)

  • Per-key format validated via validateSidecarKey (decimal uint32, non-zero, known purpose).
  • requireActiveKey checks existence for both Active.Storage and Active.Raft when non-zero.
  • Known Stage 5+ deferral (unchanged): requireActiveKey does not verify that the referenced key's Purpose matches the expected direction (e.g., Active.Storage pointing to a key with Purpose: "raft" passes). This is an intentional boundary — purpose-direction cross-validation belongs to the FSM layer at Stage 5+.

errors.go

All errors referenced in code and tests are defined: ErrUnsupportedFilesystem (line 50) and ErrSidecarActiveKeyMissing (line 57) added in round-1 fixes; all others from Stage 0 are present.

Test coverage summary

Function Sub-cases Directions covered
TestWriteSidecar_RejectsActiveKeyMissing 2 Storage ✓, Raft ✓
TestReadSidecar_RejectsActiveKeyMissing 3 Storage only ✓, Raft only ✓, Both ✓
TestWriteSidecar_StaleTmpDoesNotLeakPermissiveMode 1 Codex P2 regression ✓
TestWriteSidecar_StaleTmpFileIsCleanedOnWriteFailure 1 Pre-creation failure path ✓
TestWriteSidecar_DoesNotMutateCaller 1 Gemini caller-mutation fix ✓

Reviewer checklist (PR description)

  • Sidecar JSON matches §5.1 shape verbatim. Field names, JSON keys, Wrapped []byte → base64. ✓
  • Crash-durable protocol order confirmed. Write → file.Syncos.Renamedir.Sync. ✓
  • Validation applied at both Read and Write. validateSidecar (including requireActiveKey) runs in both paths. ✓

Summary

Round 3 is clean. The read-path Raft-direction test gap from round 2 is closed. All round-1 (claude), gemini, and codex issues remain resolved. The only open item is the intentional Stage 5+ deferral (purpose-direction cross-validation in requireActiveKey), which is not a blocker.

This PR is ready to merge once the base feat/encryption-foundation (#719) lands on main.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0320384045

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +160 to +163
if _, ok := sc.Keys[idStr]; !ok {
return pkgerrors.Wrapf(ErrSidecarActiveKeyMissing,
"active.%s=%d not present in keys map", purpose, id)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject active key IDs with mismatched purpose

validateSidecar only checks that active.storage/active.raft IDs exist in Keys, but it never verifies that the referenced entry has the matching purpose. A sidecar where active.storage points to a raft key (or vice versa) currently passes both WriteSidecar and ReadSidecar, even though this is malformed state and can route the wrong DEK to a purpose-specific encryption path after restart/rotation.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant