fix(raft): auto-repair truncated WAL tail on startup#613
Conversation
When the kernel OOM-killer SIGKILLs the app mid-WAL-write, the last preallocated WAL segment is left with a torn trailing record. wal.ReadAll then returns io.ErrUnexpectedEOF and our wrapper propagated it directly — the process failed to start and entered a crash-loop until an operator manually quarantined the bad segment. This is the same failure mode etcd's own server guards against with wal.Repair, which truncates the partial trailing record so the node can resume from the last fully-committed index and catch up from the leader via raft replication. Change: extract the wal.Open + ReadAll sequence into openAndReadWAL. On io.ErrUnexpectedEOF from ReadAll, close the handle, log a warning, invoke wal.Repair, re-open, and retry ReadAll once. A second failure (or anything other than ErrUnexpectedEOF) surfaces as before. CRC mismatches and other genuine corruption are deliberately NOT caught — those are not torn-write artifacts and auto-truncating them would silently discard valid state. Motivation: 2026-04-24 incident — a traffic spike caused kernel OOM on all 4 live nodes 22-169 times in 24h; one unlucky kill landed inside a WAL record and made the node un-startable for 5+ hours until manual recovery. With this fix, the same WAL state would have self-repaired on the next restart with no operator involvement. Tests (internal/raftengine/etcd/wal_store_test.go): - TestLoadWalStateRepairsTruncatedTail — seeds a WAL, chops the tail inside framing, asserts re-open succeeds and pre-truncation entries survive. - TestLoadWalStateUnrepairableCRCMismatchReturnsError — flips a byte inside a record, asserts the error propagates (repair cannot mask real corruption). - TestOpenAndReadWALSucceedsWithoutRepair — happy-path sanity check. Full internal/raftengine/etcd test suite passes.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 53 minutes and 33 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces an auto-repair mechanism for etcd WAL files to handle torn trailing records caused by unexpected process termination. The implementation adds an openAndReadWAL helper that catches io.ErrUnexpectedEOF and invokes wal.Repair. Feedback indicates that the repair logic might be bypassed by a preceding call to loadPersistedSnapshot which also scans the WAL. Other suggestions include logging errors when closing the WAL during repair and refining the test case to ensure it truncates actual record framing rather than just zero-padding.
| } | ||
|
|
||
| w, err := wal.Open(logger, walDir, walSnapshotFor(snapshot)) | ||
| w, hardState, entries, err := openAndReadWAL(logger, walDir, walSnapshotFor(snapshot)) |
There was a problem hiding this comment.
The auto-repair logic introduced in openAndReadWAL is likely bypassed because loadPersistedSnapshot (called at line 111) also performs a full scan of the WAL via wal.ValidSnapshotEntries. If the WAL tail is torn, loadPersistedSnapshot will return io.ErrUnexpectedEOF and the process will exit before reaching this repair logic. Consider moving the repair mechanism to a higher level in loadWalState or ensuring loadPersistedSnapshot is also protected against torn trailing records.
| if err == nil { | ||
| return w, hardState, entries, nil | ||
| } | ||
| _ = w.Close() |
There was a problem hiding this comment.
The error from w.Close() is silently ignored. Per the general rules, errors from Close() should be logged to ensure resource leaks or other cleanup problems are visible.
if closeErr := w.Close(); closeErr != nil {
logger.Warn("failed to close WAL before repair", zap.Error(closeErr))
}References
- Do not silently ignore errors from Close() methods on resources like network connections. Log them to ensure resource leaks or other cleanup problems are visible.
| walPath := lastWALFile(t, filepath.Join(dir, walDirName)) | ||
| info, err := os.Stat(walPath) | ||
| require.NoError(t, err) | ||
| require.NoError(t, os.Truncate(walPath, info.Size()-(64*1024+16))) |
There was a problem hiding this comment.
The truncation info.Size()-(64*1024+16) may not reliably trigger io.ErrUnexpectedEOF. Since etcd WAL files are preallocated (default 64MB), truncating from the end of the file typically only removes zero-padding, which the decoder treats as a clean EOF. To effectively test the repair logic, the truncation should occur within a record's framing rather than the preallocated padding.
… test Address Gemini review feedback on the original auto-repair commit: - loadWalState: wrap both loadPersistedSnapshot and openAndReadWAL in a single io.ErrUnexpectedEOF → wal.Repair → retry pass at the top level. The previous version only retried around openAndReadWAL, but loadPersistedSnapshot also scans the WAL via wal.ValidSnapshotEntries, so a torn snapshot-record could abort startup before the repair logic ran. Hoisting covers both paths with one repair attempt, one log line, and no double-repair. - openAndReadWAL: log the Close() error on the failure path instead of `_ = w.Close()`. Likewise for the storage-init failure branch in tryLoadWalState. Resource cleanup errors should be visible. - TestLoadWalStateRepairsTruncatedTail: introduce truncateInsideLastRecord which scans the preallocated 64 MiB WAL file backwards past the zero-padded tail to find the end of real record framing, then truncates 5 bytes inside it. The previous `info.Size() - (64*1024+16)` landed in zero padding: the decoder returned a clean EOF and the test was passing for the wrong reason (no ErrUnexpectedEOF → no repair exercised). With the new helper we actually cut mid-record and the assertion reflects real repair behaviour. All three tests still pass; the full internal/raftengine/etcd suite runs green.
|
Addressed review feedback in f3f922d:
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces an auto-repair mechanism for the Write-Ahead Log (WAL) to handle partial trailing records, typically caused by OOM-SIGKILL events. The logic is implemented by retrying the WAL load after invoking wal.Repair if an io.ErrUnexpectedEOF is encountered. Feedback was provided regarding the error handling on the second load attempt; specifically, the error should be wrapped as "WAL unrepairable" to maintain consistency with the PR description and provide better context in logs.
| if !wal.Repair(logger, walDir) { | ||
| return nil, errors.Wrap(err, "WAL unrepairable") | ||
| } | ||
| return tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm) |
There was a problem hiding this comment.
The implementation does not fully match the PR description regarding error wrapping on the second attempt. The description states: 'If ... the second ReadAll still fails, the error surfaces wrapped as "WAL unrepairable"'. However, the current code returns the error from the second tryLoadWalState call directly without wrapping it. For consistency and better context in logs/error reporting, consider wrapping the second failure as well.
| if !wal.Repair(logger, walDir) { | |
| return nil, errors.Wrap(err, "WAL unrepairable") | |
| } | |
| return tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm) | |
| if !wal.Repair(logger, walDir) { | |
| return nil, errors.Wrap(err, "WAL unrepairable") | |
| } | |
| ds, err = tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm) | |
| if err != nil { | |
| return nil, errors.Wrap(err, "WAL unrepairable") | |
| } | |
| return ds, nil |
|
Codex Review: Didn't find any major issues. Another round soon, please! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
…fter repair' Gemini Medium feedback: the PR description said a second-attempt failure surfaces wrapped as 'WAL unrepairable', but the code returned it directly. Wrap the error so operators can grep for the exact marker and distinguish a failed-repair boot from any other loadWal failure mode.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 18007b3a0e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // truncates that partial record once and is idempotent, so the | ||
| // retry is bounded. | ||
| ds, err := tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm) | ||
| if err == nil || !errors.Is(err, io.ErrUnexpectedEOF) { |
There was a problem hiding this comment.
Restrict WAL repair trigger to WAL read paths
loadWalState now retries/repairs whenever tryLoadWalState returns io.ErrUnexpectedEOF, but tryLoadWalState also includes restoreSnapshotState; if FSM restore returns io.ErrUnexpectedEOF (for example from a truncated or incompatible FSM snapshot payload), this path will run wal.Repair against an otherwise healthy WAL and then re-label the failure as "WAL unrepairable after repair". That misclassifies the fault domain and hides the actionable root cause for operators; the repair gate should only match EOFs coming from WAL scanning (ValidSnapshotEntries/ReadAll), not restore errors.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces an automatic WAL repair mechanism to handle io.ErrUnexpectedEOF errors, which typically occur when a process is killed mid-write. It refactors loadWalState to include a retry loop that invokes wal.Repair upon encountering a truncated tail. Additionally, the PR improves error handling by logging failures during WAL closure and adds comprehensive unit tests to verify repair behavior and ensure that genuine corruption still results in errors. Feedback was provided to refine the error handling logic in loadWalState to more explicitly return nil for the state object when an error occurs, improving clarity.
| // truncates that partial record once and is idempotent, so the | ||
| // retry is bounded. | ||
| ds, err := tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm) | ||
| if err == nil || !errors.Is(err, io.ErrUnexpectedEOF) { |
There was a problem hiding this comment.
The check !errors.Is(err, io.ErrUnexpectedEOF) correctly identifies torn trailing records. However, if tryLoadWalState returns a different error (e.g., a CRC mismatch or permission error), this condition will return nil, err. Since ds is likely nil when err != nil, it is safer to explicitly return nil, err to avoid returning a typed nil interface if ds were ever non-nil in error cases.
if err == nil {
return ds, nil
}
if !errors.Is(err, io.ErrUnexpectedEOF) {
return nil, err
}Codex P2 + Gemini Medium: the round-2 refactor lifted the repair retry around tryLoadWalState, which also runs restoreSnapshotState (FSM snapshot decode). A truncated .fsm payload would surface as io.ErrUnexpectedEOF and incorrectly trigger wal.Repair against a perfectly-good WAL. The FSM snapshotter has its own on-disk CRC footer; wal.Repair does nothing for that failure mode. Split the retry into two per-path helpers: - loadPersistedSnapshotWithRepair: wraps loadPersistedSnapshot (scans WAL via wal.ValidSnapshotEntries). - openAndReadWALWithRepair: wraps openAndReadWAL. restoreSnapshotState now runs between the two and its errors propagate without invoking repair.
|
Round 3 in d3fa038:
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements an automatic WAL repair mechanism to handle partial trailing records caused by abrupt process termination. It introduces helper functions loadPersistedSnapshotWithRepair and openAndReadWALWithRepair that attempt to repair the WAL when an io.ErrUnexpectedEOF is encountered. Additionally, new tests simulate torn writes to verify the repair logic and ensure that actual data corruption still triggers errors. The review feedback suggests refactoring the snapshotter instantiation to avoid redundant allocations across these new helper functions.
| snapshot, err := loadPersistedSnapshotWithRepair(logger, walDir, snapDir) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| if err := restoreSnapshotState(fsm, snapshot, fsmSnapDir); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| w, err := wal.Open(logger, walDir, walSnapshotFor(snapshot)) | ||
| if err != nil { | ||
| return nil, errors.WithStack(err) | ||
| } | ||
|
|
||
| _, hardState, entries, err := w.ReadAll() | ||
| w, hardState, entries, err := openAndReadWALWithRepair(logger, walDir, walSnapshotFor(snapshot)) | ||
| if err != nil { | ||
| _ = w.Close() | ||
| return nil, errors.WithStack(err) | ||
| return nil, err | ||
| } | ||
| snapshotter := snap.New(logger, snapDir) |
There was a problem hiding this comment.
The snapshotter is now being instantiated twice: once inside loadPersistedSnapshotWithRepair and again here in loadWalState. It is more efficient and cleaner to create the snapshotter once at the beginning of loadWalState and pass it into the repair wrapper.
| snapshot, err := loadPersistedSnapshotWithRepair(logger, walDir, snapDir) | |
| if err != nil { | |
| return nil, err | |
| } | |
| if err := restoreSnapshotState(fsm, snapshot, fsmSnapDir); err != nil { | |
| return nil, err | |
| } | |
| w, err := wal.Open(logger, walDir, walSnapshotFor(snapshot)) | |
| if err != nil { | |
| return nil, errors.WithStack(err) | |
| } | |
| _, hardState, entries, err := w.ReadAll() | |
| w, hardState, entries, err := openAndReadWALWithRepair(logger, walDir, walSnapshotFor(snapshot)) | |
| if err != nil { | |
| _ = w.Close() | |
| return nil, errors.WithStack(err) | |
| return nil, err | |
| } | |
| snapshotter := snap.New(logger, snapDir) | |
| snapshotter := snap.New(logger, snapDir) | |
| snapshot, err := loadPersistedSnapshotWithRepair(logger, walDir, snapshotter) | |
| if err != nil { | |
| return nil, err | |
| } | |
| if err := restoreSnapshotState(fsm, snapshot, fsmSnapDir); err != nil { | |
| return nil, err | |
| } | |
| w, hardState, entries, err := openAndReadWALWithRepair(logger, walDir, walSnapshotFor(snapshot)) | |
| if err != nil { | |
| return nil, err | |
| } |
| func loadPersistedSnapshotWithRepair(logger *zap.Logger, walDir, snapDir string) (raftpb.Snapshot, error) { | ||
| snapshotter := snap.New(logger, snapDir) | ||
| snapshot, err := loadPersistedSnapshot(logger, walDir, snapshotter) |
There was a problem hiding this comment.
Refactor this function to accept an existing *snap.Snapshotter instead of creating a new one from snapDir. This avoids redundant allocations and aligns with the suggested change in loadWalState.
| func loadPersistedSnapshotWithRepair(logger *zap.Logger, walDir, snapDir string) (raftpb.Snapshot, error) { | |
| snapshotter := snap.New(logger, snapDir) | |
| snapshot, err := loadPersistedSnapshot(logger, walDir, snapshotter) | |
| func loadPersistedSnapshotWithRepair(logger *zap.Logger, walDir string, snapshotter *snap.Snapshotter) (raftpb.Snapshot, error) { | |
| snapshot, err := loadPersistedSnapshot(logger, walDir, snapshotter) |
|
Codex Review: Didn't find any major issues. 🎉 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
- Gemini Medium (wal_store.go:132 and :160): loadWalState and loadPersistedSnapshotWithRepair both instantiated snap.New from the same snapDir. Create the snapshotter once in loadWalState and pass the pointer down so the helper does not re-allocate.
|
Round 4 in 261b47b:
Build + No other actionable feedback this round. /gemini review |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Code Review
This pull request introduces an automatic WAL repair mechanism to handle io.ErrUnexpectedEOF errors, which typically occur when a process is terminated during a write operation. The changes include new helper functions loadPersistedSnapshotWithRepair and openAndReadWALWithRepair that attempt a single wal.Repair before retrying. Additionally, unit tests were added to verify that truncated tails are correctly repaired while genuine CRC mismatches still result in errors. I have no feedback to provide.
…#617) ## Summary Add two OOM-defense defaults to `scripts/rolling-update.sh`: - `GOMEMLIMIT=1800MiB` (via new `DEFAULT_EXTRA_ENV`, merged into the existing `EXTRA_ENV` plumbing) - `--memory=2500m` on the remote `docker run` (via new `CONTAINER_MEMORY_LIMIT`) Both are env-var-controlled with empty-string opt-out (`${VAR-default}` so unset uses the default, but an explicit empty string disables it). ## Motivation 2026-04-24 incident: all 4 live nodes were kernel-OOM-SIGKILLed 22-169 times in 24h under a traffic spike. Each kill risked WAL-tail truncation and triggered election storms, cascading into p99 GET spikes to 6-8s. The runtime defense was applied by hand during the incident; this PR makes it the script default so future rollouts inherit it. - `GOMEMLIMIT` — Go runtime GCs aggressively as heap approaches the limit, keeping RSS below the container ceiling. - `--memory` (cgroup hard limit) — if Go can't keep up (e.g. non-heap growth), the kill is scoped to the container, not host processes like `qemu-guest-agent` or `systemd`. ## Behavior changes | Variable | Default | Opt-out | |----------|---------|---------| | `DEFAULT_EXTRA_ENV` | `GOMEMLIMIT=1800MiB` | `DEFAULT_EXTRA_ENV=""` | | `CONTAINER_MEMORY_LIMIT`| `2500m` | `CONTAINER_MEMORY_LIMIT=""` | Operator-supplied `EXTRA_ENV` keys override matching keys in `DEFAULT_EXTRA_ENV` (e.g., `EXTRA_ENV="GOMEMLIMIT=3000MiB"` wins over the default). ## Related Companion PRs (defense-in-depth): - #612 `memwatch` — graceful shutdown before kernel OOM (prevents WAL corruption in the first place) - #613 WAL auto-repair — recovers on startup when the above fails - #616 rolling-update via GitHub Actions over Tailscale — consumes this script ## Test plan - [x] `bash -n scripts/rolling-update.sh` passes - [x] Deployed equivalents manually on all 4 live nodes during the incident (2026-04-24T07:44Z - 07:46Z); no OOM recurrence since - [ ] Next rolling-update invocation should produce `docker run ... --memory=2500m ... -e GOMEMLIMIT=1800MiB ...` on each node ## Design doc reference `docs/design/2026_04_24_proposed_resilience_roadmap.md` (item 1 — capacity/runtime defenses).
… starvation (#619) ## Summary Design doc (only — no code in this PR) for a four-layer workload-isolation model, prompted by the 2026-04-24 incident's afternoon phase. **Problem:** Today, one client host with 37 connections running a tight XREAD loop consumed 14 CPU cores on the leader via `loadStreamAt → unmarshalStreamValue → proto.Unmarshal` (81% of CPU per pprof). Raft goroutines couldn't get CPU → step_queue_full = 75,692 on the leader (vs 0-119 on followers) → Raft commit p99 jumped to 6-10s, Lua p99 stuck at 6-8s. Follower replication was healthy (applied-index within 34 of leader); the damage was entirely CPU-scheduling on the leader. **Gap:** elastickv has no explicit workload-class isolation. Go's scheduler treats every goroutine equally; a single heavy command path can starve unrelated paths (raft, lease, Lua, GET/SET). ## Four-layer defense model - **Layer 1 — heavy-command worker pool**: gate XREAD / KEYS / SCAN / Lua onto a bounded pool (~`2 × GOMAXPROCS`); reply `-BUSY` when full. Cheap commands keep their own fast path. - **Layer 2 — locked OS threads for raft**: `runtime.LockOSThread()` on the Ready loop + dispatcher lanes so the Go scheduler can't starve them. **Not v1** — only if measurement after Layer 1 + 4 still shows `step_queue_full > 0`. - **Layer 3 — per-client admission control**: per-peer-IP connection cap (default 8). Extends, doesn't replace, roadmap item 6's global in-flight semaphore. - **Layer 4 — XREAD O(N) → O(new)**: entry-per-key layout (`!redis|stream|<key>|entry|<id>`) with range-scan, dual-read migration fallback, legacy-removal gated on `elastickv_stream_legacy_format_reads_total == 0`. Hashes/sets/zsets share the same one-blob pattern and are called out as follow-up. ## Recommended sequencing Layer 4 (correctness bug, concentrated change) → Layer 1 (generic defense for next unknown hotspot) → Layer 3 (reconcile with roadmap item 6) → Layer 2 (only if forced by measurement). ## Relationship to other in-flight work - Complements (does not replace) `docs/design/2026_04_24_proposed_resilience_roadmap.md` item 6 (admission control). This doc's Layer 3 focuses on per-client fairness; the roadmap's item 6 is global in-flight capping. Both are needed. - Consistent with memwatch (#612): Layer 3 admission threshold should fire **before** memwatch's shutdown threshold — flagged as an open question in the doc. - Assumes WAL auto-repair (#613), GOMEMLIMIT defaults (#617) are landed so the cluster survives long enough to matter. ## Open questions called out in the doc - Static vs dynamic command classification (Layer 1) - `-BUSY` backoff semantics — how do we avoid client retry spinning becoming the new hot loop? - Number of locked OS threads on variable-core hosts (Layer 2) - Stream migration soak window before removing legacy-format fallback (Layer 4, currently 30 days, arbitrary) ## Deliverable `docs/design/2026_04_24_proposed_workload_isolation.md` — 446 lines, dated-prefix / `**Status: Proposed**` convention matching the rest of `docs/design/`. No code. ## Test plan - [x] File paths and function references in the doc spot-checked against `origin/main` - [x] Cross-references to `2026_04_24_proposed_resilience_roadmap.md` reconciled (complements, doesn't duplicate) - [ ] Design review — decide on the open questions before implementing Layer 4 (which blocks Layer 1 on XREAD specifically)
Summary
When the kernel OOM-killer SIGKILLs the app mid-WAL-write, the last preallocated WAL segment is left with a torn trailing record.
wal.ReadAllreturnsio.ErrUnexpectedEOFand our wrapper propagated it directly — the process failed to start and entered a crash-loop until an operator manually quarantined the bad segment.This PR invokes
wal.Repairon startup whenReadAllreturnsio.ErrUnexpectedEOF, matching the recovery path that etcd's own server uses. The partial tail record is truncated, the node resumes from the last fully-committed index, and catches up from the leader via normal raft replication.Motivation
2026-04-24 incident: a traffic spike caused kernel OOM-SIGKILL on all 4 live nodes 22-169 times in 24h. One kill landed inside a WAL record, making the node un-startable for 5+ hours until manual quarantine. With this fix the same state would self-repair on the next restart with no operator involvement.
Scope of repair
io.ErrUnexpectedEOF— the torn-trailing-record signature.wal.Repairreturns false, or the secondReadAllstill fails, the error surfaces wrapped as "WAL unrepairable"."WAL tail truncated, repairing"with dir + original error.Test plan
go build ./...go test ./internal/raftengine/etcd/...— full suite passes (7.3s)TestLoadWalStateRepairsTruncatedTail— seeds a WAL, chops the tail inside framing, asserts re-open succeeds and pre-truncation entries surviveTestLoadWalStateUnrepairableCRCMismatchReturnsError— flips a byte inside a record, asserts the error propagates (repair cannot mask real corruption)TestOpenAndReadWALSucceedsWithoutRepair— happy-path sanity checkkill -9mid-write (e.g., under heavyPROPOSEload), observe auto-repair on restart, confirm catch-up from leaderFollow-ups (tracked, not in this PR)