Skip to content

fix(raft): auto-repair truncated WAL tail on startup#613

Merged
bootjp merged 5 commits intomainfrom
ops/wal-auto-repair
Apr 24, 2026
Merged

fix(raft): auto-repair truncated WAL tail on startup#613
bootjp merged 5 commits intomainfrom
ops/wal-auto-repair

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 24, 2026

Summary

When the kernel OOM-killer SIGKILLs the app mid-WAL-write, the last preallocated WAL segment is left with a torn trailing record. wal.ReadAll returns io.ErrUnexpectedEOF and our wrapper propagated it directly — the process failed to start and entered a crash-loop until an operator manually quarantined the bad segment.

This PR invokes wal.Repair on startup when ReadAll returns io.ErrUnexpectedEOF, matching the recovery path that etcd's own server uses. The partial tail record is truncated, the node resumes from the last fully-committed index, and catches up from the leader via normal raft replication.

Motivation

2026-04-24 incident: a traffic spike caused kernel OOM-SIGKILL on all 4 live nodes 22-169 times in 24h. One kill landed inside a WAL record, making the node un-startable for 5+ hours until manual quarantine. With this fix the same state would self-repair on the next restart with no operator involvement.

Scope of repair

  • Triggers only on io.ErrUnexpectedEOF — the torn-trailing-record signature.
  • CRC mismatches and other errors propagate unchanged. Those are genuine corruption, not in-flight-write artifacts; auto-truncating them would silently discard valid state.
  • One repair attempt only. If wal.Repair returns false, or the second ReadAll still fails, the error surfaces wrapped as "WAL unrepairable".
  • One log line emitted on repair: "WAL tail truncated, repairing" with dir + original error.

Test plan

  • go build ./...
  • go test ./internal/raftengine/etcd/... — full suite passes (7.3s)
  • TestLoadWalStateRepairsTruncatedTail — seeds a WAL, chops the tail inside framing, asserts re-open succeeds and pre-truncation entries survive
  • TestLoadWalStateUnrepairableCRCMismatchReturnsError — flips a byte inside a record, asserts the error propagates (repair cannot mask real corruption)
  • TestOpenAndReadWALSucceedsWithoutRepair — happy-path sanity check
  • Manual: reproduce the incident — kill a node with kill -9 mid-write (e.g., under heavy PROPOSE load), observe auto-repair on restart, confirm catch-up from leader

Follow-ups (tracked, not in this PR)

When the kernel OOM-killer SIGKILLs the app mid-WAL-write, the last
preallocated WAL segment is left with a torn trailing record.
wal.ReadAll then returns io.ErrUnexpectedEOF and our wrapper
propagated it directly — the process failed to start and entered a
crash-loop until an operator manually quarantined the bad segment.

This is the same failure mode etcd's own server guards against with
wal.Repair, which truncates the partial trailing record so the node
can resume from the last fully-committed index and catch up from the
leader via raft replication.

Change: extract the wal.Open + ReadAll sequence into openAndReadWAL.
On io.ErrUnexpectedEOF from ReadAll, close the handle, log a warning,
invoke wal.Repair, re-open, and retry ReadAll once. A second failure
(or anything other than ErrUnexpectedEOF) surfaces as before. CRC
mismatches and other genuine corruption are deliberately NOT caught
— those are not torn-write artifacts and auto-truncating them would
silently discard valid state.

Motivation: 2026-04-24 incident — a traffic spike caused kernel OOM
on all 4 live nodes 22-169 times in 24h; one unlucky kill landed
inside a WAL record and made the node un-startable for 5+ hours
until manual recovery. With this fix, the same WAL state would have
self-repaired on the next restart with no operator involvement.

Tests (internal/raftengine/etcd/wal_store_test.go):
- TestLoadWalStateRepairsTruncatedTail — seeds a WAL, chops the tail
  inside framing, asserts re-open succeeds and pre-truncation
  entries survive.
- TestLoadWalStateUnrepairableCRCMismatchReturnsError — flips a byte
  inside a record, asserts the error propagates (repair cannot mask
  real corruption).
- TestOpenAndReadWALSucceedsWithoutRepair — happy-path sanity check.

Full internal/raftengine/etcd test suite passes.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 53 minutes and 33 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 53 minutes and 33 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d7598e2b-2347-44a0-a279-d0d1db40a773

📥 Commits

Reviewing files that changed from the base of the PR and between 99c0560 and 261b47b.

📒 Files selected for processing (2)
  • internal/raftengine/etcd/wal_store.go
  • internal/raftengine/etcd/wal_store_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ops/wal-auto-repair

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an auto-repair mechanism for etcd WAL files to handle torn trailing records caused by unexpected process termination. The implementation adds an openAndReadWAL helper that catches io.ErrUnexpectedEOF and invokes wal.Repair. Feedback indicates that the repair logic might be bypassed by a preceding call to loadPersistedSnapshot which also scans the WAL. Other suggestions include logging errors when closing the WAL during repair and refining the test case to ensure it truncates actual record framing rather than just zero-padding.

Comment thread internal/raftengine/etcd/wal_store.go Outdated
}

w, err := wal.Open(logger, walDir, walSnapshotFor(snapshot))
w, hardState, entries, err := openAndReadWAL(logger, walDir, walSnapshotFor(snapshot))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The auto-repair logic introduced in openAndReadWAL is likely bypassed because loadPersistedSnapshot (called at line 111) also performs a full scan of the WAL via wal.ValidSnapshotEntries. If the WAL tail is torn, loadPersistedSnapshot will return io.ErrUnexpectedEOF and the process will exit before reaching this repair logic. Consider moving the repair mechanism to a higher level in loadWalState or ensuring loadPersistedSnapshot is also protected against torn trailing records.

Comment thread internal/raftengine/etcd/wal_store.go Outdated
if err == nil {
return w, hardState, entries, nil
}
_ = w.Close()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error from w.Close() is silently ignored. Per the general rules, errors from Close() should be logged to ensure resource leaks or other cleanup problems are visible.

if closeErr := w.Close(); closeErr != nil {
			logger.Warn("failed to close WAL before repair", zap.Error(closeErr))
		}
References
  1. Do not silently ignore errors from Close() methods on resources like network connections. Log them to ensure resource leaks or other cleanup problems are visible.

walPath := lastWALFile(t, filepath.Join(dir, walDirName))
info, err := os.Stat(walPath)
require.NoError(t, err)
require.NoError(t, os.Truncate(walPath, info.Size()-(64*1024+16)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The truncation info.Size()-(64*1024+16) may not reliably trigger io.ErrUnexpectedEOF. Since etcd WAL files are preallocated (default 64MB), truncating from the end of the file typically only removes zero-padding, which the decoder treats as a clean EOF. To effectively test the repair logic, the truncation should occur within a record's framing rather than the preallocated padding.

… test

Address Gemini review feedback on the original auto-repair commit:

- loadWalState: wrap both loadPersistedSnapshot and openAndReadWAL in
  a single io.ErrUnexpectedEOF → wal.Repair → retry pass at the
  top level. The previous version only retried around openAndReadWAL,
  but loadPersistedSnapshot also scans the WAL via
  wal.ValidSnapshotEntries, so a torn snapshot-record could abort
  startup before the repair logic ran. Hoisting covers both paths
  with one repair attempt, one log line, and no double-repair.
- openAndReadWAL: log the Close() error on the failure path instead
  of `_ = w.Close()`. Likewise for the storage-init failure branch
  in tryLoadWalState. Resource cleanup errors should be visible.
- TestLoadWalStateRepairsTruncatedTail: introduce
  truncateInsideLastRecord which scans the preallocated 64 MiB WAL
  file backwards past the zero-padded tail to find the end of real
  record framing, then truncates 5 bytes inside it. The previous
  `info.Size() - (64*1024+16)` landed in zero padding: the decoder
  returned a clean EOF and the test was passing for the wrong
  reason (no ErrUnexpectedEOF → no repair exercised). With the new
  helper we actually cut mid-record and the assertion reflects real
  repair behaviour.

All three tests still pass; the full internal/raftengine/etcd suite
runs green.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Addressed review feedback in f3f922d:

  • HIGH: hoisted repair to loadWalState level so it also catches io.ErrUnexpectedEOF from loadPersistedSnapshotwal.ValidSnapshotEntries (previous version only retried around openAndReadWAL, leaving a window where a torn snapshot-record aborted startup before repair ran).
  • MEDIUM: w.Close() errors now logged instead of silently dropped in both error branches.
  • MEDIUM: replaced Size()-(64KB+16) with truncateInsideLastRecord which scans the preallocated tail backwards past the zero padding and truncates 5 bytes inside real framing. The previous truncation likely landed in zero padding → test was passing for the wrong reason (no ErrUnexpectedEOF). With the new helper the repair path is actually exercised.

/gemini review
@codex review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an auto-repair mechanism for the Write-Ahead Log (WAL) to handle partial trailing records, typically caused by OOM-SIGKILL events. The logic is implemented by retrying the WAL load after invoking wal.Repair if an io.ErrUnexpectedEOF is encountered. Feedback was provided regarding the error handling on the second load attempt; specifically, the error should be wrapped as "WAL unrepairable" to maintain consistency with the PR description and provide better context in logs.

Comment thread internal/raftengine/etcd/wal_store.go Outdated
Comment on lines +125 to +128
if !wal.Repair(logger, walDir) {
return nil, errors.Wrap(err, "WAL unrepairable")
}
return tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation does not fully match the PR description regarding error wrapping on the second attempt. The description states: 'If ... the second ReadAll still fails, the error surfaces wrapped as "WAL unrepairable"'. However, the current code returns the error from the second tryLoadWalState call directly without wrapping it. For consistency and better context in logs/error reporting, consider wrapping the second failure as well.

Suggested change
if !wal.Repair(logger, walDir) {
return nil, errors.Wrap(err, "WAL unrepairable")
}
return tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm)
if !wal.Repair(logger, walDir) {
return nil, errors.Wrap(err, "WAL unrepairable")
}
ds, err = tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm)
if err != nil {
return nil, errors.Wrap(err, "WAL unrepairable")
}
return ds, nil

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…fter repair'

Gemini Medium feedback: the PR description said a second-attempt
failure surfaces wrapped as 'WAL unrepairable', but the code returned
it directly. Wrap the error so operators can grep for the exact
marker and distinguish a failed-repair boot from any other loadWal
failure mode.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Addressed in 18007b3:

  • Gemini Medium: second tryLoadWalState failure is now wrapped as "WAL unrepairable after repair", matching the PR description.

/gemini review
@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 18007b3a0e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/raftengine/etcd/wal_store.go Outdated
// truncates that partial record once and is idempotent, so the
// retry is bounded.
ds, err := tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm)
if err == nil || !errors.Is(err, io.ErrUnexpectedEOF) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict WAL repair trigger to WAL read paths

loadWalState now retries/repairs whenever tryLoadWalState returns io.ErrUnexpectedEOF, but tryLoadWalState also includes restoreSnapshotState; if FSM restore returns io.ErrUnexpectedEOF (for example from a truncated or incompatible FSM snapshot payload), this path will run wal.Repair against an otherwise healthy WAL and then re-label the failure as "WAL unrepairable after repair". That misclassifies the fault domain and hides the actionable root cause for operators; the repair gate should only match EOFs coming from WAL scanning (ValidSnapshotEntries/ReadAll), not restore errors.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an automatic WAL repair mechanism to handle io.ErrUnexpectedEOF errors, which typically occur when a process is killed mid-write. It refactors loadWalState to include a retry loop that invokes wal.Repair upon encountering a truncated tail. Additionally, the PR improves error handling by logging failures during WAL closure and adds comprehensive unit tests to verify repair behavior and ensure that genuine corruption still results in errors. Feedback was provided to refine the error handling logic in loadWalState to more explicitly return nil for the state object when an error occurs, improving clarity.

Comment thread internal/raftengine/etcd/wal_store.go Outdated
// truncates that partial record once and is idempotent, so the
// retry is bounded.
ds, err := tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm)
if err == nil || !errors.Is(err, io.ErrUnexpectedEOF) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check !errors.Is(err, io.ErrUnexpectedEOF) correctly identifies torn trailing records. However, if tryLoadWalState returns a different error (e.g., a CRC mismatch or permission error), this condition will return nil, err. Since ds is likely nil when err != nil, it is safer to explicitly return nil, err to avoid returning a typed nil interface if ds were ever non-nil in error cases.

if err == nil {
		return ds, nil
	}
	if !errors.Is(err, io.ErrUnexpectedEOF) {
		return nil, err
	}

Codex P2 + Gemini Medium: the round-2 refactor lifted the repair
retry around tryLoadWalState, which also runs restoreSnapshotState
(FSM snapshot decode). A truncated .fsm payload would surface as
io.ErrUnexpectedEOF and incorrectly trigger wal.Repair against a
perfectly-good WAL. The FSM snapshotter has its own on-disk CRC
footer; wal.Repair does nothing for that failure mode.

Split the retry into two per-path helpers:
- loadPersistedSnapshotWithRepair: wraps loadPersistedSnapshot
  (scans WAL via wal.ValidSnapshotEntries).
- openAndReadWALWithRepair: wraps openAndReadWAL.

restoreSnapshotState now runs between the two and its errors
propagate without invoking repair.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Round 3 in d3fa038:

  • Codex P2 + Gemini Medium: scoped wal.Repair to WAL-only reads. The round-2 refactor wrapped tryLoadWalState which also ran restoreSnapshotState; a truncated .fsm payload returning io.ErrUnexpectedEOF would have triggered wal.Repair on an otherwise-healthy WAL. Split into two per-path helpers (loadPersistedSnapshotWithRepair, openAndReadWALWithRepair); FSM restore runs between them and its errors propagate without invoking repair.

/gemini review
@codex review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an automatic WAL repair mechanism to handle partial trailing records caused by abrupt process termination. It introduces helper functions loadPersistedSnapshotWithRepair and openAndReadWALWithRepair that attempt to repair the WAL when an io.ErrUnexpectedEOF is encountered. Additionally, new tests simulate torn writes to verify the repair logic and ensure that actual data corruption still triggers errors. The review feedback suggests refactoring the snapshotter instantiation to avoid redundant allocations across these new helper functions.

Comment thread internal/raftengine/etcd/wal_store.go Outdated
Comment on lines +120 to +132
snapshot, err := loadPersistedSnapshotWithRepair(logger, walDir, snapDir)
if err != nil {
return nil, err
}
if err := restoreSnapshotState(fsm, snapshot, fsmSnapDir); err != nil {
return nil, err
}

w, err := wal.Open(logger, walDir, walSnapshotFor(snapshot))
if err != nil {
return nil, errors.WithStack(err)
}

_, hardState, entries, err := w.ReadAll()
w, hardState, entries, err := openAndReadWALWithRepair(logger, walDir, walSnapshotFor(snapshot))
if err != nil {
_ = w.Close()
return nil, errors.WithStack(err)
return nil, err
}
snapshotter := snap.New(logger, snapDir)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The snapshotter is now being instantiated twice: once inside loadPersistedSnapshotWithRepair and again here in loadWalState. It is more efficient and cleaner to create the snapshotter once at the beginning of loadWalState and pass it into the repair wrapper.

Suggested change
snapshot, err := loadPersistedSnapshotWithRepair(logger, walDir, snapDir)
if err != nil {
return nil, err
}
if err := restoreSnapshotState(fsm, snapshot, fsmSnapDir); err != nil {
return nil, err
}
w, err := wal.Open(logger, walDir, walSnapshotFor(snapshot))
if err != nil {
return nil, errors.WithStack(err)
}
_, hardState, entries, err := w.ReadAll()
w, hardState, entries, err := openAndReadWALWithRepair(logger, walDir, walSnapshotFor(snapshot))
if err != nil {
_ = w.Close()
return nil, errors.WithStack(err)
return nil, err
}
snapshotter := snap.New(logger, snapDir)
snapshotter := snap.New(logger, snapDir)
snapshot, err := loadPersistedSnapshotWithRepair(logger, walDir, snapshotter)
if err != nil {
return nil, err
}
if err := restoreSnapshotState(fsm, snapshot, fsmSnapDir); err != nil {
return nil, err
}
w, hardState, entries, err := openAndReadWALWithRepair(logger, walDir, walSnapshotFor(snapshot))
if err != nil {
return nil, err
}

Comment thread internal/raftengine/etcd/wal_store.go Outdated
Comment on lines +158 to +160
func loadPersistedSnapshotWithRepair(logger *zap.Logger, walDir, snapDir string) (raftpb.Snapshot, error) {
snapshotter := snap.New(logger, snapDir)
snapshot, err := loadPersistedSnapshot(logger, walDir, snapshotter)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Refactor this function to accept an existing *snap.Snapshotter instead of creating a new one from snapDir. This avoids redundant allocations and aligns with the suggested change in loadWalState.

Suggested change
func loadPersistedSnapshotWithRepair(logger *zap.Logger, walDir, snapDir string) (raftpb.Snapshot, error) {
snapshotter := snap.New(logger, snapDir)
snapshot, err := loadPersistedSnapshot(logger, walDir, snapshotter)
func loadPersistedSnapshotWithRepair(logger *zap.Logger, walDir string, snapshotter *snap.Snapshotter) (raftpb.Snapshot, error) {
snapshot, err := loadPersistedSnapshot(logger, walDir, snapshotter)

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 🎉

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Gemini Medium (wal_store.go:132 and :160): loadWalState and
  loadPersistedSnapshotWithRepair both instantiated snap.New from the
  same snapDir. Create the snapshotter once in loadWalState and pass
  the pointer down so the helper does not re-allocate.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Round 4 in 261b47b:

  • Gemini Medium (wal_store.go:132 and :160): shared the snapshotter across loadWalState and loadPersistedSnapshotWithRepair — the helper now takes *snap.Snapshotter from the caller instead of instantiating its own. One allocation per open instead of two.

Build + go test ./internal/raftengine/etcd/... pass; golangci-lint clean on the changed package (the pre-existing goconst warning on wal_purge.go:140 is unrelated to this PR).

No other actionable feedback this round.

/gemini review
@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an automatic WAL repair mechanism to handle io.ErrUnexpectedEOF errors, which typically occur when a process is terminated during a write operation. The changes include new helper functions loadPersistedSnapshotWithRepair and openAndReadWALWithRepair that attempt a single wal.Repair before retrying. Additionally, unit tests were added to verify that truncated tails are correctly repaired while genuine CRC mismatches still result in errors. I have no feedback to provide.

bootjp added a commit that referenced this pull request Apr 24, 2026
…#617)

## Summary
Add two OOM-defense defaults to `scripts/rolling-update.sh`:

- `GOMEMLIMIT=1800MiB` (via new `DEFAULT_EXTRA_ENV`, merged into the
existing `EXTRA_ENV` plumbing)
- `--memory=2500m` on the remote `docker run` (via new
`CONTAINER_MEMORY_LIMIT`)

Both are env-var-controlled with empty-string opt-out (`${VAR-default}`
so unset uses the default, but an explicit empty string disables it).

## Motivation
2026-04-24 incident: all 4 live nodes were kernel-OOM-SIGKILLed 22-169
times in 24h under a traffic spike. Each kill risked WAL-tail truncation
and triggered election storms, cascading into p99 GET spikes to 6-8s.
The runtime defense was applied by hand during the incident; this PR
makes it the script default so future rollouts inherit it.

- `GOMEMLIMIT` — Go runtime GCs aggressively as heap approaches the
limit, keeping RSS below the container ceiling.
- `--memory` (cgroup hard limit) — if Go can't keep up (e.g. non-heap
growth), the kill is scoped to the container, not host processes like
`qemu-guest-agent` or `systemd`.

## Behavior changes

| Variable | Default | Opt-out |
|----------|---------|---------|
| `DEFAULT_EXTRA_ENV` | `GOMEMLIMIT=1800MiB` | `DEFAULT_EXTRA_ENV=""` |
| `CONTAINER_MEMORY_LIMIT`| `2500m` | `CONTAINER_MEMORY_LIMIT=""` |

Operator-supplied `EXTRA_ENV` keys override matching keys in
`DEFAULT_EXTRA_ENV` (e.g., `EXTRA_ENV="GOMEMLIMIT=3000MiB"` wins over
the default).

## Related
Companion PRs (defense-in-depth):
- #612 `memwatch` — graceful shutdown before kernel OOM (prevents WAL
corruption in the first place)
- #613 WAL auto-repair — recovers on startup when the above fails
- #616 rolling-update via GitHub Actions over Tailscale — consumes this
script

## Test plan
- [x] `bash -n scripts/rolling-update.sh` passes
- [x] Deployed equivalents manually on all 4 live nodes during the
incident (2026-04-24T07:44Z - 07:46Z); no OOM recurrence since
- [ ] Next rolling-update invocation should produce `docker run ...
--memory=2500m ... -e GOMEMLIMIT=1800MiB ...` on each node

## Design doc reference
`docs/design/2026_04_24_proposed_resilience_roadmap.md` (item 1 —
capacity/runtime defenses).
bootjp added a commit that referenced this pull request Apr 24, 2026
… starvation (#619)

## Summary
Design doc (only — no code in this PR) for a four-layer
workload-isolation model, prompted by the 2026-04-24 incident's
afternoon phase.

**Problem:** Today, one client host with 37 connections running a tight
XREAD loop consumed 14 CPU cores on the leader via `loadStreamAt →
unmarshalStreamValue → proto.Unmarshal` (81% of CPU per pprof). Raft
goroutines couldn't get CPU → step_queue_full = 75,692 on the leader (vs
0-119 on followers) → Raft commit p99 jumped to 6-10s, Lua p99 stuck at
6-8s. Follower replication was healthy (applied-index within 34 of
leader); the damage was entirely CPU-scheduling on the leader.

**Gap:** elastickv has no explicit workload-class isolation. Go's
scheduler treats every goroutine equally; a single heavy command path
can starve unrelated paths (raft, lease, Lua, GET/SET).

## Four-layer defense model
- **Layer 1 — heavy-command worker pool**: gate XREAD / KEYS / SCAN /
Lua onto a bounded pool (~`2 × GOMAXPROCS`); reply `-BUSY` when full.
Cheap commands keep their own fast path.
- **Layer 2 — locked OS threads for raft**: `runtime.LockOSThread()` on
the Ready loop + dispatcher lanes so the Go scheduler can't starve them.
**Not v1** — only if measurement after Layer 1 + 4 still shows
`step_queue_full > 0`.
- **Layer 3 — per-client admission control**: per-peer-IP connection cap
(default 8). Extends, doesn't replace, roadmap item 6's global in-flight
semaphore.
- **Layer 4 — XREAD O(N) → O(new)**: entry-per-key layout
(`!redis|stream|<key>|entry|<id>`) with range-scan, dual-read migration
fallback, legacy-removal gated on
`elastickv_stream_legacy_format_reads_total == 0`. Hashes/sets/zsets
share the same one-blob pattern and are called out as follow-up.

## Recommended sequencing
Layer 4 (correctness bug, concentrated change) → Layer 1 (generic
defense for next unknown hotspot) → Layer 3 (reconcile with roadmap item
6) → Layer 2 (only if forced by measurement).

## Relationship to other in-flight work
- Complements (does not replace)
`docs/design/2026_04_24_proposed_resilience_roadmap.md` item 6
(admission control). This doc's Layer 3 focuses on per-client fairness;
the roadmap's item 6 is global in-flight capping. Both are needed.
- Consistent with memwatch (#612): Layer 3 admission threshold should
fire **before** memwatch's shutdown threshold — flagged as an open
question in the doc.
- Assumes WAL auto-repair (#613), GOMEMLIMIT defaults (#617) are landed
so the cluster survives long enough to matter.

## Open questions called out in the doc
- Static vs dynamic command classification (Layer 1)
- `-BUSY` backoff semantics — how do we avoid client retry spinning
becoming the new hot loop?
- Number of locked OS threads on variable-core hosts (Layer 2)
- Stream migration soak window before removing legacy-format fallback
(Layer 4, currently 30 days, arbitrary)

## Deliverable
`docs/design/2026_04_24_proposed_workload_isolation.md` — 446 lines,
dated-prefix / `**Status: Proposed**` convention matching the rest of
`docs/design/`. No code.

## Test plan
- [x] File paths and function references in the doc spot-checked against
`origin/main`
- [x] Cross-references to `2026_04_24_proposed_resilience_roadmap.md`
reconciled (complements, doesn't duplicate)
- [ ] Design review — decide on the open questions before implementing
Layer 4 (which blocks Layer 1 on XREAD specifically)
@bootjp bootjp merged commit 2531aa8 into main Apr 24, 2026
8 checks passed
@bootjp bootjp deleted the ops/wal-auto-repair branch April 24, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant