fix(raft): auto-repair truncated WAL tail on startup by bootjp · Pull Request #613 · bootjp/elastickv

bootjp · 2026-04-24T07:52:23Z

Summary

When the kernel OOM-killer SIGKILLs the app mid-WAL-write, the last preallocated WAL segment is left with a torn trailing record. wal.ReadAll returns io.ErrUnexpectedEOF and our wrapper propagated it directly — the process failed to start and entered a crash-loop until an operator manually quarantined the bad segment.

This PR invokes wal.Repair on startup when ReadAll returns io.ErrUnexpectedEOF, matching the recovery path that etcd's own server uses. The partial tail record is truncated, the node resumes from the last fully-committed index, and catches up from the leader via normal raft replication.

Motivation

2026-04-24 incident: a traffic spike caused kernel OOM-SIGKILL on all 4 live nodes 22-169 times in 24h. One kill landed inside a WAL record, making the node un-startable for 5+ hours until manual quarantine. With this fix the same state would self-repair on the next restart with no operator involvement.

Scope of repair

Triggers only on io.ErrUnexpectedEOF — the torn-trailing-record signature.
CRC mismatches and other errors propagate unchanged. Those are genuine corruption, not in-flight-write artifacts; auto-truncating them would silently discard valid state.
One repair attempt only. If wal.Repair returns false, or the second ReadAll still fails, the error surfaces wrapped as "WAL unrepairable".
One log line emitted on repair: "WAL tail truncated, repairing" with dir + original error.

Test plan

go build ./...
go test ./internal/raftengine/etcd/... — full suite passes (7.3s)
TestLoadWalStateRepairsTruncatedTail — seeds a WAL, chops the tail inside framing, asserts re-open succeeds and pre-truncation entries survive
TestLoadWalStateUnrepairableCRCMismatchReturnsError — flips a byte inside a record, asserts the error propagates (repair cannot mask real corruption)
TestOpenAndReadWALSucceedsWithoutRepair — happy-path sanity check
Manual: reproduce the incident — kill a node with kill -9 mid-write (e.g., under heavy PROPOSE load), observe auto-repair on restart, confirm catch-up from leader

Follow-ups (tracked, not in this PR)

Prometheus counter for WAL-repair-invocations (operability signal — frequent repair = underlying reliability problem)
Companion PR: memwatch-based graceful shutdown (feat(memwatch): graceful shutdown on memory-pressure threshold #612) avoids triggering this path in the first place

When the kernel OOM-killer SIGKILLs the app mid-WAL-write, the last preallocated WAL segment is left with a torn trailing record. wal.ReadAll then returns io.ErrUnexpectedEOF and our wrapper propagated it directly — the process failed to start and entered a crash-loop until an operator manually quarantined the bad segment. This is the same failure mode etcd's own server guards against with wal.Repair, which truncates the partial trailing record so the node can resume from the last fully-committed index and catch up from the leader via raft replication. Change: extract the wal.Open + ReadAll sequence into openAndReadWAL. On io.ErrUnexpectedEOF from ReadAll, close the handle, log a warning, invoke wal.Repair, re-open, and retry ReadAll once. A second failure (or anything other than ErrUnexpectedEOF) surfaces as before. CRC mismatches and other genuine corruption are deliberately NOT caught — those are not torn-write artifacts and auto-truncating them would silently discard valid state. Motivation: 2026-04-24 incident — a traffic spike caused kernel OOM on all 4 live nodes 22-169 times in 24h; one unlucky kill landed inside a WAL record and made the node un-startable for 5+ hours until manual recovery. With this fix, the same WAL state would have self-repaired on the next restart with no operator involvement. Tests (internal/raftengine/etcd/wal_store_test.go): - TestLoadWalStateRepairsTruncatedTail — seeds a WAL, chops the tail inside framing, asserts re-open succeeds and pre-truncation entries survive. - TestLoadWalStateUnrepairableCRCMismatchReturnsError — flips a byte inside a record, asserts the error propagates (repair cannot mask real corruption). - TestOpenAndReadWALSucceedsWithoutRepair — happy-path sanity check. Full internal/raftengine/etcd test suite passes.

coderabbitai · 2026-04-24T07:52:31Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 53 minutes and 33 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 53 minutes and 33 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d7598e2b-2347-44a0-a279-d0d1db40a773

📥 Commits

Reviewing files that changed from the base of the PR and between 99c0560 and 261b47b.

📒 Files selected for processing (2)

internal/raftengine/etcd/wal_store.go
internal/raftengine/etcd/wal_store_test.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ops/wal-auto-repair

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces an auto-repair mechanism for etcd WAL files to handle torn trailing records caused by unexpected process termination. The implementation adds an openAndReadWAL helper that catches io.ErrUnexpectedEOF and invokes wal.Repair. Feedback indicates that the repair logic might be bypassed by a preceding call to loadPersistedSnapshot which also scans the WAL. Other suggestions include logging errors when closing the WAL during repair and refining the test case to ensure it truncates actual record framing rather than just zero-padding.

gemini-code-assist · 2026-04-24T07:56:02Z

 	}

-	w, err := wal.Open(logger, walDir, walSnapshotFor(snapshot))
+	w, hardState, entries, err := openAndReadWAL(logger, walDir, walSnapshotFor(snapshot))


The auto-repair logic introduced in openAndReadWAL is likely bypassed because loadPersistedSnapshot (called at line 111) also performs a full scan of the WAL via wal.ValidSnapshotEntries. If the WAL tail is torn, loadPersistedSnapshot will return io.ErrUnexpectedEOF and the process will exit before reaching this repair logic. Consider moving the repair mechanism to a higher level in loadWalState or ensuring loadPersistedSnapshot is also protected against torn trailing records.

gemini-code-assist · 2026-04-24T07:56:02Z

+		if err == nil {
+			return w, hardState, entries, nil
+		}
+		_ = w.Close()


The error from w.Close() is silently ignored. Per the general rules, errors from Close() should be logged to ensure resource leaks or other cleanup problems are visible.

if closeErr := w.Close(); closeErr != nil { logger.Warn("failed to close WAL before repair", zap.Error(closeErr)) }

References

Do not silently ignore errors from Close() methods on resources like network connections. Log them to ensure resource leaks or other cleanup problems are visible.

gemini-code-assist · 2026-04-24T07:56:02Z

+	walPath := lastWALFile(t, filepath.Join(dir, walDirName))
+	info, err := os.Stat(walPath)
+	require.NoError(t, err)
+	require.NoError(t, os.Truncate(walPath, info.Size()-(64*1024+16)))


The truncation info.Size()-(64*1024+16) may not reliably trigger io.ErrUnexpectedEOF. Since etcd WAL files are preallocated (default 64MB), truncating from the end of the file typically only removes zero-padding, which the decoder treats as a clean EOF. To effectively test the repair logic, the truncation should occur within a record's framing rather than the preallocated padding.

… test Address Gemini review feedback on the original auto-repair commit: - loadWalState: wrap both loadPersistedSnapshot and openAndReadWAL in a single io.ErrUnexpectedEOF → wal.Repair → retry pass at the top level. The previous version only retried around openAndReadWAL, but loadPersistedSnapshot also scans the WAL via wal.ValidSnapshotEntries, so a torn snapshot-record could abort startup before the repair logic ran. Hoisting covers both paths with one repair attempt, one log line, and no double-repair. - openAndReadWAL: log the Close() error on the failure path instead of `_ = w.Close()`. Likewise for the storage-init failure branch in tryLoadWalState. Resource cleanup errors should be visible. - TestLoadWalStateRepairsTruncatedTail: introduce truncateInsideLastRecord which scans the preallocated 64 MiB WAL file backwards past the zero-padded tail to find the end of real record framing, then truncates 5 bytes inside it. The previous `info.Size() - (64*1024+16)` landed in zero padding: the decoder returned a clean EOF and the test was passing for the wrong reason (no ErrUnexpectedEOF → no repair exercised). With the new helper we actually cut mid-record and the assertion reflects real repair behaviour. All three tests still pass; the full internal/raftengine/etcd suite runs green.

bootjp · 2026-04-24T15:07:18Z

Addressed review feedback in f3f922d:

HIGH: hoisted repair to loadWalState level so it also catches io.ErrUnexpectedEOF from loadPersistedSnapshot → wal.ValidSnapshotEntries (previous version only retried around openAndReadWAL, leaving a window where a torn snapshot-record aborted startup before repair ran).
MEDIUM: w.Close() errors now logged instead of silently dropped in both error branches.
MEDIUM: replaced Size()-(64KB+16) with truncateInsideLastRecord which scans the preallocated tail backwards past the zero padding and truncates 5 bytes inside real framing. The previous truncation likely landed in zero padding → test was passing for the wrong reason (no ErrUnexpectedEOF). With the new helper the repair path is actually exercised.

/gemini review
@codex review

gemini-code-assist

Code Review

This pull request introduces an auto-repair mechanism for the Write-Ahead Log (WAL) to handle partial trailing records, typically caused by OOM-SIGKILL events. The logic is implemented by retrying the WAL load after invoking wal.Repair if an io.ErrUnexpectedEOF is encountered. Feedback was provided regarding the error handling on the second load attempt; specifically, the error should be wrapped as "WAL unrepairable" to maintain consistency with the PR description and provide better context in logs.

gemini-code-assist · 2026-04-24T15:12:59Z

+	if !wal.Repair(logger, walDir) {
+		return nil, errors.Wrap(err, "WAL unrepairable")
+	}
+	return tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm)


The implementation does not fully match the PR description regarding error wrapping on the second attempt. The description states: 'If ... the second ReadAll still fails, the error surfaces wrapped as "WAL unrepairable"'. However, the current code returns the error from the second tryLoadWalState call directly without wrapping it. For consistency and better context in logs/error reporting, consider wrapping the second failure as well.

Suggested change

if !wal.Repair(logger, walDir) {

return nil, errors.Wrap(err, "WAL unrepairable")

}

return tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm)

if !wal.Repair(logger, walDir) {

return nil, errors.Wrap(err, "WAL unrepairable")

}

ds, err = tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm)

if err != nil {

return nil, errors.Wrap(err, "WAL unrepairable")

}

return ds, nil

chatgpt-codex-connector · 2026-04-24T15:15:07Z

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…fter repair' Gemini Medium feedback: the PR description said a second-attempt failure surfaces wrapped as 'WAL unrepairable', but the code returned it directly. Wrap the error so operators can grep for the exact marker and distinguish a failed-repair boot from any other loadWal failure mode.

bootjp · 2026-04-24T15:38:21Z

Addressed in 18007b3:

Gemini Medium: second tryLoadWalState failure is now wrapped as "WAL unrepairable after repair", matching the PR description.

/gemini review
@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 18007b3a0e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T15:42:16Z

+	// truncates that partial record once and is idempotent, so the
+	// retry is bounded.
+	ds, err := tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm)
+	if err == nil || !errors.Is(err, io.ErrUnexpectedEOF) {


Restrict WAL repair trigger to WAL read paths

loadWalState now retries/repairs whenever tryLoadWalState returns io.ErrUnexpectedEOF, but tryLoadWalState also includes restoreSnapshotState; if FSM restore returns io.ErrUnexpectedEOF (for example from a truncated or incompatible FSM snapshot payload), this path will run wal.Repair against an otherwise healthy WAL and then re-label the failure as "WAL unrepairable after repair". That misclassifies the fault domain and hides the actionable root cause for operators; the repair gate should only match EOFs coming from WAL scanning (ValidSnapshotEntries/ReadAll), not restore errors.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request introduces an automatic WAL repair mechanism to handle io.ErrUnexpectedEOF errors, which typically occur when a process is killed mid-write. It refactors loadWalState to include a retry loop that invokes wal.Repair upon encountering a truncated tail. Additionally, the PR improves error handling by logging failures during WAL closure and adds comprehensive unit tests to verify repair behavior and ensure that genuine corruption still results in errors. Feedback was provided to refine the error handling logic in loadWalState to more explicitly return nil for the state object when an error occurs, improving clarity.

gemini-code-assist · 2026-04-24T15:44:51Z

+	// truncates that partial record once and is idempotent, so the
+	// retry is bounded.
+	ds, err := tryLoadWalState(logger, walDir, snapDir, fsmSnapDir, fsm)
+	if err == nil || !errors.Is(err, io.ErrUnexpectedEOF) {


The check !errors.Is(err, io.ErrUnexpectedEOF) correctly identifies torn trailing records. However, if tryLoadWalState returns a different error (e.g., a CRC mismatch or permission error), this condition will return nil, err. Since ds is likely nil when err != nil, it is safer to explicitly return nil, err to avoid returning a typed nil interface if ds were ever non-nil in error cases.

if err == nil { return ds, nil } if !errors.Is(err, io.ErrUnexpectedEOF) { return nil, err }

Codex P2 + Gemini Medium: the round-2 refactor lifted the repair retry around tryLoadWalState, which also runs restoreSnapshotState (FSM snapshot decode). A truncated .fsm payload would surface as io.ErrUnexpectedEOF and incorrectly trigger wal.Repair against a perfectly-good WAL. The FSM snapshotter has its own on-disk CRC footer; wal.Repair does nothing for that failure mode. Split the retry into two per-path helpers: - loadPersistedSnapshotWithRepair: wraps loadPersistedSnapshot (scans WAL via wal.ValidSnapshotEntries). - openAndReadWALWithRepair: wraps openAndReadWAL. restoreSnapshotState now runs between the two and its errors propagate without invoking repair.

bootjp · 2026-04-24T15:57:24Z

Round 3 in d3fa038:

Codex P2 + Gemini Medium: scoped wal.Repair to WAL-only reads. The round-2 refactor wrapped tryLoadWalState which also ran restoreSnapshotState; a truncated .fsm payload returning io.ErrUnexpectedEOF would have triggered wal.Repair on an otherwise-healthy WAL. Split into two per-path helpers (loadPersistedSnapshotWithRepair, openAndReadWALWithRepair); FSM restore runs between them and its errors propagate without invoking repair.

/gemini review
@codex review

gemini-code-assist

Code Review

This pull request implements an automatic WAL repair mechanism to handle partial trailing records caused by abrupt process termination. It introduces helper functions loadPersistedSnapshotWithRepair and openAndReadWALWithRepair that attempt to repair the WAL when an io.ErrUnexpectedEOF is encountered. Additionally, new tests simulate torn writes to verify the repair logic and ensure that actual data corruption still triggers errors. The review feedback suggests refactoring the snapshotter instantiation to avoid redundant allocations across these new helper functions.

gemini-code-assist · 2026-04-24T15:59:13Z

+	snapshot, err := loadPersistedSnapshotWithRepair(logger, walDir, snapDir)
 	if err != nil {
 		return nil, err
 	}
 	if err := restoreSnapshotState(fsm, snapshot, fsmSnapDir); err != nil {
 		return nil, err
 	}

-	w, err := wal.Open(logger, walDir, walSnapshotFor(snapshot))
-	if err != nil {
-		return nil, errors.WithStack(err)
-	}
-
-	_, hardState, entries, err := w.ReadAll()
+	w, hardState, entries, err := openAndReadWALWithRepair(logger, walDir, walSnapshotFor(snapshot))
 	if err != nil {
-		_ = w.Close()
-		return nil, errors.WithStack(err)
+		return nil, err
 	}
+	snapshotter := snap.New(logger, snapDir)


The snapshotter is now being instantiated twice: once inside loadPersistedSnapshotWithRepair and again here in loadWalState. It is more efficient and cleaner to create the snapshotter once at the beginning of loadWalState and pass it into the repair wrapper.

Suggested change

snapshot, err := loadPersistedSnapshotWithRepair(logger, walDir, snapDir)

if err != nil {

return nil, err

}

if err := restoreSnapshotState(fsm, snapshot, fsmSnapDir); err != nil {

return nil, err

}

w, err := wal.Open(logger, walDir, walSnapshotFor(snapshot))

if err != nil {

return nil, errors.WithStack(err)

}

_, hardState, entries, err := w.ReadAll()

w, hardState, entries, err := openAndReadWALWithRepair(logger, walDir, walSnapshotFor(snapshot))

if err != nil {

_ = w.Close()

return nil, errors.WithStack(err)

return nil, err

}

snapshotter := snap.New(logger, snapDir)

snapshotter := snap.New(logger, snapDir)

snapshot, err := loadPersistedSnapshotWithRepair(logger, walDir, snapshotter)

if err != nil {

return nil, err

}

if err := restoreSnapshotState(fsm, snapshot, fsmSnapDir); err != nil {

return nil, err

}

w, hardState, entries, err := openAndReadWALWithRepair(logger, walDir, walSnapshotFor(snapshot))

if err != nil {

return nil, err

}

gemini-code-assist · 2026-04-24T15:59:13Z

+func loadPersistedSnapshotWithRepair(logger *zap.Logger, walDir, snapDir string) (raftpb.Snapshot, error) {
+	snapshotter := snap.New(logger, snapDir)
+	snapshot, err := loadPersistedSnapshot(logger, walDir, snapshotter)


Refactor this function to accept an existing *snap.Snapshotter instead of creating a new one from snapDir. This avoids redundant allocations and aligns with the suggested change in loadWalState.

Suggested change

func loadPersistedSnapshotWithRepair(logger *zap.Logger, walDir, snapDir string) (raftpb.Snapshot, error) {

snapshotter := snap.New(logger, snapDir)

snapshot, err := loadPersistedSnapshot(logger, walDir, snapshotter)

func loadPersistedSnapshotWithRepair(logger *zap.Logger, walDir string, snapshotter *snap.Snapshotter) (raftpb.Snapshot, error) {

snapshot, err := loadPersistedSnapshot(logger, walDir, snapshotter)

chatgpt-codex-connector · 2026-04-24T16:03:07Z

Codex Review: Didn't find any major issues. 🎉

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Gemini Medium (wal_store.go:132 and :160): loadWalState and loadPersistedSnapshotWithRepair both instantiated snap.New from the same snapDir. Create the snapshotter once in loadWalState and pass the pointer down so the helper does not re-allocate.

bootjp · 2026-04-24T16:13:22Z

Round 4 in 261b47b:

Gemini Medium (wal_store.go:132 and :160): shared the snapshotter across loadWalState and loadPersistedSnapshotWithRepair — the helper now takes *snap.Snapshotter from the caller instead of instantiating its own. One allocation per open instead of two.

Build + go test ./internal/raftengine/etcd/... pass; golangci-lint clean on the changed package (the pre-existing goconst warning on wal_purge.go:140 is unrelated to this PR).

No other actionable feedback this round.

/gemini review
@codex review

chatgpt-codex-connector · 2026-04-24T16:13:31Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

gemini-code-assist

Code Review

This pull request introduces an automatic WAL repair mechanism to handle io.ErrUnexpectedEOF errors, which typically occur when a process is terminated during a write operation. The changes include new helper functions loadPersistedSnapshotWithRepair and openAndReadWALWithRepair that attempt a single wal.Repair before retrying. Additionally, unit tests were added to verify that truncated tails are correctly repaired while genuine CRC mismatches still result in errors. I have no feedback to provide.

…#617) ## Summary Add two OOM-defense defaults to `scripts/rolling-update.sh`: - `GOMEMLIMIT=1800MiB` (via new `DEFAULT_EXTRA_ENV`, merged into the existing `EXTRA_ENV` plumbing) - `--memory=2500m` on the remote `docker run` (via new `CONTAINER_MEMORY_LIMIT`) Both are env-var-controlled with empty-string opt-out (`${VAR-default}` so unset uses the default, but an explicit empty string disables it). ## Motivation 2026-04-24 incident: all 4 live nodes were kernel-OOM-SIGKILLed 22-169 times in 24h under a traffic spike. Each kill risked WAL-tail truncation and triggered election storms, cascading into p99 GET spikes to 6-8s. The runtime defense was applied by hand during the incident; this PR makes it the script default so future rollouts inherit it. - `GOMEMLIMIT` — Go runtime GCs aggressively as heap approaches the limit, keeping RSS below the container ceiling. - `--memory` (cgroup hard limit) — if Go can't keep up (e.g. non-heap growth), the kill is scoped to the container, not host processes like `qemu-guest-agent` or `systemd`. ## Behavior changes | Variable | Default | Opt-out | |----------|---------|---------| | `DEFAULT_EXTRA_ENV` | `GOMEMLIMIT=1800MiB` | `DEFAULT_EXTRA_ENV=""` | | `CONTAINER_MEMORY_LIMIT`| `2500m` | `CONTAINER_MEMORY_LIMIT=""` | Operator-supplied `EXTRA_ENV` keys override matching keys in `DEFAULT_EXTRA_ENV` (e.g., `EXTRA_ENV="GOMEMLIMIT=3000MiB"` wins over the default). ## Related Companion PRs (defense-in-depth): - #612 `memwatch` — graceful shutdown before kernel OOM (prevents WAL corruption in the first place) - #613 WAL auto-repair — recovers on startup when the above fails - #616 rolling-update via GitHub Actions over Tailscale — consumes this script ## Test plan - [x] `bash -n scripts/rolling-update.sh` passes - [x] Deployed equivalents manually on all 4 live nodes during the incident (2026-04-24T07:44Z - 07:46Z); no OOM recurrence since - [ ] Next rolling-update invocation should produce `docker run ... --memory=2500m ... -e GOMEMLIMIT=1800MiB ...` on each node ## Design doc reference `docs/design/2026_04_24_proposed_resilience_roadmap.md` (item 1 — capacity/runtime defenses).

… starvation (#619) ## Summary Design doc (only — no code in this PR) for a four-layer workload-isolation model, prompted by the 2026-04-24 incident's afternoon phase. **Problem:** Today, one client host with 37 connections running a tight XREAD loop consumed 14 CPU cores on the leader via `loadStreamAt → unmarshalStreamValue → proto.Unmarshal` (81% of CPU per pprof). Raft goroutines couldn't get CPU → step_queue_full = 75,692 on the leader (vs 0-119 on followers) → Raft commit p99 jumped to 6-10s, Lua p99 stuck at 6-8s. Follower replication was healthy (applied-index within 34 of leader); the damage was entirely CPU-scheduling on the leader. **Gap:** elastickv has no explicit workload-class isolation. Go's scheduler treats every goroutine equally; a single heavy command path can starve unrelated paths (raft, lease, Lua, GET/SET). ## Four-layer defense model - **Layer 1 — heavy-command worker pool**: gate XREAD / KEYS / SCAN / Lua onto a bounded pool (~`2 × GOMAXPROCS`); reply `-BUSY` when full. Cheap commands keep their own fast path. - **Layer 2 — locked OS threads for raft**: `runtime.LockOSThread()` on the Ready loop + dispatcher lanes so the Go scheduler can't starve them. **Not v1** — only if measurement after Layer 1 + 4 still shows `step_queue_full > 0`. - **Layer 3 — per-client admission control**: per-peer-IP connection cap (default 8). Extends, doesn't replace, roadmap item 6's global in-flight semaphore. - **Layer 4 — XREAD O(N) → O(new)**: entry-per-key layout (`!redis|stream|<key>|entry|<id>`) with range-scan, dual-read migration fallback, legacy-removal gated on `elastickv_stream_legacy_format_reads_total == 0`. Hashes/sets/zsets share the same one-blob pattern and are called out as follow-up. ## Recommended sequencing Layer 4 (correctness bug, concentrated change) → Layer 1 (generic defense for next unknown hotspot) → Layer 3 (reconcile with roadmap item 6) → Layer 2 (only if forced by measurement). ## Relationship to other in-flight work - Complements (does not replace) `docs/design/2026_04_24_proposed_resilience_roadmap.md` item 6 (admission control). This doc's Layer 3 focuses on per-client fairness; the roadmap's item 6 is global in-flight capping. Both are needed. - Consistent with memwatch (#612): Layer 3 admission threshold should fire **before** memwatch's shutdown threshold — flagged as an open question in the doc. - Assumes WAL auto-repair (#613), GOMEMLIMIT defaults (#617) are landed so the cluster survives long enough to matter. ## Open questions called out in the doc - Static vs dynamic command classification (Layer 1) - `-BUSY` backoff semantics — how do we avoid client retry spinning becoming the new hot loop? - Number of locked OS threads on variable-core hosts (Layer 2) - Stream migration soak window before removing legacy-format fallback (Layer 4, currently 30 days, arbitrary) ## Deliverable `docs/design/2026_04_24_proposed_workload_isolation.md` — 446 lines, dated-prefix / `**Status: Proposed**` convention matching the rest of `docs/design/`. No code. ## Test plan - [x] File paths and function references in the doc spot-checked against `origin/main` - [x] Cross-references to `2026_04_24_proposed_resilience_roadmap.md` reconciled (complements, doesn't duplicate) - [ ] Design review — decide on the open questions before implementing Layer 4 (which blocks Layer 1 on XREAD specifically)

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

This was referenced Apr 24, 2026

ops(rolling-update): add GOMEMLIMIT=1800MiB + --memory=2500m defaults #617

Merged

docs(design): propose workload-class isolation after 2026-04-24 XREAD starvation #619

Merged

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

bootjp merged commit 2531aa8 into main Apr 24, 2026
8 checks passed

bootjp deleted the ops/wal-auto-repair branch April 24, 2026 17:11

Conversation

bootjp commented Apr 24, 2026

Summary

Motivation

Scope of repair

Test plan

Follow-ups (tracked, not in this PR)

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Apr 24, 2026

Uh oh!

bootjp commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Apr 24, 2026

Uh oh!

bootjp commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading