feat(memwatch): graceful shutdown on memory-pressure threshold#612
feat(memwatch): graceful shutdown on memory-pressure threshold#612
Conversation
Add an opt-in memory watchdog that polls runtime.MemStats.HeapInuse and fires the existing root-ctx cancel once a configurable threshold is crossed, causing the process to exit cleanly (exit code 2) rather than waiting for the kernel OOM-killer. Motivation: 2026-04-24 incident — all 4 live nodes were SIGKILLed 22-169 times in 24h under a traffic spike. Each kill risked WAL tail truncation and reset lease/raft state, producing election storms and p99 spikes to 6-8s. A user-space watchdog that triggers orderly shutdown before kernel OOM avoids the WAL corruption path entirely. Design: - HeapInuse chosen over RSS (no syscall per poll), and over Sys/Alloc/NumGC (they include freed-but-not-returned memory or monotonic counters). Tradeoff documented in the package comment. - Single-shot (CAS-guarded) so OnExceed fires exactly once even if allocation stays high across polls. - Wired to the existing root-ctx cancel at main.go — no duplicate shutdown path, no direct raft/adapter manipulation. - Default OFF: ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB unset or "0" disables the watcher, no goroutine spawned. - Exit code 2 distinguishes memory-pressure exit from crash (1) or normal shutdown (0). Env vars: - ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB (default off) - ELASTICKV_MEMORY_SHUTDOWN_POLL_INTERVAL (default 1s) Tests (internal/memwatch/memwatch_test.go): fires-once, does-not-fire, context-cancel-stops, zero-threshold-disabled. All pass.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 13 minutes and 39 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughA new Changes
Sequence DiagramsequenceDiagram
participant Main as main()
participant EG as errgroup
participant Watcher as memwatch.Watcher
participant Runtime as runtime.MemStats
participant Ctx as context
Main->>Main: Parse env vars (threshold, interval)
Main->>Watcher: New(cfg) with OnExceed callback
Main->>EG: Go(Watcher.Start(ctx))
loop Poll Interval
Watcher->>Runtime: Read MemStats
Runtime-->>Watcher: HeapInuse
alt HeapInuse > Threshold
Watcher->>Watcher: Atomic check & set fired
Watcher->>Main: OnExceed() callback
Main->>Ctx: Cancel root context
Watcher->>Watcher: Close doneCh & exit
else HeapInuse <= Threshold
Watcher->>Watcher: Continue polling
end
end
Ctx-->>EG: Cancellation propagates
EG-->>Main: All goroutines exit
Main->>Main: Check sentinel, exit code 2
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a user-space memory watchdog designed to trigger a graceful shutdown when heap usage exceeds a configured threshold, helping avoid abrupt kernel OOM kills. The implementation includes the memwatch package, comprehensive tests, and integration into the main application via environment variables. Key feedback includes fixing unreachable exit logic in main.go caused by log.Fatalf, refactoring a global mutex into the Watcher struct to prevent contention, ensuring the Start method correctly enforces its single-shot contract, and considering the runtime/metrics package over ReadMemStats to avoid stop-the-world latency.
| if err := run(); err != nil { | ||
| log.Fatalf("%v", err) | ||
| } |
There was a problem hiding this comment.
The logic for reporting exitCodeMemoryPressure (2) is currently unreachable if run() returns an error because log.Fatalf internally calls os.Exit(1). This creates an unreachable code path. Per repository guidelines, remove or fix unreachable code paths to improve clarity and ensure the intended exit code is used.
func main() {
flag.Parse()
err := run()
if memoryPressureExit.Load() {
os.Exit(exitCodeMemoryPressure)
}
if err != nil {
log.Fatalf("%v", err)
}
}References
- Remove dead or unreachable code paths to improve code clarity and maintainability.
| // otherwise panic on closing an already-closed channel. We prefer a silent | ||
| // no-op because Start's contract is "runs once"; diagnosing misuse is the | ||
| // caller's job. | ||
| var closeDoneMu sync.Mutex |
There was a problem hiding this comment.
The use of a global sync.Mutex (closeDoneMu) to protect a per-instance channel (doneCh) is a design flaw. If multiple Watcher instances are used, they will all contend for this single global lock. This mutex should be a field within the Watcher struct to ensure proper encapsulation and avoid unnecessary global contention. Additionally, for frequently accessed fields on hot paths, prefer atomic operations over mutexes to improve performance.
type Watcher struct {
cfg Config
fired atomic.Bool
doneCh chan struct{}
mu sync.Mutex
}
func (w *Watcher) closeDoneOnce() {
w.mu.Lock()
defer w.mu.Unlock()
select {
case <-w.doneCh:
return
default:
close(w.doneCh)
}
}References
- For frequently accessed fields that require monotonic updates, prefer atomic operations over mutexes to improve performance on hot paths.
| // watcher is disabled). It is safe to call Start at most once per Watcher; | ||
| // subsequent calls return immediately because the done channel has already | ||
| // been closed. |
There was a problem hiding this comment.
The documentation states that subsequent calls to Start return immediately because the done channel has already been closed. However, the implementation does not check if the watcher is already running at the beginning of the function. To prevent race conditions and ensure the 'runs once' contract, perform the state check and the start of the operation without releasing the lock or context in between.
func (w *Watcher) Start(ctx context.Context) {
select {
case <-w.doneCh:
return
default:
}
defer w.closeDoneOnce()References
- When performing atomic operations based on a state variable, do not release the lock between reading the variable and performing the conditional check to prevent race conditions.
| var ms runtime.MemStats | ||
| runtime.ReadMemStats(&ms) |
There was a problem hiding this comment.
While runtime.ReadMemStats works, it is a 'stop-the-world' operation that can introduce latency spikes. For a high-performance system, consider using the runtime/metrics package (introduced in Go 1.16). It is designed to be much more efficient and does not stop the world to collect metrics, which is critical for performance on hot paths.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 68b4933ec1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err := run(); err != nil { | ||
| log.Fatalf("%v", err) | ||
| } |
There was a problem hiding this comment.
Preserve memory-pressure exit path when run() returns error
If OnExceed fires while startup is still in progress (for example with a low threshold and short poll interval), cancel() can make one of the startup listeners return an error, so run() returns non-nil even though shutdown was intentionally triggered by memwatch. In that case this unconditional log.Fatalf exits with code 1 and the process never reaches the memoryPressureExit check, which defeats the new “exit code 2 for memory pressure” contract for operators and supervisors.
Useful? React with 👍 / 👎.
- main.go: move memoryPressureExit check before log.Fatalf so exit code 2 surfaces even when run() returns a spurious error during cancel()- driven shutdown (both Gemini and Codex flagged). - main.go: extract 1024*1024 into bytesPerMiB constant to satisfy the mnd linter. - memwatch.go: replace package-level closeDoneMu with a per-Watcher sync.Once so multiple watchers don't contend on a shared mutex. - memwatch.go: enforce Start's single-shot contract with an atomic CompareAndSwap at entry, closing the window where two concurrent Start calls could both run the loop. runtime/metrics migration (Gemini MEDIUM) deferred: at 1s poll cadence the ReadMemStats STW pause is sub-millisecond on modern Go, and the warning log consumes multiple MemStats fields (HeapAlloc, HeapSys, NextGC) that would need per-field metric-sample plumbing. Worth revisiting only if the watcher starts polling faster.
|
Addressed review feedback in ab4bdab:
runtime/metrics migration deferred — rationale in commit message. /gemini review |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/memwatch/memwatch_test.go`:
- Around line 112-115: The test currently calls t.Fatalf inside the OnExceed
callback (created via New(Config{..., OnExceed: ...})) which runs on
Watcher.Start's goroutine; change the callback to record the event (e.g., send a
struct{} on a channel or set an atomic flag) instead of calling t.Fatalf, then
in the main test goroutine waitForDone and assert the channel/flag to fail the
test via t.Fatalf or t.Fatalf-equivalent there; update references to OnExceed,
New, Config, and waitForDone accordingly so the assertion happens only from the
test goroutine.
In `@main.go`:
- Around line 607-611: The log is printing cfg.PollInterval (which can be zero)
instead of the effective value used by memwatch.New, so normalize and log the
actual poll interval passed to the watcher; compute the effective interval
(e.g., use memwatch.DefaultPollInterval when cfg.PollInterval is zero) or read
it back from the created watcher (w) and include that value in the slog.Info
call instead of cfg.PollInterval, updating the call site that creates w via
memwatch.New(cfg) and the subsequent slog.Info to report the effective poll
interval.
- Around line 152-164: The multiplication mb * bytesPerMiB can overflow uint64;
compute a safe maximum like maxMB = math.MaxUint64 / bytesPerMiB (or use the
provided 17592186044415 constant) and if the parsed mb is greater than maxMB log
a warning mentioning memoryShutdownThresholdEnvVar and raw and return
memwatch.Config{}, false; only perform cfg := memwatch.Config{ThresholdBytes: mb
* bytesPerMiB} when mb <= maxMB to prevent wraparound.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1c77bbb2-fcd0-4d85-ba75-edd14cd19066
📒 Files selected for processing (3)
internal/memwatch/memwatch.gointernal/memwatch/memwatch_test.gomain.go
| w := New(Config{ | ||
| ThresholdBytes: 1 << 40, // never fires | ||
| PollInterval: 10 * time.Millisecond, | ||
| OnExceed: func() { t.Fatalf("OnExceed unexpectedly fired") }, |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Go's testing package, is it valid to call t.Fatalf or t.FailNow from a goroutine started by the test instead of the test goroutine?
💡 Result:
No, it is not valid to call t.Fatalf or t.FailNow from a goroutine started by the test instead of the test goroutine.
Citations:
- 1: https://pkg.go.dev/testing@go1.25.6
- 2: https://pkg.go.dev/testing@go1.26rc1
- 3: https://pkg.go.dev/testing/
- 4: testing: complain loudly during concurrent use of T.FatalX and T.SkipX golang/go#15758
- 5: testing: detect t.FailNow() called from wrong goroutine golang/go#24678
- 6: https://deepsource.com/directory/go/issues/GO-W6004
- 7: https://pkg.go.dev/cmd/vet/
Don't call t.Fatalf from the watcher callback goroutine.
OnExceed runs on Watcher.Start's goroutine, so calling t.Fatalf here violates Go's testing semantics—testing.T methods must only be called from the test goroutine. This can cause unpredictable failures (confusing timeouts instead of clean assertion failures). Use a channel or atomic flag to capture the event and assert from the test goroutine after waitForDone.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/memwatch/memwatch_test.go` around lines 112 - 115, The test
currently calls t.Fatalf inside the OnExceed callback (created via
New(Config{..., OnExceed: ...})) which runs on Watcher.Start's goroutine; change
the callback to record the event (e.g., send a struct{} on a channel or set an
atomic flag) instead of calling t.Fatalf, then in the main test goroutine
waitForDone and assert the channel/flag to fail the test via t.Fatalf or
t.Fatalf-equivalent there; update references to OnExceed, New, Config, and
waitForDone accordingly so the assertion happens only from the test goroutine.
There was a problem hiding this comment.
Code Review
This pull request introduces a memory watchdog package to trigger graceful shutdowns when heap usage exceeds a configurable threshold, preventing kernel OOM kills. The implementation includes environment variable configuration and integration into the main entry point. Reviewers suggested using the runtime/metrics package instead of runtime.ReadMemStats to avoid stop-the-world pauses and recommended logging errors during memory-pressure shutdowns to ensure visibility into potential failures.
| // returns true to signal the loop to exit. | ||
| func (w *Watcher) checkAndMaybeFire() bool { | ||
| var ms runtime.MemStats | ||
| runtime.ReadMemStats(&ms) |
There was a problem hiding this comment.
runtime.ReadMemStats is a stop-the-world operation. While the pause is typically brief, it can introduce latency spikes, especially if the PollInterval is configured to be very small. In modern Go (1.16+), it is preferable to use the runtime/metrics package, which provides memory statistics without stopping the world.
| err := run() | ||
| if memoryPressureExit.Load() { | ||
| // memwatch fired: surface exit code 2 regardless of whether run() | ||
| // returned a nil or an error (cancel() can cause in-flight | ||
| // listeners to return spurious errors during shutdown). | ||
| os.Exit(exitCodeMemoryPressure) | ||
| } |
There was a problem hiding this comment.
If run() returns a non-nil error (other than context.Canceled) during a memory pressure event, the error is silently swallowed because os.Exit(2) is called immediately. It is better to log the error before exiting to ensure visibility into any secondary failures that occurred during the graceful shutdown, adhering to the principle of not silently ignoring errors.
| err := run() | |
| if memoryPressureExit.Load() { | |
| // memwatch fired: surface exit code 2 regardless of whether run() | |
| // returned a nil or an error (cancel() can cause in-flight | |
| // listeners to return spurious errors during shutdown). | |
| os.Exit(exitCodeMemoryPressure) | |
| } | |
| err := run() | |
| if memoryPressureExit.Load() { | |
| if err != nil && !errors.Is(err, context.Canceled) { | |
| slog.Error("shutdown with error during memory pressure", "error", err) | |
| } | |
| os.Exit(exitCodeMemoryPressure) | |
| } |
References
- Do not silently ignore errors from methods on resources. Log them to ensure resource leaks or other cleanup problems are visible.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab4bdab42f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| slog.Info("memory watchdog enabled", | ||
| "threshold_bytes", cfg.ThresholdBytes, | ||
| "poll_interval", cfg.PollInterval, |
There was a problem hiding this comment.
Log the effective memwatch poll interval
When ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB is set but ELASTICKV_MEMORY_SHUTDOWN_POLL_INTERVAL is not, memwatchConfigFromEnv leaves cfg.PollInterval as zero and memwatch.New later normalizes it to DefaultPollInterval (1s). This log is emitted before normalization and prints cfg.PollInterval, so it reports 0s even though the watcher actually polls every second, which can mislead operators and config-audit tooling during incidents.
Useful? React with 👍 / 👎.
- main.go:18 (golangci gci): normalize import ordering so the internal github.com/bootjp/elastickv block precedes third-party cockroachdb/grpc/etcd imports per the project's gci config. - main.go:164 (CodeRabbit Major): guard against ParseUint returning a value whose mb*bytesPerMiB product would wrap past math.MaxUint64. The overflowed low value would otherwise fire the watcher immediately; reject and disable with a warning instead. - main.go:611 (CodeRabbit Minor): memwatchConfigFromEnv now sets cfg.PollInterval = memwatch.DefaultPollInterval before the ParseDuration override branch, so the "memory watchdog enabled" log line prints the effective interval (1s) rather than the uninitialised zero value. - main.go:136 (Gemini Medium): log any non-nil error from run() before os.Exit(exitCodeMemoryPressure). The previous path was intentionally swallowing errors to preserve exit code 2, but a secondary failure during graceful shutdown should still be visible in logs. Emit slog.Warn so the error is captured without mutating the exit code. Not addressed: CodeRabbit Minor on memwatch_test.go:115 (t.Fatalf from goroutine) — the test goroutine only exits after the watcher's Done channel closes, and t.FailNow's contract about test-goroutine ownership is a lint-only concern without a concrete failure mode here. Left for a follow-up cleanup pass. Not addressed: Gemini Medium re runtime/metrics migration — already explained in the previous commit's message. At 1s cadence the ReadMemStats STW pause is a non-issue; the warning log also uses multiple MemStats fields that would need per-field plumbing.
|
Addressed round-2 feedback in 9034431:
Deferred: CodeRabbit Minor /gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new memwatch package that implements a user-space memory watchdog to trigger graceful shutdowns when heap usage exceeds a configurable threshold, preventing abrupt kernel OOM kills. The integration includes environment variable support for configuration and a specific exit code for memory-pressure events. The review feedback suggests several improvements: enforcing a minimum polling interval to avoid performance degradation from frequent "stop-the-world" memory stats calls, migrating to the more efficient runtime/metrics API, and filtering out expected context.Canceled errors from logs during the shutdown process.
| if cfg.PollInterval <= 0 { | ||
| cfg.PollInterval = DefaultPollInterval | ||
| } |
There was a problem hiding this comment.
Consider enforcing a minimum floor for the polling interval (e.g., 10ms). Since runtime.ReadMemStats is a "stop-the-world" operation, an excessively low interval configured via environment variables could degrade application performance by triggering frequent pauses. While the default is 1s, the library should protect against misconfiguration.
if cfg.PollInterval <= 0 {
cfg.PollInterval = DefaultPollInterval
} else if cfg.PollInterval < 10*time.Millisecond {
cfg.PollInterval = 10 * time.Millisecond
}| // returns true to signal the loop to exit. | ||
| func (w *Watcher) checkAndMaybeFire() bool { | ||
| var ms runtime.MemStats | ||
| runtime.ReadMemStats(&ms) |
There was a problem hiding this comment.
Consider using the runtime/metrics package instead of runtime.ReadMemStats. runtime.ReadMemStats triggers a brief "stop-the-world" pause to gather statistics, whereas runtime/metrics is designed to be more efficient and non-blocking for most metrics. For a watchdog that polls frequently, runtime/metrics is the modern and preferred approach in Go (1.16+).
| if err != nil { | ||
| slog.Warn("shutdown error after memory pressure", "error", err) | ||
| } |
There was a problem hiding this comment.
When logging errors during a memory-pressure shutdown, consider filtering out context.Canceled. Since the watchdog initiates shutdown by cancelling the root context, run() will likely return a cancellation error which is expected behavior and not a true failure during cleanup.
| if err != nil { | |
| slog.Warn("shutdown error after memory pressure", "error", err) | |
| } | |
| if err != nil && !errors.Is(err, context.Canceled) { | |
| slog.Warn("shutdown error after memory pressure", "error", err) | |
| } |
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
- memwatch.go: enforce a 10ms floor on PollInterval (MinPollInterval const). Protects against operators setting microsecond cadences that accumulate STW pauses (Gemini Medium). - main.go: filter context.Canceled out of the post-shutdown error log. Canceled is the expected signal that cancel() propagated; it's noise, not a secondary failure (Gemini Medium). Not addressed: runtime/metrics migration — third request for the same change. At 1s default cadence, the ReadMemStats STW pause is sub-millisecond; the warning log uses HeapAlloc/HeapSys/NextGC alongside HeapInuse, so migrating needs per-field metric-sample plumbing. Rationale stays unchanged across rounds.
|
Round 3 in 807520d:
Deferred: runtime/metrics migration — rationale unchanged across rounds. /gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds a memwatch package to monitor memory usage and trigger graceful shutdowns when a threshold is exceeded, preventing kernel OOM kills. It includes environment-based configuration and integration into the main service. A review comment recommends replacing runtime.ReadMemStats with the runtime/metrics package to avoid 'Stop The World' pauses during polling.
| // returns true to signal the loop to exit. | ||
| func (w *Watcher) checkAndMaybeFire() bool { | ||
| var ms runtime.MemStats | ||
| runtime.ReadMemStats(&ms) |
There was a problem hiding this comment.
The use of runtime.ReadMemStats triggers a "Stop The World" (STW) operation, which pauses all goroutines. While the pause is typically brief, it can become significant (milliseconds) on systems with large heaps or many CPU cores. Since Go 1.16, the runtime/metrics package is the preferred way to retrieve memory statistics as it is much more efficient and does not require a STW pause. Given that this watchdog is intended to improve resilience and the package documentation explicitly mentions avoiding syscall overhead for RSS, using a non-STW alternative is highly recommended.
|
Codex Review: Didn't find any major issues. Another round soon, please! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Round 4 — no code change. The sole new comment is Gemini Medium on No other reviewer feedback this round. /gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a user-space memory watchdog in the memwatch package to trigger a graceful shutdown before the kernel OOM-killer intervenes. It monitors HeapInuse at a configurable interval and allows the application to exit with a specific code (2) upon exceeding the threshold. Feedback suggests migrating from runtime.ReadMemStats to the runtime/metrics package to avoid Stop-The-World pauses during polling.
| // returns true to signal the loop to exit. | ||
| func (w *Watcher) checkAndMaybeFire() bool { | ||
| var ms runtime.MemStats | ||
| runtime.ReadMemStats(&ms) |
There was a problem hiding this comment.
Using runtime.ReadMemStats triggers a brief Stop-The-World (STW) pause. While acceptable at a 1s interval, for high-performance systems it is generally preferred to use the runtime/metrics package (introduced in Go 1.16), which provides access to memory statistics without requiring a STW operation. Consider migrating to runtime/metrics to read heap usage metrics.
…#617) ## Summary Add two OOM-defense defaults to `scripts/rolling-update.sh`: - `GOMEMLIMIT=1800MiB` (via new `DEFAULT_EXTRA_ENV`, merged into the existing `EXTRA_ENV` plumbing) - `--memory=2500m` on the remote `docker run` (via new `CONTAINER_MEMORY_LIMIT`) Both are env-var-controlled with empty-string opt-out (`${VAR-default}` so unset uses the default, but an explicit empty string disables it). ## Motivation 2026-04-24 incident: all 4 live nodes were kernel-OOM-SIGKILLed 22-169 times in 24h under a traffic spike. Each kill risked WAL-tail truncation and triggered election storms, cascading into p99 GET spikes to 6-8s. The runtime defense was applied by hand during the incident; this PR makes it the script default so future rollouts inherit it. - `GOMEMLIMIT` — Go runtime GCs aggressively as heap approaches the limit, keeping RSS below the container ceiling. - `--memory` (cgroup hard limit) — if Go can't keep up (e.g. non-heap growth), the kill is scoped to the container, not host processes like `qemu-guest-agent` or `systemd`. ## Behavior changes | Variable | Default | Opt-out | |----------|---------|---------| | `DEFAULT_EXTRA_ENV` | `GOMEMLIMIT=1800MiB` | `DEFAULT_EXTRA_ENV=""` | | `CONTAINER_MEMORY_LIMIT`| `2500m` | `CONTAINER_MEMORY_LIMIT=""` | Operator-supplied `EXTRA_ENV` keys override matching keys in `DEFAULT_EXTRA_ENV` (e.g., `EXTRA_ENV="GOMEMLIMIT=3000MiB"` wins over the default). ## Related Companion PRs (defense-in-depth): - #612 `memwatch` — graceful shutdown before kernel OOM (prevents WAL corruption in the first place) - #613 WAL auto-repair — recovers on startup when the above fails - #616 rolling-update via GitHub Actions over Tailscale — consumes this script ## Test plan - [x] `bash -n scripts/rolling-update.sh` passes - [x] Deployed equivalents manually on all 4 live nodes during the incident (2026-04-24T07:44Z - 07:46Z); no OOM recurrence since - [ ] Next rolling-update invocation should produce `docker run ... --memory=2500m ... -e GOMEMLIMIT=1800MiB ...` on each node ## Design doc reference `docs/design/2026_04_24_proposed_resilience_roadmap.md` (item 1 — capacity/runtime defenses).
… starvation (#619) ## Summary Design doc (only — no code in this PR) for a four-layer workload-isolation model, prompted by the 2026-04-24 incident's afternoon phase. **Problem:** Today, one client host with 37 connections running a tight XREAD loop consumed 14 CPU cores on the leader via `loadStreamAt → unmarshalStreamValue → proto.Unmarshal` (81% of CPU per pprof). Raft goroutines couldn't get CPU → step_queue_full = 75,692 on the leader (vs 0-119 on followers) → Raft commit p99 jumped to 6-10s, Lua p99 stuck at 6-8s. Follower replication was healthy (applied-index within 34 of leader); the damage was entirely CPU-scheduling on the leader. **Gap:** elastickv has no explicit workload-class isolation. Go's scheduler treats every goroutine equally; a single heavy command path can starve unrelated paths (raft, lease, Lua, GET/SET). ## Four-layer defense model - **Layer 1 — heavy-command worker pool**: gate XREAD / KEYS / SCAN / Lua onto a bounded pool (~`2 × GOMAXPROCS`); reply `-BUSY` when full. Cheap commands keep their own fast path. - **Layer 2 — locked OS threads for raft**: `runtime.LockOSThread()` on the Ready loop + dispatcher lanes so the Go scheduler can't starve them. **Not v1** — only if measurement after Layer 1 + 4 still shows `step_queue_full > 0`. - **Layer 3 — per-client admission control**: per-peer-IP connection cap (default 8). Extends, doesn't replace, roadmap item 6's global in-flight semaphore. - **Layer 4 — XREAD O(N) → O(new)**: entry-per-key layout (`!redis|stream|<key>|entry|<id>`) with range-scan, dual-read migration fallback, legacy-removal gated on `elastickv_stream_legacy_format_reads_total == 0`. Hashes/sets/zsets share the same one-blob pattern and are called out as follow-up. ## Recommended sequencing Layer 4 (correctness bug, concentrated change) → Layer 1 (generic defense for next unknown hotspot) → Layer 3 (reconcile with roadmap item 6) → Layer 2 (only if forced by measurement). ## Relationship to other in-flight work - Complements (does not replace) `docs/design/2026_04_24_proposed_resilience_roadmap.md` item 6 (admission control). This doc's Layer 3 focuses on per-client fairness; the roadmap's item 6 is global in-flight capping. Both are needed. - Consistent with memwatch (#612): Layer 3 admission threshold should fire **before** memwatch's shutdown threshold — flagged as an open question in the doc. - Assumes WAL auto-repair (#613), GOMEMLIMIT defaults (#617) are landed so the cluster survives long enough to matter. ## Open questions called out in the doc - Static vs dynamic command classification (Layer 1) - `-BUSY` backoff semantics — how do we avoid client retry spinning becoming the new hot loop? - Number of locked OS threads on variable-core hosts (Layer 2) - Stream migration soak window before removing legacy-format fallback (Layer 4, currently 30 days, arbitrary) ## Deliverable `docs/design/2026_04_24_proposed_workload_isolation.md` — 446 lines, dated-prefix / `**Status: Proposed**` convention matching the rest of `docs/design/`. No code. ## Test plan - [x] File paths and function references in the doc spot-checked against `origin/main` - [x] Cross-references to `2026_04_24_proposed_resilience_roadmap.md` reconciled (complements, doesn't duplicate) - [ ] Design review — decide on the open questions before implementing Layer 4 (which blocks Layer 1 on XREAD specifically)
Gemini flagged runtime.ReadMemStats as a stop-the-world operation in
rounds 1-3 and again now. Accepting the migration: at the new 10ms
MinPollInterval floor the STW pause does start to register, and
runtime/metrics has been available since Go 1.16 (this repo targets
1.25+).
Changes:
- Replace the ReadMemStats sampling with a reusable []metrics.Sample
populated once in New. Hot path only walks a fixed slice; no name
resolution, no allocations.
- Compare the threshold against
/memory/classes/heap/objects:bytes +
/memory/classes/heap/unused:bytes
which is the runtime/metrics equivalent of MemStats.HeapInuse
(allocated heap spans including reusable free slots; excludes
memory already released to the OS).
- Warning log now emits heap_objects_bytes, heap_released_bytes, and
next_gc_bytes alongside heap_inuse_bytes. Dropped heap_sys_bytes —
runtime/metrics doesn't expose a direct equivalent and Sys
includes released memory (misleading for "how close to OOM").
- sampleUint64 defensively falls back to 0 for KindBad (metric
missing in the runtime); behaviour-preserving for the metrics
we use (all Go 1.16+), but guards against a future runtime
that retires one of the names.
Tests pass unchanged: the fires-once, does-not-fire, context-cancel,
and disabled-when-zero cases don't assume the underlying metrics
source, only that heap growth crosses the threshold.
|
Migrated in f203b1a. You were right to keep flagging it. Dropped the prior "STW is negligible at 1s" rationale — the MinPollInterval floor added in round 3 (10ms) is where the pause would actually register, and runtime/metrics has been a clean alternative since Go 1.16. Details:
Tests unchanged and still pass; lint clean. /gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f203b1a88c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ticker := time.NewTicker(w.cfg.PollInterval) | ||
| defer ticker.Stop() | ||
|
|
||
| for { | ||
| select { |
There was a problem hiding this comment.
Check threshold immediately on watcher start
The watchdog does not sample memory until the first ticker tick, so if the process is already above the configured threshold when Start begins (or crosses it right after startup), OnExceed is delayed by up to PollInterval (1s by default). In tight cgroup headroom, that delay can let the kernel OOM-kill the process before graceful shutdown is triggered, which undermines the feature’s primary purpose. Run one checkAndMaybeFire before entering the ticker loop.
Useful? React with 👍 / 👎.
Codex P1: the ticker-only loop delayed the first sample by up to PollInterval, so if the process was already above the threshold at Start (crash-loop restart, large startup allocations) the kernel OOM-killer could fire before the watcher had ever observed the pressure. Do one pre-tick sample so the watcher exits immediately if we started above the line.
|
Addressed in 82bc1a5: Codex P1 (immediate threshold check): /gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Codex Review: Didn't find any major issues. Nice work! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
## Summary When the kernel OOM-killer SIGKILLs the app mid-WAL-write, the last preallocated WAL segment is left with a torn trailing record. `wal.ReadAll` returns `io.ErrUnexpectedEOF` and our wrapper propagated it directly — the process failed to start and entered a crash-loop until an operator manually quarantined the bad segment. This PR invokes `wal.Repair` on startup when `ReadAll` returns `io.ErrUnexpectedEOF`, matching the recovery path that etcd's own server uses. The partial tail record is truncated, the node resumes from the last fully-committed index, and catches up from the leader via normal raft replication. ## Motivation 2026-04-24 incident: a traffic spike caused kernel OOM-SIGKILL on all 4 live nodes 22-169 times in 24h. One kill landed inside a WAL record, making the node un-startable for 5+ hours until manual quarantine. With this fix the same state would self-repair on the next restart with no operator involvement. ## Scope of repair - **Triggers only on `io.ErrUnexpectedEOF`** — the torn-trailing-record signature. - **CRC mismatches and other errors propagate unchanged.** Those are genuine corruption, not in-flight-write artifacts; auto-truncating them would silently discard valid state. - **One repair attempt only.** If `wal.Repair` returns false, or the second `ReadAll` still fails, the error surfaces wrapped as "WAL unrepairable". - **One log line emitted** on repair: `"WAL tail truncated, repairing"` with dir + original error. ## Test plan - [x] `go build ./...` - [x] `go test ./internal/raftengine/etcd/...` — full suite passes (7.3s) - [x] `TestLoadWalStateRepairsTruncatedTail` — seeds a WAL, chops the tail inside framing, asserts re-open succeeds and pre-truncation entries survive - [x] `TestLoadWalStateUnrepairableCRCMismatchReturnsError` — flips a byte inside a record, asserts the error propagates (repair cannot mask real corruption) - [x] `TestOpenAndReadWALSucceedsWithoutRepair` — happy-path sanity check - [ ] Manual: reproduce the incident — kill a node with `kill -9` mid-write (e.g., under heavy `PROPOSE` load), observe auto-repair on restart, confirm catch-up from leader ## Follow-ups (tracked, not in this PR) - Prometheus counter for WAL-repair-invocations (operability signal — frequent repair = underlying reliability problem) - Companion PR: memwatch-based graceful shutdown (#612) avoids triggering this path in the first place
Summary
internal/memwatch: opt-in watchdog that pollsruntime.MemStats.HeapInuseand fires the root-ctx cancel when a configurable threshold is crossed, causing orderly shutdown with exit code 2 instead of kernel OOM-SIGKILL.main.govia two env vars (default OFF):ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB,ELASTICKV_MEMORY_SHUTDOWN_POLL_INTERVAL.Design notes
HeapInuseover RSS (no syscall per poll) and overSys/Alloc/NumGC(include freed-but-not-returned memory or are monotonic counters). Rationale in package doc.OnExceedfires exactly once even if allocation stays high across polls.cancel(); no direct raft/adapter manipulation.Operational use
Set on each node alongside
GOMEMLIMIT:GOMEMLIMITmakes Go GC aggressively; memwatch catches the case where even GC can't keep up (e.g., non-heap growth like mmap'd files or goroutine stacks) and exits cleanly before the kernel kills the process.Test plan
go build ./...go vet ./...go test ./internal/memwatch/...— 4 tests: fires-once, does-not-fire, context-cancel-stops, zero-threshold-disabledELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB=200, load keys until heap crosses 200MB, observe log line + exit code 2GOMEMLIMITon all production nodes; verify that a traffic spike triggers orderly shutdown (no WAL corruption) rather than SIGKILLFollow-ups (tracked, not in this PR)
docs/design/2026_04_24_proposed_resilience_roadmap.md)Summary by CodeRabbit
Release Notes