feat(memwatch): graceful shutdown on memory-pressure threshold by bootjp · Pull Request #612 · bootjp/elastickv

bootjp · 2026-04-24T07:47:43Z

Summary

Add internal/memwatch: opt-in watchdog that polls runtime.MemStats.HeapInuse and fires the root-ctx cancel when a configurable threshold is crossed, causing orderly shutdown with exit code 2 instead of kernel OOM-SIGKILL.
Wire into main.go via two env vars (default OFF): ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB, ELASTICKV_MEMORY_SHUTDOWN_POLL_INTERVAL.
Motivation: 2026-04-24 incident — all 4 live nodes SIGKILLed 22-169 times in 24h under traffic spike. Each kill risked WAL-tail truncation and reset lease/raft state, producing election storms + p99 6-8s.

Design notes

HeapInuse over RSS (no syscall per poll) and over Sys / Alloc / NumGC (include freed-but-not-returned memory or are monotonic counters). Rationale in package doc.
Single-shot (CAS-guarded): OnExceed fires exactly once even if allocation stays high across polls.
No duplicate shutdown path — hooks into the existing root-ctx cancel(); no direct raft/adapter manipulation.
Zero-cost when disabled: no goroutine, no ticker if threshold == 0.
Exit code 2 distinguishes memory-pressure exit from crash (1) and normal shutdown (0).

Operational use

Set on each node alongside GOMEMLIMIT:

-e GOMEMLIMIT=1800MiB
-e ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB=1900

GOMEMLIMIT makes Go GC aggressively; memwatch catches the case where even GC can't keep up (e.g., non-heap growth like mmap'd files or goroutine stacks) and exits cleanly before the kernel kills the process.

Test plan

go build ./...
go vet ./...
go test ./internal/memwatch/... — 4 tests: fires-once, does-not-fire, context-cancel-stops, zero-threshold-disabled
Manual: on a test node, set ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB=200, load keys until heap crosses 200MB, observe log line + exit code 2
Operational: combine with GOMEMLIMIT on all production nodes; verify that a traffic spike triggers orderly shutdown (no WAL corruption) rather than SIGKILL

Follow-ups (tracked, not in this PR)

Prometheus metric for memwatch-fired counter (per docs/design/2026_04_24_proposed_resilience_roadmap.md)
Compose with admission control (item 6 of resilience roadmap) via a shared "draining" flag so new work is rejected before shutdown

Summary by CodeRabbit

Release Notes

New Features
- Added an optional memory watchdog that monitors heap usage and initiates graceful shutdown when a configured threshold is exceeded. Configurable via environment variables and disabled by default.

Add an opt-in memory watchdog that polls runtime.MemStats.HeapInuse and fires the existing root-ctx cancel once a configurable threshold is crossed, causing the process to exit cleanly (exit code 2) rather than waiting for the kernel OOM-killer. Motivation: 2026-04-24 incident — all 4 live nodes were SIGKILLed 22-169 times in 24h under a traffic spike. Each kill risked WAL tail truncation and reset lease/raft state, producing election storms and p99 spikes to 6-8s. A user-space watchdog that triggers orderly shutdown before kernel OOM avoids the WAL corruption path entirely. Design: - HeapInuse chosen over RSS (no syscall per poll), and over Sys/Alloc/NumGC (they include freed-but-not-returned memory or monotonic counters). Tradeoff documented in the package comment. - Single-shot (CAS-guarded) so OnExceed fires exactly once even if allocation stays high across polls. - Wired to the existing root-ctx cancel at main.go — no duplicate shutdown path, no direct raft/adapter manipulation. - Default OFF: ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB unset or "0" disables the watcher, no goroutine spawned. - Exit code 2 distinguishes memory-pressure exit from crash (1) or normal shutdown (0). Env vars: - ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB (default off) - ELASTICKV_MEMORY_SHUTDOWN_POLL_INTERVAL (default 1s) Tests (internal/memwatch/memwatch_test.go): fires-once, does-not-fire, context-cancel-stops, zero-threshold-disabled. All pass.

coderabbitai · 2026-04-24T07:47:50Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 13 minutes and 39 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 13 minutes and 39 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 040b79a3-5df5-41c5-87ee-3d9883c8b004

📥 Commits

Reviewing files that changed from the base of the PR and between ab4bdab and 82bc1a5.

📒 Files selected for processing (2)

internal/memwatch/memwatch.go
main.go

📝 Walkthrough

Walkthrough

A new memwatch package provides a user-space memory watchdog that monitors heap in-use memory at configurable intervals. The watchdog invokes a callback and triggers graceful shutdown when heap usage exceeds a threshold. Integration into main() enables environment-controlled memory monitoring.

Changes

Cohort / File(s)	Summary
memwatch Package `internal/memwatch/memwatch.go`, `internal/memwatch/memwatch_test.go`	New package with `Watcher` type that polls `runtime.MemStats` on a ticker, fires `OnExceed` callback exactly once when `HeapInuse` exceeds threshold, and closes a done channel on exit. Test suite validates lifecycle, callback invocation semantics, context cancellation, and disabled-state behavior.
Main Integration `main.go`	Integrates memwatch watchdog into main's error group with environment-variable configuration (threshold and poll interval). On memory threshold exceed, sets an atomic sentinel and cancels root context; main forces exit code 2 if sentinel is set regardless of `run()` outcome.

Sequence Diagram

sequenceDiagram
    participant Main as main()
    participant EG as errgroup
    participant Watcher as memwatch.Watcher
    participant Runtime as runtime.MemStats
    participant Ctx as context

    Main->>Main: Parse env vars (threshold, interval)
    Main->>Watcher: New(cfg) with OnExceed callback
    Main->>EG: Go(Watcher.Start(ctx))
    
    loop Poll Interval
        Watcher->>Runtime: Read MemStats
        Runtime-->>Watcher: HeapInuse
        alt HeapInuse > Threshold
            Watcher->>Watcher: Atomic check & set fired
            Watcher->>Main: OnExceed() callback
            Main->>Ctx: Cancel root context
            Watcher->>Watcher: Close doneCh & exit
        else HeapInuse <= Threshold
            Watcher->>Watcher: Continue polling
        end
    end
    
    Ctx-->>EG: Cancellation propagates
    EG-->>Main: All goroutines exit
    Main->>Main: Check sentinel, exit code 2

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 Hop hop, the heap's now in sight,
With watchful eyes and pollish delight,
When memory climbs too high above,
We gracefully shutdown with care and love,
No OOM crashes—just clean goodbye! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately describes the main feature: a graceful shutdown mechanism triggered by memory-pressure thresholds via the memwatch watchdog.
Docstring Coverage	✅ Passed	Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ops/memwatch-graceful-shutdown

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a user-space memory watchdog designed to trigger a graceful shutdown when heap usage exceeds a configured threshold, helping avoid abrupt kernel OOM kills. The implementation includes the memwatch package, comprehensive tests, and integration into the main application via environment variables. Key feedback includes fixing unreachable exit logic in main.go caused by log.Fatalf, refactoring a global mutex into the Watcher struct to prevent contention, ensuring the Start method correctly enforces its single-shot contract, and considering the runtime/metrics package over ReadMemStats to avoid stop-the-world latency.

gemini-code-assist · 2026-04-24T07:54:49Z

 	if err := run(); err != nil {
 		log.Fatalf("%v", err)
 	}


The logic for reporting exitCodeMemoryPressure (2) is currently unreachable if run() returns an error because log.Fatalf internally calls os.Exit(1). This creates an unreachable code path. Per repository guidelines, remove or fix unreachable code paths to improve clarity and ensure the intended exit code is used.

func main() { flag.Parse() err := run() if memoryPressureExit.Load() { os.Exit(exitCodeMemoryPressure) } if err != nil { log.Fatalf("%v", err) } }

References

Remove dead or unreachable code paths to improve code clarity and maintainability.

gemini-code-assist · 2026-04-24T07:54:49Z

+// otherwise panic on closing an already-closed channel. We prefer a silent
+// no-op because Start's contract is "runs once"; diagnosing misuse is the
+// caller's job.
+var closeDoneMu sync.Mutex


The use of a global sync.Mutex (closeDoneMu) to protect a per-instance channel (doneCh) is a design flaw. If multiple Watcher instances are used, they will all contend for this single global lock. This mutex should be a field within the Watcher struct to ensure proper encapsulation and avoid unnecessary global contention. Additionally, for frequently accessed fields on hot paths, prefer atomic operations over mutexes to improve performance.

type Watcher struct { cfg Config fired atomic.Bool doneCh chan struct{} mu sync.Mutex } func (w *Watcher) closeDoneOnce() { w.mu.Lock() defer w.mu.Unlock() select { case <-w.doneCh: return default: close(w.doneCh) } }

References

For frequently accessed fields that require monotonic updates, prefer atomic operations over mutexes to improve performance on hot paths.

gemini-code-assist · 2026-04-24T07:54:49Z

+// watcher is disabled). It is safe to call Start at most once per Watcher;
+// subsequent calls return immediately because the done channel has already
+// been closed.


The documentation states that subsequent calls to Start return immediately because the done channel has already been closed. However, the implementation does not check if the watcher is already running at the beginning of the function. To prevent race conditions and ensure the 'runs once' contract, perform the state check and the start of the operation without releasing the lock or context in between.

func (w *Watcher) Start(ctx context.Context) { select { case <-w.doneCh: return default: } defer w.closeDoneOnce()

References

When performing atomic operations based on a state variable, do not release the lock between reading the variable and performing the conditional check to prevent race conditions.

gemini-code-assist · 2026-04-24T07:54:49Z

+	var ms runtime.MemStats
+	runtime.ReadMemStats(&ms)


While runtime.ReadMemStats works, it is a 'stop-the-world' operation that can introduce latency spikes. For a high-performance system, consider using the runtime/metrics package (introduced in Go 1.16). It is designed to be much more efficient and does not stop the world to collect metrics, which is critical for performance on hot paths.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68b4933ec1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T07:58:14Z

 	if err := run(); err != nil {
 		log.Fatalf("%v", err)
 	}


Preserve memory-pressure exit path when run() returns error

If OnExceed fires while startup is still in progress (for example with a low threshold and short poll interval), cancel() can make one of the startup listeners return an error, so run() returns non-nil even though shutdown was intentionally triggered by memwatch. In that case this unconditional log.Fatalf exits with code 1 and the process never reaches the memoryPressureExit check, which defeats the new “exit code 2 for memory pressure” contract for operators and supervisors.

Useful? React with 👍 / 👎.

- main.go: move memoryPressureExit check before log.Fatalf so exit code 2 surfaces even when run() returns a spurious error during cancel()- driven shutdown (both Gemini and Codex flagged). - main.go: extract 1024*1024 into bytesPerMiB constant to satisfy the mnd linter. - memwatch.go: replace package-level closeDoneMu with a per-Watcher sync.Once so multiple watchers don't contend on a shared mutex. - memwatch.go: enforce Start's single-shot contract with an atomic CompareAndSwap at entry, closing the window where two concurrent Start calls could both run the loop. runtime/metrics migration (Gemini MEDIUM) deferred: at 1s poll cadence the ReadMemStats STW pause is sub-millisecond on modern Go, and the warning log consumes multiple MemStats fields (HeapAlloc, HeapSys, NextGC) that would need per-field metric-sample plumbing. Worth revisiting only if the watcher starts polling faster.

bootjp · 2026-04-24T15:03:23Z

Addressed review feedback in ab4bdab:

main.go: moved memoryPressureExit check before log.Fatalf (Gemini HIGH / Codex P2)
main.go: extracted bytesPerMiB constant (golangci mnd)
memwatch.go: replaced package-level mutex with per-Watcher sync.Once (Gemini MEDIUM)
memwatch.go: enforced Start single-shot via started.CompareAndSwap (Gemini MEDIUM)

runtime/metrics migration deferred — rationale in commit message.

/gemini review
@codex review

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/memwatch/memwatch_test.go`:
- Around line 112-115: The test currently calls t.Fatalf inside the OnExceed
callback (created via New(Config{..., OnExceed: ...})) which runs on
Watcher.Start's goroutine; change the callback to record the event (e.g., send a
struct{} on a channel or set an atomic flag) instead of calling t.Fatalf, then
in the main test goroutine waitForDone and assert the channel/flag to fail the
test via t.Fatalf or t.Fatalf-equivalent there; update references to OnExceed,
New, Config, and waitForDone accordingly so the assertion happens only from the
test goroutine.

In `@main.go`:
- Around line 607-611: The log is printing cfg.PollInterval (which can be zero)
instead of the effective value used by memwatch.New, so normalize and log the
actual poll interval passed to the watcher; compute the effective interval
(e.g., use memwatch.DefaultPollInterval when cfg.PollInterval is zero) or read
it back from the created watcher (w) and include that value in the slog.Info
call instead of cfg.PollInterval, updating the call site that creates w via
memwatch.New(cfg) and the subsequent slog.Info to report the effective poll
interval.
- Around line 152-164: The multiplication mb * bytesPerMiB can overflow uint64;
compute a safe maximum like maxMB = math.MaxUint64 / bytesPerMiB (or use the
provided 17592186044415 constant) and if the parsed mb is greater than maxMB log
a warning mentioning memoryShutdownThresholdEnvVar and raw and return
memwatch.Config{}, false; only perform cfg := memwatch.Config{ThresholdBytes: mb
* bytesPerMiB} when mb <= maxMB to prevent wraparound.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1c77bbb2-fcd0-4d85-ba75-edd14cd19066

📥 Commits

Reviewing files that changed from the base of the PR and between 99c0560 and ab4bdab.

📒 Files selected for processing (3)

internal/memwatch/memwatch.go
internal/memwatch/memwatch_test.go
main.go

coderabbitai · 2026-04-24T15:07:46Z

+	w := New(Config{
+		ThresholdBytes: 1 << 40, // never fires
+		PollInterval:   10 * time.Millisecond,
+		OnExceed:       func() { t.Fatalf("OnExceed unexpectedly fired") },


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

In Go's testing package, is it valid to call t.Fatalf or t.FailNow from a goroutine started by the test instead of the test goroutine?

💡 Result:

No, it is not valid to call t.Fatalf or t.FailNow from a goroutine started by the test instead of the test goroutine.

Citations:

1: https://pkg.go.dev/testing@go1.25.6

2: https://pkg.go.dev/testing@go1.26rc1

3: https://pkg.go.dev/testing/

4: testing: complain loudly during concurrent use of T.FatalX and T.SkipX golang/go#15758

5: testing: detect t.FailNow() called from wrong goroutine golang/go#24678

6: https://deepsource.com/directory/go/issues/GO-W6004

7: https://pkg.go.dev/cmd/vet/

Don't call t.Fatalf from the watcher callback goroutine.

OnExceed runs on Watcher.Start's goroutine, so calling t.Fatalf here violates Go's testing semantics—testing.T methods must only be called from the test goroutine. This can cause unpredictable failures (confusing timeouts instead of clean assertion failures). Use a channel or atomic flag to capture the event and assert from the test goroutine after waitForDone.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/memwatch/memwatch_test.go` around lines 112 - 115, The test currently calls t.Fatalf inside the OnExceed callback (created via New(Config{..., OnExceed: ...})) which runs on Watcher.Start's goroutine; change the callback to record the event (e.g., send a struct{} on a channel or set an atomic flag) instead of calling t.Fatalf, then in the main test goroutine waitForDone and assert the channel/flag to fail the test via t.Fatalf or t.Fatalf-equivalent there; update references to OnExceed, New, Config, and waitForDone accordingly so the assertion happens only from the test goroutine.

gemini-code-assist

Code Review

This pull request introduces a memory watchdog package to trigger graceful shutdowns when heap usage exceeds a configurable threshold, preventing kernel OOM kills. The implementation includes environment variable configuration and integration into the main entry point. Reviewers suggested using the runtime/metrics package instead of runtime.ReadMemStats to avoid stop-the-world pauses and recommended logging errors during memory-pressure shutdowns to ensure visibility into potential failures.

gemini-code-assist · 2026-04-24T15:10:01Z

+// returns true to signal the loop to exit.
+func (w *Watcher) checkAndMaybeFire() bool {
+	var ms runtime.MemStats
+	runtime.ReadMemStats(&ms)


runtime.ReadMemStats is a stop-the-world operation. While the pause is typically brief, it can introduce latency spikes, especially if the PollInterval is configured to be very small. In modern Go (1.16+), it is preferable to use the runtime/metrics package, which provides memory statistics without stopping the world.

gemini-code-assist · 2026-04-24T15:10:01Z

+	err := run()
+	if memoryPressureExit.Load() {
+		// memwatch fired: surface exit code 2 regardless of whether run()
+		// returned a nil or an error (cancel() can cause in-flight
+		// listeners to return spurious errors during shutdown).
+		os.Exit(exitCodeMemoryPressure)
+	}


If run() returns a non-nil error (other than context.Canceled) during a memory pressure event, the error is silently swallowed because os.Exit(2) is called immediately. It is better to log the error before exiting to ensure visibility into any secondary failures that occurred during the graceful shutdown, adhering to the principle of not silently ignoring errors.

Suggested change

err := run()

if memoryPressureExit.Load() {

// memwatch fired: surface exit code 2 regardless of whether run()

// returned a nil or an error (cancel() can cause in-flight

// listeners to return spurious errors during shutdown).

os.Exit(exitCodeMemoryPressure)

}

err := run()

if memoryPressureExit.Load() {

if err != nil && !errors.Is(err, context.Canceled) {

slog.Error("shutdown with error during memory pressure", "error", err)

}

os.Exit(exitCodeMemoryPressure)

}

References

Do not silently ignore errors from methods on resources. Log them to ensure resource leaks or other cleanup problems are visible.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab4bdab42f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T15:13:12Z

+	slog.Info("memory watchdog enabled",
+		"threshold_bytes", cfg.ThresholdBytes,
+		"poll_interval", cfg.PollInterval,


Log the effective memwatch poll interval

When ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB is set but ELASTICKV_MEMORY_SHUTDOWN_POLL_INTERVAL is not, memwatchConfigFromEnv leaves cfg.PollInterval as zero and memwatch.New later normalizes it to DefaultPollInterval (1s). This log is emitted before normalization and prints cfg.PollInterval, so it reports 0s even though the watcher actually polls every second, which can mislead operators and config-audit tooling during incidents.

Useful? React with 👍 / 👎.

- main.go:18 (golangci gci): normalize import ordering so the internal github.com/bootjp/elastickv block precedes third-party cockroachdb/grpc/etcd imports per the project's gci config. - main.go:164 (CodeRabbit Major): guard against ParseUint returning a value whose mb*bytesPerMiB product would wrap past math.MaxUint64. The overflowed low value would otherwise fire the watcher immediately; reject and disable with a warning instead. - main.go:611 (CodeRabbit Minor): memwatchConfigFromEnv now sets cfg.PollInterval = memwatch.DefaultPollInterval before the ParseDuration override branch, so the "memory watchdog enabled" log line prints the effective interval (1s) rather than the uninitialised zero value. - main.go:136 (Gemini Medium): log any non-nil error from run() before os.Exit(exitCodeMemoryPressure). The previous path was intentionally swallowing errors to preserve exit code 2, but a secondary failure during graceful shutdown should still be visible in logs. Emit slog.Warn so the error is captured without mutating the exit code. Not addressed: CodeRabbit Minor on memwatch_test.go:115 (t.Fatalf from goroutine) — the test goroutine only exits after the watcher's Done channel closes, and t.FailNow's contract about test-goroutine ownership is a lint-only concern without a concrete failure mode here. Left for a follow-up cleanup pass. Not addressed: Gemini Medium re runtime/metrics migration — already explained in the previous commit's message. At 1s cadence the ReadMemStats STW pause is a non-issue; the warning log also uses multiple MemStats fields that would need per-field plumbing.

bootjp · 2026-04-24T15:38:21Z

Addressed round-2 feedback in 9034431:

golangci gci (main.go:18)
CodeRabbit Major overflow (main.go:164): guard mb*bytesPerMiB wrap; disable watcher on overflow
CodeRabbit Minor effective interval (main.go:611): log the resolved PollInterval, not the zero pre-default
Gemini Medium swallowed error (main.go:136): log residual run() error before os.Exit(2)

Deferred: CodeRabbit Minor t.Fatalf from goroutine (follow-up cleanup), Gemini Medium runtime/metrics migration (rationale in prior commit).

/gemini review
@codex review

gemini-code-assist

Code Review

This pull request introduces a new memwatch package that implements a user-space memory watchdog to trigger graceful shutdowns when heap usage exceeds a configurable threshold, preventing abrupt kernel OOM kills. The integration includes environment variable support for configuration and a specific exit code for memory-pressure events. The review feedback suggests several improvements: enforcing a minimum polling interval to avoid performance degradation from frequent "stop-the-world" memory stats calls, migrating to the more efficient runtime/metrics API, and filtering out expected context.Canceled errors from logs during the shutdown process.

gemini-code-assist · 2026-04-24T15:41:17Z

+	if cfg.PollInterval <= 0 {
+		cfg.PollInterval = DefaultPollInterval
+	}


Consider enforcing a minimum floor for the polling interval (e.g., 10ms). Since runtime.ReadMemStats is a "stop-the-world" operation, an excessively low interval configured via environment variables could degrade application performance by triggering frequent pauses. While the default is 1s, the library should protect against misconfiguration.

if cfg.PollInterval <= 0 { cfg.PollInterval = DefaultPollInterval } else if cfg.PollInterval < 10*time.Millisecond { cfg.PollInterval = 10 * time.Millisecond }

gemini-code-assist · 2026-04-24T15:41:17Z

+// returns true to signal the loop to exit.
+func (w *Watcher) checkAndMaybeFire() bool {
+	var ms runtime.MemStats
+	runtime.ReadMemStats(&ms)


Consider using the runtime/metrics package instead of runtime.ReadMemStats. runtime.ReadMemStats triggers a brief "stop-the-world" pause to gather statistics, whereas runtime/metrics is designed to be more efficient and non-blocking for most metrics. For a watchdog that polls frequently, runtime/metrics is the modern and preferred approach in Go (1.16+).

gemini-code-assist · 2026-04-24T15:41:17Z

+		if err != nil {
+			slog.Warn("shutdown error after memory pressure", "error", err)
+		}


When logging errors during a memory-pressure shutdown, consider filtering out context.Canceled. Since the watchdog initiates shutdown by cancelling the root context, run() will likely return a cancellation error which is expected behavior and not a true failure during cleanup.

Suggested change

if err != nil {

slog.Warn("shutdown error after memory pressure", "error", err)

}

if err != nil && !errors.Is(err, context.Canceled) {

slog.Warn("shutdown error after memory pressure", "error", err)

}

chatgpt-codex-connector · 2026-04-24T15:48:39Z

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- memwatch.go: enforce a 10ms floor on PollInterval (MinPollInterval const). Protects against operators setting microsecond cadences that accumulate STW pauses (Gemini Medium). - main.go: filter context.Canceled out of the post-shutdown error log. Canceled is the expected signal that cancel() propagated; it's noise, not a secondary failure (Gemini Medium). Not addressed: runtime/metrics migration — third request for the same change. At 1s default cadence, the ReadMemStats STW pause is sub-millisecond; the warning log uses HeapAlloc/HeapSys/NextGC alongside HeapInuse, so migrating needs per-field metric-sample plumbing. Rationale stays unchanged across rounds.

bootjp · 2026-04-24T15:57:24Z

Round 3 in 807520d:

Gemini Medium: enforced 10ms floor on PollInterval (MinPollInterval) so misconfig can't turn into STW-storm.
Gemini Medium: filter context.Canceled from the post-shutdown-error log — it's the expected cancel propagation, not a failure.

Deferred: runtime/metrics migration — rationale unchanged across rounds.

/gemini review
@codex review

gemini-code-assist

Code Review

This pull request adds a memwatch package to monitor memory usage and trigger graceful shutdowns when a threshold is exceeded, preventing kernel OOM kills. It includes environment-based configuration and integration into the main service. A review comment recommends replacing runtime.ReadMemStats with the runtime/metrics package to avoid 'Stop The World' pauses during polling.

gemini-code-assist · 2026-04-24T16:03:09Z

+// returns true to signal the loop to exit.
+func (w *Watcher) checkAndMaybeFire() bool {
+	var ms runtime.MemStats
+	runtime.ReadMemStats(&ms)


The use of runtime.ReadMemStats triggers a "Stop The World" (STW) operation, which pauses all goroutines. While the pause is typically brief, it can become significant (milliseconds) on systems with large heaps or many CPU cores. Since Go 1.16, the runtime/metrics package is the preferred way to retrieve memory statistics as it is much more efficient and does not require a STW pause. Given that this watchdog is intended to improve resilience and the package documentation explicitly mentions avoiding syscall overhead for RSS, using a non-STW alternative is highly recommended.

chatgpt-codex-connector · 2026-04-24T16:07:33Z

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

bootjp · 2026-04-24T16:13:20Z

Round 4 — no code change.

The sole new comment is Gemini Medium on internal/memwatch/memwatch.go:176 asking to migrate from runtime.ReadMemStats to runtime/metrics. Declining again; see prior rounds (ab4bdab, 9034431, 807520d). Rationale is unchanged: migration is out of scope for this PR, tracked as a follow-up.

No other reviewer feedback this round.

/gemini review
@codex review

gemini-code-assist

Code Review

This pull request introduces a user-space memory watchdog in the memwatch package to trigger a graceful shutdown before the kernel OOM-killer intervenes. It monitors HeapInuse at a configurable interval and allows the application to exit with a specific code (2) upon exceeding the threshold. Feedback suggests migrating from runtime.ReadMemStats to the runtime/metrics package to avoid Stop-The-World pauses during polling.

gemini-code-assist · 2026-04-24T16:18:59Z

+// returns true to signal the loop to exit.
+func (w *Watcher) checkAndMaybeFire() bool {
+	var ms runtime.MemStats
+	runtime.ReadMemStats(&ms)


Using runtime.ReadMemStats triggers a brief Stop-The-World (STW) pause. While acceptable at a 1s interval, for high-performance systems it is generally preferred to use the runtime/metrics package (introduced in Go 1.16), which provides access to memory statistics without requiring a STW operation. Consider migrating to runtime/metrics to read heap usage metrics.

…#617) ## Summary Add two OOM-defense defaults to `scripts/rolling-update.sh`: - `GOMEMLIMIT=1800MiB` (via new `DEFAULT_EXTRA_ENV`, merged into the existing `EXTRA_ENV` plumbing) - `--memory=2500m` on the remote `docker run` (via new `CONTAINER_MEMORY_LIMIT`) Both are env-var-controlled with empty-string opt-out (`${VAR-default}` so unset uses the default, but an explicit empty string disables it). ## Motivation 2026-04-24 incident: all 4 live nodes were kernel-OOM-SIGKILLed 22-169 times in 24h under a traffic spike. Each kill risked WAL-tail truncation and triggered election storms, cascading into p99 GET spikes to 6-8s. The runtime defense was applied by hand during the incident; this PR makes it the script default so future rollouts inherit it. - `GOMEMLIMIT` — Go runtime GCs aggressively as heap approaches the limit, keeping RSS below the container ceiling. - `--memory` (cgroup hard limit) — if Go can't keep up (e.g. non-heap growth), the kill is scoped to the container, not host processes like `qemu-guest-agent` or `systemd`. ## Behavior changes | Variable | Default | Opt-out | |----------|---------|---------| | `DEFAULT_EXTRA_ENV` | `GOMEMLIMIT=1800MiB` | `DEFAULT_EXTRA_ENV=""` | | `CONTAINER_MEMORY_LIMIT`| `2500m` | `CONTAINER_MEMORY_LIMIT=""` | Operator-supplied `EXTRA_ENV` keys override matching keys in `DEFAULT_EXTRA_ENV` (e.g., `EXTRA_ENV="GOMEMLIMIT=3000MiB"` wins over the default). ## Related Companion PRs (defense-in-depth): - #612 `memwatch` — graceful shutdown before kernel OOM (prevents WAL corruption in the first place) - #613 WAL auto-repair — recovers on startup when the above fails - #616 rolling-update via GitHub Actions over Tailscale — consumes this script ## Test plan - [x] `bash -n scripts/rolling-update.sh` passes - [x] Deployed equivalents manually on all 4 live nodes during the incident (2026-04-24T07:44Z - 07:46Z); no OOM recurrence since - [ ] Next rolling-update invocation should produce `docker run ... --memory=2500m ... -e GOMEMLIMIT=1800MiB ...` on each node ## Design doc reference `docs/design/2026_04_24_proposed_resilience_roadmap.md` (item 1 — capacity/runtime defenses).

… starvation (#619) ## Summary Design doc (only — no code in this PR) for a four-layer workload-isolation model, prompted by the 2026-04-24 incident's afternoon phase. **Problem:** Today, one client host with 37 connections running a tight XREAD loop consumed 14 CPU cores on the leader via `loadStreamAt → unmarshalStreamValue → proto.Unmarshal` (81% of CPU per pprof). Raft goroutines couldn't get CPU → step_queue_full = 75,692 on the leader (vs 0-119 on followers) → Raft commit p99 jumped to 6-10s, Lua p99 stuck at 6-8s. Follower replication was healthy (applied-index within 34 of leader); the damage was entirely CPU-scheduling on the leader. **Gap:** elastickv has no explicit workload-class isolation. Go's scheduler treats every goroutine equally; a single heavy command path can starve unrelated paths (raft, lease, Lua, GET/SET). ## Four-layer defense model - **Layer 1 — heavy-command worker pool**: gate XREAD / KEYS / SCAN / Lua onto a bounded pool (~`2 × GOMAXPROCS`); reply `-BUSY` when full. Cheap commands keep their own fast path. - **Layer 2 — locked OS threads for raft**: `runtime.LockOSThread()` on the Ready loop + dispatcher lanes so the Go scheduler can't starve them. **Not v1** — only if measurement after Layer 1 + 4 still shows `step_queue_full > 0`. - **Layer 3 — per-client admission control**: per-peer-IP connection cap (default 8). Extends, doesn't replace, roadmap item 6's global in-flight semaphore. - **Layer 4 — XREAD O(N) → O(new)**: entry-per-key layout (`!redis|stream|<key>|entry|<id>`) with range-scan, dual-read migration fallback, legacy-removal gated on `elastickv_stream_legacy_format_reads_total == 0`. Hashes/sets/zsets share the same one-blob pattern and are called out as follow-up. ## Recommended sequencing Layer 4 (correctness bug, concentrated change) → Layer 1 (generic defense for next unknown hotspot) → Layer 3 (reconcile with roadmap item 6) → Layer 2 (only if forced by measurement). ## Relationship to other in-flight work - Complements (does not replace) `docs/design/2026_04_24_proposed_resilience_roadmap.md` item 6 (admission control). This doc's Layer 3 focuses on per-client fairness; the roadmap's item 6 is global in-flight capping. Both are needed. - Consistent with memwatch (#612): Layer 3 admission threshold should fire **before** memwatch's shutdown threshold — flagged as an open question in the doc. - Assumes WAL auto-repair (#613), GOMEMLIMIT defaults (#617) are landed so the cluster survives long enough to matter. ## Open questions called out in the doc - Static vs dynamic command classification (Layer 1) - `-BUSY` backoff semantics — how do we avoid client retry spinning becoming the new hot loop? - Number of locked OS threads on variable-core hosts (Layer 2) - Stream migration soak window before removing legacy-format fallback (Layer 4, currently 30 days, arbitrary) ## Deliverable `docs/design/2026_04_24_proposed_workload_isolation.md` — 446 lines, dated-prefix / `**Status: Proposed**` convention matching the rest of `docs/design/`. No code. ## Test plan - [x] File paths and function references in the doc spot-checked against `origin/main` - [x] Cross-references to `2026_04_24_proposed_resilience_roadmap.md` reconciled (complements, doesn't duplicate) - [ ] Design review — decide on the open questions before implementing Layer 4 (which blocks Layer 1 on XREAD specifically)

Gemini flagged runtime.ReadMemStats as a stop-the-world operation in rounds 1-3 and again now. Accepting the migration: at the new 10ms MinPollInterval floor the STW pause does start to register, and runtime/metrics has been available since Go 1.16 (this repo targets 1.25+). Changes: - Replace the ReadMemStats sampling with a reusable []metrics.Sample populated once in New. Hot path only walks a fixed slice; no name resolution, no allocations. - Compare the threshold against /memory/classes/heap/objects:bytes + /memory/classes/heap/unused:bytes which is the runtime/metrics equivalent of MemStats.HeapInuse (allocated heap spans including reusable free slots; excludes memory already released to the OS). - Warning log now emits heap_objects_bytes, heap_released_bytes, and next_gc_bytes alongside heap_inuse_bytes. Dropped heap_sys_bytes — runtime/metrics doesn't expose a direct equivalent and Sys includes released memory (misleading for "how close to OOM"). - sampleUint64 defensively falls back to 0 for KindBad (metric missing in the runtime); behaviour-preserving for the metrics we use (all Go 1.16+), but guards against a future runtime that retires one of the names. Tests pass unchanged: the fires-once, does-not-fire, context-cancel, and disabled-when-zero cases don't assume the underlying metrics source, only that heap growth crosses the threshold.

bootjp · 2026-04-24T16:30:46Z

Migrated in f203b1a.

You were right to keep flagging it. Dropped the prior "STW is negligible at 1s" rationale — the MinPollInterval floor added in round 3 (10ms) is where the pause would actually register, and runtime/metrics has been a clean alternative since Go 1.16.

Details:

Threshold now compared against /memory/classes/heap/objects:bytes + /memory/classes/heap/unused:bytes — the runtime/metrics equivalent of MemStats.HeapInuse (allocated heap spans including reusable free slots; excludes OS-released memory).
[]metrics.Sample allocated once in New, reused every poll — no per-poll allocation, no name resolution on the hot path.
Warning log still structured: heap_inuse_bytes, heap_objects_bytes, heap_released_bytes, next_gc_bytes. Dropped heap_sys_bytes since Sys includes released memory and isn't useful for the "how close to OOM" question.
sampleUint64 defensively returns 0 for KindBad so a future runtime retiring a metric name fails closed (no false-positive trigger) rather than panicking.

Tests unchanged and still pass; lint clean.

/gemini review
@codex review

gemini-code-assist · 2026-04-24T16:30:49Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f203b1a88c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T16:36:23Z

+	ticker := time.NewTicker(w.cfg.PollInterval)
+	defer ticker.Stop()
+
+	for {
+		select {


Check threshold immediately on watcher start

The watchdog does not sample memory until the first ticker tick, so if the process is already above the configured threshold when Start begins (or crosses it right after startup), OnExceed is delayed by up to PollInterval (1s by default). In tight cgroup headroom, that delay can let the kernel OOM-kill the process before graceful shutdown is triggered, which undermines the feature’s primary purpose. Run one checkAndMaybeFire before entering the ticker loop.

Useful? React with 👍 / 👎.

Codex P1: the ticker-only loop delayed the first sample by up to PollInterval, so if the process was already above the threshold at Start (crash-loop restart, large startup allocations) the kernel OOM-killer could fire before the watcher had ever observed the pressure. Do one pre-tick sample so the watcher exits immediately if we started above the line.

bootjp · 2026-04-24T16:54:19Z

Addressed in 82bc1a5:

Codex P1 (immediate threshold check): Start now samples once before the first tick. If the process is already above the threshold at Start — crash-loop restart, large startup allocations — the watcher fires immediately rather than waiting up to PollInterval (1 s default) for the first tick. Closes the window where the kernel OOM-killer could act before the watcher had ever observed pressure.

/gemini review
@codex review

gemini-code-assist · 2026-04-24T16:54:22Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

chatgpt-codex-connector · 2026-04-24T17:02:16Z

Codex Review: Didn't find any major issues. Nice work!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

## Summary When the kernel OOM-killer SIGKILLs the app mid-WAL-write, the last preallocated WAL segment is left with a torn trailing record. `wal.ReadAll` returns `io.ErrUnexpectedEOF` and our wrapper propagated it directly — the process failed to start and entered a crash-loop until an operator manually quarantined the bad segment. This PR invokes `wal.Repair` on startup when `ReadAll` returns `io.ErrUnexpectedEOF`, matching the recovery path that etcd's own server uses. The partial tail record is truncated, the node resumes from the last fully-committed index, and catches up from the leader via normal raft replication. ## Motivation 2026-04-24 incident: a traffic spike caused kernel OOM-SIGKILL on all 4 live nodes 22-169 times in 24h. One kill landed inside a WAL record, making the node un-startable for 5+ hours until manual quarantine. With this fix the same state would self-repair on the next restart with no operator involvement. ## Scope of repair - **Triggers only on `io.ErrUnexpectedEOF`** — the torn-trailing-record signature. - **CRC mismatches and other errors propagate unchanged.** Those are genuine corruption, not in-flight-write artifacts; auto-truncating them would silently discard valid state. - **One repair attempt only.** If `wal.Repair` returns false, or the second `ReadAll` still fails, the error surfaces wrapped as "WAL unrepairable". - **One log line emitted** on repair: `"WAL tail truncated, repairing"` with dir + original error. ## Test plan - [x] `go build ./...` - [x] `go test ./internal/raftengine/etcd/...` — full suite passes (7.3s) - [x] `TestLoadWalStateRepairsTruncatedTail` — seeds a WAL, chops the tail inside framing, asserts re-open succeeds and pre-truncation entries survive - [x] `TestLoadWalStateUnrepairableCRCMismatchReturnsError` — flips a byte inside a record, asserts the error propagates (repair cannot mask real corruption) - [x] `TestOpenAndReadWALSucceedsWithoutRepair` — happy-path sanity check - [ ] Manual: reproduce the incident — kill a node with `kill -9` mid-write (e.g., under heavy `PROPOSE` load), observe auto-repair on restart, confirm catch-up from leader ## Follow-ups (tracked, not in this PR) - Prometheus counter for WAL-repair-invocations (operability signal — frequent repair = underlying reliability problem) - Companion PR: memwatch-based graceful shutdown (#612) avoids triggering this path in the first place

bootjp mentioned this pull request Apr 24, 2026

fix(raft): auto-repair truncated WAL tail on startup #613

Merged

6 tasks

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

This was referenced Apr 24, 2026

ops(rolling-update): add GOMEMLIMIT=1800MiB + --memory=2500m defaults #617

Merged

docs(design): propose workload-class isolation after 2026-04-24 XREAD starvation #619

Merged

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

bootjp merged commit 59cefa2 into main Apr 24, 2026
8 checks passed

bootjp deleted the ops/memwatch-graceful-shutdown branch April 24, 2026 17:12

Conversation

bootjp commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design notes

Operational use

Test plan

Follow-ups (tracked, not in this PR)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 24, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

bootjp commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading