Skip to content

feat(memwatch): graceful shutdown on memory-pressure threshold#612

Merged
bootjp merged 6 commits intomainfrom
ops/memwatch-graceful-shutdown
Apr 24, 2026
Merged

feat(memwatch): graceful shutdown on memory-pressure threshold#612
bootjp merged 6 commits intomainfrom
ops/memwatch-graceful-shutdown

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 24, 2026

Summary

  • Add internal/memwatch: opt-in watchdog that polls runtime.MemStats.HeapInuse and fires the root-ctx cancel when a configurable threshold is crossed, causing orderly shutdown with exit code 2 instead of kernel OOM-SIGKILL.
  • Wire into main.go via two env vars (default OFF): ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB, ELASTICKV_MEMORY_SHUTDOWN_POLL_INTERVAL.
  • Motivation: 2026-04-24 incident — all 4 live nodes SIGKILLed 22-169 times in 24h under traffic spike. Each kill risked WAL-tail truncation and reset lease/raft state, producing election storms + p99 6-8s.

Design notes

  • HeapInuse over RSS (no syscall per poll) and over Sys / Alloc / NumGC (include freed-but-not-returned memory or are monotonic counters). Rationale in package doc.
  • Single-shot (CAS-guarded): OnExceed fires exactly once even if allocation stays high across polls.
  • No duplicate shutdown path — hooks into the existing root-ctx cancel(); no direct raft/adapter manipulation.
  • Zero-cost when disabled: no goroutine, no ticker if threshold == 0.
  • Exit code 2 distinguishes memory-pressure exit from crash (1) and normal shutdown (0).

Operational use

Set on each node alongside GOMEMLIMIT:

-e GOMEMLIMIT=1800MiB
-e ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB=1900

GOMEMLIMIT makes Go GC aggressively; memwatch catches the case where even GC can't keep up (e.g., non-heap growth like mmap'd files or goroutine stacks) and exits cleanly before the kernel kills the process.

Test plan

  • go build ./...
  • go vet ./...
  • go test ./internal/memwatch/... — 4 tests: fires-once, does-not-fire, context-cancel-stops, zero-threshold-disabled
  • Manual: on a test node, set ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB=200, load keys until heap crosses 200MB, observe log line + exit code 2
  • Operational: combine with GOMEMLIMIT on all production nodes; verify that a traffic spike triggers orderly shutdown (no WAL corruption) rather than SIGKILL

Follow-ups (tracked, not in this PR)

  • Prometheus metric for memwatch-fired counter (per docs/design/2026_04_24_proposed_resilience_roadmap.md)
  • Compose with admission control (item 6 of resilience roadmap) via a shared "draining" flag so new work is rejected before shutdown

Summary by CodeRabbit

Release Notes

  • New Features
    • Added an optional memory watchdog that monitors heap usage and initiates graceful shutdown when a configured threshold is exceeded. Configurable via environment variables and disabled by default.

Add an opt-in memory watchdog that polls runtime.MemStats.HeapInuse
and fires the existing root-ctx cancel once a configurable threshold
is crossed, causing the process to exit cleanly (exit code 2) rather
than waiting for the kernel OOM-killer.

Motivation: 2026-04-24 incident — all 4 live nodes were SIGKILLed
22-169 times in 24h under a traffic spike. Each kill risked WAL
tail truncation and reset lease/raft state, producing election
storms and p99 spikes to 6-8s. A user-space watchdog that triggers
orderly shutdown before kernel OOM avoids the WAL corruption path
entirely.

Design:
- HeapInuse chosen over RSS (no syscall per poll), and over
  Sys/Alloc/NumGC (they include freed-but-not-returned memory or
  monotonic counters). Tradeoff documented in the package comment.
- Single-shot (CAS-guarded) so OnExceed fires exactly once even if
  allocation stays high across polls.
- Wired to the existing root-ctx cancel at main.go — no duplicate
  shutdown path, no direct raft/adapter manipulation.
- Default OFF: ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB unset or "0"
  disables the watcher, no goroutine spawned.
- Exit code 2 distinguishes memory-pressure exit from crash (1)
  or normal shutdown (0).

Env vars:
- ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB (default off)
- ELASTICKV_MEMORY_SHUTDOWN_POLL_INTERVAL (default 1s)

Tests (internal/memwatch/memwatch_test.go): fires-once, does-not-fire,
context-cancel-stops, zero-threshold-disabled. All pass.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 13 minutes and 39 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 13 minutes and 39 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 040b79a3-5df5-41c5-87ee-3d9883c8b004

📥 Commits

Reviewing files that changed from the base of the PR and between ab4bdab and 82bc1a5.

📒 Files selected for processing (2)
  • internal/memwatch/memwatch.go
  • main.go
📝 Walkthrough

Walkthrough

A new memwatch package provides a user-space memory watchdog that monitors heap in-use memory at configurable intervals. The watchdog invokes a callback and triggers graceful shutdown when heap usage exceeds a threshold. Integration into main() enables environment-controlled memory monitoring.

Changes

Cohort / File(s) Summary
memwatch Package
internal/memwatch/memwatch.go, internal/memwatch/memwatch_test.go
New package with Watcher type that polls runtime.MemStats on a ticker, fires OnExceed callback exactly once when HeapInuse exceeds threshold, and closes a done channel on exit. Test suite validates lifecycle, callback invocation semantics, context cancellation, and disabled-state behavior.
Main Integration
main.go
Integrates memwatch watchdog into main's error group with environment-variable configuration (threshold and poll interval). On memory threshold exceed, sets an atomic sentinel and cancels root context; main forces exit code 2 if sentinel is set regardless of run() outcome.

Sequence Diagram

sequenceDiagram
    participant Main as main()
    participant EG as errgroup
    participant Watcher as memwatch.Watcher
    participant Runtime as runtime.MemStats
    participant Ctx as context

    Main->>Main: Parse env vars (threshold, interval)
    Main->>Watcher: New(cfg) with OnExceed callback
    Main->>EG: Go(Watcher.Start(ctx))
    
    loop Poll Interval
        Watcher->>Runtime: Read MemStats
        Runtime-->>Watcher: HeapInuse
        alt HeapInuse > Threshold
            Watcher->>Watcher: Atomic check & set fired
            Watcher->>Main: OnExceed() callback
            Main->>Ctx: Cancel root context
            Watcher->>Watcher: Close doneCh & exit
        else HeapInuse <= Threshold
            Watcher->>Watcher: Continue polling
        end
    end
    
    Ctx-->>EG: Cancellation propagates
    EG-->>Main: All goroutines exit
    Main->>Main: Check sentinel, exit code 2
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 Hop hop, the heap's now in sight,
With watchful eyes and pollish delight,
When memory climbs too high above,
We gracefully shutdown with care and love,
No OOM crashes—just clean goodbye!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately describes the main feature: a graceful shutdown mechanism triggered by memory-pressure thresholds via the memwatch watchdog.
Docstring Coverage ✅ Passed Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ops/memwatch-graceful-shutdown

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a user-space memory watchdog designed to trigger a graceful shutdown when heap usage exceeds a configured threshold, helping avoid abrupt kernel OOM kills. The implementation includes the memwatch package, comprehensive tests, and integration into the main application via environment variables. Key feedback includes fixing unreachable exit logic in main.go caused by log.Fatalf, refactoring a global mutex into the Watcher struct to prevent contention, ensuring the Start method correctly enforces its single-shot contract, and considering the runtime/metrics package over ReadMemStats to avoid stop-the-world latency.

Comment thread main.go Outdated
Comment on lines 127 to 129
if err := run(); err != nil {
log.Fatalf("%v", err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for reporting exitCodeMemoryPressure (2) is currently unreachable if run() returns an error because log.Fatalf internally calls os.Exit(1). This creates an unreachable code path. Per repository guidelines, remove or fix unreachable code paths to improve clarity and ensure the intended exit code is used.

func main() {
	flag.Parse()

	err := run()
	if memoryPressureExit.Load() {
		os.Exit(exitCodeMemoryPressure)
	}
	if err != nil {
		log.Fatalf("%v", err)
	}
}
References
  1. Remove dead or unreachable code paths to improve code clarity and maintainability.

Comment thread internal/memwatch/memwatch.go Outdated
// otherwise panic on closing an already-closed channel. We prefer a silent
// no-op because Start's contract is "runs once"; diagnosing misuse is the
// caller's job.
var closeDoneMu sync.Mutex
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of a global sync.Mutex (closeDoneMu) to protect a per-instance channel (doneCh) is a design flaw. If multiple Watcher instances are used, they will all contend for this single global lock. This mutex should be a field within the Watcher struct to ensure proper encapsulation and avoid unnecessary global contention. Additionally, for frequently accessed fields on hot paths, prefer atomic operations over mutexes to improve performance.

type Watcher struct {
	cfg    Config
	fired  atomic.Bool
	doneCh chan struct{}
	mu     sync.Mutex
}

func (w *Watcher) closeDoneOnce() {
	w.mu.Lock()
	defer w.mu.Unlock()
	select {
	case <-w.doneCh:
		return
	default:
		close(w.doneCh)
	}
}
References
  1. For frequently accessed fields that require monotonic updates, prefer atomic operations over mutexes to improve performance on hot paths.

Comment on lines +113 to +115
// watcher is disabled). It is safe to call Start at most once per Watcher;
// subsequent calls return immediately because the done channel has already
// been closed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation states that subsequent calls to Start return immediately because the done channel has already been closed. However, the implementation does not check if the watcher is already running at the beginning of the function. To prevent race conditions and ensure the 'runs once' contract, perform the state check and the start of the operation without releasing the lock or context in between.

func (w *Watcher) Start(ctx context.Context) {
	select {
	case <-w.doneCh:
		return
	default:
	}
	defer w.closeDoneOnce()
References
  1. When performing atomic operations based on a state variable, do not release the lock between reading the variable and performing the conditional check to prevent race conditions.

Comment thread internal/memwatch/memwatch.go Outdated
Comment on lines +169 to +170
var ms runtime.MemStats
runtime.ReadMemStats(&ms)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While runtime.ReadMemStats works, it is a 'stop-the-world' operation that can introduce latency spikes. For a high-performance system, consider using the runtime/metrics package (introduced in Go 1.16). It is designed to be much more efficient and does not stop the world to collect metrics, which is critical for performance on hot paths.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68b4933ec1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread main.go Outdated
Comment on lines 127 to 129
if err := run(); err != nil {
log.Fatalf("%v", err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve memory-pressure exit path when run() returns error

If OnExceed fires while startup is still in progress (for example with a low threshold and short poll interval), cancel() can make one of the startup listeners return an error, so run() returns non-nil even though shutdown was intentionally triggered by memwatch. In that case this unconditional log.Fatalf exits with code 1 and the process never reaches the memoryPressureExit check, which defeats the new “exit code 2 for memory pressure” contract for operators and supervisors.

Useful? React with 👍 / 👎.

- main.go: move memoryPressureExit check before log.Fatalf so exit code
  2 surfaces even when run() returns a spurious error during cancel()-
  driven shutdown (both Gemini and Codex flagged).
- main.go: extract 1024*1024 into bytesPerMiB constant to satisfy the
  mnd linter.
- memwatch.go: replace package-level closeDoneMu with a per-Watcher
  sync.Once so multiple watchers don't contend on a shared mutex.
- memwatch.go: enforce Start's single-shot contract with an atomic
  CompareAndSwap at entry, closing the window where two concurrent
  Start calls could both run the loop.

runtime/metrics migration (Gemini MEDIUM) deferred: at 1s poll cadence
the ReadMemStats STW pause is sub-millisecond on modern Go, and the
warning log consumes multiple MemStats fields (HeapAlloc, HeapSys,
NextGC) that would need per-field metric-sample plumbing. Worth
revisiting only if the watcher starts polling faster.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Addressed review feedback in ab4bdab:

  • main.go: moved memoryPressureExit check before log.Fatalf (Gemini HIGH / Codex P2)
  • main.go: extracted bytesPerMiB constant (golangci mnd)
  • memwatch.go: replaced package-level mutex with per-Watcher sync.Once (Gemini MEDIUM)
  • memwatch.go: enforced Start single-shot via started.CompareAndSwap (Gemini MEDIUM)

runtime/metrics migration deferred — rationale in commit message.

/gemini review
@codex review

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/memwatch/memwatch_test.go`:
- Around line 112-115: The test currently calls t.Fatalf inside the OnExceed
callback (created via New(Config{..., OnExceed: ...})) which runs on
Watcher.Start's goroutine; change the callback to record the event (e.g., send a
struct{} on a channel or set an atomic flag) instead of calling t.Fatalf, then
in the main test goroutine waitForDone and assert the channel/flag to fail the
test via t.Fatalf or t.Fatalf-equivalent there; update references to OnExceed,
New, Config, and waitForDone accordingly so the assertion happens only from the
test goroutine.

In `@main.go`:
- Around line 607-611: The log is printing cfg.PollInterval (which can be zero)
instead of the effective value used by memwatch.New, so normalize and log the
actual poll interval passed to the watcher; compute the effective interval
(e.g., use memwatch.DefaultPollInterval when cfg.PollInterval is zero) or read
it back from the created watcher (w) and include that value in the slog.Info
call instead of cfg.PollInterval, updating the call site that creates w via
memwatch.New(cfg) and the subsequent slog.Info to report the effective poll
interval.
- Around line 152-164: The multiplication mb * bytesPerMiB can overflow uint64;
compute a safe maximum like maxMB = math.MaxUint64 / bytesPerMiB (or use the
provided 17592186044415 constant) and if the parsed mb is greater than maxMB log
a warning mentioning memoryShutdownThresholdEnvVar and raw and return
memwatch.Config{}, false; only perform cfg := memwatch.Config{ThresholdBytes: mb
* bytesPerMiB} when mb <= maxMB to prevent wraparound.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1c77bbb2-fcd0-4d85-ba75-edd14cd19066

📥 Commits

Reviewing files that changed from the base of the PR and between 99c0560 and ab4bdab.

📒 Files selected for processing (3)
  • internal/memwatch/memwatch.go
  • internal/memwatch/memwatch_test.go
  • main.go

Comment on lines +112 to +115
w := New(Config{
ThresholdBytes: 1 << 40, // never fires
PollInterval: 10 * time.Millisecond,
OnExceed: func() { t.Fatalf("OnExceed unexpectedly fired") },
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

In Go's testing package, is it valid to call t.Fatalf or t.FailNow from a goroutine started by the test instead of the test goroutine?

💡 Result:

No, it is not valid to call t.Fatalf or t.FailNow from a goroutine started by the test instead of the test goroutine.

Citations:


Don't call t.Fatalf from the watcher callback goroutine.

OnExceed runs on Watcher.Start's goroutine, so calling t.Fatalf here violates Go's testing semantics—testing.T methods must only be called from the test goroutine. This can cause unpredictable failures (confusing timeouts instead of clean assertion failures). Use a channel or atomic flag to capture the event and assert from the test goroutine after waitForDone.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/memwatch/memwatch_test.go` around lines 112 - 115, The test
currently calls t.Fatalf inside the OnExceed callback (created via
New(Config{..., OnExceed: ...})) which runs on Watcher.Start's goroutine; change
the callback to record the event (e.g., send a struct{} on a channel or set an
atomic flag) instead of calling t.Fatalf, then in the main test goroutine
waitForDone and assert the channel/flag to fail the test via t.Fatalf or
t.Fatalf-equivalent there; update references to OnExceed, New, Config, and
waitForDone accordingly so the assertion happens only from the test goroutine.

Comment thread main.go
Comment thread main.go
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a memory watchdog package to trigger graceful shutdowns when heap usage exceeds a configurable threshold, preventing kernel OOM kills. The implementation includes environment variable configuration and integration into the main entry point. Reviewers suggested using the runtime/metrics package instead of runtime.ReadMemStats to avoid stop-the-world pauses and recommended logging errors during memory-pressure shutdowns to ensure visibility into potential failures.

Comment thread internal/memwatch/memwatch.go Outdated
// returns true to signal the loop to exit.
func (w *Watcher) checkAndMaybeFire() bool {
var ms runtime.MemStats
runtime.ReadMemStats(&ms)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

runtime.ReadMemStats is a stop-the-world operation. While the pause is typically brief, it can introduce latency spikes, especially if the PollInterval is configured to be very small. In modern Go (1.16+), it is preferable to use the runtime/metrics package, which provides memory statistics without stopping the world.

Comment thread main.go
Comment on lines +130 to +136
err := run()
if memoryPressureExit.Load() {
// memwatch fired: surface exit code 2 regardless of whether run()
// returned a nil or an error (cancel() can cause in-flight
// listeners to return spurious errors during shutdown).
os.Exit(exitCodeMemoryPressure)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If run() returns a non-nil error (other than context.Canceled) during a memory pressure event, the error is silently swallowed because os.Exit(2) is called immediately. It is better to log the error before exiting to ensure visibility into any secondary failures that occurred during the graceful shutdown, adhering to the principle of not silently ignoring errors.

Suggested change
err := run()
if memoryPressureExit.Load() {
// memwatch fired: surface exit code 2 regardless of whether run()
// returned a nil or an error (cancel() can cause in-flight
// listeners to return spurious errors during shutdown).
os.Exit(exitCodeMemoryPressure)
}
err := run()
if memoryPressureExit.Load() {
if err != nil && !errors.Is(err, context.Canceled) {
slog.Error("shutdown with error during memory pressure", "error", err)
}
os.Exit(exitCodeMemoryPressure)
}
References
  1. Do not silently ignore errors from methods on resources. Log them to ensure resource leaks or other cleanup problems are visible.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab4bdab42f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread main.go
Comment on lines +608 to +610
slog.Info("memory watchdog enabled",
"threshold_bytes", cfg.ThresholdBytes,
"poll_interval", cfg.PollInterval,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Log the effective memwatch poll interval

When ELASTICKV_MEMORY_SHUTDOWN_THRESHOLD_MB is set but ELASTICKV_MEMORY_SHUTDOWN_POLL_INTERVAL is not, memwatchConfigFromEnv leaves cfg.PollInterval as zero and memwatch.New later normalizes it to DefaultPollInterval (1s). This log is emitted before normalization and prints cfg.PollInterval, so it reports 0s even though the watcher actually polls every second, which can mislead operators and config-audit tooling during incidents.

Useful? React with 👍 / 👎.

- main.go:18 (golangci gci): normalize import ordering so the
  internal github.com/bootjp/elastickv block precedes third-party
  cockroachdb/grpc/etcd imports per the project's gci config.
- main.go:164 (CodeRabbit Major): guard against ParseUint returning
  a value whose mb*bytesPerMiB product would wrap past
  math.MaxUint64. The overflowed low value would otherwise fire the
  watcher immediately; reject and disable with a warning instead.
- main.go:611 (CodeRabbit Minor): memwatchConfigFromEnv now sets
  cfg.PollInterval = memwatch.DefaultPollInterval before the
  ParseDuration override branch, so the "memory watchdog enabled"
  log line prints the effective interval (1s) rather than the
  uninitialised zero value.
- main.go:136 (Gemini Medium): log any non-nil error from run()
  before os.Exit(exitCodeMemoryPressure). The previous path was
  intentionally swallowing errors to preserve exit code 2, but a
  secondary failure during graceful shutdown should still be
  visible in logs. Emit slog.Warn so the error is captured without
  mutating the exit code.

Not addressed: CodeRabbit Minor on memwatch_test.go:115 (t.Fatalf
from goroutine) — the test goroutine only exits after the watcher's
Done channel closes, and t.FailNow's contract about test-goroutine
ownership is a lint-only concern without a concrete failure mode
here. Left for a follow-up cleanup pass.

Not addressed: Gemini Medium re runtime/metrics migration — already
explained in the previous commit's message. At 1s cadence the
ReadMemStats STW pause is a non-issue; the warning log also uses
multiple MemStats fields that would need per-field plumbing.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Addressed round-2 feedback in 9034431:

  • golangci gci (main.go:18)
  • CodeRabbit Major overflow (main.go:164): guard mb*bytesPerMiB wrap; disable watcher on overflow
  • CodeRabbit Minor effective interval (main.go:611): log the resolved PollInterval, not the zero pre-default
  • Gemini Medium swallowed error (main.go:136): log residual run() error before os.Exit(2)

Deferred: CodeRabbit Minor t.Fatalf from goroutine (follow-up cleanup), Gemini Medium runtime/metrics migration (rationale in prior commit).

/gemini review
@codex review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new memwatch package that implements a user-space memory watchdog to trigger graceful shutdowns when heap usage exceeds a configurable threshold, preventing abrupt kernel OOM kills. The integration includes environment variable support for configuration and a specific exit code for memory-pressure events. The review feedback suggests several improvements: enforcing a minimum polling interval to avoid performance degradation from frequent "stop-the-world" memory stats calls, migrating to the more efficient runtime/metrics API, and filtering out expected context.Canceled errors from logs during the shutdown process.

Comment thread internal/memwatch/memwatch.go Outdated
Comment on lines +101 to +103
if cfg.PollInterval <= 0 {
cfg.PollInterval = DefaultPollInterval
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider enforcing a minimum floor for the polling interval (e.g., 10ms). Since runtime.ReadMemStats is a "stop-the-world" operation, an excessively low interval configured via environment variables could degrade application performance by triggering frequent pauses. While the default is 1s, the library should protect against misconfiguration.

if cfg.PollInterval <= 0 {
		cfg.PollInterval = DefaultPollInterval
	} else if cfg.PollInterval < 10*time.Millisecond {
		cfg.PollInterval = 10 * time.Millisecond
	}

Comment thread internal/memwatch/memwatch.go Outdated
// returns true to signal the loop to exit.
func (w *Watcher) checkAndMaybeFire() bool {
var ms runtime.MemStats
runtime.ReadMemStats(&ms)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider using the runtime/metrics package instead of runtime.ReadMemStats. runtime.ReadMemStats triggers a brief "stop-the-world" pause to gather statistics, whereas runtime/metrics is designed to be more efficient and non-blocking for most metrics. For a watchdog that polls frequently, runtime/metrics is the modern and preferred approach in Go (1.16+).

Comment thread main.go Outdated
Comment on lines +137 to +139
if err != nil {
slog.Warn("shutdown error after memory pressure", "error", err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When logging errors during a memory-pressure shutdown, consider filtering out context.Canceled. Since the watchdog initiates shutdown by cancelling the root context, run() will likely return a cancellation error which is expected behavior and not a true failure during cleanup.

Suggested change
if err != nil {
slog.Warn("shutdown error after memory pressure", "error", err)
}
if err != nil && !errors.Is(err, context.Canceled) {
slog.Warn("shutdown error after memory pressure", "error", err)
}

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- memwatch.go: enforce a 10ms floor on PollInterval (MinPollInterval
  const). Protects against operators setting microsecond cadences
  that accumulate STW pauses (Gemini Medium).
- main.go: filter context.Canceled out of the post-shutdown error log.
  Canceled is the expected signal that cancel() propagated; it's noise,
  not a secondary failure (Gemini Medium).

Not addressed: runtime/metrics migration — third request for the
same change. At 1s default cadence, the ReadMemStats STW pause is
sub-millisecond; the warning log uses HeapAlloc/HeapSys/NextGC
alongside HeapInuse, so migrating needs per-field metric-sample
plumbing. Rationale stays unchanged across rounds.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Round 3 in 807520d:

  • Gemini Medium: enforced 10ms floor on PollInterval (MinPollInterval) so misconfig can't turn into STW-storm.
  • Gemini Medium: filter context.Canceled from the post-shutdown-error log — it's the expected cancel propagation, not a failure.

Deferred: runtime/metrics migration — rationale unchanged across rounds.

/gemini review
@codex review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a memwatch package to monitor memory usage and trigger graceful shutdowns when a threshold is exceeded, preventing kernel OOM kills. It includes environment-based configuration and integration into the main service. A review comment recommends replacing runtime.ReadMemStats with the runtime/metrics package to avoid 'Stop The World' pauses during polling.

Comment thread internal/memwatch/memwatch.go Outdated
// returns true to signal the loop to exit.
func (w *Watcher) checkAndMaybeFire() bool {
var ms runtime.MemStats
runtime.ReadMemStats(&ms)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of runtime.ReadMemStats triggers a "Stop The World" (STW) operation, which pauses all goroutines. While the pause is typically brief, it can become significant (milliseconds) on systems with large heaps or many CPU cores. Since Go 1.16, the runtime/metrics package is the preferred way to retrieve memory statistics as it is much more efficient and does not require a STW pause. Given that this watchdog is intended to improve resilience and the package documentation explicitly mentions avoiding syscall overhead for RSS, using a non-STW alternative is highly recommended.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Round 4 — no code change.

The sole new comment is Gemini Medium on internal/memwatch/memwatch.go:176 asking to migrate from runtime.ReadMemStats to runtime/metrics. Declining again; see prior rounds (ab4bdab, 9034431, 807520d). Rationale is unchanged: migration is out of scope for this PR, tracked as a follow-up.

No other reviewer feedback this round.

/gemini review
@codex review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a user-space memory watchdog in the memwatch package to trigger a graceful shutdown before the kernel OOM-killer intervenes. It monitors HeapInuse at a configurable interval and allows the application to exit with a specific code (2) upon exceeding the threshold. Feedback suggests migrating from runtime.ReadMemStats to the runtime/metrics package to avoid Stop-The-World pauses during polling.

Comment thread internal/memwatch/memwatch.go Outdated
// returns true to signal the loop to exit.
func (w *Watcher) checkAndMaybeFire() bool {
var ms runtime.MemStats
runtime.ReadMemStats(&ms)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using runtime.ReadMemStats triggers a brief Stop-The-World (STW) pause. While acceptable at a 1s interval, for high-performance systems it is generally preferred to use the runtime/metrics package (introduced in Go 1.16), which provides access to memory statistics without requiring a STW operation. Consider migrating to runtime/metrics to read heap usage metrics.

bootjp added a commit that referenced this pull request Apr 24, 2026
…#617)

## Summary
Add two OOM-defense defaults to `scripts/rolling-update.sh`:

- `GOMEMLIMIT=1800MiB` (via new `DEFAULT_EXTRA_ENV`, merged into the
existing `EXTRA_ENV` plumbing)
- `--memory=2500m` on the remote `docker run` (via new
`CONTAINER_MEMORY_LIMIT`)

Both are env-var-controlled with empty-string opt-out (`${VAR-default}`
so unset uses the default, but an explicit empty string disables it).

## Motivation
2026-04-24 incident: all 4 live nodes were kernel-OOM-SIGKILLed 22-169
times in 24h under a traffic spike. Each kill risked WAL-tail truncation
and triggered election storms, cascading into p99 GET spikes to 6-8s.
The runtime defense was applied by hand during the incident; this PR
makes it the script default so future rollouts inherit it.

- `GOMEMLIMIT` — Go runtime GCs aggressively as heap approaches the
limit, keeping RSS below the container ceiling.
- `--memory` (cgroup hard limit) — if Go can't keep up (e.g. non-heap
growth), the kill is scoped to the container, not host processes like
`qemu-guest-agent` or `systemd`.

## Behavior changes

| Variable | Default | Opt-out |
|----------|---------|---------|
| `DEFAULT_EXTRA_ENV` | `GOMEMLIMIT=1800MiB` | `DEFAULT_EXTRA_ENV=""` |
| `CONTAINER_MEMORY_LIMIT`| `2500m` | `CONTAINER_MEMORY_LIMIT=""` |

Operator-supplied `EXTRA_ENV` keys override matching keys in
`DEFAULT_EXTRA_ENV` (e.g., `EXTRA_ENV="GOMEMLIMIT=3000MiB"` wins over
the default).

## Related
Companion PRs (defense-in-depth):
- #612 `memwatch` — graceful shutdown before kernel OOM (prevents WAL
corruption in the first place)
- #613 WAL auto-repair — recovers on startup when the above fails
- #616 rolling-update via GitHub Actions over Tailscale — consumes this
script

## Test plan
- [x] `bash -n scripts/rolling-update.sh` passes
- [x] Deployed equivalents manually on all 4 live nodes during the
incident (2026-04-24T07:44Z - 07:46Z); no OOM recurrence since
- [ ] Next rolling-update invocation should produce `docker run ...
--memory=2500m ... -e GOMEMLIMIT=1800MiB ...` on each node

## Design doc reference
`docs/design/2026_04_24_proposed_resilience_roadmap.md` (item 1 —
capacity/runtime defenses).
bootjp added a commit that referenced this pull request Apr 24, 2026
… starvation (#619)

## Summary
Design doc (only — no code in this PR) for a four-layer
workload-isolation model, prompted by the 2026-04-24 incident's
afternoon phase.

**Problem:** Today, one client host with 37 connections running a tight
XREAD loop consumed 14 CPU cores on the leader via `loadStreamAt →
unmarshalStreamValue → proto.Unmarshal` (81% of CPU per pprof). Raft
goroutines couldn't get CPU → step_queue_full = 75,692 on the leader (vs
0-119 on followers) → Raft commit p99 jumped to 6-10s, Lua p99 stuck at
6-8s. Follower replication was healthy (applied-index within 34 of
leader); the damage was entirely CPU-scheduling on the leader.

**Gap:** elastickv has no explicit workload-class isolation. Go's
scheduler treats every goroutine equally; a single heavy command path
can starve unrelated paths (raft, lease, Lua, GET/SET).

## Four-layer defense model
- **Layer 1 — heavy-command worker pool**: gate XREAD / KEYS / SCAN /
Lua onto a bounded pool (~`2 × GOMAXPROCS`); reply `-BUSY` when full.
Cheap commands keep their own fast path.
- **Layer 2 — locked OS threads for raft**: `runtime.LockOSThread()` on
the Ready loop + dispatcher lanes so the Go scheduler can't starve them.
**Not v1** — only if measurement after Layer 1 + 4 still shows
`step_queue_full > 0`.
- **Layer 3 — per-client admission control**: per-peer-IP connection cap
(default 8). Extends, doesn't replace, roadmap item 6's global in-flight
semaphore.
- **Layer 4 — XREAD O(N) → O(new)**: entry-per-key layout
(`!redis|stream|<key>|entry|<id>`) with range-scan, dual-read migration
fallback, legacy-removal gated on
`elastickv_stream_legacy_format_reads_total == 0`. Hashes/sets/zsets
share the same one-blob pattern and are called out as follow-up.

## Recommended sequencing
Layer 4 (correctness bug, concentrated change) → Layer 1 (generic
defense for next unknown hotspot) → Layer 3 (reconcile with roadmap item
6) → Layer 2 (only if forced by measurement).

## Relationship to other in-flight work
- Complements (does not replace)
`docs/design/2026_04_24_proposed_resilience_roadmap.md` item 6
(admission control). This doc's Layer 3 focuses on per-client fairness;
the roadmap's item 6 is global in-flight capping. Both are needed.
- Consistent with memwatch (#612): Layer 3 admission threshold should
fire **before** memwatch's shutdown threshold — flagged as an open
question in the doc.
- Assumes WAL auto-repair (#613), GOMEMLIMIT defaults (#617) are landed
so the cluster survives long enough to matter.

## Open questions called out in the doc
- Static vs dynamic command classification (Layer 1)
- `-BUSY` backoff semantics — how do we avoid client retry spinning
becoming the new hot loop?
- Number of locked OS threads on variable-core hosts (Layer 2)
- Stream migration soak window before removing legacy-format fallback
(Layer 4, currently 30 days, arbitrary)

## Deliverable
`docs/design/2026_04_24_proposed_workload_isolation.md` — 446 lines,
dated-prefix / `**Status: Proposed**` convention matching the rest of
`docs/design/`. No code.

## Test plan
- [x] File paths and function references in the doc spot-checked against
`origin/main`
- [x] Cross-references to `2026_04_24_proposed_resilience_roadmap.md`
reconciled (complements, doesn't duplicate)
- [ ] Design review — decide on the open questions before implementing
Layer 4 (which blocks Layer 1 on XREAD specifically)
Gemini flagged runtime.ReadMemStats as a stop-the-world operation in
rounds 1-3 and again now. Accepting the migration: at the new 10ms
MinPollInterval floor the STW pause does start to register, and
runtime/metrics has been available since Go 1.16 (this repo targets
1.25+).

Changes:
- Replace the ReadMemStats sampling with a reusable []metrics.Sample
  populated once in New. Hot path only walks a fixed slice; no name
  resolution, no allocations.
- Compare the threshold against
    /memory/classes/heap/objects:bytes +
    /memory/classes/heap/unused:bytes
  which is the runtime/metrics equivalent of MemStats.HeapInuse
  (allocated heap spans including reusable free slots; excludes
  memory already released to the OS).
- Warning log now emits heap_objects_bytes, heap_released_bytes, and
  next_gc_bytes alongside heap_inuse_bytes. Dropped heap_sys_bytes —
  runtime/metrics doesn't expose a direct equivalent and Sys
  includes released memory (misleading for "how close to OOM").
- sampleUint64 defensively falls back to 0 for KindBad (metric
  missing in the runtime); behaviour-preserving for the metrics
  we use (all Go 1.16+), but guards against a future runtime
  that retires one of the names.

Tests pass unchanged: the fires-once, does-not-fire, context-cancel,
and disabled-when-zero cases don't assume the underlying metrics
source, only that heap growth crosses the threshold.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Migrated in f203b1a.

You were right to keep flagging it. Dropped the prior "STW is negligible at 1s" rationale — the MinPollInterval floor added in round 3 (10ms) is where the pause would actually register, and runtime/metrics has been a clean alternative since Go 1.16.

Details:

  • Threshold now compared against /memory/classes/heap/objects:bytes + /memory/classes/heap/unused:bytes — the runtime/metrics equivalent of MemStats.HeapInuse (allocated heap spans including reusable free slots; excludes OS-released memory).
  • []metrics.Sample allocated once in New, reused every poll — no per-poll allocation, no name resolution on the hot path.
  • Warning log still structured: heap_inuse_bytes, heap_objects_bytes, heap_released_bytes, next_gc_bytes. Dropped heap_sys_bytes since Sys includes released memory and isn't useful for the "how close to OOM" question.
  • sampleUint64 defensively returns 0 for KindBad so a future runtime retiring a metric name fails closed (no false-positive trigger) rather than panicking.

Tests unchanged and still pass; lint clean.

/gemini review
@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f203b1a88c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +174 to +178
ticker := time.NewTicker(w.cfg.PollInterval)
defer ticker.Stop()

for {
select {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Check threshold immediately on watcher start

The watchdog does not sample memory until the first ticker tick, so if the process is already above the configured threshold when Start begins (or crosses it right after startup), OnExceed is delayed by up to PollInterval (1s by default). In tight cgroup headroom, that delay can let the kernel OOM-kill the process before graceful shutdown is triggered, which undermines the feature’s primary purpose. Run one checkAndMaybeFire before entering the ticker loop.

Useful? React with 👍 / 👎.

Codex P1: the ticker-only loop delayed the first sample by up to
PollInterval, so if the process was already above the threshold at
Start (crash-loop restart, large startup allocations) the kernel
OOM-killer could fire before the watcher had ever observed the
pressure. Do one pre-tick sample so the watcher exits immediately
if we started above the line.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Addressed in 82bc1a5:

Codex P1 (immediate threshold check): Start now samples once before the first tick. If the process is already above the threshold at Start — crash-loop restart, large startup allocations — the watcher fires immediately rather than waiting up to PollInterval (1 s default) for the first tick. Closes the window where the kernel OOM-killer could act before the watcher had ever observed pressure.

/gemini review
@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Nice work!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

bootjp added a commit that referenced this pull request Apr 24, 2026
## Summary
When the kernel OOM-killer SIGKILLs the app mid-WAL-write, the last
preallocated WAL segment is left with a torn trailing record.
`wal.ReadAll` returns `io.ErrUnexpectedEOF` and our wrapper propagated
it directly — the process failed to start and entered a crash-loop until
an operator manually quarantined the bad segment.

This PR invokes `wal.Repair` on startup when `ReadAll` returns
`io.ErrUnexpectedEOF`, matching the recovery path that etcd's own server
uses. The partial tail record is truncated, the node resumes from the
last fully-committed index, and catches up from the leader via normal
raft replication.

## Motivation
2026-04-24 incident: a traffic spike caused kernel OOM-SIGKILL on all 4
live nodes 22-169 times in 24h. One kill landed inside a WAL record,
making the node un-startable for 5+ hours until manual quarantine. With
this fix the same state would self-repair on the next restart with no
operator involvement.

## Scope of repair
- **Triggers only on `io.ErrUnexpectedEOF`** — the torn-trailing-record
signature.
- **CRC mismatches and other errors propagate unchanged.** Those are
genuine corruption, not in-flight-write artifacts; auto-truncating them
would silently discard valid state.
- **One repair attempt only.** If `wal.Repair` returns false, or the
second `ReadAll` still fails, the error surfaces wrapped as "WAL
unrepairable".
- **One log line emitted** on repair: `"WAL tail truncated, repairing"`
with dir + original error.

## Test plan
- [x] `go build ./...`
- [x] `go test ./internal/raftengine/etcd/...` — full suite passes
(7.3s)
- [x] `TestLoadWalStateRepairsTruncatedTail` — seeds a WAL, chops the
tail inside framing, asserts re-open succeeds and pre-truncation entries
survive
- [x] `TestLoadWalStateUnrepairableCRCMismatchReturnsError` — flips a
byte inside a record, asserts the error propagates (repair cannot mask
real corruption)
- [x] `TestOpenAndReadWALSucceedsWithoutRepair` — happy-path sanity
check
- [ ] Manual: reproduce the incident — kill a node with `kill -9`
mid-write (e.g., under heavy `PROPOSE` load), observe auto-repair on
restart, confirm catch-up from leader

## Follow-ups (tracked, not in this PR)
- Prometheus counter for WAL-repair-invocations (operability signal —
frequent repair = underlying reliability problem)
- Companion PR: memwatch-based graceful shutdown (#612) avoids
triggering this path in the first place
@bootjp bootjp merged commit 59cefa2 into main Apr 24, 2026
8 checks passed
@bootjp bootjp deleted the ops/memwatch-graceful-shutdown branch April 24, 2026 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant