Skip to content

Conversation

@StephenButtolph
Copy link
Contributor

@StephenButtolph StephenButtolph commented Apr 10, 2023

Why this should be merged

Closes #1314

Nodes OOMing can cause extended downtime due to re-generating any non-committed state. This will more aggressively warn a node operator that the node is running out of memory.

How this works

Adds configs:

  1. This adds memory tracking into the process memory tracker.
  2. Registers a health check into the node to report unhealthy depending on the tracked memory profile.
  3. Adds configs to parameterize the node behavior.
  • --system-tracker-memory-required-available (defaults to 500 MiB)
  • --system-tracker-memory-warning-threshold-available (defaults to 1 GiB)

How this was tested

  • Locally (on Ubuntu 22.04) I intentionally introduced a memory leak and monitored the behavior of the node.
  • Locally (on MacOS)
  • Locally (on Windows) Windows handles swap a bit differently than mac / ubuntu... So I had some issues figuring out what the expected values were... but it seemed to be reasonable

@StephenButtolph StephenButtolph added enhancement New feature or request monitoring This primarily focuses on logs, metrics, and/or tracing incident response labels Apr 10, 2023
@StephenButtolph StephenButtolph added this to the v1.10.0 (Cortina) milestone Apr 10, 2023
@StephenButtolph StephenButtolph self-assigned this Apr 10, 2023
@StephenButtolph StephenButtolph linked an issue Apr 10, 2023 that may be closed by this pull request
times = &cpu.TimesStat{}
}

io, err := p.p.IOCounters()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't implemented on macos.

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

availableMemoryBytes := n.resourceManager.AvailableMemoryBytes()

Does this work for Windows?

(this works on linux/mac)

@StephenButtolph StephenButtolph changed the title Add memory alerting into the health check Add available memory health check Apr 22, 2023
@StephenButtolph StephenButtolph marked this pull request as draft May 22, 2023 20:27
@StephenButtolph StephenButtolph added the DO NOT MERGE This PR must not be merged in its current state label May 22, 2023
@StephenButtolph
Copy link
Contributor Author

Fetching memory inside of K8s isn't currently supported - so this approach doesn't currently work. See: shirou/gopsutil#1320

@github-actions
Copy link

This PR has become stale because it has been open for 30 days with no activity. Adding the lifecycle/frozen label will cause this PR to ignore lifecycle events.

@github-actions
Copy link

This PR has become stale because it has been open for 30 days with no activity. Adding the lifecycle/frozen label will cause this PR to ignore lifecycle events.

@github-actions
Copy link

This PR has become stale because it has been open for 30 days with no activity. Adding the lifecycle/frozen label will cause this PR to ignore lifecycle events.

@github-project-automation github-project-automation bot moved this from Backlog 🗄️ to Done ✅ in avalanchego Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DO NOT MERGE This PR must not be merged in its current state enhancement New feature or request incident response lifecycle/stale monitoring This primarily focuses on logs, metrics, and/or tracing

Projects

Archived in project
Status: Frozen 🧊

Development

Successfully merging this pull request may close these issues.

Add memory usage health check

6 participants