Log cgroup memory.events on exit code 137 as dmesg fallback#126015
Log cgroup memory.events on exit code 137 as dmesg fallback#126015danmoseley wants to merge 4 commits intodotnet:mainfrom
Conversation
When a test process exits with code 137 (SIGKILL/OOM) on Helix Linux containers, dmesg is already called but fails because CAP_SYSLOG is not available. Add a fallback that reads /sys/fs/cgroup/memory.events (cgroup v2) which requires no special privileges and confirms whether the OOM killer fired. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @dotnet/runtime-infrastructure |
|
Tested on Linux in a cgroup forced to go OOM this says in this example:
So it's not much but it's a clue that tells us exit 137 (which could be OOM killer, Helix timeout or something else) is OOM (or almost surely) |
There was a problem hiding this comment.
Pull request overview
This PR improves Linux Helix test diagnostics for cases where test processes exit with code 137 (SIGKILL/OOM) by adding a non-privileged fallback to confirm OOM killer activity via cgroup v2 memory.events, since dmesg often fails in containers without CAP_SYSLOG.
Changes:
- On Helix work items with exit code 137, attempt to read and print
memory.eventsfrom the root cgroup and then from the per-cgroup path derived from/proc/self/cgroup. - Keep the existing
dmesg | tail -50behavior, adding the cgroup dump as additional context.
|
@copilot address feedback |
- Reword comment to clarify the cgroup check is a fallback, not conditional on dmesg failure. - Replace grep -oP (PCRE) with awk for parsing /proc/self/cgroup, improving portability in minimal container images. - Extract cgroup path into a variable for readability. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add comments explaining /proc/self/cgroup format with example - Note silent no-op behavior on macOS / cgroup v1 - Reverse loop order: check process-specific cgroup first, fall back to root (root always exists on cgroup v2, so old order would never reach the more informative specific path) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Note This comment was generated with Copilot. Added readability comments (example of Also fixed the loop order: the old code checked the root cgroup's |
Move the file-existence check into the if condition instead of relying on awk's 2>/dev/null to silently handle missing file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Note
This PR was authored with Copilot assistance.
When a test process exits with code 137 (SIGKILL/OOM) on Helix Linux containers,
dmesgis already called but fails becauseCAP_SYSLOGis not available in the container. This adds a fallback that reads/sys/fs/cgroup/memory.events(cgroup v2), which requires no special privileges and confirms whether the OOM killer fired (theoom_killcounter).The script tries two paths (first match wins):
/sys/fs/cgroup/<cgroup-path>/memory.events(process-specific cgroup, derived from/proc/self/cgroup)/sys/fs/cgroup/memory.events(root cgroup fallback)This won't replace
dmesg— a separate issue (dotnet/dnceng#6481) tracks enablingCAP_SYSLOGfor richer OOM diagnostics — but it gives immediate confirmation of OOM at zero infrastructure cost. Silent no-op on macOS, cgroup v1, or any system without/proc/self/cgroup.Related issues: