Skip to content

Log cgroup memory.events on exit code 137 as dmesg fallback#126015

Open
danmoseley wants to merge 4 commits intodotnet:mainfrom
danmoseley:oom-memory-events-fallback
Open

Log cgroup memory.events on exit code 137 as dmesg fallback#126015
danmoseley wants to merge 4 commits intodotnet:mainfrom
danmoseley:oom-memory-events-fallback

Conversation

@danmoseley
Copy link
Member

@danmoseley danmoseley commented Mar 24, 2026

Note

This PR was authored with Copilot assistance.

When a test process exits with code 137 (SIGKILL/OOM) on Helix Linux containers, dmesg is already called but fails because CAP_SYSLOG is not available in the container. This adds a fallback that reads /sys/fs/cgroup/memory.events (cgroup v2), which requires no special privileges and confirms whether the OOM killer fired (the oom_kill counter).

The script tries two paths (first match wins):

  1. /sys/fs/cgroup/<cgroup-path>/memory.events (process-specific cgroup, derived from /proc/self/cgroup)
  2. /sys/fs/cgroup/memory.events (root cgroup fallback)

This won't replace dmesg — a separate issue (dotnet/dnceng#6481) tracks enabling CAP_SYSLOG for richer OOM diagnostics — but it gives immediate confirmation of OOM at zero infrastructure cost. Silent no-op on macOS, cgroup v1, or any system without /proc/self/cgroup.

Related issues:

When a test process exits with code 137 (SIGKILL/OOM) on Helix Linux
containers, dmesg is already called but fails because CAP_SYSLOG is
not available. Add a fallback that reads /sys/fs/cgroup/memory.events
(cgroup v2) which requires no special privileges and confirms whether
the OOM killer fired.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 24, 2026 04:50
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

@danmoseley
Copy link
Member Author

Tested on Linux in a cgroup forced to go OOM

 === Running inside cgroup: 0::/system.slice/run-r9f8eabb6405f4917a55147e7977f0422.scope ===                   
                                                                                                                 
   === Child (PID 1069) exited with code 137 ===                                                                 
   Exit code is 137 (SIGKILL) — checking cgroup memory.events...                                                 
                                                                                                                 
   cgroup memory.events (/sys/fs/cgroup//system.slice/run-r9f8eabb6405f4917a55147e7977f0422.scope/memory.events):
   low 0                                                                                                         
   high 0                                                                                                        
   max 48                                                                                                        
   oom 1                                                                                                         
   oom_kill 1                                                                                                    
   oom_group_kill 0                                                                                              
                               

this says in this example:

  • max 48 — 48 times the cgroup hit its memory limit
  • oom 1 — 1 OOM condition triggered
  • oom_kill 1 — 1 process was killed by the OOM killer

So it's not much but it's a clue that tells us exit 137 (which could be OOM killer, Helix timeout or something else) is OOM (or almost surely)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Linux Helix test diagnostics for cases where test processes exit with code 137 (SIGKILL/OOM) by adding a non-privileged fallback to confirm OOM killer activity via cgroup v2 memory.events, since dmesg often fails in containers without CAP_SYSLOG.

Changes:

  • On Helix work items with exit code 137, attempt to read and print memory.events from the root cgroup and then from the per-cgroup path derived from /proc/self/cgroup.
  • Keep the existing dmesg | tail -50 behavior, adding the cgroup dump as additional context.

@danmoseley
Copy link
Member Author

@copilot address feedback

danmoseley and others added 2 commits March 24, 2026 20:21
- Reword comment to clarify the cgroup check is a fallback, not
  conditional on dmesg failure.
- Replace grep -oP (PCRE) with awk for parsing /proc/self/cgroup,
  improving portability in minimal container images.
- Extract cgroup path into a variable for readability.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add comments explaining /proc/self/cgroup format with example
- Note silent no-op behavior on macOS / cgroup v1
- Reverse loop order: check process-specific cgroup first,
  fall back to root (root always exists on cgroup v2, so old
  order would never reach the more informative specific path)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 25, 2026 02:47
@danmoseley
Copy link
Member Author

Note

This comment was generated with Copilot.

Added readability comments (example of /proc/self/cgroup format, what the awk expression extracts, note that it's a silent no-op on macOS/cgroup v1).

Also fixed the loop order: the old code checked the root cgroup's memory.events first, but on cgroup v2 that file always exists, so break would prevent the more informative process-specific cgroup path from ever being checked. Reversed to check specific first, falling back to root.

Move the file-existence check into the if condition instead of
relying on awk's 2>/dev/null to silently handle missing file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants