Skip to content

DRT: Added DRT operation for OOM kill#169406

Open
cpj2195 wants to merge 1 commit intocockroachdb:masterfrom
cpj2195:drt/add_OOM_operation
Open

DRT: Added DRT operation for OOM kill#169406
cpj2195 wants to merge 1 commit intocockroachdb:masterfrom
cpj2195:drt/add_OOM_operation

Conversation

@cpj2195
Copy link
Copy Markdown
Contributor

@cpj2195 cpj2195 commented Apr 30, 2026

Summary

Add a new DRT operation (oom) that induces memory pressure on a
randomly selected node by filling /dev/shm with a ballast file,
leaving a configurable reserve (3 GiB) to keep the node alive long
enough to capture heap profiles.

  • Remounts /dev/shm to 90% of total RAM to allow sufficient fill
  • Monitors the node for 10 minutes: captures heap profiles, checks
    peer health, and queries for unavailable ranges every 30s
  • Cleanup removes the ballast, restores /dev/shm sizing, and
    restarts the node via cockroach.sh if it became unresponsive
  • Requires at least 2 live nodes; cannot run concurrently with other
    operations

Epic: none

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented Apr 30, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@cpj2195 cpj2195 force-pushed the drt/add_OOM_operation branch from f36e286 to 305c2a0 Compare April 30, 2026 06:21
@cpj2195 cpj2195 requested review from a team, shailendra-patel and williamchoe3 and removed request for a team April 30, 2026 06:26
@cpj2195 cpj2195 force-pushed the drt/add_OOM_operation branch from 305c2a0 to e9ebbf2 Compare May 4, 2026 07:26
@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 4, 2026

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants