Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add out-of-memory detection #7974

Merged
merged 1 commit into from
May 29, 2024
Merged

Add out-of-memory detection #7974

merged 1 commit into from
May 29, 2024

Conversation

berland
Copy link
Contributor

@berland berland commented May 24, 2024

Whenever a realization fails the output of dmesg is checked to see if there are any signs of known pids having been killed by the operating system

Issue
Resolves #7797

Approach
Simple string searches in dmesg output.

image

  • PR title captures the intent of the changes, and is fitting for release notes.
  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • Make sure tests pass locally (after every commit!)

When applicable

  • When there are user facing changes: Updated documentation
  • New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
  • Large PR: Prepare changes in small commits for more convenient review
  • Bug fix: Add regression test for the bug
  • Bug fix: Create Backport PR to latest release

@berland berland self-assigned this May 24, 2024
@berland berland added the release-notes:improvement Automatically categorise as improvement in release notes label May 24, 2024
@berland
Copy link
Contributor Author

berland commented May 24, 2024

Logged information pr. realization:

$ grep OOM job-runner-log-2024-05-24T1448.txt 
2024-05-24 14:50:11,917 - _ert_forward_model_runner.job - WARNING - Found OOM trace in dmesg: [1818156.631036] Out of memory: Killed process 2781107 (perl) total-vm:88988304kB, anon-rss:61191332kB, file-rss:72kB, shmem-rss:0kB, UID:21417 pgtables:173784kB oom_score_adj:0[1818265.834033] Out of memory: Killed process 2783031 (perl) total-vm:88737412kB, anon-rss:61053644kB, file-rss:0kB, shmem-rss:0kB, UID:21417 pgtables:173300kB oom_score_adj:0, assuming OOM is the cause of realization kill.
$ cat memory-profile-2024-05-24T1448.csv 
timestamp,fm_step_id,fm_step_name,rss,max_rss,free,oom_score
2024-05-24T14:48:23.251441,0,poly_eval,339968,339968,62710861824,666
2024-05-24T14:48:23.385019,1,memory_hog,8192,8192,62696112128,666
2024-05-24T14:48:28.439066,1,memory_hog,9302544384,9302544384,53263364096,728
2024-05-24T14:48:33.493965,1,memory_hog,18611359744,18611359744,43843858432,790
2024-05-24T14:48:38.549775,1,memory_hog,28015267840,28015267840,34421346304,852
2024-05-24T14:48:43.604736,1,memory_hog,37322678272,37322678272,25096753152,914
2024-05-24T14:48:48.660139,1,memory_hog,46310809600,46310809600,16054931456,974
2024-05-24T14:48:53.715816,1,memory_hog,55389773824,55389773824,6964690944,1034
2024-05-24T14:48:58.804576,1,memory_hog,62403100672,62403100672,147828736,1080
2024-05-24T14:49:03.885224,1,memory_hog,62558224384,62558224384,119640064,1097
2024-05-24T14:49:08.944431,1,memory_hog,62590488576,62590488576,76648448,1128
2024-05-24T14:49:14.063756,1,memory_hog,62717767680,62717767680,85110784,1156
2024-05-24T14:49:19.158945,1,memory_hog,62618357760,62717767680,113479680,1183
2024-05-24T14:49:24.261354,1,memory_hog,62648082432,62717767680,74260480,1211
2024-05-24T14:49:29.347948,1,memory_hog,62632665088,62717767680,98037760,1238
2024-05-24T14:49:34.399316,1,memory_hog,62521217024,62717767680,103342080,1264
2024-05-24T14:49:39.560477,1,memory_hog,62371196928,62717767680,107094016,1266
2024-05-24T14:49:45.014429,1,memory_hog,62366171136,62717767680,101826560,1266
2024-05-24T14:49:51.663442,1,memory_hog,62372913152,62717767680,86609920,1266
2024-05-24T14:49:57.076573,1,memory_hog,62460080128,62717767680,306544640,1267
2024-05-24T14:50:06.953941,1,memory_hog,0,62717767680,3058823168,666

@berland
Copy link
Contributor Author

berland commented May 24, 2024

image

@berland
Copy link
Contributor Author

berland commented May 28, 2024

This seems more problematic on Ubuntu 22.04. The terminal window running pytest is taken down when trying the test_cli.py. Also on Ubuntu, dmesg yields Permission denied. A hint can be gotten from journalctl:

mai 28 07:51:10 eqbob systemd[1207]: vte-spawn-d5a86939-ebc9-42c6-aa1f-d70575478263.scope: systemd-oomd killed 22 process(es) in this unit.

but this would not be sufficient for this Ert feature.

@berland berland force-pushed the detect_oom branch 3 times, most recently from 25435a8 to db38f8f Compare May 28, 2024 07:44
@@ -138,6 +195,9 @@ def ensure_file_handles_closed():
try:
exit_code = process.wait(timeout=self.MEMORY_POLL_PERIOD)
except TimeoutExpired:
fm_step_pids |= set(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process.children gets updated every time I see :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, by intention.

@xjules
Copy link
Contributor

xjules commented May 29, 2024

FAILED tests/unit_tests/forward_model_runner/test_job_dispatch.py::test_killed_by_oom - assert False
 +  where False = killed_by_oom({666, 667})
 +    where {666, 667} = set([666, 667])
= 1 failed, 1940 passed, 22 skipped, 4823 warnings, 1 rerun in 350.92s (0:05:50) =

Whenever a realization fails the output of dmesg is checked to see if there
are any signs of known pids having been killed by the operating system

Add an integration test that will exhaust all available memory (by
default, this test is not active) and assert that Ert picks up the cause
of the realization failing.
@berland
Copy link
Contributor Author

berland commented May 29, 2024

FAILED tests/unit_tests/forward_model_runner/test_job_dispatch.py::test_killed_by_oom - assert False
 +  where False = killed_by_oom({666, 667})
 +    where {666, 667} = set([666, 667])
= 1 failed, 1940 passed, 22 skipped, 4823 warnings, 1 rerun in 350.92s (0:05:50) =

Fixed. The test must be skipped on MacOS.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.94%. Comparing base (36979af) to head (29e89ff).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7974   +/-   ##
=======================================
  Coverage   85.93%   85.94%           
=======================================
  Files         381      381           
  Lines       23466    23466           
  Branches      628      635    +7     
=======================================
+ Hits        20165    20167    +2     
+ Misses       3228     3226    -2     
  Partials       73       73           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@xjules xjules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIce job! 🥇 🚀

@berland berland merged commit 7051849 into equinor:main May 29, 2024
38 checks passed
@berland berland deleted the detect_oom branch June 6, 2024 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:improvement Automatically categorise as improvement in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Detect OOM realization kill
3 participants