Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proxy script for qstat #3553

Merged
merged 1 commit into from
Aug 8, 2022
Merged

Add proxy script for qstat #3553

merged 1 commit into from
Aug 8, 2022

Conversation

berland
Copy link
Contributor

@berland berland commented Jun 21, 2022

Issue
Resolves #3541

Approach
This shell script is significantly faster to run compared to the
real qstat ("backend") as it only greps in a file on /tmp.
The file on /tmp acts as a cache, and will be updated whenever
it gets too old (~seconds).

The queue manager in ERT polls the queing system (that is qstat)
once pr. second for every realization. This can replicate a DDoS on
the cluster queue server, which we mitigate by using this proxy.

Blocked by #3537

Pre review checklist

  • Added appropriate release note label
  • PR title captures the intent of the changes, and is fitting for release notes.
  • Commit history is consistent and clean, in line with the contribution guidelines.

Adding labels helps the maintainers when writing release notes. This is the list of release note labels.

@codecov-commenter
Copy link

codecov-commenter commented Jun 21, 2022

Codecov Report

Merging #3553 (56611b7) into main (0079c8e) will decrease coverage by 0.05%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #3553      +/-   ##
==========================================
- Coverage   63.34%   63.28%   -0.06%     
==========================================
  Files         597      595       -2     
  Lines       46243    46198      -45     
  Branches     4144     4156      +12     
==========================================
- Hits        29293    29237      -56     
- Misses      15662    15667       +5     
- Partials     1288     1294       +6     
Impacted Files Coverage Δ
src/libres/lib/include/ert/res_util/string.hpp 65.78% <0.00%> (-1.96%) ⬇️
src/libres/lib/enkf/enkf_obs.cpp 64.22% <0.00%> (-0.86%) ⬇️
src/libres/lib/enkf/ensemble_config.cpp 42.47% <0.00%> (-0.25%) ⬇️
src/libres/lib/res_util/block_fs.cpp 67.75% <0.00%> (ø)
src/libres/lib/enkf/enkf_main_ensemble.cpp
src/libres/lib/enkf/enkf_main_manage_fs.cpp
src/libres/lib/enkf/ert_template.cpp 15.17% <0.00%> (+0.39%) ⬆️
src/libres/lib/enkf/enkf_main.cpp 37.54% <0.00%> (+4.08%) ⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@berland berland self-assigned this Jun 21, 2022
@berland berland added the release-notes:improvement Automatically categorise as improvement in release notes label Jun 21, 2022
@berland berland changed the title wip: qstat proxy shell script Add proxy script for qstat Jun 22, 2022
@berland berland force-pushed the qstat_proxy branch 2 times, most recently from 92ca9b0 to de8e3e5 Compare June 23, 2022 10:19
@berland
Copy link
Contributor Author

berland commented Jun 23, 2022

TODO:

  • Fails on macos due to diffent syntax for the stat command (solved with stat -f %B on mac)
  • Write tests provoking race conditions when qstat backend takes a second to answer
  • mv command is atomic (or very close). Use that to move qstat output into the correct place when it is done.
  • Approach is not portable to mac as flock does not exist for mac. shlock on mac is an option but it states it does not work on NFS.
  • Use cmake to install the proxy script for tests, setup.py for normal usage.
  • Verify that an "Ensemble experiment" on the poly-case on an Azure node with 1000 realizations runs smoothly.

@xjules xjules added the release-notes:bug-fix Automatically categorise as bug fix in release notes label Jun 24, 2022
@berland berland marked this pull request as ready for review June 27, 2022 10:22
Copy link
Contributor

@jondequinor jondequinor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose bash? flock (fcntl), stat, cat should all be supported by python stdlib.

Unless there's a really good reason to choose bash, we should choose the portable python route.

For example, try building a komodo distribution on macos. It will fail mostly because it assumes rename, rmdir to come from coreutils, which on macos they do not.

@berland
Copy link
Contributor Author

berland commented Jun 28, 2022

sh was chosen as I speculate it is significantly cheaper/faster to spawn compared to Python. As this is command will be called every second for every realization with the current queue manager, startup time can be critical. The file locking used by flock is in any case not portable to Mac.

@jondequinor jondequinor dismissed their stale review June 28, 2022 14:12

I buy the speed argument

@berland berland force-pushed the qstat_proxy branch 8 times, most recently from eb2062b to 11de59f Compare August 4, 2022 14:03
@@ -1,9 +1,12 @@
click
cmake-format
decorator
ecl_data_io
ecl_data_io
Copy link
Contributor

@xjules xjules Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2x the same

@xjules
Copy link
Contributor

xjules commented Aug 5, 2022

Just to recap our discussion: add one more test to check that proxyfile gets new content when timeout happens. Otherwise it's a solid piece of work! Nice @berland

@berland
Copy link
Contributor Author

berland commented Aug 8, 2022

Rebase pending reviewer @xjules

Copy link
Contributor

@xjules xjules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @berland ! Very good job! 🚀

This shell script is significantly faster to run compared to the
real qstat ("backend") as it only greps in a file on /tmp.
The file on /tmp acts as a cache, and will be updated whenever
it gets too old (~seconds).

The queue manager in ERT polls the queing system (that is qstat)
once pr. second for every realization. This can replicate a DDoS on
the cluster queue server, which we mitigate by using this proxy.
@berland berland merged commit 78073e0 into equinor:main Aug 8, 2022
@berland berland deleted the qstat_proxy branch October 18, 2022 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:bug-fix Automatically categorise as bug fix in release notes release-notes:improvement Automatically categorise as improvement in release notes
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Update Torque driver to handle thousands of realizations
4 participants