Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry qsub and qstat in case of failures #3537

Merged
merged 1 commit into from
Aug 4, 2022

Conversation

berland
Copy link
Contributor

@berland berland commented Jun 17, 2022

Issue
Resolves #405

Approach
Retry qstat and qsub commands in case of (intermittent) failures.

Based on (blocked by) #3518 and #3490

Pre review checklist

  • Added appropriate release note label
  • PR title captures the intent of the changes, and is fitting for release notes.
  • Commit history is consistent and clean, in line with the contribution guidelines.

Adding labels helps the maintainers when writing release notes. This is the list of release note labels.

@berland berland changed the title Torque testing Retry qsub and qstat in case if failures Jun 17, 2022
@berland berland changed the title Retry qsub and qstat in case if failures Retry qsub and qstat in case of failures Jun 17, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 17, 2022

Codecov Report

Merging #3537 (fcea14d) into main (636e1bc) will increase coverage by 0.00%.
The diff coverage is 61.11%.

@@           Coverage Diff           @@
##             main    #3537   +/-   ##
=======================================
  Coverage   64.70%   64.70%           
=======================================
  Files         592      592           
  Lines       46161    46187   +26     
  Branches     4161     4165    +4     
=======================================
+ Hits        29867    29885   +18     
- Misses      14950    14956    +6     
- Partials     1344     1346    +2     
Impacted Files Coverage Δ
src/libres/lib/job_queue/torque_driver.cpp 80.24% <61.11%> (-1.02%) ⬇️
src/libres/lib/res_util/block_fs.cpp 68.16% <0.00%> (ø)
src/ert/data/record/_transformation.py 89.04% <0.00%> (+0.47%) ⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@berland berland added the release-notes:bug-fix Automatically categorise as bug fix in release notes label Jun 17, 2022
@berland berland self-assigned this Jun 21, 2022
@berland berland force-pushed the torque_testing branch 4 times, most recently from 6b88200 to 670f5d8 Compare June 22, 2022 10:59
@berland berland marked this pull request as ready for review June 22, 2022 11:00
@berland berland force-pushed the torque_testing branch 4 times, most recently from c6c2ae0 to 61e2ce4 Compare June 27, 2022 10:20
@berland berland mentioned this pull request Jun 27, 2022
3 tasks
@berland berland force-pushed the torque_testing branch 2 times, most recently from c222754 to 7578bdd Compare June 27, 2022 12:36
 retry a couple of times with exponential sleep. */
int return_value = -1;
int sleep_time = 2;
int max_sleep_time = 2 * 2 * 2 * 2; /* max 4 attempts */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this like exponentially increasing sleep_time? If this is required then max_sleep_time could be just max_sleep_count=4, but I guess this doesn't require additional i++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exponentiality is covered by sleep_time *= 2. Testing sleep_time < max_sleep_time made the while-statement nicer as far as I remember, instead of having something like max_sleep_count = 4.

Copy link
Contributor

@xjules xjules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had some minor comments, but looks good nonetheless!

retry a couple of times with exponential sleep. ERT pings qstat
every second for every realization, thus the initial sleep time
is 2 seconds. */
int return_value = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this return_value parameter can go directly into the while loop; ie. int return_value = util_spawn_blocking(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed to bootstrap the while-statement.

@@ -97,6 +96,24 @@ def create_local_queue(
return job_queue


def create_job_queue_node(job_id=0, num_cpus=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering why to have this helper function when it's used only once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

pytest.param(FLAKY_QSUB, FLAKY_QSTAT, id="all_flaky"),
],
)
def test_run_torque_job(tmpdir, qsub_script, qstat_script):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment (a few words) what / why is it that we test here? Especially regarding the behaviour of FLAKY versions :)

@jondequinor
Copy link
Contributor

There's a wrapper script provided by someone located in /prog/util/lib/pbswrap/. It does not do exponential backoff, but it will retry 5 times. Other than that, it's pretty bare bones, and I much prefer to call qsub/qstat directly. My vote is on keep working on this PR.

@berland berland force-pushed the torque_testing branch 4 times, most recently from ac5005b to a2b83cf Compare August 1, 2022 12:57
Copy link
Contributor

@xjules xjules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

@berland berland force-pushed the torque_testing branch 2 times, most recently from fcea14d to fa24126 Compare August 2, 2022 08:07
qstat is used to ask the queue server for torque for the
status of individual jobs. With ERTs ping rate at once pr sec
pr realization, we need to accept intermittent failures and retry.

Add tests for the torque driver
@berland berland merged commit 6f6b1d2 into equinor:main Aug 4, 2022
@berland berland deleted the torque_testing branch October 18, 2022 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:bug-fix Automatically categorise as bug fix in release notes
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Torque driver unable to get job status
4 participants