-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry qsub and qstat in case of failures #3537
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3537 +/- ##
=======================================
Coverage 64.70% 64.70%
=======================================
Files 592 592
Lines 46161 46187 +26
Branches 4161 4165 +4
=======================================
+ Hits 29867 29885 +18
- Misses 14950 14956 +6
- Partials 1344 1346 +2
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
6b88200
to
670f5d8
Compare
c6c2ae0
to
61e2ce4
Compare
c222754
to
7578bdd
Compare
 retry a couple of times with exponential sleep. */ | ||
int return_value = -1; | ||
int sleep_time = 2; | ||
int max_sleep_time = 2 * 2 * 2 * 2; /* max 4 attempts */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this like exponentially increasing sleep_time
? If this is required then max_sleep_time
could be just max_sleep_count=4
, but I guess this doesn't require additional i++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exponentiality is covered by sleep_time *= 2
. Testing sleep_time < max_sleep_time
made the while-statement nicer as far as I remember, instead of having something like max_sleep_count = 4
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had some minor comments, but looks good nonetheless!
retry a couple of times with exponential sleep. ERT pings qstat | ||
every second for every realization, thus the initial sleep time | ||
is 2 seconds. */ | ||
int return_value = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this return_value
parameter can go directly into the while loop; ie. int return_value = util_spawn_blocking(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed to bootstrap the while
-statement.
@@ -97,6 +96,24 @@ def create_local_queue( | |||
return job_queue | |||
|
|||
|
|||
def create_job_queue_node(job_id=0, num_cpus=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering why to have this helper function when it's used only once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
pytest.param(FLAKY_QSUB, FLAKY_QSTAT, id="all_flaky"), | ||
], | ||
) | ||
def test_run_torque_job(tmpdir, qsub_script, qstat_script): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment (a few words) what / why is it that we test here? Especially regarding the behaviour of FLAKY versions :)
There's a wrapper script provided by someone located in |
ac5005b
to
a2b83cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚀
fcea14d
to
fa24126
Compare
qstat is used to ask the queue server for torque for the status of individual jobs. With ERTs ping rate at once pr sec pr realization, we need to accept intermittent failures and retry. Add tests for the torque driver
Issue
Resolves #405
Approach
Retry qstat and qsub commands in case of (intermittent) failures.
Based on (blocked by) #3518 and #3490Pre review checklist
Adding labels helps the maintainers when writing release notes. This is the list of release note labels.