Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cover previously unhandled job queue state JOB_QUEUE_DO_KILL_NODE_FAILURE #5667

Conversation

valentin-krasontovitsch
Copy link
Contributor

Issue
Possibly resolved #4809

Approach

  • increase timeout before job considered unconfirmed beyond recovery
  • handle unconfirmed state gracefully

Pre review checklist

  • Added appropriate release note label
  • PR title captures the intent of the changes, and is fitting for release notes.
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • IRRELEVANT - Updated documentation
  • GOING TO, USING USERS - Ensured new behaviour is tested

Adding labels helps the maintainers when writing release notes. This is the list of release note labels.

@valentin-krasontovitsch valentin-krasontovitsch added the release-notes:bug-fix Automatically categorise as bug fix in release notes label Jul 3, 2023
@codecov-commenter
Copy link

Codecov Report

Merging #5667 (2b8e2b2) into main (3e167bf) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #5667      +/-   ##
==========================================
- Coverage   80.20%   80.18%   -0.02%     
==========================================
  Files         373      373              
  Lines       23280    23280              
  Branches     1066     1066              
==========================================
- Hits        18671    18667       -4     
- Misses       4333     4337       +4     
  Partials      276      276              
Impacted Files Coverage Δ
src/clib/lib/job_queue/job_node.cpp 23.29% <100.00%> (ø)
src/ert/job_queue/job_queue_node.py 90.05% <100.00%> (ø)

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@valentin-krasontovitsch valentin-krasontovitsch marked this pull request as ready for review July 3, 2023 10:58
Copy link
Contributor

@berland berland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to go, remember to squash the fixup commit

Possibly due to network issues, sometimes a job cannot be confirmed as
running based on the existence of the `STATUS` file, even though it is
running. We increase the timeout before we decide that the job is
unsavable to be more tolerant of the network.

Cf. equinor#4809
The state `JOB_QUEUE_NODE_DO_KILL_NODE_FAILURE` arises when a job has
been submitted and should be running, but cannot be confirmed as running
based on the `STATUS` file after a certain waiting period.

It was previously unhandled in python code, and led to a unexpected job
status exception - which would e.g. in ESMDA lead to a node failure when
trying to consequently go on with the realization in question.

We remedy this by calling the exit callback, which should make sure that
there are no failures downstream.

Cf. equinor#4809
@valentin-krasontovitsch valentin-krasontovitsch merged commit 28ed038 into equinor:main Jul 11, 2023
34 checks passed
@valentin-krasontovitsch valentin-krasontovitsch deleted the 4809-job-queue-unhandled-state branch July 11, 2023 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:bug-fix Automatically categorise as bug fix in release notes
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Assertion Error of unexpected job status can happen with JOB_QUEUE_DO_KILL_NODE_FAILURE
4 participants