Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wreck: check for invalid tasks to run on node and issue fatal error #901

merged 3 commits into from Nov 12, 2016


None yet
4 participants
Copy link

grondo commented Nov 11, 2016

This should resolve the specific issue from #900. A reproducer was added under t/issues and I've verified it fails before and passes after this PR is applied.

There may be other cases where unexpected values cause hangs during wreck job launch, but those are not tackled here.

grondo added some commits Nov 11, 2016

wreck: allow fatal kvs error without wrexecd exit
Allow a "fatalerror" to be written to kvs job "error log" without
actually causing wrexecd to immediately exit in wlog_fatal().

This is required for cases where we want to register and log a
fatal error, but allow wrexecd to continue startup protocol so
that barrier and kvs fence for other wrexecd processes can be
completed normally. The fatal kvs error will then be detected
and the job will terminate with a failed state instead of hang.
wreck: fatal error if nprocs <= 0
Register a fatal error in kvs if local number of processes to run
is detected as less than, or equal to, zero. This typically means
an "invalid specification" was inserted into the KVS for the lwj
in question, e.g. rank.X.cores == 0.

Do not, however, allow wrexecd to immediately terminate in
wlog_fatal(), instead allow all wrexecds to reach the "starting"
state transition kvs_fence(). After, wrexecd on rank 0 will notice
the fatal error in kvs and will abort the job with a "failed" state.

Fixes #900

@grondo grondo added the review label Nov 11, 2016


This comment has been minimized.

Copy link

coveralls commented Nov 11, 2016

Coverage Status

Coverage decreased (-48.2%) to 27.699% when pulling 374ca3c on grondo:issue#900 into da42f80 on flux-framework:master.

t/issues: add test for issue#900
Add test under the issues test driver to ensure wreck jobs abort
with invalid value under rank.X.cores kvs entry.

@grondo grondo force-pushed the grondo:issue#900 branch from 374ca3c to 3581d0e Nov 11, 2016


This comment has been minimized.

Copy link

codecov-io commented Nov 11, 2016

Current coverage is 72.38% (diff: 100%)

Merging #901 into master will increase coverage by 0.03%

@@             master       #901   diff @@
  Files           157        157          
  Lines         27041      27043     +2   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
+ Hits          19563      19574    +11   
+ Misses         7478       7469     -9   
  Partials          0          0          
Diff Coverage File Path
•••••••••• 100% src/modules/wreck/wrexecd.c

Powered by Codecov. Last update da42f80...3581d0e


This comment has been minimized.

Copy link

coveralls commented Nov 11, 2016

Coverage Status

Coverage increased (+0.04%) to 75.985% when pulling 3581d0e on grondo:issue#900 into da42f80 on flux-framework:master.


This comment has been minimized.

Copy link

garlick commented Nov 12, 2016

Looks good!

@garlick garlick merged commit 87f14cd into flux-framework:master Nov 12, 2016

4 checks passed

codecov/patch 100% of diff hit (target 72.34%)
codecov/project 72.38% (+0.03%) compared to da42f80
continuous-integration/travis-ci/pr The Travis CI build passed
coverage/coveralls Coverage increased (+0.04%) to 75.985%

@garlick garlick removed the review label Nov 12, 2016

@grondo grondo referenced this pull request Nov 28, 2016


Create 0.6.0 release notes #916

@grondo grondo deleted the grondo:issue#900 branch Dec 1, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.