Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wreck: add support for job epilogs #1513

Merged
merged 7 commits into from May 10, 2018

Conversation

Projects
None yet
5 participants
@grondo
Copy link
Contributor

grondo commented May 9, 2018

This is an experimental PR to address the request for job epilogs detailed in #1497.

Instead of choosing either an epilog that runs before the job is "complete" or after, this PR adds both, an epilog.pre that runs during a new "completing" state, and an epilog.post that runs after the job is "complete" (and thus resources have been returned to the system).

flux-wreckrun will now exit on completing state, since it doesn't need to wait for the epilog.pre to finish, but keeping the job in "completing" state is meant to hold resources during epilog execution. There is no concept of partial release of resources in wreck, so the job doesn't hit "complete" state until all epilog.pre scripts have finished.

As with the wreck job environment, epilog.pre and post may be set globally in lwj.epilog.{pre,post}, or per-job via new -x, --epilog=SCRIPT and -p, --postscript=SCRIPT options to wreckrun/submit. Per-job epilog overrides the globally set script.

Happy to make any changes requested here, this is just a first try to address the issue (and keep in mind this is only the proposed solution for wreck exec system, we'll have a different scheme for the replacement)

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 9, 2018

Coverage Status

Coverage decreased (-0.02%) to 78.815% when pulling 60595fc on grondo:wreck-epilog into dc91642 on flux-framework:master.

@grondo grondo force-pushed the grondo:wreck-epilog branch from 2122318 to a1a0a21 May 9, 2018

@grondo grondo force-pushed the grondo:wreck-epilog branch from a1a0a21 to 60595fc May 9, 2018

grondo added some commits May 8, 2018

wreck: add support for epilog.pre and epilog.post scripts
Add support for user-specific scripts run after all tasks have exited
via a new epilog.lua wrexecd plugin. Two types of epilog scripts are
supported:

 - An "epilog.pre" script is run after all tasks have exited but before
 the job is placed in the "complete" state. During this time, the job
 will be in the "completing" state.

 - An "epilog.post" script is run after the job state has been set
 to "complete". Note, this script may run after resources associated
 with the job have been returned to the system.

Similar to the wreck environment variables, global `lwj.epilog.pre` and
`lwj.epilog.post` may be set and will be inherited by all jobs run within
the current instance. Per-job epilogs set in the kvs directory of the
job will override the global settings.
wreck: allow wreckrun to exit at completing state
Once a job hits completing state and all task io has complated,
wreckrun can exit. There is no reason to wait for the job epilog
to finish.
wreck: add options to wreckrun and submit for epilogs
Add -x, --epilog=script and -p, --postscript=SCRIPT convenience options
for wreckrun and flux-submit to set per-job epilog.pre and epilog.post.
testsuite: add t2000-wreck-epilog.t
Add t2000-wreck-epilog.t to verify epilog.pre and post functionality.

@grondo grondo force-pushed the grondo:wreck-epilog branch from 60595fc to b5f6ddd May 9, 2018

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented May 9, 2018

Codecov Report

❗️ No coverage uploaded for pull request base (master@54fcc97). Click here to learn what that means.
The diff coverage is 100%.

@@           Coverage Diff            @@
##             master   #1513   +/-   ##
========================================
  Coverage          ?   78.6%           
========================================
  Files             ?     165           
  Lines             ?   30816           
  Branches          ?       0           
========================================
  Hits              ?   24223           
  Misses            ?    6593           
  Partials          ?       0
Impacted Files Coverage Δ
src/common/libjsc/jstatctl.c 73.66% <ø> (ø)
src/modules/wreck/wrexecd.c 74.62% <100%> (ø)
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 9, 2018

BTW, the change to add a "completing" state will need at least one patch to the t1000-jsc.t test in flux-sched.

diff --git a/t/t1000-jsc.t b/t/t1000-jsc.t
index 03fcd3a..1bab062 100755
--- a/t/t1000-jsc.t
+++ b/t/t1000-jsc.t
@@ -17,13 +17,15 @@ tr2="submitted->allocated"
 tr3="allocated->runrequest"
 tr4="runrequest->starting"
 tr5="starting->running"
-tr6="running->complete"
+tr6="running->completing"
+tr7="completing->complete"
 trans="$tr1
 $tr2
 $tr3
 $tr4
 $tr5
-$tr6"
+$tr6
+$tr7"

@dongahn, any other non-test related fallout from this potential change to flux-sched?

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented May 9, 2018

@dongahn, any other non-test related fallout from this potential change to flux-sched?

Yes, I think the sched's finite state machine needs to be adjusted for this change. Probably, here.

To begin with, we will probably want to make J_RUNNING -> J_COMPLETING transition a NOOP and then the J_COMPLETING -> J_COMPLETE transition inherit the current action for J_RUNNING->J_COMPLETE.

If we need a further action for J_RUNNING -> J_COMPLETING later, we can expand on that switch statement.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 9, 2018

Thanks for taking a look!

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented May 10, 2018

I can add those changes and look at how the sched behaves if you want. Just let me know.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 10, 2018

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented May 10, 2018

I agree we should always enhance our testing, but I suspect this particular case may just have worked "by accident" in the existing test cases other than JSC.

My guess is, J_COMPLETING in the job_state_t enumerator just inherited the value of the old J_COMPLETE. And shifting by one just worked out to be okay for sched.

Maybe this is a good time to assign an unique integer to each of the state in this enum?

This way, when a new state is added, we can do better checks...

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 10, 2018

Ah good point, we probably should have reserved a block of integers between RUNNING and COMPLETE that would be invisible states to the scheduler for now.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented May 10, 2018

@grondo: Thanks. As discussed, please propose the enum value change. I will adjust that in flux-sched and do a PR.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 10, 2018

Thanks @dongahn!

We can iterate on the enum values. In my first cut I regrouped the values into WRECK vs Scheduler states, and left (arbitrary) value holes in case we need to insert something in the near term:

diff --git a/src/common/libjsc/jstatctl.h b/src/common/libjsc/jstatctl.h
index 8b8993b..3874ccf 100644
--- a/src/common/libjsc/jstatctl.h
+++ b/src/common/libjsc/jstatctl.h
@@ -39,23 +39,38 @@ extern "C" {
  * please refer to README.md
  */
 typedef enum {
-    J_NULL = 1,  /*!< The state has yet to be assigned */
-    J_RESERVED,  /*!< Reserved by the program execution service */
-    J_SUBMITTED, /*!< Submitted to the system */
-    J_PENDING,   /*!< Pending */
-    J_SCHEDREQ,  /*!< Resources requested to be selected */
-    J_SELECTED,  /*!< Assigned to requested resource in RDL */
-    J_ALLOCATED, /*!< Got allocated/contained by the program executoin service 
-    J_RUNREQUEST,/*!< Requested to be executed */
-    J_STARTING,  /*!< Starting */
-    J_STOPPED,   /*!< Stopped *including init barrier hit for a tool) */
-    J_RUNNING,   /*!< Running */
-    J_CANCELLED, /*!< Cancelled */
-    J_COMPLETING,/*!< Completing */
-    J_COMPLETE,  /*!< Completed */
-    J_REAPED,    /*!< Reaped */
-    J_FAILED,    /*!< Failed */
-    J_FOR_RENT   /*!< Space For Rent */
+    J_NULL =       0,  /*!< The state has yet to be assigned */
+
+    /* WRECK job initial condition states:
+     */
+    J_RESERVED =   1, /*!< Reserved by the program execution service */
+    J_SUBMITTED =  2, /*!< Submitted to the system */
+
+    /* Scheduler internal states:
+     */
+    J_PENDING =   11, /*!< Pending */
+    J_SCHEDREQ =  12, /*!< Resources requested to be selected */
+    J_SELECTED =  13, /*!< Assigned to requested resource in RDL */
+    J_ALLOCATED = 14, /*!< Got alloc/contained by the program exec service */
+
+    /* WRECK job execution states:
+     */
+    J_RUNREQUEST= 21, /*!< Requested to be executed */
+    J_STARTING =  22, /*!< Starting */
+    J_STOPPED =   23, /*!< Stopped (including init barrier hit for a tool) */
+    J_RUNNING =   24, /*!< Running */
+    J_COMPLETING= 26, /*!< Completing (all tasks exited, epilog running) */
+
+    /* WRECK job terminal states:
+     */
+    J_CANCELLED = 51, /*!< Cancelled (before execution) */
+    J_COMPLETE =  52, /*!< Completed */
+    J_FAILED =    53, /*!< Failed (before exec) */
+
+    /* Scheduler post exec states:
+     */
+    J_REAPED =   101, /*!< Reaped */
+    J_FOR_RENT = 102, /*!< Space For Rent */
 } job_state_t;
 
 typedef int (*jsc_handler_f)(const char *base_jcb, void *arg, int errnum);

Unfortunately this breaks one of the t2001-jsc.t tests:

expecting success: 
    flux jstat update 1 state-pair '{"state-pair": {"ostate": 13, "nstate": 12}}' &&
    flux kvs get --json lwj.0.0.1.state > output.9.1 &&
    cat >expected.9.1 <<-EOF &&
cancelled
EOF
    test_cmp expected.9.1 output.9.1 

--- expected.9.1	2018-05-10 16:43:32.324534000 +0000
+++ output.9.1	2018-05-10 16:43:32.321534000 +0000
@@ -1 +1 @@
-cancelled
+schedreq
not ok 13 - jstat 9: update state-pair
#	
#	    flux jstat update 1 state-pair '{"state-pair": {"ostate": 13, "nstate": 12}}' &&
#	    flux kvs get --json lwj.0.0.1.state > output.9.1 &&
#	    cat >expected.9.1 <<-EOF &&
#	cancelled
#	EOF
#	    test_cmp expected.9.1 output.9.1 
#	

I fixed this by updating the integers in the json encoded "state-pair" update, but this made me worry -- are there other places where raw integers for states are being used? Is this fix ok?

diff --git a/t/t2001-jsc.t b/t/t2001-jsc.t
index 70c2cb0..ea2109c 100755
--- a/t/t2001-jsc.t
+++ b/t/t2001-jsc.t
@@ -207,7 +207,7 @@ test_expect_success 'jstat 8: query detects bad inputs' '
 '
 
 test_expect_success 'jstat 9: update state-pair' "
-    flux jstat update 1 state-pair '{\"state-pair\": {\"ostate\": 13, \"nstate\
+    flux jstat update 1 state-pair '{\"state-pair\": {\"ostate\": 24, \"nstate\
     flux kvs get --json $(flux wreck kvs-path 1).state > output.9.1 &&
     cat >expected.9.1 <<-EOF &&
 cancelled
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 10, 2018

One other thing I noticed while looking here is that JSC has a "stopped" state, but there is no equivalent in WRECK. WRECK sets the job state to "sync" when stop-children-in-exec is used and all tasks are stopped at exec(). Presuming this is the intent of the "stopped" state I can update wrexecd to use "stopped" as the state instead of "sync"

libjsc: hard-code enumeration values for job_state_t
Problem: Insertion of new job states into the job_state_t enumeration
may inadvertently cause re-enumeration of other state values, triggering
hard to find bugs and confusion for developers. Also, the states encoded
in job_state_t are a mixture of scheduler and execution system states.

Add explicit integers for job states encoded in the enumeration, and
regroup values so that it is easy for developers to see which states
are scheduler specific vs WRECK execution system states. Leave space
between groups of states for near-term additions.
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 10, 2018

Ok, I went ahead and pushed the current proposal.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented May 10, 2018

I like the grouping!

I fixed this by updating the integers in the json encoded "state-pair" update, but this made me worry -- are there other places where raw integers for states are being used? Is this fix ok?

Yes, this fix is okay. This is a testing purpose only and there should be no other places where raw integers are used. There is an JSC utility call that converts the state names into the raw integers but I'm not sure how to use it under sharness testers.

One other thing I noticed while looking here is that JSC has a "stopped" state, but there is no equivalent in WRECK. WRECK sets the job state to "sync" when stop-children-in-exec is used and all tasks are stopped at exec(). Presuming this is the intent of the "stopped" state I can update wrexecd to use "stopped" as the state instead of "sync"

There is currently no user of this state so it actually makes sense to change J_STOPPED to J_SYNC.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 10, 2018

There is currently no user of this state so it actually makes sense to change J_STOPPED to J_SYNC.

Ok, I will make that change as well here just to keep things moving along.

libjsc: rename J_STOPPED to J_SYNC
Problem: The libjsc J_STOPPED state ("stopped") is not actually set
by the wreck execution system. It is presumed this was meant to be
the "sync" state set by wreck when tasks are all running but stopped
in exec(2).

Update JSC to use "sync" and J_SYNC instead of "stopped"/J_STOPPED.
@garlick

This comment has been minimized.

Copy link
Member

garlick commented May 10, 2018

Just poking at this a little bit, and a couple of observations which may be OK but wanted to check:

  • epilog.pre missing - error logged to flux dmesg, but job status is unaffected
  • epilog.pre exits with nonzero exit code - ignored
  • epilog.pre emits stdout - ignored
  • epilog.pre emits stderr - output logged as error to flux dmesg, lines not split into separate log messages
@garlick

This comment has been minimized.

Copy link
Member

garlick commented May 10, 2018

By the way I concur, the comments and grouping of those job states really does add clarity!

@garlick

This comment has been minimized.

Copy link
Member

garlick commented May 10, 2018

Restarted a couple of builders. One had stalled out after

FAIL: t0016-cron-faketime.t 4 - flux-cron tab works for any day midnight

the other seems to have failed to completely shut down in the valgrind test, and thus the test failed due to various un-freed module memory.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 10, 2018

Just poking at this a little bit, and a couple of observations which may be OK but wanted to check:

epilog.pre missing - error logged to flux dmesg, but job status is unaffected

Unfortunately I'm not sure there is a valid state to transition to for failed epilog. The "job" is already done, the state is "completing". Perhaps we should just call these epilog scripts "best effort" for the wreck prototype?

Maybe something that could help is for flux-wreckrun and submit to generate an error if the path supplied to --epilog doesn't exist or isn't executable at runtime?

epilog.pre exits with nonzero exit code - ignored

Same answer I guess. The exit status was purposely ignored because nothing can really be done about it.

epilog.pre emits stdout - ignored

FYI, the stdout/err of epilog scripts are coming from wrexecd stdout/err, which is forwarded back to the job module via the cmb.exec service. Right now all stdout of wrexecd is ignored, and stderr is logged. This was meant to transition to the per-rank "job error log" kz file, but since kz writes are currently synchronous that work was delayed.

Should we also be logging stdout do you think? Maybe the stdout of wrexecd should explicitly be set to /dev/null

epilog.pre emits stderr - output logged as error to flux dmesg, lines not split into separate log messages

Hm, this one is more interesting. I overlooked the fact that a cmb.exec io message could contain more than one line. I guess we have to split the lines on the receiving side in the job module.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented May 10, 2018

Hm, this one is more interesting. I overlooked the fact that a cmb.exec io message could contain more than one line. I guess we have to split the lines on the receiving side in the job module.

I opened #1516 to suggest that flux_log should do it for us.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented May 10, 2018

I'm fine with all that. Shall I press the button when this turns green?

@garlick

This comment has been minimized.

Copy link
Member

garlick commented May 10, 2018

Failed again in valgrind. Restarted.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented May 10, 2018

Finally! Travis is touchy today!

@garlick garlick merged commit b531e07 into flux-framework:master May 10, 2018

4 checks passed

codecov/patch No report found to compare against
Details
codecov/project No report found to compare against
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls First build on master at 78.832%
Details
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 10, 2018

Thanks, sorry this one was such a pain!

@grondo grondo deleted the grondo:wreck-epilog branch May 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.