Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add node exclusion support #305

Merged
merged 6 commits into from Apr 10, 2018

Conversation

Projects
None yet
5 participants
@dongahn
Copy link
Contributor

dongahn commented Apr 3, 2018

This is an experimental PR. Please don't merge yet.

As requested in flux-framework/flux-core#1415 (comment), I added rudimentary support for excluding or including a node resource by the node's name.

To facilitate testing, this PR also adds support for node-exclusion support.

This PR has not been fully validated for backfill scheduling. Also, only manually tested with no addition test cases yet.

An node-exclusion command is valid only when the target node has nothing allocated or reserved.

This requires another experimental PR I will post to flux-core that includes flux wreck exclude and flux wreck include

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Apr 3, 2018

Codecov Report

Merging #305 into master will decrease coverage by 0.13%.
The diff coverage is 64.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #305      +/-   ##
==========================================
- Coverage   74.12%   73.98%   -0.14%     
==========================================
  Files          49       49              
  Lines        9290     9511     +221     
==========================================
+ Hits         6886     7037     +151     
- Misses       2404     2474      +70
Impacted Files Coverage Δ
resrc/resrc.h 100% <ø> (ø) ⬆️
sched/sched_fcfs.c 94.38% <100%> (+0.06%) ⬆️
resrc/resrc_reqst.c 90.98% <100%> (+0.07%) ⬆️
sched/sched_backfill.c 92.4% <100%> (+1.95%) ⬆️
resrc/resrc_flow.c 41.05% <33.33%> (-0.14%) ⬇️
sched/flux-waitjob.c 84.42% <44.44%> (-0.7%) ⬇️
sched/sched.c 71.95% <55.94%> (-2%) ⬇️
resrc/resrc.c 83.14% <79.26%> (+0.02%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 722e241...5ed2f7c. Read the comment docs.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 3, 2018

Coverage Status

Coverage decreased (-0.2%) to 75.675% when pulling 5ed2f7c on dongahn:node-exclusion into 722e241 on flux-framework:master.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 4, 2018

@SteVwonder: I will appreciate if you can review the code for me at this point.

FYI -- action trigger routine is getting complex and I will need to refactor it in a near future, though.

@SteVwonder
Copy link
Member

SteVwonder left a comment

LGTM. Thanks @dongahn for putting together these splash changes in sched!

My comments are not blockers for this PR. I am fine with merging as is; although I'm concerned about the applicability to exact scenario outlined in flux-framework/flux-core#1415 (see my comments below about line 1891). @dongahn just let me know if you would like this merged now or if you want to make any additional changes.

flux_log (h, LOG_DEBUG,
"attempt to exclude nonexistent node (%s).", hostname);
goto error;
} else if (resrc_state (node) != RESOURCE_IDLE

This comment has been minimized.

@SteVwonder

SteVwonder Apr 4, 2018

Member

So if the node currently has a job running on it, then this function will just log an error, correct? This seems reasonable for the case where the queue is empty and thus the node is idle. In the scenario that @trws is dealing with, I presume the queue is full and the node is continually having work scheduled on it. That really limits the window where the node can be excluded. Would it be possible to also mark an allocated node as excluded?

This comment has been minimized.

@SteVwonder

SteVwonder Apr 4, 2018

Member

Come to think of it, in a different scenario where a backfilling scheduler is loaded, then the resource will almost always have a reservation on it. This will potentially make excluding a node impossible. I am not saying we need to make the reservation case work to merge this PR, but it is something to keep in mind for later.

This comment has been minimized.

@dongahn

dongahn Apr 5, 2018

Author Contributor

Good point. This was meant to start simple for splash. If this wouldn't work for this case, we should expand the case.

But what should be the sematics, though, when the target node is allocated and/or reserved?

The allocated job can be killed before the node gets excluded.

For reservation, it seems we should release all reservations globally and rerun schedule_jobs?

This comment has been minimized.

@dongahn

dongahn Apr 5, 2018

Author Contributor

Good point. This was meant to start simple for splash. If this wouldn't work for this case, we should expand the case.

But what should be the sematics, though, when the target node is allocated and/or reserved?

The allocated job can be killed before the node gets excluded.

For reservation, it seems we should release all reservations globally and rerun schedule_jobs?

This comment has been minimized.

@SteVwonder

SteVwonder Apr 5, 2018

Member

For allocated, I see two possible options. 1) We kill the currently running process on the node and then mark the node excluded. 2) We immediately mark the node excluded, wait for the running process to die (or the walltime limit to be reach), and then keep the state of the node excluded (i.e., do not flip it back to idle). I do not have a strong preference either way, but my current vote is for option 2. I think that option 2 is easier since it doesn't require interfacing with wreck to kill the running process. Additionally, if users want functionality similar to option 1, they can mark the node excluded then issue a wreck kill to whatever job was running on that node. Maybe @trws can chime in with what functionality he would prefer for splash?

For reserved, I agree with you, @dongahn.

This comment has been minimized.

@dongahn

dongahn Apr 5, 2018

Author Contributor

Great. For allocated, maybe, we can provide an option to exclude command so that one choose either semantics. I will see what can be done.

Thanks @SteVwonder!

goto error;
} else if (resrc_state (node) != RESOURCE_EXCLUDED
&& resrc_state (node) != RESOURCE_IDLE
&& resrc_state (node) != RESOURCE_INVALID) {

This comment has been minimized.

@SteVwonder

SteVwonder Apr 4, 2018

Member

Should the include RPC convert an INVALID node to IDLE? I was thinking that should be an error. Although, even if INVALID was removed from here, someone could still mark the node EXCLUDED and then flip it to IDLE via this function. So I guess the ultimate question is, should include and exclude be able to change the state of an INVALID node? Doesn't need to be addressed in this PR, but something we should think about.

This comment has been minimized.

@dongahn

dongahn Apr 5, 2018

Author Contributor

Yeah. I may be wrong, but for some reason, the resrc vertex started as INVALID and I needes this to cover the case where the vertex has never been scheduled. I could have fixed the resrc code to set it to IDLE as the starting state but I wasn't sure what its impact would be. I can certainly take a look at this when I have more time.

This comment has been minimized.

@SteVwonder

SteVwonder Apr 5, 2018

Member

Oh. Right. I forgot that resources start as INVALID. Not now, but we should probably change that (i'll open a new ticket). We should also probably formally define what these states exactly mean somewhere. I was under the impression that INVALID was a kind of catch all for "broken", but I couldn't find any documentation anywhere on states. Do you know of any documentation of that sort anywhere?

This comment has been minimized.

@dongahn

dongahn Apr 5, 2018

Author Contributor

Thanks @SteVwonder. Yes. We will def. need a doc for resource states and transiations.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 5, 2018

Thanks @SteVwonder for reviewing! I still don't recommend this to be merged yet. I need to cover backfill case for node exclusive schdeduling and add test cases.

But I wanted to have a second pair of eyes for @trws. Please see one additional sematics question above.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 5, 2018

@SteVwonder: I reworked the node-exclusive scheduling support and forced a push. Not perfect but that's how much I could do without changing the design too much: 85d6002

Once I will add some test cases for node exclusion/inclusion and the different exclusion semantics options as we discussed above, this PR should be ready for full review.

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Apr 6, 2018

The addition of the reason output parameter is really nice! I like that refactor a lot!

@dongahn dongahn force-pushed the dongahn:node-exclusion branch from 7e46744 to d1db626 Apr 9, 2018

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 9, 2018

@SteVwonder, @grondo: I just forced a push that has test cases and I believe this is ready for a full review.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 9, 2018

FYI I think CI test fails because this PR will need flux-framework/flux-core#1418

@dongahn dongahn force-pushed the dongahn:node-exclusion branch 2 times, most recently from a47656e to e2d0016 Apr 9, 2018

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Apr 10, 2018

Thanks @dongahn! The new tests LGTM. The additional integration with wreck to kill running jobs on excluded resources is very nice! I'm all for merging once we get the green light from Travis.

(Making a note here so it gets cross referenced in the relevant issues....) Once we have command line utilities to query the state of resources and get the hostlist for a running/completed job, we should revisit these tests.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 10, 2018

For some reason one if the new test cases fails on Travis... Oh well...

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 10, 2018

Here's output from the failing test in case it helps:

Found ./flux-sched-0.4.0-100-g145e627/_build/sub/t/t1007-exclude.output
expecting success: 
    adjust_session_info 4 &&
    flux hwloc reload ${excl_4N4B} &&
    flux module load sched sched-once=true node-excl=true &&
    flux wreck exclude cab1234 &&
    timed_wait_job 5 &&
    flux submit -N 1 sleep 0 &&
    flux submit -N 1 sleep 0 &&
    flux submit -N 1 sleep 0 &&
    flux submit -N 1 sleep 0 &&
    timed_wait_job 5 submitted &&
    state=$(flux kvs get -j $(job_kvs_path 4).state) &&
    test ${state} = "submitted"
submit: Submitted jobid 1
submit: Submitted jobid 2
submit: Submitted jobid 3
submit: Submitted jobid 4
flux-waitjob: 4 already started (submitted)
flux-waitjob: 4 already reached the target state
flux-waitjob: wait_job_complete: completion detected
ok 1 - excluding a node with no job allocated or reserved works
excluding a node with no job allocated or reserved works
expecting success: 
    flux wreck include cab1234 &&
    timed_wait_job 4 &&
    state=$(flux kvs get -j $(job_kvs_path 4).state) &&
    test ${state} = "complete"
flux-waitjob: 4 already started (runrequest)
not ok 2 - including an excluded node works
expecting success: 
    adjust_session_info 1 &&
    flux module remove sched &&
    flux module load sched sched-once=true node-excl=true &&
    flux submit -N 1 sleep 120 &&
    timed_wait_job 5 running &&
    flux wreck exclude cab1234 &&
    state=$(flux kvs get -j $(job_kvs_path 5).state) &&
    test ${state} = "running" &&
    flux wreck cancel -f 5 &&
    state=$(flux kvs get -j $(job_kvs_path 5).state) &&
    test ${state} = "complete" &&
    flux wreck include cab1234
flux-waitjob: waitjob_cb: completion notified
flux-waitjob: waitjob_cb: completion notified
submit: Submitted jobid 5
flux-waitjob: 5 already started (runrequest)
flux-waitjob: waitjob_cb: completion notified
wreck: Sending SIGKILL to 5
ok 3 - excluding a node does not kill the jobs
excluding a node does not kill the jobs
expecting success: 
    adjust_session_info 1 &&
    flux submit -N 1 sleep 120 &&
    timed_wait_job 5 running &&
    flux wreck exclude --kill cab1235 &&
    state=$(flux kvs get -j $(job_kvs_path 6).state) &&
    test ${state} = "complete" &&
    flux wreck include cab1235
submit: Submitted jobid 6
flux-waitjob: 6 already started (runrequest)
flux-waitjob: waitjob_cb: completion notified
ok 4 - -k option kills the jobs using the node
-k option kills the jobs using the node
expecting success: 
    adjust_session_info 4 &&
    flux module remove sched &&
    flux module load sched sched-once=true node-excl=true plugin=sched.backfill &&
    timed_wait_job 5 &&
    flux submit -N 4 sleep 0 &&
    flux submit -N 4 sleep 0 &&
    flux submit -N 4 sleep 0 &&
    flux submit -N 4 sleep 0 &&
    timed_wait_job 5 submitted &&
    flux wreck exclude -k cab1235 &&
    state=$(flux kvs get -j $(job_kvs_path 7).state) &&
    test ${state} = "complete"
submit: Submitted jobid 7
submit: Submitted jobid 8
submit: Submitted jobid 9
submit: Submitted jobid 10
flux-waitjob: 10 already started (submitted)
flux-waitjob: 10 already reached the target state
flux-waitjob: wait_job_complete: completion detected
ok 5 - excluding a node with reservations works
excluding a node with reservations works
expecting success: 
    flux module remove sched &&
    flux module load sched node-excl=true &&
    test_must_fail flux wreck exclude foo &&
    test_must_fail flux wreck exclude -k bar &&
    test_must_fail flux wreck include foo
flux-wreck: Unable to exclude node foo: No such file or directory
flux-wreck: Unable to exclude node bar: No such file or directory
flux-wreck: Unable to include node foo: No such file or directory
ok 6 - attempting to exclude or include an invalid node must fail
attempting to exclude or include an invalid node must fail
# failed 1 among 6 test(s)
1..6
flux-waitjob: error in flux_reactor_run: Success
@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 10, 2018

@grondo: Wow. How did you get this? This should be very helpful for debugging CI issues!

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 10, 2018

flux-waitjob: error in flux_reactor_run: Success

Man... another of this kind... Will fix this.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 10, 2018

Wow. How did you get this? This should be very helpful for debugging CI issues!

At the bottom of the travis logfile (and when the CI session fails) the travis script will cat all *.output and *.broker.log files it finds. Travis collapses these, so you have to know to look for them and click them to expand (or look at the raw logfile)

Looking at the output I pasted above it seems like maybe something got messed up in the cut and paste. I'll double check to make sure.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 10, 2018

E.g. try looking at this URL for the log: https://travis-ci.org/flux-framework/flux-sched/jobs/364374686#L5003

(Edit: not sure if the line # linking is working with Travis, just look for line 5003 in the output)

@dongahn dongahn force-pushed the dongahn:node-exclusion branch 2 times, most recently from 2b7f196 to b1d10b0 Apr 10, 2018

dongahn added some commits Apr 3, 2018

resrc: Add node exclusive scheduling
Add node-exclusive=[true|false] as a sched load option.

Add the exclusivity check in the resrc_tree_search function
such that we don't lead to overscheduling under node-exclusive
scheduling mode.

The purpose of the else branch in this function, which is
the branch taken when the resource match test on the
visiting resource vertex fails, is to select the high-level
resources leading to each exclusively allocated resource.
This is needed when the job request is partially specified,
for example, when only 1 core is requested. It allows
for selecting the cluster, node, socket that contain
this particular core.

For node-exclusive scheduling, however, this branch
must check exclusivity so that we stop recursing down
the hardware tree if the node vertex is already allocated
to another job.
sched: Add node exclusion/inclusion support
Add support for excluding or including a node resource by the node's name.

Introduce two message handlers accordingly

Emit a resource event when node-inclusion succeeds

Add support for multiple ranks running on the same node

Semantics:

"flux wreck exclude nodename" leave the jobs allocated
to the node running and release all of the reservations made
ont that node.

"flux wreck exclude --kill nodename" kills the jobs
allocated to the node and release all reservations.

@dongahn dongahn force-pushed the dongahn:node-exclusion branch from b1d10b0 to f3ec928 Apr 10, 2018

dongahn added some commits Apr 9, 2018

test: Add node exclusion and inclusion tests
Add test cases for flux wreck exclude|include nodename.
sched: Add support for "failed" event in job state machine
Deal with a "failed" event within the action trigger routine,
which can be emitted from wreck.
waitjob: logging cleanup
In particular when waitjob finishes successfully, flux-waitjob
prints out "flux-waitjob: error in flux_reactor_run: Success"

Use log_msg instead of log_err to fix it.

@dongahn dongahn force-pushed the dongahn:node-exclusion branch from d2b6e10 to 5ed2f7c Apr 10, 2018

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 10, 2018

@SteVwonder: I fixed the CI test failures and this is ready to go in.

@SteVwonder SteVwonder merged commit 38abd39 into flux-framework:master Apr 10, 2018

2 of 4 checks passed

codecov/patch 64.48% of diff hit (target 74.12%)
Details
codecov/project 73.98% (-0.14%) compared to 722e241
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.2%) to 75.675%
Details
@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Apr 10, 2018

Thanks @dongahn!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.