Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wreck: Add support for node exclusion and inclusion #1418

Merged
merged 2 commits into from Apr 10, 2018

Conversation

Projects
None yet
4 participants
@dongahn
Copy link
Contributor

dongahn commented Apr 3, 2018

This is an experimental PR. Please don't merge yet.

Add support for exclusion and inclusion by node name.

Only supported when the scheduler is loaded.

This is required to test out the new experimental feature added in sched: #PR 305

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 3, 2018

Coverage Status

Coverage increased (+0.03%) to 79.037% when pulling 7954711 on dongahn:node-exclusion into c3b3868 on flux-framework:master.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 9, 2018

@grondo: I made some minor modifications + test cases. Could you take a quick look at this?

flux-framework/flux-sched#305 will need this.

@dongahn dongahn referenced this pull request Apr 9, 2018

Merged

Add node exclusion support #305

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Apr 9, 2018

Codecov Report

Merging #1418 into master will increase coverage by 0.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1418      +/-   ##
==========================================
+ Coverage    78.7%   78.73%   +0.02%     
==========================================
  Files         163      163              
  Lines       30206    30206              
==========================================
+ Hits        23775    23783       +8     
+ Misses       6431     6423       -8
Impacted Files Coverage Δ
src/common/libkvs/kvs_watch.c 89.69% <0%> (-0.43%) ⬇️
src/common/libflux/message.c 81.01% <0%> (-0.36%) ⬇️
src/common/libflux/handle.c 83.91% <0%> (+0.24%) ⬆️
src/common/libutil/base64.c 95.77% <0%> (+0.7%) ⬆️
src/broker/modservice.c 80.58% <0%> (+0.97%) ⬆️
src/common/libflux/request.c 88.46% <0%> (+1.28%) ⬆️
src/modules/connector-local/local.c 74.59% <0%> (+1.63%) ⬆️
@grondo
Copy link
Contributor

grondo left a comment

Just a couple minor comments inline.

Also, a couple more general questions:

  • Would it be useful to also allow nodes to be excluded by rank, for example for testing? I'm not sure what happens now in the scheduler when you are running multiple brokers per node and you try to exclude that node by name.

  • Is there a corresponding command to query which nodes are currently excluded?

local resp, err = f:rpc ("sched.exclude", { node = nodename, kill = toboolean (self.opt.k) })
if not resp then
if err == "Function not implemented" and not self.opt.f then
prog:die ("Node exclusion is not supported when scheduler not loaded")

This comment has been minimized.

@grondo

grondo Apr 9, 2018

Contributor

I think and not self.opt.f will always be true here and could be removed (if I'm not mistaken). Same for the include command below.

local resp, err = f:rpc ("sched.include", { node = nodename })
if not resp then
if err == "Function not implemented" and not self.opt.f then
prog:die ("Node inclusion is not supported when scheduler not loaded")

This comment has been minimized.

@grondo

grondo Apr 9, 2018

Contributor

Same as above, is and not self.opt.f needed here?

This comment has been minimized.

@dongahn

dongahn Apr 9, 2018

Author Contributor

Thanks @grondo. I am not really a lua type. Will fix.

usage = "NODENAME",
handler = function (self, arg)
local nodename = arg[1]
local resp, err = f:rpc ("sched.include", { node = nodename })

This comment has been minimized.

@grondo

grondo Apr 9, 2018

Contributor

Oops forgot to add this comment.

You might want to check to see if arg[1] is nil here, and emit a useful error if so, e.g.

local nodename = arg[1]
if not nodename then
    self:die ("Required argument NODENAME missing.")
end

Right now, in flux-core anyway I get:

$ flux wreck include
flux-wreck: Node inclusion is not supported when scheduler not loaded

If sched is loaded, probably EPROTO?

return not not X
end

local nodename = arg[1]

This comment has been minimized.

@grondo

grondo Apr 9, 2018

Contributor

Same comment as below about checking for nil nodename

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 9, 2018

Would it be useful to also allow nodes to be excluded by rank, for example for testing?

Yes, this should be useful. I'll look into this to see if this can also be quickly done.

I'm not sure what happens now in the scheduler when you are running multiple brokers per node and you try to exclude that node by name.

Urg... This will be pretty convoluted and I realize this case needs some more work.

two parts.

  1. I am using this node indexing scheme in resrc.
    https://github.com/flux-framework/flux-sched/blob/master/resrc/resrc.c#L511

So when you have multiple ranks managing that node, only the first rank (say FR) that reports that node will get into that hash table and excluded from being scheduled.

I am wondering if changing the value type of this hash to be a list will address the issue. This depends on other parts of resrc which use this table, though.

  1. Also because https://github.com/flux-framework/flux-sched/blob/master/sched/sched.c#L1217 code doesn't recognize the excluded rank, FR will still be used. I think augmenting rs2rank should do this.

Is there a corresponding command to query which nodes are currently excluded?

Yes this should be useful.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 9, 2018

Thanks, @dongahn. Let me know if you want to save some of these items for future PRs and get this in to solve the immediate use case though!

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 9, 2018

Thanks, @dongahn. Let me know if you want to save some of these items for future PRs and get this in to solve the immediate use case though!

Since I will be on travel and such this week, it would be good to defer the query capability to a future PR. I will try to get to the other cases and see if I can address them quickly.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 9, 2018

Since I will be on travel and such this week, it would be good to defer the query capability to a future PR. I will try to get to the other cases and see if I can address them quickly.

Sounds good! Probably handling multiple brokers per node is also only a "nice to have" and could be delayed indefinitely as well.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 9, 2018

@grondo: it turned out exclusion by rank is a bit more involved. I will try to tackle that when I circle back to the R_lite where I will have to touch that rank resolution pass anyway.

I added support for the multiple ranks case and forced a push to flux-framework/flux-sched#305

And I also forced a push here to incorporate your suggestions. (All good catches, thanks)

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 10, 2018

@dongahn, thanks! Looks like this maybe needs a rebase on current flux-core master?

dongahn added some commits Apr 3, 2018

wreck: Add support for node exclusion and inclusion
Add support for exclusion and inclusion by node name.

Only supported when the scheduler is loaded.
@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 10, 2018

Sorry about that. Rebased and pushed.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 10, 2018

Ok, will merge once Travis reports. Thanks!

@grondo grondo merged commit 83861e2 into flux-framework:master Apr 10, 2018

4 checks passed

codecov/patch Coverage not affected when comparing c3b3868...7954711
Details
codecov/project 78.73% (+0.02%) compared to c3b3868
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.03%) to 79.037%
Details
@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 10, 2018

Thanks @grondo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.