New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add R_lite support #321
Add R_lite support #321
Conversation
Add this so that the upper layer can pass the resource types to effect the lightweight R emitting logic within resrc. Prepare for R_lite support.
Add resrc_tree_serialize_lite as the main function to traverse the hardware hierarchy to emit the resources in the lightweight R format. Add support so that the upper layer can pass in the resources types to emit and how they and their children (i.e., the next level resource type to be emitted) should be emitted. This way, the resrc layer can still remain resource type agnostic and the upper layer passes in the input that control emit behavior.
@grondo: once thing I couldn't test was whether the scheduler-generated |
Restarted the build as flux-framework/flux-core#1485 was just meged. |
Codecov Report
@@ Coverage Diff @@
## master #321 +/- ##
==========================================
+ Coverage 74.25% 74.43% +0.17%
==========================================
Files 49 49
Lines 9540 9638 +98
==========================================
+ Hits 7084 7174 +90
- Misses 2456 2464 +8
Continue to review full report at Codecov.
|
I can check this out tomorrow, but can you explain in more detail what you
mean by R_lite resulting in correct affinity?
As of now, wreck is not using R_lite for any affinity. The plan iirc was to
add it only if necessary.
…On Tue, Apr 24, 2018, 4:09 PM Dong H. Ahn ***@***.***> wrote:
@grondo <https://github.com/grondo>: once thing I couldn't test was
whether the scheduler-generated R_lite has led to correct tasks affinity.
If you know how to test this easily, please let me know.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#321 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAtSUn81CqNuJls3vr8PsC4rqflTsxDMks5tr7CxgaJpZM4TijSq>
.
|
Oh sorry I didn't know that. I thought I saw cpuset etc from the new code and mistakenly thought it does do affinity. I certaintly didn't take a deep dive. Just so that I understand, |
On Tue, Apr 24, 2018, 6:58 PM Dong H. Ahn ***@***.***> wrote:
As of now, wreck is not using R_lite for any affinity. The plan iirc was to
add it only if necessary
Oh sorry I didn't know that. I thought I saw cpuset etc from the new code
and mistakenly thought it does do affinity. I certaintly didn't take a deep
dive. Just so that I understand, R_lite has info on which particular core
or set of cores are allocated to the job. But wreck *currently* only
retrieves the count from it? So essecially same as the old rank.core
scheme? (Still good to have some exercise in preparation for the new
execution engine)
Yes, that's right. Though it would be trivial for wrexecd to at least set
CPU affinity for all tasks on a node/rank to the set supplied by the
scheduler in rank:N core of R_lite (so at least it is possible, unlike only
a core count)
… |
|
||
resrc_api_map_put (gmap, "node", (void *)(intptr_t)REDUCE_UNDER_ME); | ||
resrc_api_map_put (rmap, "core", (void *)(intptr_t)NONE_UNDER_ME); | ||
if (resrc_tree_serialize_lite (gat, red, job->resrc_tree, gmap, rmap)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to confirm that I'm reading this correctly: the red
JSON object is not used directly by the scheduler, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, red
(reduction) is only used by the serializer function, an unfortunate side effect of a recursive API. I could hide it by introducing a wrapper around resrc_tree_serialize_lite, if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code LGTM. I like the flexibility provided by the reduce/gather maps. That was a really nice touch :)
I took it for a spin on a single-node Flux instance and the output of R_lite LGTM as well (I tried to play around with larger instances, but I'm getting some weird hangs when I try and srun -N2 flux start
).
Thanks @SteVwonder. I don't think I saw thr multinode hang... Can you consistently reproduve it? |
Well on a second thought, if he user only want one resource type to be serialized in the reduced form, red should be used instead. So I think it is better to keep it this way... |
I don't think the hangs are related to this PR. I suspect it is probably something silly on my end. I'll play around with it some more tomorrow and open a ticket if I can't figure it out. Leaving the interface the way it is makes sense given the use-case you describe. |
@SteVwonder: I think this can go in unless @grondo disagrees. Once this gets merged, I will rebase my other PR: #323. Thanks. |
Please note that this PR will fail in Travis CI until PR flux-framework/flux-core#1485 will be merged into flux-core.
Add simple support APIs (e.g.,
resrc_api_map_put
) to theresrc_api
layer that allows the user to serialize only certain resource types in two different forms (i.e., gather form and reduce form).Introduce
resrc_tree_serialize_lite
andresrc_to_json_lite
to serialize an individualresrc
object in these forms.Then, in
sched.c
, use the above APIs to only serializenode
andcore
resource types into the agree-uponR_lite
format.Finally, because
resrc
does not haverank
information, introduceresolve_rank ()
to resolve the node/hostname into the corresponding rank as a post processing step. This was discussed with @grondo and @garlick offline some time back.Add some test cases.
References:
We discussed
R_lite
support in flux-framework/flux-core#1485 and flux-framework/flux-core#1378 (comment).R_lite
is already understood by wreck per flux-framework/flux-core#1399This should also resolve flux-framework/flux-core#1439
It should also significantly reduce the memory growth issues with massive large numbers of jobs.