Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add R_lite support #321

Merged
merged 5 commits into from Apr 27, 2018
Merged

Add R_lite support #321

merged 5 commits into from Apr 27, 2018

Conversation

dongahn
Copy link
Member

@dongahn dongahn commented Apr 24, 2018

Please note that this PR will fail in Travis CI until PR flux-framework/flux-core#1485 will be merged into flux-core.

  • Add simple support APIs (e.g., resrc_api_map_put) to the resrc_api layer that allows the user to serialize only certain resource types in two different forms (i.e., gather form and reduce form).

  • Introduce resrc_tree_serialize_lite and resrc_to_json_lite to serialize an individual resrc object in these forms.

  • Then, in sched.c, use the above APIs to only serialize node and core resource types into the agree-upon R_lite format.

  • Finally, because resrc does not have rank information, introduceresolve_rank () to resolve the node/hostname into the corresponding rank as a post processing step. This was discussed with @grondo and @garlick offline some time back.

  • Add some test cases.

References:
We discussed R_lite support in flux-framework/flux-core#1485 and flux-framework/flux-core#1378 (comment).

R_lite is already understood by wreck per flux-framework/flux-core#1399

This should also resolve flux-framework/flux-core#1439

It should also significantly reduce the memory growth issues with massive large numbers of jobs.

Add this so that the upper layer can pass the resource
types to effect the lightweight R emitting logic within
resrc.

Prepare for R_lite support.
Add resrc_tree_serialize_lite as the main function to
traverse the hardware hierarchy to emit the resources
in the lightweight R format.

Add support so that the upper layer can pass in
the resources types to emit and how they and
their children (i.e., the next level resource
type to be emitted) should be emitted.

This way, the resrc layer can still remain resource
type agnostic and the upper layer passes in
the input that control emit behavior.
@dongahn
Copy link
Member Author

dongahn commented Apr 24, 2018

@grondo: once thing I couldn't test was whether the scheduler-generated R_lite has led to correct tasks affinity. If you know how to test this easily, please let me know.

@dongahn
Copy link
Member Author

dongahn commented Apr 25, 2018

Restarted the build as flux-framework/flux-core#1485 was just meged.

@codecov-io
Copy link

codecov-io commented Apr 25, 2018

Codecov Report

Merging #321 into master will increase coverage by 0.17%.
The diff coverage is 92.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #321      +/-   ##
==========================================
+ Coverage   74.25%   74.43%   +0.17%     
==========================================
  Files          49       49              
  Lines        9540     9638      +98     
==========================================
+ Hits         7084     7174      +90     
- Misses       2456     2464       +8
Impacted Files Coverage Δ
resrc/resrc.h 100% <ø> (ø) ⬆️
resrc/resrc.c 83.31% <87.5%> (+0.16%) ⬆️
resrc/test/tresrc.c 96.36% <92.3%> (-0.2%) ⬇️
sched/sched.c 73.98% <93.93%> (+0.35%) ⬆️
resrc/resrc_tree.c 86.83% <96.77%> (+1.45%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ed8516...eec7e1c. Read the comment docs.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 76.113% when pulling eec7e1c on dongahn:light_R into 3ed8516 on flux-framework:master.

@grondo
Copy link
Contributor

grondo commented Apr 25, 2018 via email

@dongahn
Copy link
Member Author

dongahn commented Apr 25, 2018

As of now, wreck is not using R_lite for any affinity. The plan iirc was to
add it only if necessary

Oh sorry I didn't know that. I thought I saw cpuset etc from the new code and mistakenly thought it does do affinity. I certaintly didn't take a deep dive. Just so that I understand, R_lite has info on which particular core or set of cores are allocated to the job. But wreck currently only retrieves the count from it? So essecially same as the old rank.core scheme? (Still good to have some exercise in preparation for the new execution engine).

@grondo
Copy link
Contributor

grondo commented Apr 25, 2018 via email


resrc_api_map_put (gmap, "node", (void *)(intptr_t)REDUCE_UNDER_ME);
resrc_api_map_put (rmap, "core", (void *)(intptr_t)NONE_UNDER_ME);
if (resrc_tree_serialize_lite (gat, red, job->resrc_tree, gmap, rmap)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to confirm that I'm reading this correctly: the red JSON object is not used directly by the scheduler, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, red (reduction) is only used by the serializer function, an unfortunate side effect of a recursive API. I could hide it by introducing a wrapper around resrc_tree_serialize_lite, if you want.

Copy link
Member

@SteVwonder SteVwonder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM. I like the flexibility provided by the reduce/gather maps. That was a really nice touch :)

I took it for a spin on a single-node Flux instance and the output of R_lite LGTM as well (I tried to play around with larger instances, but I'm getting some weird hangs when I try and srun -N2 flux start).

@dongahn
Copy link
Member Author

dongahn commented Apr 26, 2018

Thanks @SteVwonder. I don't think I saw thr multinode hang... Can you consistently reproduve it?

@dongahn
Copy link
Member Author

dongahn commented Apr 26, 2018

Well on a second thought, if he user only want one resource type to be serialized in the reduced form, red should be used instead. So I think it is better to keep it this way...

@SteVwonder
Copy link
Member

I don't think the hangs are related to this PR. I suspect it is probably something silly on my end. I'll play around with it some more tomorrow and open a ticket if I can't figure it out.

Leaving the interface the way it is makes sense given the use-case you describe.

@dongahn
Copy link
Member Author

dongahn commented Apr 27, 2018

@SteVwonder: I think this can go in unless @grondo disagrees. Once this gets merged, I will rebase my other PR: #323. Thanks.

@SteVwonder SteVwonder merged commit 908db96 into flux-framework:master Apr 27, 2018
@grondo grondo mentioned this pull request May 11, 2018
@dongahn dongahn deleted the light_R branch July 13, 2019 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

All ranks log an error when r_lite is missing, even when it is not to run on that broker/node
5 participants