Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add R_lite support #321

Merged
merged 5 commits into from Apr 27, 2018

Conversation

Projects
None yet
5 participants
@dongahn
Copy link
Contributor

dongahn commented Apr 24, 2018

Please note that this PR will fail in Travis CI until PR flux-framework/flux-core#1485 will be merged into flux-core.

  • Add simple support APIs (e.g., resrc_api_map_put) to the resrc_api layer that allows the user to serialize only certain resource types in two different forms (i.e., gather form and reduce form).

  • Introduce resrc_tree_serialize_lite and resrc_to_json_lite to serialize an individual resrc object in these forms.

  • Then, in sched.c, use the above APIs to only serialize node and core resource types into the agree-upon R_lite format.

  • Finally, because resrc does not have rank information, introduceresolve_rank () to resolve the node/hostname into the corresponding rank as a post processing step. This was discussed with @grondo and @garlick offline some time back.

  • Add some test cases.

References:
We discussed R_lite support in flux-framework/flux-core#1485 and flux-framework/flux-core#1378 (comment).

R_lite is already understood by wreck per flux-framework/flux-core#1399

This should also resolve flux-framework/flux-core#1439

It should also significantly reduce the memory growth issues with massive large numbers of jobs.

dongahn added some commits Mar 29, 2018

resrc: Add support for a simple map of opaque type
Add this so that the upper layer can pass the resource
types to effect the lightweight R emitting logic within
resrc.

Prepare for R_lite support.
resrc: Add lightweight R emitting support
Add resrc_tree_serialize_lite as the main function to
traverse the hardware hierarchy to emit the resources
in the lightweight R format.

Add support so that the upper layer can pass in
the resources types to emit and how they and
their children (i.e., the next level resource
type to be emitted) should be emitted.

This way, the resrc layer can still remain resource
type agnostic and the upper layer passes in
the input that control emit behavior.

@dongahn dongahn force-pushed the dongahn:light_R branch from f477f84 to eec7e1c Apr 24, 2018

@dongahn dongahn requested review from SteVwonder and grondo Apr 24, 2018

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 24, 2018

@grondo: once thing I couldn't test was whether the scheduler-generated R_lite has led to correct tasks affinity. If you know how to test this easily, please let me know.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 25, 2018

Restarted the build as flux-framework/flux-core#1485 was just meged.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Apr 25, 2018

Codecov Report

Merging #321 into master will increase coverage by 0.17%.
The diff coverage is 92.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #321      +/-   ##
==========================================
+ Coverage   74.25%   74.43%   +0.17%     
==========================================
  Files          49       49              
  Lines        9540     9638      +98     
==========================================
+ Hits         7084     7174      +90     
- Misses       2456     2464       +8
Impacted Files Coverage Δ
resrc/resrc.h 100% <ø> (ø) ⬆️
resrc/resrc.c 83.31% <87.5%> (+0.16%) ⬆️
resrc/test/tresrc.c 96.36% <92.3%> (-0.2%) ⬇️
sched/sched.c 73.98% <93.93%> (+0.35%) ⬆️
resrc/resrc_tree.c 86.83% <96.77%> (+1.45%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ed8516...eec7e1c. Read the comment docs.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 25, 2018

Coverage Status

Coverage increased (+0.2%) to 76.113% when pulling eec7e1c on dongahn:light_R into 3ed8516 on flux-framework:master.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 25, 2018

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 25, 2018

As of now, wreck is not using R_lite for any affinity. The plan iirc was to
add it only if necessary

Oh sorry I didn't know that. I thought I saw cpuset etc from the new code and mistakenly thought it does do affinity. I certaintly didn't take a deep dive. Just so that I understand, R_lite has info on which particular core or set of cores are allocated to the job. But wreck currently only retrieves the count from it? So essecially same as the old rank.core scheme? (Still good to have some exercise in preparation for the new execution engine).

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 25, 2018

@dongahn dongahn referenced this pull request Apr 25, 2018

Closed

Please document R_lite #121


resrc_api_map_put (gmap, "node", (void *)(intptr_t)REDUCE_UNDER_ME);
resrc_api_map_put (rmap, "core", (void *)(intptr_t)NONE_UNDER_ME);
if (resrc_tree_serialize_lite (gat, red, job->resrc_tree, gmap, rmap)) {

This comment has been minimized.

@SteVwonder

SteVwonder Apr 26, 2018

Member

I want to confirm that I'm reading this correctly: the red JSON object is not used directly by the scheduler, correct?

This comment has been minimized.

@dongahn

dongahn Apr 26, 2018

Author Contributor

Yes, red (reduction) is only used by the serializer function, an unfortunate side effect of a recursive API. I could hide it by introducing a wrapper around resrc_tree_serialize_lite, if you want.

@SteVwonder
Copy link
Member

SteVwonder left a comment

The code LGTM. I like the flexibility provided by the reduce/gather maps. That was a really nice touch :)

I took it for a spin on a single-node Flux instance and the output of R_lite LGTM as well (I tried to play around with larger instances, but I'm getting some weird hangs when I try and srun -N2 flux start).

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 26, 2018

Thanks @SteVwonder. I don't think I saw thr multinode hang... Can you consistently reproduve it?

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 26, 2018

Well on a second thought, if he user only want one resource type to be serialized in the reduced form, red should be used instead. So I think it is better to keep it this way...

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Apr 26, 2018

I don't think the hangs are related to this PR. I suspect it is probably something silly on my end. I'll play around with it some more tomorrow and open a ticket if I can't figure it out.

Leaving the interface the way it is makes sense given the use-case you describe.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 27, 2018

@SteVwonder: I think this can go in unless @grondo disagrees. Once this gets merged, I will rebase my other PR: #323. Thanks.

@SteVwonder SteVwonder merged commit 908db96 into flux-framework:master Apr 27, 2018

4 checks passed

codecov/patch 92.66% of diff hit (target 74.25%)
Details
codecov/project 74.43% (+0.17%) compared to 3ed8516
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.2%) to 76.113%
Details

@grondo grondo referenced this pull request May 11, 2018

Closed

Need 0.5.0 Release #340

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.