Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource: stop collecting/reducing hwloc XML #4263

Merged
merged 2 commits into from
Apr 10, 2022

Conversation

garlick
Copy link
Member

@garlick garlick commented Apr 2, 2022

Now that fluxion has the Rv1 reader (flux-framework/flux-sched#921), it no longer requires the resource.get-xml RPC to fetch aggregated hwloc XML for the instance. This PR removes that RPC and leaves the hwloc XML out of the Rlocal ➡️ reduction, greatly reducing the amount of data that needs to be moved before the scheduler can be started when resources are dynamically discovered (e.g. under slurm).

This PR is based on top of #4262 (remove flux-hwloc)

@garlick
Copy link
Member Author

garlick commented Apr 2, 2022

Oh darn, looks like fluxion is failing a few tests:

2022-04-02T23:16:28.4623930Z FAIL: t1016-nest-namespace.t 3 - namespace: gpu id remapping works with hwloc (pol=hi)
2022-04-02T23:16:28.4624445Z FAIL: t1016-nest-namespace.t 4 - namespace: parent CUDA_VISIBLE_DEVICES has no effect
2022-04-02T23:16:28.4625912Z FAIL: t1016-nest-namespace.t 7 - namespace: gpu id remapping works with hwloc (pol=low)
2022-04-02T23:16:28.4626401Z FAIL: t1016-nest-namespace.t 8 - namespace: parent CUDA_VISIBLE_DEVICES has no effect
2022-04-02T23:16:28.4626873Z FAIL: t1016-nest-namespace.t 9 - namespace: removing resource and qmanager modules

@garlick
Copy link
Member Author

garlick commented Apr 7, 2022

I reran the sched test after sched merged the PR to disable the failing tests, but it looks like there may be another set that I missed before?

2022-04-02T23:17:03.0162163Z PASS: t4004-match-hwloc.t 1 - resource-query works on simple query using xml file
2022-04-02T23:17:03.0167276Z PASS: t4004-match-hwloc.t 2 - resource-query works on gpu query using xml (NVIDIA)
2022-04-02T23:17:03.0171870Z PASS: t4004-match-hwloc.t 3 - resource-query works on gpu query using xml (AMD GPU)
2022-04-02T23:17:03.0176261Z PASS: t4004-match-hwloc.t 4 - resource-query works on gpu query using xml (AMD RSMI GPU)
2022-04-02T23:17:03.0182307Z PASS: t4004-match-hwloc.t 5 - resource-query works on gpu type query using xml (MIXED)
2022-04-02T23:17:03.0187006Z PASS: t4004-match-hwloc.t 6 - resource-query works with allowlist
2022-04-02T23:17:03.0191367Z PASS: t4004-match-hwloc.t 7 - loading resource module with a tiny hwloc xml file works
2022-04-02T23:17:03.0195701Z PASS: t4004-match-hwloc.t 8 - match-allocate works with four one-core jobspecs
2022-04-02T23:17:03.0200026Z PASS: t4004-match-hwloc.t 9 - match-allocate fails when all resources are allocated
2022-04-02T23:17:03.0210644Z PASS: t4004-match-hwloc.t 10 - removing resource works
2022-04-02T23:17:03.0211263Z PASS: t4004-match-hwloc.t 11 - load test resources (4N4B)
2022-04-02T23:17:03.0211732Z PASS: t4004-match-hwloc.t 12 - loading resource module with default resource info source
2022-04-02T23:17:03.0212228Z FAIL: t4004-match-hwloc.t 13 - match-allocate works with four two-socket jobspecs
2022-04-02T23:17:03.0212704Z FAIL: t4004-match-hwloc.t 14 - match-allocate fails when all resources are allocated
2022-04-02T23:17:03.0213125Z FAIL: t4004-match-hwloc.t 15 - unload fluxion resource
2022-04-02T23:17:03.0213525Z PASS: t4004-match-hwloc.t 16 - load test resources (4N4B_sierra2)
2022-04-02T23:17:03.0213894Z PASS: t4004-match-hwloc.t 17 - load fluxion resource
2022-04-02T23:17:03.0214285Z FAIL: t4004-match-hwloc.t 18 - match allocate
2022-04-02T23:17:03.0214642Z FAIL: t4004-match-hwloc.t 19 - removing resource works
2022-04-02T23:17:03.0215014Z ERROR: t4004-match-hwloc.t - exited with status 1

@codecov
Copy link

codecov bot commented Apr 7, 2022

Codecov Report

Merging #4263 (4433365) into master (2de6b30) will increase coverage by 0.00%.
The diff coverage is n/a.

❗ Current head 4433365 differs from pull request most recent head 974ba5c. Consider uploading reports for the commit 974ba5c to get more accurate results

@@           Coverage Diff            @@
##           master    #4263    +/-   ##
========================================
  Coverage   83.54%   83.55%            
========================================
  Files         387      387            
  Lines       64867    64725   -142     
========================================
- Hits        54194    54081   -113     
+ Misses      10673    10644    -29     
Impacted Files Coverage Δ
src/modules/resource/inventory.c 83.09% <ø> (+1.45%) ⬆️
src/modules/resource/topo.c 69.40% <ø> (-0.92%) ⬇️
src/broker/exec.c 68.96% <0.00%> (-5.33%) ⬇️
src/common/libpmi/simple_client.c 85.38% <0.00%> (-1.83%) ⬇️
src/modules/job-archive/job-archive.c 60.66% <0.00%> (-1.48%) ⬇️
src/common/libpmi/pmi2.c 88.60% <0.00%> (-1.27%) ⬇️
src/modules/job-info/guest_watch.c 76.21% <0.00%> (-0.82%) ⬇️
src/common/libterminus/terminus.c 85.82% <0.00%> (-0.25%) ⬇️
src/common/libsubprocess/remote.c 71.11% <0.00%> (-0.25%) ⬇️
... and 16 more

@garlick
Copy link
Member Author

garlick commented Apr 7, 2022

Rebased on current master and added a commit that improves error responses for unknown service methods (noticed while tracking down fluxion's use of the deprecated resource.get-xml RPC).

This won't pass the sched CI test until flux-framework/flux-sched#929 is merged.

@garlick
Copy link
Member Author

garlick commented Apr 8, 2022

The sched PR was merged so kicking of CI again.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@garlick
Copy link
Member Author

garlick commented Apr 10, 2022

Thanks!

Problem: the resource module collects hwloc XML to rank 0 by tacking
it onto its Rlocal -> R reduction.  This was to support fluxion while
an Rv1 reader was being developed.  That has now been completed, so
this can be removed.

Don't aggregate hwloc XML at rank 0.
Drop the resource.get-xml RPC used by fluxion.

Update resource tests.
Problem: the message dispatcher returns a human readable error when
requests that don't match a handler, but the topic is not included,
which is inconvenient when diagnosing failures.

Add the topic to the error response.

Also, clean up message topic usage in this function:
- set topic to "unknown" if flux_msg_get_topic() fails to avoid uninitialized
  variable contents being passed to caliper on failure
- eliminate duplicate flux_msg_get_topic() in fprintf call that is
  used as a last resort when a message is going to be dropped
@mergify mergify bot merged commit 5473d54 into flux-framework:master Apr 10, 2022
@garlick garlick deleted the no_xml_collect branch April 17, 2022 21:28
garlick added a commit to garlick/flux-core that referenced this pull request Mar 1, 2024
Problem: a comment in topo.c states that XML is reduced
along with R_local, however this is not true.

XML is no longer reduced since flux-framework#4263 (in flux-core-0.39.0).

Drop that part of the comment.
grondo pushed a commit to garlick/flux-core that referenced this pull request Mar 4, 2024
Problem: a comment in topo.c states that XML is reduced
along with R_local, however this is not true.

XML is no longer reduced since flux-framework#4263 (in flux-core-0.39.0).

Drop that part of the comment.
garlick added a commit to garlick/flux-core that referenced this pull request Mar 5, 2024
Problem: a comment in topo.c states that XML is reduced
along with R_local, however this is not true.

XML is no longer reduced since flux-framework#4263 (in flux-core-0.39.0).

Drop that part of the comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants