Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace resource-hwloc module with enhanced flux-hwloc command #1968

Merged
merged 17 commits into from Feb 2, 2019

Conversation

@grondo
Copy link
Contributor

commented Jan 29, 2019

The work here is not quite done, but I wanted to post a functional PR early since I'm sure there will be many comments...

This PR is an alternative approach to #1931. The resource-hwloc module itself is removed in favor of an enhanced flux-hwloc command. To make this possible, the following changes were made to flux-hwloc:

  • hwloc xml is fetched directly the kvs. This obviates the need for the resource-hwloc.topo service.
  • hwloc xml is loaded directly into the kvs by flux hwloc reload:
    • in the case a directory argument is supplied to reload, the flux-hwloc reload command just puts the XML strings directly under a single transaction.
    • in the case of system topology reload, a new plumbing command flux-hwloc aggregate-load is executed across the session to load local topology and insert it into the kvs (under a kvs_fence)
  • As in #1931, an aggregation is done by flux-hwloc aggregate-load, executed by flux-hwloc reload across the session.
  • Due to this work, flux hwloc topology and flux hwloc info now have --local and --rank options to allow operating on the local topology or a subset of ranks (default is --rank-"all")
  • The flux hwloc reload command should work the same as before.
  • the resource.hwloc.xml directories are now populated in rc1 by simple flux hwloc reload

This approach removes the possible deadlocks introduced in #1931 since modules can be loaded/unloaded from any rank at any time. Doing the hwloc reload from a command interface was just much easier to reason about.

In this PR the aggregate_* functions were split off into a new convenience library under src/common/libaggregate. This just kept the complexity of src/cmd/builtin/hwloc.c down.

Still TODO:

  • test updates
  • doc updates
@grondo grondo force-pushed the grondo:flux-hwloc branch from d644227 to 0d9d693 Jan 29, 2019
@codecov-io

This comment has been minimized.

Copy link

commented Jan 29, 2019

Codecov Report

Merging #1968 into master will increase coverage by 0.11%.
The diff coverage is 81.25%.

@@            Coverage Diff             @@
##           master    #1968      +/-   ##
==========================================
+ Coverage   80.07%   80.19%   +0.11%     
==========================================
  Files         195      195              
  Lines       35048    35077      +29     
==========================================
+ Hits        28064    28129      +65     
+ Misses       6984     6948      -36
Impacted Files Coverage Δ
src/common/liboptparse/optparse.c 90.75% <100%> (-0.07%) ⬇️
src/common/libaggregate/aggregate.c 80.89% <80.89%> (ø)
src/cmd/builtin/hwloc.c 80.44% <81.28%> (+0.44%) ⬆️
src/common/libutil/veb.c 95.42% <0%> (-3.43%) ⬇️
src/modules/barrier/barrier.c 76.55% <0%> (-2.07%) ⬇️
src/common/libflux/message.c 81.64% <0%> (+0.12%) ⬆️
src/modules/kvs-watch/kvs-watch.c 79.95% <0%> (+0.9%) ⬆️
src/cmd/flux-exec.c 78.48% <0%> (+4.06%) ⬆️
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Jan 30, 2019

It occurred to me that flux-hwloc reload should just exec(2) flux-exec in order to execute flux hwloc aggregate-load across the session instead of attempting to re-implement most of flux-exec internally.

@grondo grondo force-pushed the grondo:flux-hwloc branch from a4ecc7c to 795ef25 Jan 30, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Jan 30, 2019

Ok, I've added a few tests (and added jq(1) to our testenv docker images).

I just have to update documentation and run some scale tests and then this PR is probably ready for another look.

@grondo grondo force-pushed the grondo:flux-hwloc branch from b8d773a to 1159c4e Jan 30, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Jan 31, 2019

In scale testing this PR fails for any instance >1024 brokers due to #1974. Once that fix is in I can resume some testing.

@grondo grondo force-pushed the grondo:flux-hwloc branch from 1159c4e to 45dd6fa Feb 1, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2019

Rebased and pushed. Also added flux hwloc reload -v to get some debug out of the reload process (handy while seeing how the new method scales)

I was able to do some testing at about 2K brokers on opal (across 32 real nodes), and sadly a flux hwloc reload in that environment now takes ~20s on average. This seems to be due to a combination of flux-exec overhead plus a larger than expected time in kvs_fence. However, perhaps for now it is ok. There is lots of room for future optimization (e.g. at some point we could reduce the XML, or generate an R from global XML instead of storing N XMLs in the kvs).

You can see below the difference between reloading local XML from all ranks vs just one

$ time flux hwloc reload -v
flux-hwloc: 0.013s: starting HWLOC reload on 2048 ranks (all)
flux-hwloc: 0.013s: executing aggregate-load across all ranks
flux-hwloc: Running flux exec -n -r all flux hwloc aggregate-load --rank=all --verbose --timeout=34.000 --unpack=resource.hwloc.by_rank --key=resource.hwloc.reload:0-46354
flux-hwloc aggregate-load: 0.006s: starting
flux-hwloc aggregate-load: 0.006s: pushing local xml
flux-hwloc aggregate-load: 3.445s: starting kvs fence
flux-hwloc aggregate-load: 7.096s: kvs fence complete
flux-hwloc aggregate-load: 7.096s: starting aggregate
flux-hwloc aggregate-load: 7.098s: aggregate push complete
flux-hwloc aggregate-load: 8.369s: aggregate_wait complete
flux-hwloc aggregate-load: 8.369s: done.

real	0m18.948s
user	0m13.035s
sys	0m0.432s

$ time flux hwloc reload -v -r 0
flux-hwloc: 0.004s: starting HWLOC reload on 1 ranks (0)
flux-hwloc: 0.004s: executing aggregate-load across all ranks
flux-hwloc: Running flux exec -n -r all flux hwloc aggregate-load --rank=0 --verbose --timeout=34.000 --unpack=resource.hwloc.by_rank --key=resource.hwloc.reload:0-47380
flux-hwloc aggregate-load: 0.173s: starting
flux-hwloc aggregate-load: 0.174s: pushing local xml
flux-hwloc aggregate-load: 0.322s: starting kvs fence
flux-hwloc aggregate-load: 0.561s: kvs fence complete
flux-hwloc aggregate-load: 0.561s: starting aggregate
flux-hwloc aggregate-load: 1.090s: aggregate push complete
flux-hwloc aggregate-load: 3.929s: aggregate_wait complete
flux-hwloc aggregate-load: 3.930s: done.

real	0m5.736s
user	0m1.863s
sys	0m0.372s

Also, I'm not sure if reloading XML from a directory via a single transaction in flux hwloc reload was the best idea, but since the reload from a directory is for testing only, for now perhaps we can leave it as is:

t$ flux hwloc reload -v ./t/hwloc-data/sierra2
flux-hwloc: 0.012s: starting HWLOC reload on 2048 ranks (all)
flux-hwloc: 0.012s: starting load of XML from /g/g0/grondo/git/flux-core.git/t/hwloc-data/sierra2
flux-hwloc: 25.150s: XML load complete
flux-hwloc: 25.150s: executing aggregate-load across all ranks
flux-hwloc: Running flux exec -n -r all flux hwloc aggregate-load --verbose --timeout=34.000 --unpack=resource.hwloc.by_rank --key=resource.hwloc.reload:0-49889
flux-hwloc aggregate-load: 0.189s: starting
flux-hwloc aggregate-load: 0.358s: starting aggregate
flux-hwloc aggregate-load: 0.777s: aggregate push complete
flux-hwloc aggregate-load: 3.914s: aggregate_wait complete
flux-hwloc aggregate-load: 3.914s: done.

$ flux kvs get resource.hwloc.by_rank
{"[0-2047]": {"NUMANode": 6, "Package": 2, "Core": 44, "PU": 176, "GPU": 4}}
@grondo grondo force-pushed the grondo:flux-hwloc branch 2 times, most recently from 3fda96f to b3aff34 Feb 1, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2019

Rebased and performed a bit of squashing and commit cleanup...

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2019

For comparison, the timing for flux hwloc reload on the branch for #1931 (where the resource-hwloc handles the reload via an event) is about 5x faster:

$ flux getattr size
2048
$ time flux hwloc reload

real	0m5.739s
user	0m0.014s
sys	0m0.012s

$ time flux hwloc reload -r 0

real	0m1.387s
user	0m0.012s
sys	0m0.016s
@dongahn

This comment has been minimized.

Copy link
Contributor

commented Feb 1, 2019

Cool!

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2019

Cool!

Actually, not so cool. I might have not been clear in my benchmarks, but the preferred approach in this PR (replacing the resource-hwloc module with flux hwloc + flux exec) is 5x slower at global HWLOC reload than the approach in #1931 which keeps the resource-hwloc module.

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2019

I think the weird timing of flux hwloc reload at medium scale is probably due to the effect described in #1983.

@grondo grondo force-pushed the grondo:flux-hwloc branch from b3aff34 to fda13f4 Feb 1, 2019
grondo added 13 commits Jan 28, 2019
Suppress uninitialized variable warnings from Valgrind under
optparse_add_doc() by ensuring the entire structure is initialized
instead of initializing individual fields.
Prepare t2005-hwloc.t test for removal of resource-hwloc module.
Add a set of convenience functions for aggregator client to
src/common/libaggregator.
Add libaggregate to src/cmd/Makefile.am in preparation for its
requirement by flux-hwloc subcommand.
No functions are used from sds.h in src/cmd/builtin/hwloc.c,
so remove it.
The --walk-topology flag for flux-hwloc reload is not used in
any real world scenario. Remove it.
Fixup whitespace in optparse_subcommand definitions.
Add ability for flux-hwloc to load local topology using the same
flags as used in the resource-hwloc module.

Add a "--local" option to the `info` and `topology` subcommands to
dump info or XML for the local system topology instead of the global
topology.
Add support for fetching hwloc topology XML directly from kvs instead
of the resource-hwloc.topo RPC. Add a `--rank` option to flux-hwloc
topology and info subcommands to allow dumping a subset of topology
information.
Convert cmd_lstopo() to use flux_hwloc_global_topology().

Drop code used to fetch hwloc topology with an rpc.
Add an undocument `aggregate-load` subcommand to flux-hwloc. This
subcommand is meant to be run across all ranks to aggregate a
JSON topology summary and optionally store it in the KVS. The
subcommand also optionally loads local hwloc topology XML into
the appropriate resource.hwloc.xml directory for the local rank.
This commit replaces the `reload` subcommand which uses the
resource-hwloc.reload RPC with a version that directly loads hwloc topology
XML to kvs.

When loading XML files from a directory path, the data is directly put
into the kvs under a single transaction from the `flux hwloc reload`
command. For reloading system topology, the reload command executes the
`flux hwloc aggregate-load` command to load XML on each rank.

In either case, `flux hwloc reload` executes `aggregate-load` across all
ranks to re-aggregate a topology summary JSON object, which is placed at
`resource.hwloc.by_rank` instead of the broken down per-rank directories.
Add a --verbose option to flux-hwloc reload and aggregate-load.
grondo added 4 commits Jan 29, 2019
Remove the resource-hwloc module since it now has no users.
Some tests will now use jq for parsing and validating JSON output
from commands, so include it in the 'testenv' docker images.
Update tests to validate new features of flux-hwloc builtin command.
Update the flux-hwloc(1) man page to match new options and behavior
of flux-hwloc subcommands.
@grondo grondo force-pushed the grondo:flux-hwloc branch from fda13f4 to 70babcf Feb 2, 2019
@garlick

This comment has been minimized.

Copy link
Member

commented Feb 2, 2019

Nice, thanks!

@garlick garlick merged commit fdc4a36 into flux-framework:master Feb 2, 2019
3 checks passed
3 checks passed
codecov/patch 81.25% of diff hit (target 80.07%)
Details
codecov/project 80.19% (+0.11%) compared to 5ed7f26
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@grondo grondo deleted the grondo:flux-hwloc branch Feb 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.