Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource: load resources from config or R, and rework topo discovery #3265

Merged
merged 20 commits into from Nov 4, 2020

Conversation

garlick
Copy link
Member

@garlick garlick commented Oct 13, 2020

This PR takes a step towards the architecture proposed in #3238.

The "resource object is still hwloc.by_rank and the resource.acquire RPC is unchanged in this PR.

A TOML config option is added to read the resource object from a file, e.g.:

[resource]
path = "/etc/flux/resource.json"

Code is added to fetch R from the enclosing instance, although conversion to a usable resource object is stubbed out for now.

If the resource object is known by one of these static methods, or because resource.hwloc.by_rank exists in the KVS when the resource module is loaded, then each broker validates its local shard of the resource object using hwloc (directly, not calling out to flux hwloc). If validation fails, currently an error is logged and the resource module fails to load. This needs to be changed to drain the node before this is merged.

If the resource object could not be determined up front, then resources are "discovered" as before, except the resource module now reduces the shards to a full object itself, rather than calling out to flux hwloc reload. This may avoid some entanglements when flux hwloc is not finished when the instance exits as noted in #3235.

Resources can be faked by pre-populating resource.hwloc.by_rank in the KVS and loading the resource module with the noverify option. It is no longer necessary to post an event to make this work.

One side effect of not using flux hwloc reload is that resource.hwloc.xml is not populated. However, a resource.topo-get-xml was added to allow it to be fetched directly from each rank on demand. (Fluxion currently fetches each rank synchronously from the KVS in a loop so it should be relatively easy to replace the KVS lookups with this RPC).[1]

This is a WIP because it needs more tests, and it's still under discussion whether conversion to Rv1 should happen before or after this PR.

Edit: [1] Fluxion waits to fetch XML until resource.acquire returns. When resources are discovered, that implies that all ranks are up. However when resources are configured, acquire can return before all ranks are up, so this workaround for Fluxion may need more thought.

@lgtm-com
Copy link

lgtm-com bot commented Oct 13, 2020

This pull request introduces 2 alerts when merging cd30eec into d6cddb3 - view on LGTM.com

new alerts:

  • 2 for FIXME comment

@garlick
Copy link
Member Author

garlick commented Oct 14, 2020

OK, just forced a push after some refactoring and a rebase on current master.

All the by_rank specific stuff should now be localized in byrank.[ch].

The hwloc specific stuff is relegated to rhwloc.[ch] which was plucked from the future (@grondo's forthcoming librset work). One function was added to get xml from a hwloc_topology_t.

Resources are now read/written from resource.R in the KVS rather than the by_rank specific resource.hwloc.by_rank.

hwloc xml is collected at rank 0 and written to the usual place at resource.hwloc.xml.<rank>. If resources are dynamically discovered, this is guaranteed to be in place by the time the first resource.acquire response is sent. It will not be present if resources were configured from TOML or grabbed from R in the enclosing instance. Hopefully this provides a bridge for Fluxion in the situations where we are using it now.

This should be relatively straightforward now to convert to Rv1. Find the byrank_* function calls and replace that code, drop rhwloc. This PR doesn't do anything with the acquire interface to move towards the proposed RFC 28 changes, but the byrank specific bits there are minimal. For the most part the "resource object" is passed through without modification, except for some code that calls byrank_resobj_sub() to remove ranks that are excluded from the initial response.

Which reminded me - Fluxion does look at the resource object to extract the initial idset of ranks for a grow operation (which it then satisfies with the hwloc XML data). Uh on.

@lgtm-com
Copy link

lgtm-com bot commented Oct 14, 2020

This pull request introduces 2 alerts when merging e8e1429 into 761564a - view on LGTM.com

new alerts:

  • 2 for FIXME comment

@garlick
Copy link
Member Author

garlick commented Oct 15, 2020

Just posted a fixup to store the hwloc XML data as a JSON string rather than the bare XML string, since this is what flux hwloc reload does, and is what Fluxion expects when it parses the KVS values.

I'll post a small patch to Fluxion shortly that fixes up the test suite to track the changes in this PR.

@lgtm-com
Copy link

lgtm-com bot commented Oct 15, 2020

This pull request introduces 2 alerts when merging 999f0d5 into 7f851e1 - view on LGTM.com

new alerts:

  • 2 for FIXME comment

@lgtm-com
Copy link

lgtm-com bot commented Oct 16, 2020

This pull request introduces 2 alerts when merging 72a2086 into 755b415 - view on LGTM.com

new alerts:

  • 2 for FIXME comment

@garlick
Copy link
Member Author

garlick commented Oct 18, 2020

Forced a push, adding flux resource reload [--xml] [--force] PATH.

This can be used to load test resources without reloading the resource module.

PATH can point to a file containing the JSON resource object, or if --xml is specified, a directory containing <rank>.xml files. If the latter, the resource object is derived from the hwloc XML.

The resource object may not contain ranks that are not valid in the current instance (e.g. >= size), unless --force is used.

Resources are not validated against the hwloc topology when loaded in this way.

The command completes after the new resource.R (and optionally resource.hwloc.xml) is committed to the KVS.

The scheduler still needs to be unloaded so the sequence used in test is

$ flux module remove scheduler
$ flux resource reload [args]
$ flux module load scheduler

@lgtm-com
Copy link

lgtm-com bot commented Oct 18, 2020

This pull request introduces 2 alerts when merging 9d126dd into 755b415 - view on LGTM.com

new alerts:

  • 2 for FIXME comment

@dongahn
Copy link
Member

dongahn commented Oct 19, 2020

Which reminded me - Fluxion does look at the resource object to extract the initial idset of ranks for a grow operation (which it then satisfies with the hwloc XML data). Uh on.

Yes, this is true. Fluxion does "not" use the overall resource object but it does need to fetch the idset.

@dongahn
Copy link
Member

dongahn commented Oct 19, 2020

@garlick: If there are the things that Fluxion does now slow down this conversion, you shouldn't shy away calling it out even if that require some substantial changes.

@dongahn
Copy link
Member

dongahn commented Oct 19, 2020

Edit: [1] Fluxion waits to fetch XML until resource.acquire returns. When resources are discovered, that implies that all ranks are up. However when resources are configured, acquire can return before all ranks are up, so this workaround for Fluxion may need more thought.

Is this still a problem? So the acquire will return and hwloc xml files (or RPC equivalent) are not there?

@garlick
Copy link
Member Author

garlick commented Oct 19, 2020

Edit: [1] Fluxion waits to fetch XML until resource.acquire returns. When resources are discovered, that implies that all ranks are up. However when resources are configured, acquire can return before all ranks are up, so this workaround for Fluxion may need more thought.

Is this still a problem? So the acquire will return and hwloc xml files (or RPC equivalent) are not there?

Since the above description, XML is now collected by the resource module and stored in the KVS, as before. There are some small modifications to Fluxion proposed in flux-framework/flux-sched#766 to track the changes in this PR. One change is to not load XML with KVS_FLAG_WAITCREATE, so if the XML is missing, it's a hard failure not a deadlock.

For the system instance (case 1) in the short short term, we can initially not create a resource configuration, and boot the system instance with all nodes up once. This will populate XML in the KVS. Then on subsequent restarts, it will be reloaded from there and down nodes can be tolerated.

We can improve upon that by embedding JGF in the opaque scheduler section of a resource configuration and bootstrap the system instance from that. Fluxion would need support for helping generate the resource configuration, and for bootstrapping from the JGF it gets back from resource.acquire.

For a sub-job (case 2) we need to discuss options - see #3228 (comment). In summary: The XML will not be there as of now. It would be nice if minimally there was a capability for Fluxion to bootstrap from Rv1 alone. We could also perhaps provide a hint to the resource module that it should collect XML in this case. Passing through JGF and reranking seems hard, and might not be worth it in Rv1 time frame (best to move on to Rv2).

@lgtm-com
Copy link

lgtm-com bot commented Oct 20, 2020

This pull request introduces 2 alerts when merging 3206286 into 4fe56c7 - view on LGTM.com

new alerts:

  • 2 for FIXME comment

@lgtm-com
Copy link

lgtm-com bot commented Oct 21, 2020

This pull request introduces 2 alerts when merging 009964f into 4fe56c7 - view on LGTM.com

new alerts:

  • 2 for FIXME comment

@garlick
Copy link
Member Author

garlick commented Oct 22, 2020

Just pushed a change to drain nodes on R verification failure, and related tests.

@garlick garlick force-pushed the resource_topo2 branch 2 times, most recently from 2ae43d5 to b12965a Compare October 22, 2020 20:17
@garlick
Copy link
Member Author

garlick commented Oct 22, 2020

Just got this failure in travis. I was not expecting this to be racy, but possibly the async disconnect message from unloading sched-simple is not getting there before resource.acquire when it's being loaded? Hmm, if that's going to be a problem in test then we might want to use an expicit cancel.

sched-simple: generate jobspec for simple test job

expecting success: 
	flux module unload sched-simple &&
	echo $hwloc_by_rank >R
	flux resource reload R &&
	flux module load sched-simple &&
	flux dmesg 2>&1 >reload.dmesg.log &&
	grep "ready:.*rank\[0-1\]/core\[0-1\]" reload.dmesg.log &&
	test_debug "echo result=\"$($query)\"" &&
	test "$($query)" = "rank[0-1]/core[0-1]"

2020-10-22T20:30:21.522533Z sched-simple.debug[0]: ready: 4 of 4 cores: rank[0-1]/core[0-1]
flux-resource: ERROR: [Errno 16] resources are busy (unload scheduler?)

@garlick
Copy link
Member Author

garlick commented Oct 26, 2020

This was rebased on top of #3276, and everything converted to Rv1.

@garlick garlick added this to the flux-core 0.21.0 milestone Oct 26, 2020
Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a not-super-thorough skim and caught a few things, nothing big.

if (version != 1)
if (version != 1) {
if (errp)
sprintf (errp->text, "invalid version=%d", version);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not snprintf()? i think some checkers (perhaps not the ones we use since this wasn't flagged) will mark use of sprintf() bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the buffer is known to be smaller than the maximum possible error string (even if version is 12 figures)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, you're right, why not use snprintf? I'll update.


if (flux_request_unpack (msg, NULL, "{s:i}", "up", &up) < 0) {
if (flux_respond_error (h, msg, errno, NULL) < 0)
flux_log_error (h, "error resonding to monitor-waitup request");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo resonding

}
if (up == count) {
if (flux_respond (h, msg, NULL) < 0)
flux_log_error (h, "error resonding to monitor-waitup request");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo resonding

msg = zlist_first (monitor->waiters);
while (msg) {
if (notify_one_waiter (monitor->ctx->h, count, msg)) {
zlist_remove (monitor->waiters, (void *)msg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think zlist_remove() is safe while iterating zlist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked about this as well, note that the function returns if any item is removed from the list, so this is safe. (I missed that as well, maybe a comment would help?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, i see. Yeah, a comment would be good.

@grondo
Copy link
Contributor

grondo commented Nov 4, 2020

@garlick, I'll go ahead and force-push squashed fixups for the items @chu11 identified above.

Thanks @chu11!

grondo and others added 18 commits November 3, 2020 18:20
Problem: rlist_from_json() does not fill in the error text field
on a version mismatch, making it difficult for the caller to determine
what went wrong on an error due to version.

Add a short error message on version mismatch to rlist_from_json()
Problem: A segfault is triggered if rlist_to_R() is called with a NULL
argument.

Protect against this occurrence by checking for NULL rlist argument
and returning NULL instead of segfaulting.
Problem: if posting the drain event fails when only the reason
is being updated, the node could be left undrained.

Drop the incorrect attempt at remediation when this unlikely
error occurs.
Problem: adding RPCs to the resource module that are called
on ranks other than 0 will result in disconnect messages on
those ranks, but the disconnect handler will segfault if called
on rank > 0.

Don't call acquire_disconect() if ctx->acquire is NULL.
Problem: there is duplication of code between
resource.monitor-hello and -goodbye RPCs.

Consolidate into resource.monitor-reduce, containing up and down
idsets, to make the code more concise.

While consolidating, fix bug where quickly reloading the resource
module could result in goodbye, hello messages arriving out of order,
leaving rank marked down.
Replace rutil_idset_decode_add() with rutil_idset_decode_test()

Update unit tests.
Temporarily add local xml_topology_get() function while it's
missing from librlist/rhwloc.
Add some utility functions for reading file content as a string,
reading file content as JSON, and creating a JSON object from a directory
containing <rank>.xml files.

Add some unit tests.
Add drain_rank() function to be called on rank 0 only when
a node needs to be drained by another part of the resource module.
Add inventory.c, which is a container for resources known to the
resource module.  It adds methods for populating resources from
configuration (file pointed to from TOML), from R in the enclosing
instance, and from discovery.

The resource module commits R to the KVS as 'resource.R'.  This is used
on a restart unless the resources are defined via TOML config, and then
it is ignored.

It expects R version 1 instead of the by_rank resource object format.
Other parts of the resource module such as acquire.c and discover.c are
adjusted accordingly.

Fake resources may be loaded in test using the resource.reload RPC.

Add an resource.get-xml RPC (rank 0 only) which fetches a fixed
length array from 0...size-1 of XML strings.  If XML is not yet
in inventory, block until it arrives.

Sched-simple is adjusted to acquire resources in Rv1 form.

Tests are updated.
Add 'flux resource reload' subcommand which allows dummy
resources to be loaded into the resource module for testing.

Usage: flux resource reload path [xmldir]
Add topo.c module which directly constructs R_local from the
hwloc topology.   A candidate R and topology XML is reduced and
made available on rank 0.

If R is known at module load time (e.g. from config, enclosing instance,
or resource.R in the KVS) then it is verified against the topology,
and the rank is drained if there is a problem.  The 'noverify' module
option disables resource verification.

Drop discover.c which outsourced discovery to 'flux hwloc discover'.

Drop the now-unused callback from monitor.c.

Update sharness tests that set fake resources to load resource
with the noverify option.
Add an RPC that responds when the requested count of ranks
are online.
Problem: the 'sched-PUs' module option has no effect after
the conversion from by_rank to Rv1.

Drop the 'sched-PUs' option and update tests.
Split t2310-resource-module test script out into smaller scripts:
  t2310-resource-module.t
  t2311-resource-drain.t
  t2312-resource-exclude.t
  t2313-resource-acquire.t
  t2314-resource-monitor.t
  t2315-resource-system.t

Don't reload resource module after changing exclusions
because this is not needed.

Don't load aggregator module in the 'job' personality,
as resources are now dynamically discovered by the resource
module directly.

Use flux resource instead of cobbled-together shell functions
for drain/undrain.

Replace ad-hoc wait_event() shell function with newly
added 'flux kvs eventlog wait-event' command.

Add coverage for 'flux resource reload'.
Add coverage for topo validation failure.
Add coverage for configured R.
Add t/resource/get-xml-test.py to test get-xml RPC behavior
Add t/resource/waitup.py to test monitor-waitup RPC

Co-authored-by: Mark A. Grondona <mark.grondona@gmail.com>

testsuite: t2314-resource-monitor.t: add waitup.py test script

Add test for resource.monitor-waitup RPC
The t2701-mini-batch.t sharness test was hanging at one point during
development, but only in travis.  That problem is resolved but adding
run_timeout to several blocking calls will ensure that future travis
hangs are promoted to hard failures.
Problem: valgrind test occasionally fails with 30 retries
on the local connector.

The timeout is in rc1 on the first access to the broker.
Raise it 40 retries.

N.B. the retry time starts at 0.016s and doubles each time until
it reaches 2s, then it's 2s each up to the maxmimum.  So times are:
  0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 1.024, 2, 2, 2, ....
Therefore 30 retries is about 26s and 40 retries is about 46s.
@grondo
Copy link
Contributor

grondo commented Nov 4, 2020

Ok, I've force-pushed an update addressing @chu11's comments.

@codecov-io
Copy link

Codecov Report

Merging #3265 into master will decrease coverage by 0.06%.
The diff coverage is 75.83%.

@@            Coverage Diff             @@
##           master    #3265      +/-   ##
==========================================
- Coverage   81.84%   81.77%   -0.07%     
==========================================
  Files         300      301       +1     
  Lines       46539    46948     +409     
==========================================
+ Hits        38092    38394     +302     
- Misses       8447     8554     +107     
Impacted Files Coverage Δ
src/modules/resource/topo.c 63.49% <63.49%> (ø)
src/modules/sched-simple/sched.c 74.29% <66.66%> (+2.42%) ⬆️
src/modules/resource/acquire.c 66.66% <68.18%> (+0.44%) ⬆️
src/modules/resource/rutil.c 82.39% <75.75%> (+2.73%) ⬆️
src/modules/resource/inventory.c 77.52% <77.52%> (ø)
src/modules/resource/drain.c 75.89% <80.00%> (-0.75%) ⬇️
src/modules/resource/monitor.c 81.63% <80.00%> (-5.35%) ⬇️
src/modules/resource/resource.c 81.20% <86.66%> (+0.16%) ⬆️
src/cmd/flux-resource.py 88.67% <100.00%> (-0.05%) ⬇️
src/common/libidset/idset_decode.c 98.33% <100.00%> (-0.08%) ⬇️
... and 18 more

@garlick
Copy link
Member Author

garlick commented Nov 4, 2020

Nice, thanks guys!

I restarted the flux-sched test since it appeared to be hanging

I'm not sure if I noticed whether we had hangs or not before, but the flux-sched test should not be expected to pass until we merge the flux-framework/flux-sched#766.

Think this is ready for MWP?

@grondo
Copy link
Contributor

grondo commented Nov 4, 2020

Yeah, though MWP won't work since flux-sched checks won't pass. This will need a manual merge after removal of branch protections.

@grondo
Copy link
Contributor

grondo commented Nov 4, 2020

I suppose I can go ahead and do that.

@grondo grondo merged commit 42c0237 into flux-framework:master Nov 4, 2020
@grondo
Copy link
Contributor

grondo commented Nov 4, 2020

Ok, I removed branch protections and manually merged. Huzzah!

@garlick
Copy link
Member Author

garlick commented Nov 4, 2020 via email

@garlick garlick deleted the resource_topo2 branch November 4, 2020 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants