resource: load resources from config or R, and rework topo discovery #3265

garlick · 2020-10-13T16:57:04Z

This PR takes a step towards the architecture proposed in #3238.

The "resource object is still hwloc.by_rank and the resource.acquire RPC is unchanged in this PR.

A TOML config option is added to read the resource object from a file, e.g.:

[resource]
path = "/etc/flux/resource.json"

Code is added to fetch R from the enclosing instance, although conversion to a usable resource object is stubbed out for now.

If the resource object is known by one of these static methods, or because resource.hwloc.by_rank exists in the KVS when the resource module is loaded, then each broker validates its local shard of the resource object using hwloc (directly, not calling out to flux hwloc). If validation fails, currently an error is logged and the resource module fails to load. This needs to be changed to drain the node before this is merged.

If the resource object could not be determined up front, then resources are "discovered" as before, except the resource module now reduces the shards to a full object itself, rather than calling out to flux hwloc reload. This may avoid some entanglements when flux hwloc is not finished when the instance exits as noted in #3235.

Resources can be faked by pre-populating resource.hwloc.by_rank in the KVS and loading the resource module with the noverify option. It is no longer necessary to post an event to make this work.

One side effect of not using flux hwloc reload is that resource.hwloc.xml is not populated. However, a resource.topo-get-xml was added to allow it to be fetched directly from each rank on demand. (Fluxion currently fetches each rank synchronously from the KVS in a loop so it should be relatively easy to replace the KVS lookups with this RPC).[1]

This is a WIP because it needs more tests, and it's still under discussion whether conversion to Rv1 should happen before or after this PR.

Edit: [1] Fluxion waits to fetch XML until resource.acquire returns. When resources are discovered, that implies that all ranks are up. However when resources are configured, acquire can return before all ranks are up, so this workaround for Fluxion may need more thought.

lgtm-com · 2020-10-13T17:14:03Z

This pull request introduces 2 alerts when merging cd30eec into d6cddb3 - view on LGTM.com

new alerts:

2 for FIXME comment

garlick · 2020-10-14T22:43:06Z

OK, just forced a push after some refactoring and a rebase on current master.

All the by_rank specific stuff should now be localized in byrank.[ch].

The hwloc specific stuff is relegated to rhwloc.[ch] which was plucked from the future (@grondo's forthcoming librset work). One function was added to get xml from a hwloc_topology_t.

Resources are now read/written from resource.R in the KVS rather than the by_rank specific resource.hwloc.by_rank.

hwloc xml is collected at rank 0 and written to the usual place at resource.hwloc.xml.<rank>. If resources are dynamically discovered, this is guaranteed to be in place by the time the first resource.acquire response is sent. It will not be present if resources were configured from TOML or grabbed from R in the enclosing instance. Hopefully this provides a bridge for Fluxion in the situations where we are using it now.

This should be relatively straightforward now to convert to Rv1. Find the byrank_* function calls and replace that code, drop rhwloc. This PR doesn't do anything with the acquire interface to move towards the proposed RFC 28 changes, but the byrank specific bits there are minimal. For the most part the "resource object" is passed through without modification, except for some code that calls byrank_resobj_sub() to remove ranks that are excluded from the initial response.

Which reminded me - Fluxion does look at the resource object to extract the initial idset of ranks for a grow operation (which it then satisfies with the hwloc XML data). Uh on.

lgtm-com · 2020-10-14T22:43:09Z

This pull request introduces 2 alerts when merging e8e1429 into 761564a - view on LGTM.com

new alerts:

2 for FIXME comment

garlick · 2020-10-15T15:57:35Z

Just posted a fixup to store the hwloc XML data as a JSON string rather than the bare XML string, since this is what flux hwloc reload does, and is what Fluxion expects when it parses the KVS values.

I'll post a small patch to Fluxion shortly that fixes up the test suite to track the changes in this PR.

lgtm-com · 2020-10-15T16:03:41Z

This pull request introduces 2 alerts when merging 999f0d5 into 7f851e1 - view on LGTM.com

new alerts:

2 for FIXME comment

lgtm-com · 2020-10-16T22:30:49Z

This pull request introduces 2 alerts when merging 72a2086 into 755b415 - view on LGTM.com

new alerts:

2 for FIXME comment

garlick · 2020-10-18T01:21:34Z

Forced a push, adding flux resource reload [--xml] [--force] PATH.

This can be used to load test resources without reloading the resource module.

PATH can point to a file containing the JSON resource object, or if --xml is specified, a directory containing <rank>.xml files. If the latter, the resource object is derived from the hwloc XML.

The resource object may not contain ranks that are not valid in the current instance (e.g. >= size), unless --force is used.

Resources are not validated against the hwloc topology when loaded in this way.

The command completes after the new resource.R (and optionally resource.hwloc.xml) is committed to the KVS.

The scheduler still needs to be unloaded so the sequence used in test is

$ flux module remove scheduler
$ flux resource reload [args]
$ flux module load scheduler

lgtm-com · 2020-10-18T01:30:43Z

This pull request introduces 2 alerts when merging 9d126dd into 755b415 - view on LGTM.com

new alerts:

2 for FIXME comment

dongahn · 2020-10-19T15:57:55Z

Which reminded me - Fluxion does look at the resource object to extract the initial idset of ranks for a grow operation (which it then satisfies with the hwloc XML data). Uh on.

Yes, this is true. Fluxion does "not" use the overall resource object but it does need to fetch the idset.

dongahn · 2020-10-19T16:02:10Z

@garlick: If there are the things that Fluxion does now slow down this conversion, you shouldn't shy away calling it out even if that require some substantial changes.

dongahn · 2020-10-19T16:04:57Z

Edit: [1] Fluxion waits to fetch XML until resource.acquire returns. When resources are discovered, that implies that all ranks are up. However when resources are configured, acquire can return before all ranks are up, so this workaround for Fluxion may need more thought.

Is this still a problem? So the acquire will return and hwloc xml files (or RPC equivalent) are not there?

garlick · 2020-10-19T17:33:08Z

Edit: [1] Fluxion waits to fetch XML until resource.acquire returns. When resources are discovered, that implies that all ranks are up. However when resources are configured, acquire can return before all ranks are up, so this workaround for Fluxion may need more thought.

Is this still a problem? So the acquire will return and hwloc xml files (or RPC equivalent) are not there?

Since the above description, XML is now collected by the resource module and stored in the KVS, as before. There are some small modifications to Fluxion proposed in flux-framework/flux-sched#766 to track the changes in this PR. One change is to not load XML with KVS_FLAG_WAITCREATE, so if the XML is missing, it's a hard failure not a deadlock.

For the system instance (case 1) in the short short term, we can initially not create a resource configuration, and boot the system instance with all nodes up once. This will populate XML in the KVS. Then on subsequent restarts, it will be reloaded from there and down nodes can be tolerated.

We can improve upon that by embedding JGF in the opaque scheduler section of a resource configuration and bootstrap the system instance from that. Fluxion would need support for helping generate the resource configuration, and for bootstrapping from the JGF it gets back from resource.acquire.

For a sub-job (case 2) we need to discuss options - see #3228 (comment). In summary: The XML will not be there as of now. It would be nice if minimally there was a capability for Fluxion to bootstrap from Rv1 alone. We could also perhaps provide a hint to the resource module that it should collect XML in this case. Passing through JGF and reranking seems hard, and might not be worth it in Rv1 time frame (best to move on to Rv2).

lgtm-com · 2020-10-20T23:46:31Z

This pull request introduces 2 alerts when merging 3206286 into 4fe56c7 - view on LGTM.com

new alerts:

2 for FIXME comment

lgtm-com · 2020-10-21T23:21:55Z

This pull request introduces 2 alerts when merging 009964f into 4fe56c7 - view on LGTM.com

new alerts:

2 for FIXME comment

garlick · 2020-10-22T15:14:08Z

Just pushed a change to drain nodes on R verification failure, and related tests.

garlick · 2020-10-22T22:49:29Z

Just got this failure in travis. I was not expecting this to be racy, but possibly the async disconnect message from unloading sched-simple is not getting there before resource.acquire when it's being loaded? Hmm, if that's going to be a problem in test then we might want to use an expicit cancel.

sched-simple: generate jobspec for simple test job

expecting success: 
	flux module unload sched-simple &&
	echo $hwloc_by_rank >R
	flux resource reload R &&
	flux module load sched-simple &&
	flux dmesg 2>&1 >reload.dmesg.log &&
	grep "ready:.*rank\[0-1\]/core\[0-1\]" reload.dmesg.log &&
	test_debug "echo result=\"$($query)\"" &&
	test "$($query)" = "rank[0-1]/core[0-1]"

2020-10-22T20:30:21.522533Z sched-simple.debug[0]: ready: 4 of 4 cores: rank[0-1]/core[0-1]
flux-resource: ERROR: [Errno 16] resources are busy (unload scheduler?)

garlick · 2020-10-26T02:08:50Z

This was rebased on top of #3276, and everything converted to Rv1.

chu11

did a not-super-thorough skim and caught a few things, nothing big.

chu11 · 2020-11-04T00:58:43Z

src/common/librlist/rlist.c

-    if (version != 1)
+    if (version != 1) {
+        if (errp)
+            sprintf (errp->text, "invalid version=%d", version);


why not snprintf()? i think some checkers (perhaps not the ones we use since this wasn't flagged) will mark use of sprintf() bad.

the buffer is known to be smaller than the maximum possible error string (even if version is 12 figures)

Although, you're right, why not use snprintf? I'll update.

chu11 · 2020-11-04T01:06:01Z

src/modules/resource/monitor.c

+
+    if (flux_request_unpack (msg, NULL, "{s:i}", "up", &up) < 0) {
+        if (flux_respond_error (h, msg, errno, NULL) < 0)
+            flux_log_error (h, "error resonding to monitor-waitup request");


typo resonding

chu11 · 2020-11-04T01:06:07Z

src/modules/resource/monitor.c

+    }
+    if (up == count) {
+        if (flux_respond (h, msg, NULL) < 0)
+            flux_log_error (h, "error resonding to monitor-waitup request");


typo resonding

chu11 · 2020-11-04T01:08:16Z

src/modules/resource/monitor.c

+        msg = zlist_first (monitor->waiters);
+        while (msg) {
+            if (notify_one_waiter (monitor->ctx->h, count, msg)) {
+                zlist_remove (monitor->waiters, (void *)msg);


i don't think zlist_remove() is safe while iterating zlist.

I asked about this as well, note that the function returns if any item is removed from the list, so this is safe. (I missed that as well, maybe a comment would help?)

Ahh, i see. Yeah, a comment would be good.

grondo · 2020-11-04T02:19:24Z

@garlick, I'll go ahead and force-push squashed fixups for the items @chu11 identified above.

Thanks @chu11!

Problem: rlist_from_json() does not fill in the error text field on a version mismatch, making it difficult for the caller to determine what went wrong on an error due to version. Add a short error message on version mismatch to rlist_from_json()

Problem: A segfault is triggered if rlist_to_R() is called with a NULL argument. Protect against this occurrence by checking for NULL rlist argument and returning NULL instead of segfaulting.

Problem: if posting the drain event fails when only the reason is being updated, the node could be left undrained. Drop the incorrect attempt at remediation when this unlikely error occurs.

Problem: adding RPCs to the resource module that are called on ranks other than 0 will result in disconnect messages on those ranks, but the disconnect handler will segfault if called on rank > 0. Don't call acquire_disconect() if ctx->acquire is NULL.

Problem: there is duplication of code between resource.monitor-hello and -goodbye RPCs. Consolidate into resource.monitor-reduce, containing up and down idsets, to make the code more concise. While consolidating, fix bug where quickly reloading the resource module could result in goodbye, hello messages arriving out of order, leaving rank marked down.

Replace rutil_idset_decode_add() with rutil_idset_decode_test() Update unit tests.

Temporarily add local xml_topology_get() function while it's missing from librlist/rhwloc.

Add some utility functions for reading file content as a string, reading file content as JSON, and creating a JSON object from a directory containing <rank>.xml files. Add some unit tests.

Add drain_rank() function to be called on rank 0 only when a node needs to be drained by another part of the resource module.

Add inventory.c, which is a container for resources known to the resource module. It adds methods for populating resources from configuration (file pointed to from TOML), from R in the enclosing instance, and from discovery. The resource module commits R to the KVS as 'resource.R'. This is used on a restart unless the resources are defined via TOML config, and then it is ignored. It expects R version 1 instead of the by_rank resource object format. Other parts of the resource module such as acquire.c and discover.c are adjusted accordingly. Fake resources may be loaded in test using the resource.reload RPC. Add an resource.get-xml RPC (rank 0 only) which fetches a fixed length array from 0...size-1 of XML strings. If XML is not yet in inventory, block until it arrives. Sched-simple is adjusted to acquire resources in Rv1 form. Tests are updated.

Add 'flux resource reload' subcommand which allows dummy resources to be loaded into the resource module for testing. Usage: flux resource reload path [xmldir]

Add topo.c module which directly constructs R_local from the hwloc topology. A candidate R and topology XML is reduced and made available on rank 0. If R is known at module load time (e.g. from config, enclosing instance, or resource.R in the KVS) then it is verified against the topology, and the rank is drained if there is a problem. The 'noverify' module option disables resource verification. Drop discover.c which outsourced discovery to 'flux hwloc discover'. Drop the now-unused callback from monitor.c. Update sharness tests that set fake resources to load resource with the noverify option.

Add an RPC that responds when the requested count of ranks are online.

Problem: the 'sched-PUs' module option has no effect after the conversion from by_rank to Rv1. Drop the 'sched-PUs' option and update tests.

Split t2310-resource-module test script out into smaller scripts: t2310-resource-module.t t2311-resource-drain.t t2312-resource-exclude.t t2313-resource-acquire.t t2314-resource-monitor.t t2315-resource-system.t Don't reload resource module after changing exclusions because this is not needed. Don't load aggregator module in the 'job' personality, as resources are now dynamically discovered by the resource module directly. Use flux resource instead of cobbled-together shell functions for drain/undrain. Replace ad-hoc wait_event() shell function with newly added 'flux kvs eventlog wait-event' command. Add coverage for 'flux resource reload'. Add coverage for topo validation failure. Add coverage for configured R. Add t/resource/get-xml-test.py to test get-xml RPC behavior Add t/resource/waitup.py to test monitor-waitup RPC Co-authored-by: Mark A. Grondona <mark.grondona@gmail.com> testsuite: t2314-resource-monitor.t: add waitup.py test script Add test for resource.monitor-waitup RPC

The t2701-mini-batch.t sharness test was hanging at one point during development, but only in travis. That problem is resolved but adding run_timeout to several blocking calls will ensure that future travis hangs are promoted to hard failures.

Problem: valgrind test occasionally fails with 30 retries on the local connector. The timeout is in rc1 on the first access to the broker. Raise it 40 retries. N.B. the retry time starts at 0.016s and doubles each time until it reaches 2s, then it's 2s each up to the maxmimum. So times are: 0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 1.024, 2, 2, 2, .... Therefore 30 retries is about 26s and 40 retries is about 46s.

grondo · 2020-11-04T02:21:38Z

Ok, I've force-pushed an update addressing @chu11's comments.

codecov-io · 2020-11-04T02:45:52Z

Codecov Report

Merging #3265 into master will decrease coverage by 0.06%.
The diff coverage is 75.83%.

@@            Coverage Diff             @@
##           master    #3265      +/-   ##
==========================================
- Coverage   81.84%   81.77%   -0.07%     
==========================================
  Files         300      301       +1     
  Lines       46539    46948     +409     
==========================================
+ Hits        38092    38394     +302     
- Misses       8447     8554     +107

Impacted Files	Coverage Δ
src/modules/resource/topo.c	`63.49% <63.49%> (ø)`
src/modules/sched-simple/sched.c	`74.29% <66.66%> (+2.42%)`	⬆️
src/modules/resource/acquire.c	`66.66% <68.18%> (+0.44%)`	⬆️
src/modules/resource/rutil.c	`82.39% <75.75%> (+2.73%)`	⬆️
src/modules/resource/inventory.c	`77.52% <77.52%> (ø)`
src/modules/resource/drain.c	`75.89% <80.00%> (-0.75%)`	⬇️
src/modules/resource/monitor.c	`81.63% <80.00%> (-5.35%)`	⬇️
src/modules/resource/resource.c	`81.20% <86.66%> (+0.16%)`	⬆️
src/cmd/flux-resource.py	`88.67% <100.00%> (-0.05%)`	⬇️
src/common/libidset/idset_decode.c	`98.33% <100.00%> (-0.08%)`	⬇️
... and 18 more

garlick · 2020-11-04T03:25:21Z

Nice, thanks guys!

I restarted the flux-sched test since it appeared to be hanging

I'm not sure if I noticed whether we had hangs or not before, but the flux-sched test should not be expected to pass until we merge the flux-framework/flux-sched#766.

Think this is ready for MWP?

grondo · 2020-11-04T03:27:08Z

Yeah, though MWP won't work since flux-sched checks won't pass. This will need a manual merge after removal of branch protections.

grondo · 2020-11-04T03:49:55Z

I suppose I can go ahead and do that.

grondo · 2020-11-04T03:54:46Z

Ok, I removed branch protections and manually merged. Huzzah!

garlick · 2020-11-04T05:05:46Z

Yay!!!

…

On Tue, Nov 3, 2020, 7:54 PM Mark Grondona ***@***.***> wrote: Ok, I removed branch protections and manually merged. Huzzah! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3265 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABJPW6BABT37JMLZZQUY7DSODGBFANCNFSM4SPINLLA> .

garlick force-pushed the resource_topo2 branch from cd30eec to e8e1429 Compare October 14, 2020 22:29

garlick mentioned this pull request Oct 15, 2020

testsuite: change method of setting test resources flux-framework/flux-sched#763

Closed

garlick force-pushed the resource_topo2 branch from 72a2086 to 9d126dd Compare October 18, 2020 01:17

garlick mentioned this pull request Oct 18, 2020

expect Rv1 from resource.acquire, use flux resource reload flux-framework/flux-sched#766

Merged

garlick mentioned this pull request Oct 19, 2020

resource: return Rv1 in resource.acquire response #3228

Closed

garlick force-pushed the resource_topo2 branch from 9d126dd to 3206286 Compare October 20, 2020 23:28

garlick force-pushed the resource_topo2 branch from 3206286 to 009964f Compare October 21, 2020 23:05

garlick force-pushed the resource_topo2 branch from 009964f to ec8a359 Compare October 22, 2020 14:52

garlick force-pushed the resource_topo2 branch 2 times, most recently from 2ae43d5 to b12965a Compare October 22, 2020 20:17

garlick mentioned this pull request Oct 23, 2020

add internal librlist library and flux-R utility for working with Rv1 #3276

Merged

garlick force-pushed the resource_topo2 branch from 5b58720 to 2e65206 Compare October 26, 2020 01:46

garlick added this to the flux-core 0.21.0 milestone Oct 26, 2020

chu11 reviewed Nov 4, 2020

View reviewed changes

grondo and others added 18 commits November 3, 2020 18:20

librlist: check for NULL arg in rlist_to_R()

6c965b0

Problem: A segfault is triggered if rlist_to_R() is called with a NULL argument. Protect against this occurrence by checking for NULL rlist argument and returning NULL instead of segfaulting.

resource/drain: fix incorrect error handling

cf8c04e

Problem: if posting the drain event fails when only the reason is being updated, the node could be left undrained. Drop the incorrect attempt at remediation when this unlikely error occurs.

resource/rutil: add helper function

94ba36a

Replace rutil_idset_decode_add() with rutil_idset_decode_test() Update unit tests.

resource: compile aginst librlist

c50156a

Temporarily add local xml_topology_get() function while it's missing from librlist/rhwloc.

resource/rutil: add file parsing functions

b93094a

Add some utility functions for reading file content as a string, reading file content as JSON, and creating a JSON object from a directory containing <rank>.xml files. Add some unit tests.

resource/drain: add drain_rank() to drain a node

9459520

Add drain_rank() function to be called on rank 0 only when a node needs to be drained by another part of the resource module.

cmd/flux-resource: add reload subcommand

0d8a597

Add 'flux resource reload' subcommand which allows dummy resources to be loaded into the resource module for testing. Usage: flux resource reload path [xmldir]

resource/monitor: add resource.monitor-waitup RPC

807351c

Add an RPC that responds when the requested count of ranks are online.

testsuite: use flux resource reload in sched tests

4171b99

sched-simple: drop sched-PUs module option

8717117

Problem: the 'sched-PUs' module option has no effect after the conversion from by_rank to Rv1. Drop the 'sched-PUs' option and update tests.

testsuite: add timeouts to mini-batch test

08792d2

The t2701-mini-batch.t sharness test was hanging at one point during development, but only in travis. That problem is resolved but adding run_timeout to several blocking calls will ensure that future travis hangs are promoted to hard failures.

grondo force-pushed the resource_topo2 branch from a9b32b8 to 8bafb4f Compare November 4, 2020 02:20

grondo merged commit 42c0237 into flux-framework:master Nov 4, 2020

garlick deleted the resource_topo2 branch November 4, 2020 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resource: load resources from config or R, and rework topo discovery #3265

resource: load resources from config or R, and rework topo discovery #3265

garlick commented Oct 13, 2020 •

edited

lgtm-com bot commented Oct 13, 2020

garlick commented Oct 14, 2020 •

edited

lgtm-com bot commented Oct 14, 2020

garlick commented Oct 15, 2020

lgtm-com bot commented Oct 15, 2020

lgtm-com bot commented Oct 16, 2020

garlick commented Oct 18, 2020 •

edited

lgtm-com bot commented Oct 18, 2020

dongahn commented Oct 19, 2020

dongahn commented Oct 19, 2020

dongahn commented Oct 19, 2020

garlick commented Oct 19, 2020

lgtm-com bot commented Oct 20, 2020

lgtm-com bot commented Oct 21, 2020

garlick commented Oct 22, 2020

garlick commented Oct 22, 2020

garlick commented Oct 26, 2020

chu11 left a comment

chu11 Nov 4, 2020

grondo Nov 4, 2020

grondo Nov 4, 2020

chu11 Nov 4, 2020

chu11 Nov 4, 2020

chu11 Nov 4, 2020

grondo Nov 4, 2020

chu11 Nov 4, 2020

grondo commented Nov 4, 2020

grondo commented Nov 4, 2020

codecov-io commented Nov 4, 2020

garlick commented Nov 4, 2020

grondo commented Nov 4, 2020

grondo commented Nov 4, 2020

grondo commented Nov 4, 2020

garlick commented Nov 4, 2020 via email

resource: load resources from config or R, and rework topo discovery #3265

resource: load resources from config or R, and rework topo discovery #3265

Conversation

garlick commented Oct 13, 2020 • edited

lgtm-com bot commented Oct 13, 2020

garlick commented Oct 14, 2020 • edited

lgtm-com bot commented Oct 14, 2020

garlick commented Oct 15, 2020

lgtm-com bot commented Oct 15, 2020

lgtm-com bot commented Oct 16, 2020

garlick commented Oct 18, 2020 • edited

lgtm-com bot commented Oct 18, 2020

dongahn commented Oct 19, 2020

dongahn commented Oct 19, 2020

dongahn commented Oct 19, 2020

garlick commented Oct 19, 2020

lgtm-com bot commented Oct 20, 2020

lgtm-com bot commented Oct 21, 2020

garlick commented Oct 22, 2020

garlick commented Oct 22, 2020

garlick commented Oct 26, 2020

chu11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grondo commented Nov 4, 2020

grondo commented Nov 4, 2020

codecov-io commented Nov 4, 2020

Codecov Report

garlick commented Nov 4, 2020

grondo commented Nov 4, 2020

grondo commented Nov 4, 2020

grondo commented Nov 4, 2020

garlick commented Nov 4, 2020 via email

garlick commented Oct 13, 2020 •

edited

garlick commented Oct 14, 2020 •

edited

garlick commented Oct 18, 2020 •

edited