New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resource: load resources from config or R, and rework topo discovery #3265
Conversation
This pull request introduces 2 alerts when merging cd30eec into d6cddb3 - view on LGTM.com new alerts:
|
cd30eec
to
e8e1429
Compare
OK, just forced a push after some refactoring and a rebase on current master. All the The hwloc specific stuff is relegated to Resources are now read/written from hwloc xml is collected at rank 0 and written to the usual place at This should be relatively straightforward now to convert to Rv1. Find the Which reminded me - Fluxion does look at the resource object to extract the initial idset of ranks for a grow operation (which it then satisfies with the hwloc XML data). Uh on. |
This pull request introduces 2 alerts when merging e8e1429 into 761564a - view on LGTM.com new alerts:
|
Just posted a fixup to store the hwloc XML data as a JSON string rather than the bare XML string, since this is what I'll post a small patch to Fluxion shortly that fixes up the test suite to track the changes in this PR. |
This pull request introduces 2 alerts when merging 999f0d5 into 7f851e1 - view on LGTM.com new alerts:
|
This pull request introduces 2 alerts when merging 72a2086 into 755b415 - view on LGTM.com new alerts:
|
72a2086
to
9d126dd
Compare
Forced a push, adding This can be used to load test resources without reloading the resource module.
The resource object may not contain ranks that are not valid in the current instance (e.g. >= size), unless Resources are not validated against the hwloc topology when loaded in this way. The command completes after the new The scheduler still needs to be unloaded so the sequence used in test is
|
This pull request introduces 2 alerts when merging 9d126dd into 755b415 - view on LGTM.com new alerts:
|
Yes, this is true. Fluxion does "not" use the overall resource object but it does need to fetch the idset. |
@garlick: If there are the things that Fluxion does now slow down this conversion, you shouldn't shy away calling it out even if that require some substantial changes. |
Is this still a problem? So the |
Since the above description, XML is now collected by the resource module and stored in the KVS, as before. There are some small modifications to Fluxion proposed in flux-framework/flux-sched#766 to track the changes in this PR. One change is to not load XML with KVS_FLAG_WAITCREATE, so if the XML is missing, it's a hard failure not a deadlock. For the system instance (case 1) in the short short term, we can initially not create a resource configuration, and boot the system instance with all nodes up once. This will populate XML in the KVS. Then on subsequent restarts, it will be reloaded from there and down nodes can be tolerated. We can improve upon that by embedding JGF in the opaque scheduler section of a resource configuration and bootstrap the system instance from that. Fluxion would need support for helping generate the resource configuration, and for bootstrapping from the JGF it gets back from For a sub-job (case 2) we need to discuss options - see #3228 (comment). In summary: The XML will not be there as of now. It would be nice if minimally there was a capability for Fluxion to bootstrap from Rv1 alone. We could also perhaps provide a hint to the resource module that it should collect XML in this case. Passing through JGF and reranking seems hard, and might not be worth it in Rv1 time frame (best to move on to Rv2). |
9d126dd
to
3206286
Compare
This pull request introduces 2 alerts when merging 3206286 into 4fe56c7 - view on LGTM.com new alerts:
|
3206286
to
009964f
Compare
This pull request introduces 2 alerts when merging 009964f into 4fe56c7 - view on LGTM.com new alerts:
|
009964f
to
ec8a359
Compare
Just pushed a change to drain nodes on R verification failure, and related tests. |
2ae43d5
to
b12965a
Compare
Just got this failure in travis. I was not expecting this to be racy, but possibly the async disconnect message from unloading sched-simple is not getting there before resource.acquire when it's being loaded? Hmm, if that's going to be a problem in test then we might want to use an expicit cancel.
|
5b58720
to
2e65206
Compare
This was rebased on top of #3276, and everything converted to Rv1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a not-super-thorough skim and caught a few things, nothing big.
src/common/librlist/rlist.c
Outdated
if (version != 1) | ||
if (version != 1) { | ||
if (errp) | ||
sprintf (errp->text, "invalid version=%d", version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not snprintf()
? i think some checkers (perhaps not the ones we use since this wasn't flagged) will mark use of sprintf()
bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the buffer is known to be smaller than the maximum possible error string (even if version is 12 figures)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, you're right, why not use snprintf? I'll update.
src/modules/resource/monitor.c
Outdated
|
||
if (flux_request_unpack (msg, NULL, "{s:i}", "up", &up) < 0) { | ||
if (flux_respond_error (h, msg, errno, NULL) < 0) | ||
flux_log_error (h, "error resonding to monitor-waitup request"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo resonding
src/modules/resource/monitor.c
Outdated
} | ||
if (up == count) { | ||
if (flux_respond (h, msg, NULL) < 0) | ||
flux_log_error (h, "error resonding to monitor-waitup request"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo resonding
msg = zlist_first (monitor->waiters); | ||
while (msg) { | ||
if (notify_one_waiter (monitor->ctx->h, count, msg)) { | ||
zlist_remove (monitor->waiters, (void *)msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think zlist_remove()
is safe while iterating zlist
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked about this as well, note that the function returns if any item is removed from the list, so this is safe. (I missed that as well, maybe a comment would help?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, i see. Yeah, a comment would be good.
Problem: rlist_from_json() does not fill in the error text field on a version mismatch, making it difficult for the caller to determine what went wrong on an error due to version. Add a short error message on version mismatch to rlist_from_json()
Problem: A segfault is triggered if rlist_to_R() is called with a NULL argument. Protect against this occurrence by checking for NULL rlist argument and returning NULL instead of segfaulting.
Problem: if posting the drain event fails when only the reason is being updated, the node could be left undrained. Drop the incorrect attempt at remediation when this unlikely error occurs.
Problem: adding RPCs to the resource module that are called on ranks other than 0 will result in disconnect messages on those ranks, but the disconnect handler will segfault if called on rank > 0. Don't call acquire_disconect() if ctx->acquire is NULL.
Problem: there is duplication of code between resource.monitor-hello and -goodbye RPCs. Consolidate into resource.monitor-reduce, containing up and down idsets, to make the code more concise. While consolidating, fix bug where quickly reloading the resource module could result in goodbye, hello messages arriving out of order, leaving rank marked down.
Replace rutil_idset_decode_add() with rutil_idset_decode_test() Update unit tests.
Temporarily add local xml_topology_get() function while it's missing from librlist/rhwloc.
Add some utility functions for reading file content as a string, reading file content as JSON, and creating a JSON object from a directory containing <rank>.xml files. Add some unit tests.
Add drain_rank() function to be called on rank 0 only when a node needs to be drained by another part of the resource module.
Add inventory.c, which is a container for resources known to the resource module. It adds methods for populating resources from configuration (file pointed to from TOML), from R in the enclosing instance, and from discovery. The resource module commits R to the KVS as 'resource.R'. This is used on a restart unless the resources are defined via TOML config, and then it is ignored. It expects R version 1 instead of the by_rank resource object format. Other parts of the resource module such as acquire.c and discover.c are adjusted accordingly. Fake resources may be loaded in test using the resource.reload RPC. Add an resource.get-xml RPC (rank 0 only) which fetches a fixed length array from 0...size-1 of XML strings. If XML is not yet in inventory, block until it arrives. Sched-simple is adjusted to acquire resources in Rv1 form. Tests are updated.
Add 'flux resource reload' subcommand which allows dummy resources to be loaded into the resource module for testing. Usage: flux resource reload path [xmldir]
Add topo.c module which directly constructs R_local from the hwloc topology. A candidate R and topology XML is reduced and made available on rank 0. If R is known at module load time (e.g. from config, enclosing instance, or resource.R in the KVS) then it is verified against the topology, and the rank is drained if there is a problem. The 'noverify' module option disables resource verification. Drop discover.c which outsourced discovery to 'flux hwloc discover'. Drop the now-unused callback from monitor.c. Update sharness tests that set fake resources to load resource with the noverify option.
Add an RPC that responds when the requested count of ranks are online.
Problem: the 'sched-PUs' module option has no effect after the conversion from by_rank to Rv1. Drop the 'sched-PUs' option and update tests.
Split t2310-resource-module test script out into smaller scripts: t2310-resource-module.t t2311-resource-drain.t t2312-resource-exclude.t t2313-resource-acquire.t t2314-resource-monitor.t t2315-resource-system.t Don't reload resource module after changing exclusions because this is not needed. Don't load aggregator module in the 'job' personality, as resources are now dynamically discovered by the resource module directly. Use flux resource instead of cobbled-together shell functions for drain/undrain. Replace ad-hoc wait_event() shell function with newly added 'flux kvs eventlog wait-event' command. Add coverage for 'flux resource reload'. Add coverage for topo validation failure. Add coverage for configured R. Add t/resource/get-xml-test.py to test get-xml RPC behavior Add t/resource/waitup.py to test monitor-waitup RPC Co-authored-by: Mark A. Grondona <mark.grondona@gmail.com> testsuite: t2314-resource-monitor.t: add waitup.py test script Add test for resource.monitor-waitup RPC
The t2701-mini-batch.t sharness test was hanging at one point during development, but only in travis. That problem is resolved but adding run_timeout to several blocking calls will ensure that future travis hangs are promoted to hard failures.
Problem: valgrind test occasionally fails with 30 retries on the local connector. The timeout is in rc1 on the first access to the broker. Raise it 40 retries. N.B. the retry time starts at 0.016s and doubles each time until it reaches 2s, then it's 2s each up to the maxmimum. So times are: 0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 1.024, 2, 2, 2, .... Therefore 30 retries is about 26s and 40 retries is about 46s.
a9b32b8
to
8bafb4f
Compare
Ok, I've force-pushed an update addressing @chu11's comments. |
Codecov Report
@@ Coverage Diff @@
## master #3265 +/- ##
==========================================
- Coverage 81.84% 81.77% -0.07%
==========================================
Files 300 301 +1
Lines 46539 46948 +409
==========================================
+ Hits 38092 38394 +302
- Misses 8447 8554 +107
|
Nice, thanks guys!
I'm not sure if I noticed whether we had hangs or not before, but the flux-sched test should not be expected to pass until we merge the flux-framework/flux-sched#766. Think this is ready for MWP? |
Yeah, though MWP won't work since flux-sched checks won't pass. This will need a manual merge after removal of branch protections. |
I suppose I can go ahead and do that. |
Ok, I removed branch protections and manually merged. Huzzah! |
Yay!!!
…On Tue, Nov 3, 2020, 7:54 PM Mark Grondona ***@***.***> wrote:
Ok, I removed branch protections and manually merged. Huzzah!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3265 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABJPW6BABT37JMLZZQUY7DSODGBFANCNFSM4SPINLLA>
.
|
This PR takes a step towards the architecture proposed in #3238.
The "resource object is still
hwloc.by_rank
and theresource.acquire
RPC is unchanged in this PR.A TOML config option is added to read the resource object from a file, e.g.:
Code is added to fetch
R
from the enclosing instance, although conversion to a usable resource object is stubbed out for now.If the resource object is known by one of these static methods, or because
resource.hwloc.by_rank
exists in the KVS when the resource module is loaded, then each broker validates its local shard of the resource object using hwloc (directly, not calling out toflux hwloc
). If validation fails, currently an error is logged and the resource module fails to load. This needs to be changed to drain the node before this is merged.If the resource object could not be determined up front, then resources are "discovered" as before, except the resource module now reduces the shards to a full object itself, rather than calling out to
flux hwloc reload
. This may avoid some entanglements whenflux hwloc
is not finished when the instance exits as noted in #3235.Resources can be faked by pre-populating
resource.hwloc.by_rank
in the KVS and loading the resource module with thenoverify
option. It is no longer necessary to post an event to make this work.One side effect of not using
flux hwloc reload
is thatresource.hwloc.xml
is not populated. However, aresource.topo-get-xml
was added to allow it to be fetched directly from each rank on demand. (Fluxion currently fetches each rank synchronously from the KVS in a loop so it should be relatively easy to replace the KVS lookups with this RPC).[1]This is a WIP because it needs more tests, and it's still under discussion whether conversion to Rv1 should happen before or after this PR.
Edit: [1] Fluxion waits to fetch XML until
resource.acquire
returns. When resources are discovered, that implies that all ranks are up. However when resources are configured, acquire can return before all ranks are up, so this workaround for Fluxion may need more thought.