New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add flux.taskmap to PMI kvs for better cyclic task distribution scalability #4798
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just some minor comments.
It's a little bit unfortunate but the broker doesn't use the PMI-1 API so it doesn't get a benefit here. It fetches PMI_process_mapping directly through an abstract interface offered by broker/pmiutil.c that supports various styles of access including pmix. Maybe pmiutil needs a function for retrieving a struct taskmap, and for starters, if the internal method is the simple client, it could just pull that through the aux container. Possibly another PR?
@@ -437,6 +437,22 @@ static int set_flux_instance_level (struct shell_pmi *pmi) | |||
return rc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit message for "...PMI key to RFC 34 taskmap":
s/with/will/
s/cuases/causes/
src/common/libpmi/pmi2.h
Outdated
|
||
/* PMI2_MAX_VALLEN of 1024 is a de-facto standard and should not be | ||
* increased. Experimentally increasing to 2048 was shown to cause | ||
* crashes with many MPI implementations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe"mpich and mvapich2" would be more accurate than "many MPI implementations".
src/common/libpmi/simple_server.h
Outdated
/* Maximum size of a PMI KVS value. One might be tempted to increase | ||
* this number to hold larger values, for example to hold an encoded | ||
* PMI_process_mapping with a large count of tasks per node. However, | ||
* experimentally, most MPI implementations do not handle a larger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe s/most/some/ and add a reference to #4796
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sorry, all of the mpich implementations I tried had this issue, so I assumed only open-mpi does not have this limitation.
How about I use the same wording as suggested above "mpich and mvapich2" if the superlative is bothersome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I just thought 2 out of 3 when (one of those two is a derivative of the other) wasn't really "most" :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going off the definition of most as 'the majority of' and assumed 2 out of 3 was a majority. (But for all I know commercial MPIs don't have this problem so I can see how most was incorrect here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. OK, either way is fine with me.
Ah interesting, I did not know that and assumed the broker used PMI-1. |
Ah, ok, I see what you are saying. It seems like a simple enought change that I'm ambivalent whether it should go in this PR or not. |
We could save it for later. Not critical. |
Yeah, it would be more complicated than I thought since we'd have to support all the bootstrap methods including simple_client, PMI-1, and pmix. |
Ok, I opened #4800. I noted in there that the function to fetch a taskmap in simple_server would likely have to be exposed for this to work as proposed, since the broker currently doesn't (seem to) call any of the clique functions which internally cause the simple client to fetch and create the taskmap. |
Actually we could just do this for now: diff --git a/src/broker/boot_pmi.c b/src/broker/boot_pmi.c
index 2a5db68c6..bcbfb7949 100644
--- a/src/broker/boot_pmi.c
+++ b/src/broker/boot_pmi.c
@@ -94,12 +94,22 @@ static int set_broker_mapping_attr (struct pmi_handle *pmi,
if (pmi_params->size == 1)
val = strdup ("{\"version\":1,\"map\":[[0,1,1,1]]}");
else {
+ /* First attempt to get flux.taskmap, falling back to
+ * PMI_process_mapping if this key is not available.
+ * This should be replaced when #4800 is fixed.
+ */
if (broker_pmi_kvs_get (pmi,
pmi_params->kvsname,
- "PMI_process_mapping",
+ "flux.taskmap",
buf,
sizeof (buf),
- -1) == PMI_SUCCESS) {
+ -1) == PMI_SUCCESS
+ || broker_pmi_kvs_get (pmi,
+ pmi_params->kvsname,
+ "PMI_process_mapping",
+ buf,
+ sizeof (buf),
+ -1) == PMI_SUCCESS) {
val = pmi_mapping_to_taskmap (buf);
}
} |
Oh duh 💡 that's a pretty simple solution. |
Problem: When mapping tasks with cyclic task distribution and a high number of tasks per node, PMI_process_mapping will overflow the max PMI value length of 1024. This causes Flux's libpmi to be unable to calculate clique information. Add a flux.taskmap key to the job shell PMI kvs which encodes the task mapping in RFC 34 Task Map form, which can encode cyclic task mappings more compactly. When Flux is a PMI client, it will then be able to check for this key and thereby handle a wider range of mappings.
Problem: A PMI_process_mapping string for a cyclic task distribution of greater than about 126 tasks over more than one node will overflow the maximum PMI value length, and therefore clique information for task distributions of this type and size cannot be computed by libpmi. However, the Flux job shell also places an RFC 34 Task Map in the flux.taskmap key, which should be able to handle a wider variety of encodings. In the libpmi simple client, check for a flux.taskmap key and use that as the job taskmap instead of PMI_process_mapping when it exists.
Problem: For large task counts with a cyclic task mapping, the buffer used to encode clique ranks as a string is too small. Increase the size of the buffer so that the test utility will work for large numbers of tasks
Problem: It is appealing to attempt to increase the PMI and PMI-2 max value lengths in an attempt to fit large PMI_process_mapping values, but this has been shown to be futile since most, if not all, MPI implementations assume a 1024 value length and encoding values that exceed this length has been show to cause segfaults in MPI. Add comments that explain the situation and warn observers about changing the max PMI value length.
Problem: There is no test in the testsuite that ensures PMI clique information is available when running with cyclic task distribution and 128 tasks per node across more than a single node. Add a test to t3002-pmi.t which checks that clique ranks are as expected in this scenario.
Problem: The broker only uses PMI_process_mapping to set the broker.mapping attribute, but flux.taskmap is also available and will be less likely to overflow the maximum PMI value length for cyclic task distributions with lots of tasks per node. Check for flux.taskmap before PMI_process_mapping when attempting to set broker.mapping.
Ok, I rebased and made the suggested changes, including updating the broker to check for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool looks good!
Codecov Report
@@ Coverage Diff @@
## master #4798 +/- ##
==========================================
- Coverage 83.46% 83.46% -0.01%
==========================================
Files 415 415
Lines 70753 70771 +18
==========================================
+ Hits 59056 59071 +15
- Misses 11697 11700 +3
|
Thanks! setting MWP. |
This PR adds the RFC 34 taskmap to the PMI kvs as
flux.taskmap
and changes libpmi to check for this value first before falling back toPMI_process_mapping
as suggested in #4796. This allows clique size and ranks to be calculated by libpmi even ifPMI_process_mapping
would overflow the 1K maximum value length, as is the case with 128 tasks per node across more than a single node.