Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add flux.taskmap to PMI kvs for better cyclic task distribution scalability #4798

Merged
merged 6 commits into from Dec 7, 2022

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Dec 6, 2022

This PR adds the RFC 34 taskmap to the PMI kvs as flux.taskmap and changes libpmi to check for this value first before falling back to PMI_process_mapping as suggested in #4796. This allows clique size and ranks to be calculated by libpmi even if PMI_process_mapping would overflow the 1K maximum value length, as is the case with 128 tasks per node across more than a single node.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some minor comments.

It's a little bit unfortunate but the broker doesn't use the PMI-1 API so it doesn't get a benefit here. It fetches PMI_process_mapping directly through an abstract interface offered by broker/pmiutil.c that supports various styles of access including pmix. Maybe pmiutil needs a function for retrieving a struct taskmap, and for starters, if the internal method is the simple client, it could just pull that through the aux container. Possibly another PR?

@@ -437,6 +437,22 @@ static int set_flux_instance_level (struct shell_pmi *pmi)
return rc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit message for "...PMI key to RFC 34 taskmap":

s/with/will/
s/cuases/causes/


/* PMI2_MAX_VALLEN of 1024 is a de-facto standard and should not be
* increased. Experimentally increasing to 2048 was shown to cause
* crashes with many MPI implementations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe"mpich and mvapich2" would be more accurate than "many MPI implementations".

/* Maximum size of a PMI KVS value. One might be tempted to increase
* this number to hold larger values, for example to hold an encoded
* PMI_process_mapping with a large count of tasks per node. However,
* experimentally, most MPI implementations do not handle a larger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe s/most/some/ and add a reference to #4796

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sorry, all of the mpich implementations I tried had this issue, so I assumed only open-mpi does not have this limitation.
How about I use the same wording as suggested above "mpich and mvapich2" if the superlative is bothersome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I just thought 2 out of 3 when (one of those two is a derivative of the other) wasn't really "most" :-)

Copy link
Contributor Author

@grondo grondo Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going off the definition of most as 'the majority of' and assumed 2 out of 3 was a majority. (But for all I know commercial MPIs don't have this problem so I can see how most was incorrect here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. OK, either way is fine with me.

@grondo
Copy link
Contributor Author

grondo commented Dec 6, 2022

It's a little bit unfortunate but the broker doesn't use the PMI-1 API so it doesn't get a benefit here. It fetches PMI_process_mapping directly through an abstract interface offered by broker/pmiutil.c that supports various styles of access including pmix.

Ah interesting, I did not know that and assumed the broker used PMI-1.

@grondo
Copy link
Contributor Author

grondo commented Dec 6, 2022

Maybe pmiutil needs a function for retrieving a struct taskmap, and for starters, if the internal method is the simple client, it could just pull that through the aux container. Possibly another PR?

Ah, ok, I see what you are saying. It seems like a simple enought change that I'm ambivalent whether it should go in this PR or not.

@garlick
Copy link
Member

garlick commented Dec 6, 2022

We could save it for later. Not critical.

@grondo
Copy link
Contributor Author

grondo commented Dec 6, 2022

Yeah, it would be more complicated than I thought since we'd have to support all the bootstrap methods including simple_client, PMI-1, and pmix.

@grondo
Copy link
Contributor Author

grondo commented Dec 6, 2022

Ok, I opened #4800. I noted in there that the function to fetch a taskmap in simple_server would likely have to be exposed for this to work as proposed, since the broker currently doesn't (seem to) call any of the clique functions which internally cause the simple client to fetch and create the taskmap.

@grondo
Copy link
Contributor Author

grondo commented Dec 6, 2022

Actually we could just do this for now:

diff --git a/src/broker/boot_pmi.c b/src/broker/boot_pmi.c
index 2a5db68c6..bcbfb7949 100644
--- a/src/broker/boot_pmi.c
+++ b/src/broker/boot_pmi.c
@@ -94,12 +94,22 @@ static int set_broker_mapping_attr (struct pmi_handle *pmi,
     if (pmi_params->size == 1)
         val = strdup ("{\"version\":1,\"map\":[[0,1,1,1]]}");
     else {
+        /* First attempt to get flux.taskmap, falling back to
+         * PMI_process_mapping if this key is not available.
+         * This should be replaced when #4800 is fixed.
+         */
         if (broker_pmi_kvs_get (pmi,
                                 pmi_params->kvsname,
-                                "PMI_process_mapping",
+                                "flux.taskmap",
                                 buf,
                                 sizeof (buf),
-                                -1) == PMI_SUCCESS) {
+                                -1) == PMI_SUCCESS
+            || broker_pmi_kvs_get (pmi,
+                                   pmi_params->kvsname,
+                                   "PMI_process_mapping",
+                                   buf,
+                                   sizeof (buf),
+                                   -1) == PMI_SUCCESS) {
             val = pmi_mapping_to_taskmap (buf);
         }
     }

@garlick
Copy link
Member

garlick commented Dec 6, 2022

Oh duh 💡 that's a pretty simple solution.

Problem: When mapping tasks with cyclic task distribution and a high
number of tasks per node, PMI_process_mapping will overflow the max
PMI value length of 1024. This causes Flux's libpmi to be unable to
calculate clique information.

Add a flux.taskmap key to the job shell PMI kvs which encodes the task
mapping in RFC 34 Task Map form, which can encode cyclic task mappings
more compactly. When Flux is a PMI client, it will then be able to
check for this key and thereby handle a wider range of mappings.
Problem: A PMI_process_mapping string for a cyclic task distribution
of greater than about 126 tasks over more than one node will overflow
the maximum PMI value length, and therefore clique information for
task distributions of this type and size cannot be computed by libpmi.
However, the Flux job shell also places an RFC 34 Task Map in the
flux.taskmap key, which should be able to handle a wider variety of
encodings.

In the libpmi simple client, check for a flux.taskmap key and use that
as the job taskmap instead of PMI_process_mapping when it exists.
Problem: For large task counts with a cyclic task mapping, the
buffer used to encode clique ranks as a string is too small.

Increase the size of the buffer so that the test utility will work
for large numbers of tasks
Problem: It is appealing to attempt to increase the PMI and PMI-2
max value lengths in an attempt to fit large PMI_process_mapping
values, but this has been shown to be futile since most, if not all,
MPI implementations assume a 1024 value length and encoding values
that exceed this length has been show to cause segfaults in MPI.

Add comments that explain the situation and warn observers about
changing the max PMI value length.
Problem: There is no test in the testsuite that ensures PMI clique
information is available when running with cyclic task distribution
and 128 tasks per node across more than a single node.

Add a test to t3002-pmi.t which checks that clique ranks are as
expected in this scenario.
Problem: The broker only uses PMI_process_mapping to set the
broker.mapping attribute, but flux.taskmap is also available and will
be less likely to overflow the maximum PMI value length for cyclic
task distributions with lots of tasks per node.

Check for flux.taskmap before PMI_process_mapping when attempting to
set broker.mapping.
@grondo
Copy link
Contributor Author

grondo commented Dec 6, 2022

Ok, I rebased and made the suggested changes, including updating the broker to check for flux.taskmap. Not exactly sure if that change can be adequately tested.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool looks good!

@codecov
Copy link

codecov bot commented Dec 7, 2022

Codecov Report

Merging #4798 (6d330e7) into master (ec885f6) will decrease coverage by 0.00%.
The diff coverage is 95.23%.

@@            Coverage Diff             @@
##           master    #4798      +/-   ##
==========================================
- Coverage   83.46%   83.46%   -0.01%     
==========================================
  Files         415      415              
  Lines       70753    70771      +18     
==========================================
+ Hits        59056    59071      +15     
- Misses      11697    11700       +3     
Impacted Files Coverage Δ
src/shell/pmi/pmi.c 82.54% <92.85%> (+0.93%) ⬆️
src/broker/boot_pmi.c 68.07% <100.00%> (+0.12%) ⬆️
src/common/libpmi/simple_client.c 86.46% <100.00%> (+0.17%) ⬆️
src/common/libpmi/pmi.c 91.81% <0.00%> (-1.82%) ⬇️
src/cmd/builtin/proxy.c 79.38% <0.00%> (-0.88%) ⬇️
src/common/libsubprocess/local.c 84.28% <0.00%> (-0.48%) ⬇️
src/modules/cron/cron.c 82.47% <0.00%> (-0.45%) ⬇️
src/broker/overlay.c 85.58% <0.00%> (-0.41%) ⬇️
src/common/libflux/handle.c 83.49% <0.00%> (-0.39%) ⬇️
src/common/libsubprocess/subprocess.c 87.92% <0.00%> (-0.30%) ⬇️
... and 5 more

@grondo
Copy link
Contributor Author

grondo commented Dec 7, 2022

Thanks! setting MWP.

@mergify mergify bot merged commit 2a7597c into flux-framework:master Dec 7, 2022
@grondo grondo deleted the pmi-taskmap branch December 7, 2022 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants