Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: export libpmi2.so.0.0.0 for MPICH derivatives configured --with-pm=slurm --with-pmi=pmi2 #2381

Merged
merged 13 commits into from Sep 19, 2019

Conversation

@garlick
Copy link
Member

commented Sep 18, 2019

As discussed in #2347, this PR adds back the PMI-2 API implementation we used to have, in order for our existing mvapich2 configurations to continue to work with the new exec system. It's effectively a rewrite, and now includes unit tests for the API. This will need to be deployed along with an MPI "personality" that sets LD_LIBRARY_PATH such that our library gets picked up before slurm's by dlopen().

Still todo: test that this actuall works with our mvapich2's (the whole point!)

@garlick garlick force-pushed the garlick:pmi2_emulate branch from 198aa60 to 7e0aac4 Sep 18, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 18, 2019

Some fixups squashed, and added commits to set the PMI "kvsname" to the Flux jobid in the shell instead of the string "pmi". The broker was already fetching the kvsname, so now it doesn't need to fetch the flux.jobid key we added, so reverted that code.

@@ -324,7 +324,7 @@ int PMI_Spawn_multiple(int count,

int PMI_Get_clique_ranks (int ranks[], int length)
{
int result;
int result = PMI_ERR_INIT;

This comment has been minimized.

Copy link
@chu11

chu11 Sep 18, 2019

Contributor

rebase / merge issue? It looks like you've added a useless initialization?

This comment has been minimized.

Copy link
@garlick

garlick Sep 18, 2019

Author Member

Uh oh. Thanks!

@garlick garlick force-pushed the garlick:pmi2_emulate branch 2 times, most recently from e4a2431 to 7a85e52 Sep 18, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 18, 2019

Fixed issue reported by @chu11 (and squashed). Don't know what happened there!

I was able to test on ipa with

$  #rm .notce and re-login
$ module load gcc
$ module load mvapich2
$ ./configure && make && make check # note failing t3000-mpi-basic.t
$ flux start -s4
bash-4.2$ ./flux srun -N 4 bash -c "LD_LIBRARY_PATH=../common/.libs:$LD_LIBRARY_PATH ../../t/mpi/hello"
[ipa15:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
0: completed MPI_Init in 0.260s.  There are 4 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 0.010s

so it worked but I need to run down the "Invalid my_local_id" error.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 18, 2019

[smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast

It looks like this warning only occurs when we run >1 broker per node. It's not hard to imagine that hwloc based process binding code could become confused and give up in that case, since Flux doesn't attempt to subdivide resources across brokers.

It's not a problem with one broker per node, multiple nodes, so I think pmi2 is actually working as intended here.

garlick added 3 commits Sep 16, 2019
Add aux hash to pmi_simple_client, useful to the PMI2 API
implementation for caching the kvsname, since it is required
internally for kvs_get, but the PMI2 API doesn't allow it
to be passed in like PMI-1.
@garlick garlick force-pushed the garlick:pmi2_emulate branch from 7a85e52 to 960c748 Sep 19, 2019
@codecov-io

This comment has been minimized.

Copy link

commented Sep 19, 2019

Codecov Report

Merging #2381 into master will decrease coverage by 0.01%.
The diff coverage is 81.81%.

@@            Coverage Diff             @@
##           master    #2381      +/-   ##
==========================================
- Coverage   80.94%   80.93%   -0.02%     
==========================================
  Files         221      222       +1     
  Lines       34819    34934     +115     
==========================================
+ Hits        28185    28273      +88     
- Misses       6634     6661      +27
Impacted Files Coverage Δ
src/common/libpmi/pmi.c 94.2% <ø> (-0.05%) ⬇️
src/broker/pmiutil.c 71.79% <0%> (+1.06%) ⬆️
src/shell/pmi.c 76.64% <100%> (-0.47%) ⬇️
src/broker/boot_pmi.c 65.41% <100%> (-0.26%) ⬇️
src/common/libpmi/simple_client.c 90% <63.63%> (-1.72%) ⬇️
src/common/libpmi/pmi2.c 83.62% <83.62%> (ø)
src/modules/job-info/guest_watch.c 71.59% <0%> (-2.28%) ⬇️
src/modules/connector-local/local.c 73.25% <0%> (+0.14%) ⬆️
garlick added 9 commits Sep 16, 2019
Problem:  MPICH or derivatives configured --with-pm=slurm
--with-pmi=pmi2 will dlopen Slurm's libpmi2.so in the system
ld.so path even when launched by Flux.

The Slurm library notices PMI_FD set by Flux and attempts
wire protocol version negotation.  It accepts Flux's request
to use protocol version 1, but then proceeds to use protocol
version 2 and becomes sad.

Provide a minimal libpmi2.so implemented on the pmi_simple_client
class that uses protocol version 1.  The strategy will be to
install it outside of the system ld.so path, then set LD_LIBRARY_PATH
in a shell "mpi personality" plugin (once we are ready to have
those) such that Slurm's libpmi2.so with the faulty version
negotiation is bypassed.

Note: going forward, it would be a better solution to support
protocol version 2 (and 1) in the pmi_simple_server class, but
that is a non-trivial undertaking not attempted here.
As with libpmi.so, force the version to 0:0:0 since the intent
is to provide the stable, well-known libpmi2.so.0.0.0 ABI.
Add unit tests for the PMI-2 "canonical" API,
similar to the existing ones for the PMI-1 API.
Problem: the broker no longer uses the PMI "appnum", but
its pmiutil class still fetches it.

Drop broker code for fetching and accessing the appnum.
This will speed up bootstrap with PMI-1 ever so slightly.

Updated unit test.
Problem: kvsname is hardwired to the string "pmi".  Other PMI
implementations use the jobid or jobid.<step>.  Furthermore,
PMI-2 names this field "jobid".  Since we just added a 'flux.jobid'
local PMI key, perhaps we should conform and use this instead,
then avoid the extra request/response during Flux-in-flux bootstrap.

Use a string representation of the Flux jobid for the kvsname
when PMI is offered by the Flux shell.
Now that the PMI kvsname is known to be the Flux jobid
when Flux is the PMI provider, use it to populate the 'jobid'
attr instead of the special 'flux.jobid' KVS key, which can
now be removed.
Now that the broker can determine its jobid in the enclosing
instance from the PMI kvsname, drop the special PMI "flux.jobid"
local key.
Problem: the pmi_info test program --library option no
longer works.

Our PMI client library no longer tries to dlopen foreign PMI
library DSO's so this option should have been deprecated.
Drop the option.
Problem: the client-side PMI debug flag with accompanying
DRETURN and DPRINTF macros in pmi.c are cumbersome and
tracing is perhaps better left to tools like ltrace(1).

Drop the debug argument to pmi_simple_client_create(),
the ctx->debug flag in the client handle, and the macros
in pmi.c.

Update unit tests.

Note: broker still prints some PMI debug output if the
FLUX_PMI_DEBUG env variable is set, such as whether it
is in singleton, dlopen, or wire proto mode.
@garlick garlick force-pushed the garlick:pmi2_emulate branch from 960c748 to 62889df Sep 19, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 19, 2019

Just forced a push that adds some minor cleanup tweaks, and drops the client side debugging macros which were a mess, and likely redundant with tools like ltrace(1). I'll go repeat my mvapich2 test with ltrace once I get to work and make sure we're not giving up any useful capability by dropping this.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 19, 2019

Restarted builder that failed with

ERROR: lua/t0001-send-recv.t - exited with status 137 (terminated by signal 9?)

and another that failed in valgrind test with:

==22143== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
6629flux-start: 0 (pid 22142) Killed
6630not ok 1 - valgrind reports no new errors on 2 broker run

I'm guessing slow travis today?

@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 19, 2019

I'll go repeat my mvapich2 test with ltrace once I get to work and make sure we're not giving up any useful capability by dropping this.

Yes, in addition to the server side tracing that can be turned on currently by setting the shell verbose flag (will eventually need to be a pmi plugin option, and probably log to a shell logging service), ltrace covers the builtin debugging more effectively that the macros that were in pmi.c. For example with appropriate ltrace.conf, a 2 node hello compiled with mvapich2 and running under Flux on ipa shows:

bash-4.2$ cat ../../ltrace-pmi.conf
int PMI2_Init (+int*, +int*, +int*, +int*);
int PMI2_Finalize ();

int PMI2_Job_GetId (+string, int);
int PMI2_Info_GetJobAttr (string, +string, int, +int*);

int PMI2_KVS_Put (string, string);
int PMI2_KVS_Get (string, int, string, +string, int, +int*);
int PMI2_KVS_Fence ();

bash-4.2$ flux srun -N2 bash -c "LD_LIBRARY_PATH=../common/.libs:$LD_LIBRARY_PATH ltrace -F ../../ltrace-pmi.conf -l 'libpmi2.so*' ../../t/mpi/hello"
libmpi.so.12->PMI2_Init(0, 2, 0, 0)              = 0
libmpi.so.12->PMI2_Job_GetId("22920462073856", 64) = 0
libmpi.so.12->PMI2_Job_GetId("22920462073856", 64) = 0
libmpi.so.12->PMI2_Init(0, 2, 1, 0)              = 0
libmpi.so.12->PMI2_Job_GetId("22920462073856", 64) = 0
libmpi.so.12->PMI2_Info_GetJobAttr("PMI_process_mapping", "(vector,(0,2,1))", 1024, 1) = 0
libmpi.so.12->PMI2_Job_GetId("22920462073856", 64) = 0
libmpi.so.12->PMI2_Info_GetJobAttr("PMI_process_mapping", "(vector,(0,2,1))", 1024, 1) = 0
libmpi.so.12->PMI2_KVS_Put("HOST-1", "-1463810704") = 0
libmpi.so.12->PMI2_KVS_Put("HOST-0", "-1463810960") = 0
libmpi.so.12->PMI2_KVS_Fence()                   = 0
libmpi.so.12->PMI2_KVS_Fence()                   = 0
libmpi.so.12->PMI2_KVS_Get("22920462073856", -1, "HOST-1", "-1463810704", 1024, 11) = 0
libmpi.so.12->PMI2_KVS_Get("22920462073856", -1, "HOST-0", "-1463810960", 1024, 11) = 0
libmpi.so.12->PMI2_KVS_Put("ip0", "63189-74492096") = 0
libmpi.so.12->PMI2_KVS_Put("ip1", "31319-91269312") = 0
libmpi.so.12->PMI2_KVS_Fence()                   = 0
libmpi.so.12->PMI2_KVS_Fence()                   = 0
libmpi.so.12->PMI2_KVS_Get("22920462073856", -1, "ip1", "31319-91269312", 1024, 14) = 0
libmpi.so.12->PMI2_KVS_Get("22920462073856", -1, "ip0", "63189-74492096", 1024, 14) = 0
libmpi.so.12->PMI2_KVS_Fence()                   = 0
libmpi.so.12->PMI2_KVS_Fence()                   = 0
0: completed MPI_Init in 0.687s.  There are 2 tasks
0: completed first barrier in 0.000s
libmpi.so.12->PMI2_KVS_Fence()                   = 0
libmpi.so.12->PMI2_KVS_Fence()                   = 0
libmpi.so.12->PMI2_KVS_Fence()                   = 0
libmpi.so.12->PMI2_KVS_Fence()                   = 0
libmpi.so.12->PMI2_Finalize()                    = 0
0: completed MPI_Finalize in 0.010s
libmpi.so.12->PMI2_Finalize()                    = 0
+++ exited (status 0) +++
+++ exited (status 0) +++
ltrace(1) allows function calls to shared libraries to be
traced.  To display function arguments, ltrace requires additional
configurationi information for those functions.

This file permits PMI library calls to be traced, e.g.  in slurm, run:

    srun --label -N2 ltrace -tt -l 'libpmi*' \
	-F ltrace-pmi.conf t/mpi/hello
@grondo

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2019

Nice! I poked at this and ran up to 192 tasks with no problems. (largest I could run on fluke)

0: completed MPI_Init in 11.770s.  There are 192 tasks
0: completed first barrier in 0.011s
0: completed MPI_Finalize in 0.295s
@grondo

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2019

Restarting a builder that hit a spurious valgrind error, not sure I've seen one like this, but clearly unrelated to this PR:

2019-09-19T20:44:09.360796Z broker.crit[0]: child 1 idle for 3 heartbeats
58532019-09-19T20:44:11.370030Z broker.crit[0]: child 1 idle for 4 heartbeats
58542019-09-19T20:44:13.359992Z broker.crit[0]: child 1 idle for 5 heartbeats
58552019-09-19T20:44:15.360229Z broker.crit[0]: child 1 idle for 6 heartbeats
58562019-09-19T20:44:17.360244Z broker.crit[0]: child 1 idle for 7 heartbeats
58572019-09-19T20:44:19.360240Z broker.crit[0]: child 1 idle for 8 heartbeats
5858lt-flux-broker: module 'kvs' was not cleanly shutdown
5859lt-flux-broker: module 'content-sqlite' was not cleanly shutdown
@grondo
grondo approved these changes Sep 19, 2019
Copy link
Contributor

left a comment

Looks good, thanks for the necromancy Dr. Frankenstein!

I'll set merge-when-passing so this can be merged once travis completes.

@mergify mergify bot merged commit a494a7f into flux-framework:master Sep 19, 2019
2 checks passed
2 checks passed
Summary 1 rule matches
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 19, 2019

Thanks!

@garlick garlick deleted the garlick:pmi2_emulate branch Sep 20, 2019
@garlick garlick referenced this pull request Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.