Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libpmi: add ability to bootstrap Flux using pmix #1580

Merged
merged 9 commits into from Jul 17, 2018

Conversation

Projects
None yet
5 participants
@garlick
Copy link
Member

garlick commented Jul 13, 2018

This PR adds code to Flux to enable it to bootstrap from PMIx, if built --with-pmix. I used the pmix PMI-1 compat library as a guide and pulled the code into flux-core's libpmi. The broker will use it to bootstrap if it finds PMIX_SERVER_URI or _URI2 set as discussed in #1555

This recipe works to build it on sierra:

$ LDFLAGS="-L/opt/ibm/spectrum_mpi/lib" \
CPPFLAGS="-I/opt/ibm/spectrum_mpi/include" \
./configure --with-pmix --disable-jobspec
$ make -j8
$ make -j8 check

I'm having some difficulty running anything on sierra at the moment but I'll get back to it and hopefully spin up some tests on monday. @dongahn provided some nice bsub/jsrun runes for me to cut and paste from in #1555 (thanks!)

I did notice that on my desktop, libpmix links against hwloc, and the hwloc on my system must link against cuda, because suddenly, but only when built against pmix, the valgrind test began failing with memory lost in the cuda library, so I'll need to run that down also.

Oh also, I had kind of stupidly implemented the different PMI client options as a case statement in each function, so there is a refactoring patch in here too to make the approach a little more scaleable.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Jul 13, 2018

@dongahn provided some nice bsub/jsrun runes for me to cut and paste from in #1555 (thanks!)

If that's not good enough, please use:
https://lc.llnl.gov/confluence/display/SIERRA/Using+Flux+to+schedule+and+run+jobs

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jul 14, 2018

Thanks!

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Jul 14, 2018

Codecov Report

Merging #1580 into master will increase coverage by 0.05%.
The diff coverage is 48.14%.

@@            Coverage Diff             @@
##           master    #1580      +/-   ##
==========================================
+ Coverage   79.16%   79.21%   +0.05%     
==========================================
  Files         170      171       +1     
  Lines       31182    31336     +154     
==========================================
+ Hits        24684    24823     +139     
- Misses       6498     6513      +15
Impacted Files Coverage Δ
src/common/libpmi/pmix_client.c 0% <0%> (ø)
src/common/libpmi/wrap.c 88.08% <100%> (+44.29%) ⬆️
src/common/libpmi/single.c 97.77% <100%> (+14.44%) ⬆️
src/common/libpmi/simple_client.c 96.98% <100%> (+7.05%) ⬆️
src/common/libpmi/pmi.c 75.7% <88.88%> (+27.33%) ⬆️
src/common/libpmi/clique.c 96.72% <0%> (-3.28%) ⬇️
src/modules/connector-local/local.c 72.76% <0%> (-1.43%) ⬇️
src/common/libflux/message.c 81.02% <0%> (-0.12%) ⬇️
src/broker/overlay.c 74.13% <0%> (ø) ⬆️
... and 4 more
@coveralls

This comment has been minimized.

Copy link

coveralls commented Jul 14, 2018

Coverage Status

Coverage decreased (-0.06%) to 79.396% when pulling 45c04f7 on garlick:pmix_remap into 711a8e5 on flux-framework:master.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jul 16, 2018

I did notice that on my desktop, libpmix links against hwloc, and the hwloc on my system must link against cuda, because suddenly, but only when built against pmix, the valgrind test began failing with memory lost in the cuda library, so I'll need to run that down also.

This problem went away for me after I uninstalled pmix and reinstalled from pmix-3.0.0. I was prevously using a pmix-1.x library I built in 2016, so maybe this is no longer a problem.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jul 16, 2018

I'm having some difficulty running anything on sierra at the moment but I'll get back to it and hopefully spin up some tests on monday.

@SteVwonder and @trws educated me on some sierra arcana and I was finally able to run a test. It appears that these changes do work (I teseted up to a 512 way Flux instance on 4 nodes).

I've had two builders get stuck and restarted them. Interestingly I also just saw the valgrind test get stuck on my desktop for 20m (I finally killed it). Hopefully I've not introduced some new silly problem.

I don't think I`ll be able to get coverage for the pmix client in travis without significant effort, but I will try to work on the other missed coverage areas in the pmi library.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jul 16, 2018

I don't think I`ll be able to get coverage for the pmix client in travis without significant effort

That seems reasonable. Maybe open an issue with "help wanted" tag for hooking in PMIx client testing to our CI system.

@garlick garlick force-pushed the garlick:pmix_remap branch from 808cb7b to ce460ba Jul 16, 2018

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jul 17, 2018

I think this is probably close to ready. The main missing coverage is in pmix_client.c.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jul 17, 2018

This is minor, but it might be good to leave one of the travis-ci builders without --with-pmix, just as a double check that building without that option works. Not really necessary IMO, but thought I'd mention it.

Otherwise, this PR is difficult to review/test, so I'd say if you are happy with it, I have no other comments and looks good to merge if you are ready.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jul 17, 2018

Thanks, I'll make that suggested change and then I'm happy.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jul 17, 2018

Restarted another hung builder. Python tests are the last thing in the output.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jul 17, 2018

Want to squash the travis-ci: commits together? If you'd rather not go through another round of testing I'm fine with that and could merge as is.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jul 17, 2018

Right, I'll do that. I can fix up an awkwardly worded commit message too. And I was thinking it might be slightly better to define unimplemented function pointers as NULL rather than implement functions that return PMI_FAIL, since that would have the same effect with less code.

If @tpatki's PR is ready, please merge and I'll rebase after I make the above changes.

@garlick garlick force-pushed the garlick:pmix_remap branch from 29546ec to 45c04f7 Jul 17, 2018

garlick added some commits Jul 12, 2018

libpmi: restructure with struct of function pointers
Problem: the "case statement" method of switching an
interface between methods does not scale, and we want
to add a new method for pmix.

Rework "wrap", "single", and "simple_client" implementations
to implement a single registration function that assigns
a struct of function pointers.

Update tests
build: add opt-in configure option --with-pmix
Add automake machinery to allow someone building Flux to
request that it be built with pmix support.  That is,
support for using PMIx to bootstrap flux.  If requested,
then look for libpmix and set some Makefile and config.h
variables.
libpmi: add pmix_client implementation
Extend Flux PMI client to allow PMIx to be used.
Support for this mode is only built when flux-core
is configured --with-pmix.

The purpose is to allow Flux to boot on systems
that provide libpmix.so but do not ship the pmix
PMI-1 compatibility libraries, such as IBM system
based on LSF/jsrun.

This code is based on the pmix project's PMI-1
compatibility library code, modified to fit Flux's
internal PMI library structure.

The new conditional dependency on pmix necessitates
the following build system changes:

- Link libpmi.la and libpmi2.la against libpmix.so.
- Drop libpmi.la from libflux-internal.la library.
- Link pmi servers (wrexecd, flux-start, tests)
  against src/common/libpmi/libpmi.la
- Link pmi clients (broker, tests)
  against src/common/libpmi.la.

Fixes #1581
travis-ci: configure flux-core --with-pmix
Download and install pmix-3.0.0, then configure
flux-core --with-pmix so this code gets compile
tested in travis.

One builder still builds without pmix to ensure
that configuration works too.
libpmi/wrap: add 'allow_self_wrap' param for test
Add a flag to disable the check that prevents the
PMI client from dlopening itself and recursing forever.

The flag will be used by a 'wrap' PMI implementation
unit test that explicitly wraps the flux pmi library,
and forces it to use the 'single' implementation.
libpmi/simple: implement abort call
Problem: the 'simple_client' pmi implementation does not
implement PMI_Abort().

Add a trivial local implementation which is better than
nothing.  Eventually this should call the abort server op.
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jul 17, 2018

Rebased on current master.

Right, I'll do that.

travis-ci commits were merged.

I can fix up an awkwardly worded commit message too

Done.

And I was thinking it might be slightly better to define unimplemented function pointers as NULL rather than implement functions that return PMI_FAIL, since that would have the same effect with less code.

Not done - not quite as simple as I was thinking and probably not worth it.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jul 17, 2018

Ok, then this one is ready once Travis reports?

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jul 17, 2018

Yep.

@grondo grondo merged commit 03a1ac9 into flux-framework:master Jul 17, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.06%) to 79.396%
Details
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jul 17, 2018

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.