Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement PMI-1 wire protocol in wrexecd #706

Closed
garlick opened this Issue Jun 24, 2016 · 12 comments

Comments

Projects
None yet
2 participants
@garlick
Copy link
Member

garlick commented Jun 24, 2016

As noted in #398, all MPICH derived MPI's support the PMI-1 wire protocol out of the box, unless disabled. Supporting the wire protocol in wrexecd should allow us to launch Intel MPI and mvapich without the hassles of redirecting them to our libpmi.so, dodging rpath entanglements, etc..

Slurm has this capability it turns out, in their pmi2 mpi plugin (which can negotiate version 1). @adammoody recently backported this to the slurm version we are installing on CTS-1 machines in TOSS3, and proposed that we make pmi2 the default mpi plugin instead of mvapich, and build mvapich to support it.

In addition, this is a win for testing as the mpich package for ubuntu supports this.

There is rudimentary wire proto server library in src/common/libpmi-server that is used by flux-start to launch the broker. This could be adapted to run inside the wrexecd for each job.

For more info on the wire proto see Flux RFC 13

Note we will also need to address the issue Tom pointed out in #665. There is no 1.1 wire proto operation to get the "clique ranks". We need to pre-load some info into the KVS to make this work or maybe temporarily revert to 1.0.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jun 24, 2016

Just one note on the mapping of libpmi-server callbacks:

The kvs_get() callback should be mapped to Flux's kvs_get().

The kvs_put() callback should be mapped to Flux's kvs_put(), but there should be no kvs_commit().

The barrier() callback should be mapped to Flux's kvs_fence().

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jun 24, 2016

The barrier() callback should be mapped to Flux's kvs_fence().

I was just about to ask about this.

Doesn't kvs_fence() block?

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jun 24, 2016

Yes, and I didn't think about it until just now but we can't block wrexecd in a fence that will never return until wrexecd services PMI barrier calls from other ranks.

Maybe we will need to get a "split" kvs_fence call implemented, first half calls flux_rpc(), second half calls flux_rpc_get(), then the response can be handled asynchronously with a continuation callback.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jun 24, 2016

Until then, would a global counter implemented via cmb.seq work? Once increment+fetch returns nprocs, kvs_commit() is called? Probably nowhere near as efficient as kvs_fence() but it should work for now.

How do barrier callers get updated when pmi->barrier() would return 1 in the simple protocol?

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jun 24, 2016

Although looking at the kvs_fence() implementation it seems trivial to create a split call, perhaps a kvs_fence_start() call that returns an rpc object?

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jun 24, 2016

Hm, sorry for the noise.

The continuation would not work currently because there would be no way to return 1 from pmi->barrier() callback. A change to the simple server interface would be required, perhaps by changing barrier() to a "barrier enter" then have a pmi_simple_server_barrier_complete() call or similar.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jun 24, 2016

Exactly what I was going to propose! See #707

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jun 24, 2016

Sorry we cross posted there. Yeah, that sounds good. I'll propose a PR for both those things.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jun 24, 2016

Ok, that should work great.

garlick added a commit to garlick/flux-core that referenced this issue Jun 24, 2016

libpmi-server: add barrier completion call
Change barrier() callback to barrier_enter(), and have it
return 0 on success, -1 on failure.  It no longer returns
1 when the barrier is complete.

Add pmi_simple_server_barrier_complete() to complete the
barrier.

This allows the barrier completion to be asynchronous with
respect to barrier entry, which is needed for integrating
the pmi server code into a reactor loop, as discussed in flux-framework#706.

Update flux-start and pmi/simple test for the new function
signatures.
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jun 24, 2016

In reactor context, will the while (pmi_simple_server_response ()) ... loop need to be manually called after pmi_simple_server_barrier_complete()? Or am I doing something else wrong? (Program hangs here because nothing ever sends barrier exit responses to clients)

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Jun 25, 2016

Yeah I was thinking that would need to be called there.

garlick added a commit to garlick/flux-core that referenced this issue Jun 28, 2016

libpmi-server: add barrier completion call
Restructure the barrier callback.  If there isn't one,
barriers complete internally.  If there is one, it is called
once when all local processes have entered.

Once the barrier is complete, pmi_simple_server_barrier_complete()
should be called.

This allows the barrier completion to be asynchronous with
respect to barrier entry, which is needed for integrating
the pmi server code into a reactor loop, as discussed in flux-framework#706.

Rename the barrier callback to barrier_enter() and add an rc
paramter so it can fail.

Update users.

garlick added a commit to garlick/flux-core that referenced this issue Jun 28, 2016

libpmi-server: add barrier completion call
Restructure the barrier callback.  If there isn't one,
barriers complete internally.  If there is one, it is called
once when all local processes have entered.

Once the barrier is complete, pmi_simple_server_barrier_complete()
should be called.

This allows the barrier completion to be asynchronous with
respect to barrier entry, which is needed for integrating
the pmi server code into a reactor loop, as discussed in flux-framework#706.

Rename the barrier callback to barrier_enter() and add an rc
paramter so it can fail.

Update users.
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Aug 9, 2016

This was completed with the merge of #709

@garlick garlick closed this Aug 9, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.