Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tighten up PMI implementation for OpenMPI #926

Merged
merged 5 commits into from Dec 16, 2016

Conversation

garlick
Copy link
Member

@garlick garlick commented Dec 14, 2016

This PR fixes a couple more issues that came up while building openmpi support for flux.
Some general cleanup, environment cleanup as discussed in #923, implementing the clique functions, and suppressing a log for failed PMI_KVS_Get() since that's not necessarily a flux error.

All pretty minor.

@garlick garlick added the review label Dec 14, 2016
@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 76.323% when pulling d4430bd on garlick:pmi-pkg into e84a5b5 on flux-framework:master.

@codecov-io
Copy link

codecov-io commented Dec 14, 2016

Current coverage is 76.01% (diff: 50.00%)

Merging #926 into master will increase coverage by 0.01%

@@             master       #926   diff @@
==========================================
  Files           149        149          
  Lines         26006      26010     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          19764      19772     +8   
+ Misses         6242       6238     -4   
  Partials          0          0          
Diff Coverage File Path
0% src/modules/wreck/wrexecd.c
•••••••••• 100% src/common/libpmi/simple_client.c
•••••••••• 100% src/common/libpmi/pmi.c
•••••••••• 100% src/broker/broker.c

Powered by Codecov. Last update 82e5275...9d3b730

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 76.328% when pulling 03dbebe on garlick:pmi-pkg into e84a5b5 on flux-framework:master.

@grondo
Copy link
Contributor

grondo commented Dec 16, 2016

@garlick, this all seems pretty straightforward. I think you just need to rebase on current master if everything else is good.

Problem: OpenMPI flux component is calling PMI clique functions,
but when acting as a simple v1 PMI client, they return PMI_ERR_INIT.

Change the code to fall through to emulation via parsing
PMI_process_mapping from the KVS.
Problem: a LOG_ERR message is generated when PMI_KVS_Get()
fails with ENOENT.  OpenMPI seems to like to get keys
that don't exist.

Suppress the log message if the error is ENOENT.
Problem: when Flux launches Flux, FLUX_JOB_ID and other environment
variables are set in children of the second Flux, which makes it
hard to tell whether they are just children of the second Flux
or jobs spawned by the second Flux.

After PMI bootstrap, the broker unsets FLUX_JOB_ID, FLUX_JOB_SIZE,
and FLUX_JOB_NNODES.
@garlick
Copy link
Member Author

garlick commented Dec 16, 2016

Just rebased on current master.

The only thing I'm uncertain about is setting TMPDIR to scratch-directory-rank (to avoid shared memory name collisions when running multiple flux ranks per node). Think that has any negatives?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.007%) to 76.309% when pulling 29b7c58 on garlick:pmi-pkg into 82e5275 on flux-framework:master.

@grondo
Copy link
Contributor

grondo commented Dec 16, 2016

The only thing I'm uncertain about is setting TMPDIR to scratch-directory-rank (to avoid shared memory name collisions when running multiple flux ranks per node). Think that has any negatives?

It does kind of seem strange to do this to all user's jobs by default, but I can't think of an actual problem with it. It would be nice if we could suggest to openmpi that they add node-rank to the shmem filename so that it works without outside intervention.

@garlick
Copy link
Member Author

garlick commented Dec 16, 2016

Let's go with this for now. I've probably given Ralph enough headaches for one week.

@grondo
Copy link
Contributor

grondo commented Dec 16, 2016

Does scratch-directory-rank get removed after session exit? If so, it might be surprising to have TMPDIR disappear after exit?

I do notice from the faq:

  1. Where is the file that sm will mmap in?
    The file will be in the OMPI session directory, which is typically something like /tmp/openmpi-sessions-myusername@mynodename* . The file itself will have the name shared_mem_pool.mynodename. For example, the full path could be /tmp/openmpi-sessions-myusername@node0_0/1543/1/shared_mem_pool.node0.

To place the session directory in a non-default location, use the MCA parameter orte_tmpdir_base.

Can we try something like an env variable OMPI_mca_orte_tmpdir_base (I forget exactly how those MCA environment variables are named)?

@garlick
Copy link
Member Author

garlick commented Dec 16, 2016

Excellent point! Yes it does get removed.

If something like the above suggestion works that is much better! Let me poke at that.

The new openmpi Flux component uses FLUX_PMI_LIBRARY_PATH
at runtime to dlopen our PMI library, so LD_LIBRARY_PATH
does not need to be set for it to work right in the
default configuration.  Therefore, don't change it
in the openmpi lua script.

Do set OMPI_MCA_orte_tmpdir_base to the rank-specific
directory so that shared memory segment names do not
collide when there are multiple Flux ranks on a (real)
node launching an MPI program with multiple MPI ranks
on a (flux) node.
@garlick
Copy link
Member Author

garlick commented Dec 16, 2016

OK just forced a push with that change. Seems to work!

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 76.316% when pulling 9d3b730 on garlick:pmi-pkg into 82e5275 on flux-framework:master.

@grondo
Copy link
Contributor

grondo commented Dec 16, 2016

Cool! Merging

@grondo grondo merged commit 565ebf1 into flux-framework:master Dec 16, 2016
@grondo grondo removed the review label Dec 16, 2016
@garlick garlick deleted the pmi-pkg branch December 18, 2016 16:12
@grondo grondo mentioned this pull request Mar 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants