Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tighten up PMI implementation for OpenMPI #926

Merged
merged 5 commits into from Dec 16, 2016

Conversation

Projects
None yet
4 participants
@garlick
Copy link
Member

garlick commented Dec 14, 2016

This PR fixes a couple more issues that came up while building openmpi support for flux.
Some general cleanup, environment cleanup as discussed in #923, implementing the clique functions, and suppressing a log for failed PMI_KVS_Get() since that's not necessarily a flux error.

All pretty minor.

@garlick garlick added the review label Dec 14, 2016

@coveralls

This comment has been minimized.

Copy link

coveralls commented Dec 14, 2016

Coverage Status

Coverage increased (+0.01%) to 76.323% when pulling d4430bd on garlick:pmi-pkg into e84a5b5 on flux-framework:master.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Dec 14, 2016

Current coverage is 76.01% (diff: 50.00%)

Merging #926 into master will increase coverage by 0.01%

@@             master       #926   diff @@
==========================================
  Files           149        149          
  Lines         26006      26010     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          19764      19772     +8   
+ Misses         6242       6238     -4   
  Partials          0          0          
Diff Coverage File Path
0% src/modules/wreck/wrexecd.c
•••••••••• 100% src/common/libpmi/simple_client.c
•••••••••• 100% src/common/libpmi/pmi.c
•••••••••• 100% src/broker/broker.c

Powered by Codecov. Last update 82e5275...9d3b730

@garlick garlick referenced this pull request Dec 15, 2016

Merged

update schizo/flux component #2

@coveralls

This comment has been minimized.

Copy link

coveralls commented Dec 15, 2016

Coverage Status

Coverage increased (+0.02%) to 76.328% when pulling 03dbebe on garlick:pmi-pkg into e84a5b5 on flux-framework:master.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Dec 16, 2016

@garlick, this all seems pretty straightforward. I think you just need to rebase on current master if everything else is good.

garlick added some commits Dec 13, 2016

libpmi: implement clique functions
Problem: OpenMPI flux component is calling PMI clique functions,
but when acting as a simple v1 PMI client, they return PMI_ERR_INIT.

Change the code to fall through to emulation via parsing
PMI_process_mapping from the KVS.
wreck: suppress log message on PMI kvs get error
Problem: a LOG_ERR message is generated when PMI_KVS_Get()
fails with ENOENT.  OpenMPI seems to like to get keys
that don't exist.

Suppress the log message if the error is ENOENT.
broker: unset FLUX_JOB_* environment
Problem: when Flux launches Flux, FLUX_JOB_ID and other environment
variables are set in children of the second Flux, which makes it
hard to tell whether they are just children of the second Flux
or jobs spawned by the second Flux.

After PMI bootstrap, the broker unsets FLUX_JOB_ID, FLUX_JOB_SIZE,
and FLUX_JOB_NNODES.

@garlick garlick force-pushed the garlick:pmi-pkg branch from 03dbebe to 29b7c58 Dec 16, 2016

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Dec 16, 2016

Just rebased on current master.

The only thing I'm uncertain about is setting TMPDIR to scratch-directory-rank (to avoid shared memory name collisions when running multiple flux ranks per node). Think that has any negatives?

@coveralls

This comment has been minimized.

Copy link

coveralls commented Dec 16, 2016

Coverage Status

Coverage increased (+0.007%) to 76.309% when pulling 29b7c58 on garlick:pmi-pkg into 82e5275 on flux-framework:master.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Dec 16, 2016

The only thing I'm uncertain about is setting TMPDIR to scratch-directory-rank (to avoid shared memory name collisions when running multiple flux ranks per node). Think that has any negatives?

It does kind of seem strange to do this to all user's jobs by default, but I can't think of an actual problem with it. It would be nice if we could suggest to openmpi that they add node-rank to the shmem filename so that it works without outside intervention.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Dec 16, 2016

Let's go with this for now. I've probably given Ralph enough headaches for one week.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Dec 16, 2016

Does scratch-directory-rank get removed after session exit? If so, it might be surprising to have TMPDIR disappear after exit?

I do notice from the faq:

  1. Where is the file that sm will mmap in?
    The file will be in the OMPI session directory, which is typically something like /tmp/openmpi-sessions-myusername@mynodename* . The file itself will have the name shared_mem_pool.mynodename. For example, the full path could be /tmp/openmpi-sessions-myusername@node0_0/1543/1/shared_mem_pool.node0.

To place the session directory in a non-default location, use the MCA parameter orte_tmpdir_base.

Can we try something like an env variable OMPI_mca_orte_tmpdir_base (I forget exactly how those MCA environment variables are named)?

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Dec 16, 2016

Excellent point! Yes it does get removed.

If something like the above suggestion works that is much better! Let me poke at that.

wreck/openmpi.lua: drop LD_LIBRARY_PATH, set orte_tmpdir
The new openmpi Flux component uses FLUX_PMI_LIBRARY_PATH
at runtime to dlopen our PMI library, so LD_LIBRARY_PATH
does not need to be set for it to work right in the
default configuration.  Therefore, don't change it
in the openmpi lua script.

Do set OMPI_MCA_orte_tmpdir_base to the rank-specific
directory so that shared memory segment names do not
collide when there are multiple Flux ranks on a (real)
node launching an MPI program with multiple MPI ranks
on a (flux) node.

@garlick garlick force-pushed the garlick:pmi-pkg branch from 29b7c58 to 9d3b730 Dec 16, 2016

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Dec 16, 2016

OK just forced a push with that change. Seems to work!

@coveralls

This comment has been minimized.

Copy link

coveralls commented Dec 16, 2016

Coverage Status

Coverage increased (+0.01%) to 76.316% when pulling 9d3b730 on garlick:pmi-pkg into 82e5275 on flux-framework:master.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Dec 16, 2016

Cool! Merging

@grondo grondo merged commit 565ebf1 into flux-framework:master Dec 16, 2016

3 of 4 checks passed

codecov/patch 50.00% of diff hit (target 75.99%)
Details
codecov/project 76.01% (+0.01%) compared to 82e5275
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.01%) to 76.316%
Details

@grondo grondo removed the review label Dec 16, 2016

@garlick garlick deleted the garlick:pmi-pkg branch Dec 18, 2016

@grondo grondo referenced this pull request Mar 28, 2017

Closed

0.7.0 Release Notes #1019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.