New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PMI compatibility #665
Comments
[triage] Responding to the TLDR:
For next steps we should probably retry launching an mvapich program on the RHEL 7 based systems and see if any of the above are needed. Any mvapich specific environment should be set up in src/modules/wreck/lua.d/mvapich.lua |
I think with TOSS3's mvapich these are no longer necessary, it was an issue with a specific version of MVAPICH looking for slurm's PMI specifically and not finding it, and not trying the simple PMI because of how it was built. |
OK, then let's close this and if new problems arise, open bugs for those specifically. |
The TLDR, we should do each of the following:
Come to find out, version 2.1 of MVAPICH2 has some faulty logic determining how to handle its initial wireup. The core of it is that it assumes that if it's getting PMI, it's either SLURM's PMI or hydra's, so using 1.1 features should be safe as long as it's not slurm, and if they aren't then it should be using PMGR or some other mvapich specific launching stuff. Anyway, none of those things are true for flux, so we'll need to chip away at some of these assumptions. Explicitly telling it we're not using the MPIRUN_MAPPING interface gets us to par, but only because we have PMGR-like environment variables set by wreck, so at least IMO we should try and get the PMI_process_mapping value together. The extra trick is that the aforementioned value has an undocumented format, so the only places to really find it are in hydra and in MPID.
hydra code to build such a string: https://github.com/adk9/hydra/blob/f0ce4451f04d26c55e2f59f12b59d222da838a2c/pm/pmiserv/pmiserv_utils.c#L114
MPID code to interpret it: http://fossies.org/dox/mvapich2-2.2rc1/ch3_2src_2mpid__vc_8c_source.html#l00983
Format description from the latter:
With this set, all of the versions I've tested work, and use it to do so. None of this does anything for OpenMPI, as far as I can tell, but I'm not sure what does. That still doesn't work in either opt or dotkit versions, will have to dig into that further to see why.
The text was updated successfully, but these errors were encountered: