-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMPI hangs when running one more than number of brokers #924
Comments
Trace of MPI hello on two nodes, successful run:
|
The hang doesn't reproduce for sessions with size=1 or size=2 (running mpi at size=2 and 3 respectively). However it does for size=4 (mpi at 5). Here's a trace of one such hang
|
Only 3 of 5 processes (ranks 0,1,4) entered the second barrier. Working backwards to find out what happened to ranks 2,3, both were sent responses to their last kvs get requests, but barrier enter requests were never received. |
Same run, different failure:
|
It seems that setting |
Even with the OMPI_MCA setting, I'm still hitting some errors in Also it seems that these barrier failures are not fatal to the app. In the pmix:flux component, the I'm wondering if maybe truncated |
Oh duh. Our PMI library is of course not thread safe, so this thread shifting business may actually be the cause of the problem. |
Replacing the nonblocking fence with a blocking one seems to have made the mangled responses go away and I can launch like crazy using OMPI_MCA_btl=self,tcp. Still running into problems with the default shared memory though. |
Problem was shared memory segment naming collisions, fixed by pr #926 |
With recent changes submitted to OpenMPI for flux integration, I can run MPI jobs compiled with OpenMPI, however I noticed if I try to "oversubscribe" by launching the MPI hello program with a size greater than the size of the enclosing instance, it hangs.
When I do the same with the PMI test program
src/common/libpmi/test_kvstest
, there is no hang.The text was updated successfully, but these errors were encountered: