-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpirun doesn't exit when exception is thrown in some process #236
Comments
Yes, as you said, the expected behaviour of MPI runtime is to kill all child processes and shutdown with an error code. In the script you showed, the second It seems that the failed process detection and subprocess shutdown feature were improved in Open MPI 3.0, but the version brings another issue to ChainerMN (see #221 for details) In the long-term roadmap, we are working hard to remove fault tolerance to ChainerMN. Thanks |
Thank you for looking into this! Unfortunately, I tried it with Open MPI 3.0.1, but I encountered the same behavior. I should note that even if I do not create two communicators, this issue still occurs:
Please let me know if you make any progress. Thank you! |
I looked into this some more. It seems like
this trace shows that mpi_comm operations (gather and scatter) cause execution to hang:
|
I've just found a hack to solve the issue. Note that the problem happens without import sys
# Global error handler
def global_except_hook(exctype, value, traceback):
sys.stderr.write("except_hook. Calling MPI_Abort().\n")
# NOTE: mpi4py must be imported inside exception handler, not globally.
# In chainermn, mpi4py import is carefully delayed, because
# mpi4py automatically call MPI_Init() and cause a crash on Infiniband environment.
import mpi4py.MPI
mpi4py.MPI.COMM_WORLD.Abort(1)
sys.__excepthook__(exctype, value, traceback)
sys.excepthook = global_except_hook
def func1():
import chainermn
import mpi4py.MPI
mpi_comm = mpi4py.MPI.COMM_WORLD
if mpi_comm.rank == 0:
raise ValueError('failure!')
comm = chainermn.create_communicator('naive', mpi_comm)
def func2():
import mpi4py.MPI
mpi_comm = mpi4py.MPI.COMM_WORLD
if mpi_comm.rank == 0:
raise ValueError('failure!')
mpi4py.MPI.COMM_WORLD.Barrier()
if __name__ == '__main__':
d = {'func1' : func1,
'func2' : func2}
fname = sys.argv[1] if len(sys.argv) >= 2 else 'func1'
d[fname]() |
Improved version of the error handler: import sys
# Global error handler
def global_except_hook(exctype, value, traceback):
import sys
try:
import mpi4py.MPI
sys.stderr.write("\n*****************************************************\n")
sys.stderr.write("Uncaught exception was detected on rank {}. \n".format(
mpi4py.MPI.COMM_WORLD.Get_rank()))
from traceback import print_exception
print_exception(exctype, value, traceback)
sys.stderr.write("*****************************************************\n\n\n")
sys.stderr.write("\n")
sys.stderr.write("Calling MPI_Abort() to shut down MPI processes...\n")
sys.stderr.flush()
finally:
try:
import mpi4py.MPI
mpi4py.MPI.COMM_WORLD.Abort(1)
except Exception as e:
sys.stderr.write("*****************************************************\n")
sys.stderr.write("Sorry, we failed to stop MPI, this process will hang.\n")
sys.stderr.write("*****************************************************\n")
sys.stderr.flush()
raise e
sys.excepthook = global_except_hook
def func():
import mpi4py.MPI
mpi_comm = mpi4py.MPI.COMM_WORLD
if mpi_comm.rank == 0:
raise ValueError('failure!')
mpi4py.MPI.COMM_WORLD.Barrier()
if __name__ == '__main__':
func() |
Great, thanks @keisukefukuda ! I also contacted the mpi4py maintainer, who suggested using [mpi4py.run] (http://mpi4py.readthedocs.io/en/stable/mpi4py.run.html), which also suffices (I believe mpi4py.run also calls MPI_Abort()). |
Thanks. I will also test the |
Currently mpi4py does not exit gracefully on an exception in a single rank, see: chainer/chainermn#236. This modifies sys.excepthook to make sure that all processes terminate.
Currently mpi4py does not always exit gracefully on an exception in a single rank, see: chainer/chainermn#236 and discussion in dials/dials#2311 (comment). The new decorator `mpi_abort_on_exception` catches unhandled exceptions and SystemExit and sends an abort to all ranks. It should be added to `run` methods where an unhandled exception means the program has failed.
Hi,
I'm encountering some strange behavior. When running chainermn scripts, I would like them to exit immediately if any process fails, and for mpirun to exit as well. Instead, I've run into two problems:
I think something is going wrong when communicators are made.
Running the following with OpenMPI 2.1.2, Chainer 3.5.0, and ChainerMN 1.2.0 reproduces problem 2 (and problem 1):
my_script.py:
The desired behavior is for mpirun to exit (and cause all spawned processes to exit) when any failure occurs. Is there any way I can achieve this? And what could be happening in create_communicator to cause this? While I admit this example is contrived, my own use case isn't, but faces the same problem, even when I create only one communicator (and use another condition to force a failure on certain processes).
Thank you
The text was updated successfully, but these errors were encountered: