Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpirun doesn't exit when exception is thrown in some process #236

Closed
andremoeller opened this issue Apr 17, 2018 · 7 comments
Closed

mpirun doesn't exit when exception is thrown in some process #236

andremoeller opened this issue Apr 17, 2018 · 7 comments
Assignees

Comments

@andremoeller
Copy link

andremoeller commented Apr 17, 2018

Hi,

I'm encountering some strange behavior. When running chainermn scripts, I would like them to exit immediately if any process fails, and for mpirun to exit as well. Instead, I've run into two problems:

  1. In some cases, even if execution fails in some processes, execution continues on other processes until they exit normally.
  2. (Even more pernicious): even if execution fails in some processes, execution never finishes (mpirun process never exits)

I think something is going wrong when communicators are made.

Running the following with OpenMPI 2.1.2, Chainer 3.5.0, and ChainerMN 1.2.0 reproduces problem 2 (and problem 1):

mpiexec -mca orte_abort_on_non_zero_status 1 -n 4 python my_script.py

my_script.py:

import chainermn


def main():
    comm = chainermn.create_communicator('naive')
    if comm.mpi_comm.rank != 0:
        raise ValueError('failure!')
    comm = chainermn.create_communicator('naive')

if __name__ == '__main__':
    main()

The desired behavior is for mpirun to exit (and cause all spawned processes to exit) when any failure occurs. Is there any way I can achieve this? And what could be happening in create_communicator to cause this? While I admit this example is contrived, my own use case isn't, but faces the same problem, even when I create only one communicator (and use another condition to force a failure on certain processes).

Thank you

@andremoeller andremoeller changed the title mpirun execution doesn't end when exception is thrown mpirun doesn't exit when exception is thrown in some process Apr 17, 2018
@keisukefukuda keisukefukuda self-assigned this Apr 17, 2018
@keisukefukuda
Copy link
Member

Yes, as you said, the expected behaviour of MPI runtime is to kill all child processes and shutdown with an error code.
We sometimes observe the issues as well. It's a known issue for Open MPI <= 2.1.

In the script you showed, the second create_communicator is supposed to raise an error.
I will investigate if there's any way or option to avoid the issue.

It seems that the failed process detection and subprocess shutdown feature were improved in Open MPI 3.0, but the version brings another issue to ChainerMN (see #221 for details)
We are trying to solve #221, but it will take time because it's not a ChainerMN issue.

In the long-term roadmap, we are working hard to remove fault tolerance to ChainerMN.

Thanks

@andremoeller
Copy link
Author

andremoeller commented Apr 17, 2018

@keisukefukuda ,

Thank you for looking into this!

Unfortunately, I tried it with Open MPI 3.0.1, but I encountered the same behavior.

I should note that even if I do not create two communicators, this issue still occurs:

import chainermn

def main():
    import mpi4py.MPI
    mpi_comm = mpi4py.MPI.COMM_WORLD
    if mpi_comm.rank == 0:
        raise ValueError('failure!')
    comm = chainermn.create_communicator('naive', mpi_comm)

if __name__ == '__main__':
    main()

Please let me know if you make any progress. Thank you!

@andremoeller
Copy link
Author

I looked into this some more. It seems like mpi4py isn't handling python exceptions correctly. I tried calling init_ranks with some timeout code that sends a SIGALRM signal after some time, but the handler never gets called.

mpirun -n 2 python -m trace --trace repro.py

this trace shows that mpi_comm operations (gather and scatter) cause execution to hang:

_communication_utility.py(32): global_names=mpi_comm.gather(mpi4py.MPI.Get_processor_name())

@keisukefukuda
Copy link
Member

keisukefukuda commented Apr 26, 2018

I've just found a hack to solve the issue.
It works for the tiny script, but we need to check if it works for real-world applications.

Note that the problem happens without chainermn.

import sys

# Global error handler
def global_except_hook(exctype, value, traceback):
    sys.stderr.write("except_hook. Calling MPI_Abort().\n")
    # NOTE: mpi4py must be imported inside exception handler, not globally.
    # In chainermn, mpi4py import is carefully delayed, because
    # mpi4py automatically call MPI_Init() and cause a crash on Infiniband environment.
    import mpi4py.MPI
    mpi4py.MPI.COMM_WORLD.Abort(1)
    sys.__excepthook__(exctype, value, traceback)
sys.excepthook = global_except_hook

def func1():
    import chainermn
    import mpi4py.MPI
    mpi_comm = mpi4py.MPI.COMM_WORLD
    if mpi_comm.rank == 0:
        raise ValueError('failure!')
    comm = chainermn.create_communicator('naive', mpi_comm)

def func2():
    import mpi4py.MPI
    mpi_comm = mpi4py.MPI.COMM_WORLD
    if mpi_comm.rank == 0:
        raise ValueError('failure!')

    mpi4py.MPI.COMM_WORLD.Barrier()



if __name__ == '__main__':
    d = {'func1' : func1,
         'func2' : func2}

    fname = sys.argv[1] if len(sys.argv) >= 2 else 'func1'
    d[fname]()

@keisukefukuda
Copy link
Member

Improved version of the error handler:

import sys

# Global error handler
def global_except_hook(exctype, value, traceback):
    import sys
    try:
        import mpi4py.MPI
        sys.stderr.write("\n*****************************************************\n")
        sys.stderr.write("Uncaught exception was detected on rank {}. \n".format(
            mpi4py.MPI.COMM_WORLD.Get_rank()))
        from traceback import print_exception
        print_exception(exctype, value, traceback)
        sys.stderr.write("*****************************************************\n\n\n")
        sys.stderr.write("\n")
        sys.stderr.write("Calling MPI_Abort() to shut down MPI processes...\n")
        sys.stderr.flush()
    finally:
        try:
            import mpi4py.MPI
            mpi4py.MPI.COMM_WORLD.Abort(1)
        except Exception as e:
            sys.stderr.write("*****************************************************\n")
            sys.stderr.write("Sorry, we failed to stop MPI, this process will hang.\n")
            sys.stderr.write("*****************************************************\n")
            sys.stderr.flush()
            raise e

sys.excepthook = global_except_hook


def func():
    import mpi4py.MPI
    mpi_comm = mpi4py.MPI.COMM_WORLD
    if mpi_comm.rank == 0:
        raise ValueError('failure!')

    mpi4py.MPI.COMM_WORLD.Barrier()


if __name__ == '__main__':
    func()

@andremoeller
Copy link
Author

Great, thanks @keisukefukuda !

I also contacted the mpi4py maintainer, who suggested using [mpi4py.run] (http://mpi4py.readthedocs.io/en/stable/mpi4py.run.html), which also suffices (I believe mpi4py.run also calls MPI_Abort()).

@keisukefukuda
Copy link
Member

Thanks. I will also test the mpi4py 's recommended way if it works with ChainerMN.
I think I can now close the issue. thanks for your contribution!

dwpaley added a commit to cctbx/cctbx_project that referenced this issue Jan 23, 2023
Currently mpi4py does not exit gracefully on an exception in a single
rank, see: chainer/chainermn#236. This modifies
sys.excepthook to make sure that all processes terminate.
dwpaley added a commit to cctbx/cctbx_project that referenced this issue Jan 31, 2023
Currently mpi4py does not always exit gracefully on an exception in a single
rank, see: chainer/chainermn#236 and discussion in
dials/dials#2311 (comment). The new
decorator `mpi_abort_on_exception` catches unhandled exceptions and SystemExit
and sends an abort to all ranks. It should be added to `run` methods where an 
unhandled exception means the program has failed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants