-
-
Notifications
You must be signed in to change notification settings - Fork 29
Description
What happened:
In Python 3.8 tests, the dask_mpi/tests/test_core.py and dask_mpi/tests/test_no_exit.py tests execute processes that raise exceptions at shutdown and return 0 exit codes.
What you expected to happen:
I would expect any test that raises an exception to return a non-0 exit code.
Minimal Complete Verifiable Example:
With Python 3.8 on MacOS or Linux, run the command that is executed in the dask_mpi/tests/test_core.py test:
mpirun -l -np 4 python dask_mpi/tests/test_core.pyNote the raised CommClosedError and that the exit code is 0:
echo $?Full Logs
[0] 2022-04-20 11:17:09,840 - distributed.scheduler - INFO - State start
[0] 2022-04-20 11:17:09,844 - distributed.scheduler - INFO - Clear task state
[0] 2022-04-20 11:17:09,844 - distributed.scheduler - INFO - Scheduler at: tcp://192.168.1.27:63569
[0] 2022-04-20 11:17:09,845 - distributed.scheduler - INFO - dashboard at: :63568
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO - Start worker at: tcp://192.168.1.27:63572
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO - Listening to: tcp://192.168.1.27:63572
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO - dashboard at: 192.168.1.27:63573
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO - Start worker at: tcp://192.168.1.27:63570
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO - Listening to: tcp://192.168.1.27:63570
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO - Waiting to connect to: tcp://192.168.1.27:63569
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO - dashboard at: 192.168.1.27:63574
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO - -------------------------------------------------
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO - Waiting to connect to: tcp://192.168.1.27:63569
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO - Threads: 1
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO - Memory: 2.00 GiB
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO - Local Directory: /Users/kpaul/Software/Development/dask-mpi/dask_mpi/tests/dask-worker-space/worker-iqjsppe2
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO - -------------------------------------------------
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO - Threads: 1
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO - Memory: 2.00 GiB
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO - Local Directory: /Users/kpaul/Software/Development/dask-mpi/dask_mpi/tests/dask-worker-space/worker-lb3qdkh6
[2] 2022-04-20 11:17:09,869 - distributed.worker - INFO - -------------------------------------------------
[3] 2022-04-20 11:17:09,869 - distributed.worker - INFO - -------------------------------------------------
[0] 2022-04-20 11:17:10,267 - distributed.scheduler - INFO - Receive client connection: Client-b626d33c-c0cd-11ec-b1b6-acde48001122
[0] 2022-04-20 11:17:10,270 - distributed.core - INFO - Starting established connection
[0] 2022-04-20 11:17:10,338 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://192.168.1.27:63570', name: 3, status: undefined, memory: 0, processing: 0>
[0] 2022-04-20 11:17:10,339 - distributed.scheduler - INFO - Starting worker compute stream, tcp://192.168.1.27:63570
[0] 2022-04-20 11:17:10,339 - distributed.core - INFO - Starting established connection
[3] 2022-04-20 11:17:10,339 - distributed.worker - INFO - Registered to: tcp://192.168.1.27:63569
[3] 2022-04-20 11:17:10,339 - distributed.worker - INFO - -------------------------------------------------
[3] 2022-04-20 11:17:10,340 - distributed.core - INFO - Starting established connection
[0] 2022-04-20 11:17:10,344 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://192.168.1.27:63572', name: 2, status: undefined, memory: 0, processing: 0>
[0] 2022-04-20 11:17:10,344 - distributed.scheduler - INFO - Starting worker compute stream, tcp://192.168.1.27:63572
[0] 2022-04-20 11:17:10,344 - distributed.core - INFO - Starting established connection
[2] 2022-04-20 11:17:10,345 - distributed.worker - INFO - Registered to: tcp://192.168.1.27:63569
[2] 2022-04-20 11:17:10,345 - distributed.worker - INFO - -------------------------------------------------
[2] 2022-04-20 11:17:10,346 - distributed.core - INFO - Starting established connection
[0] 2022-04-20 11:17:10,627 - distributed.scheduler - INFO - Remove client Client-b626d33c-c0cd-11ec-b1b6-acde48001122
[0] 2022-04-20 11:17:10,627 - distributed.scheduler - INFO - Remove client Client-b626d33c-c0cd-11ec-b1b6-acde48001122
[0] 2022-04-20 11:17:10,628 - distributed.scheduler - INFO - Close client connection: Client-b626d33c-c0cd-11ec-b1b6-acde48001122
[0] 2022-04-20 11:17:10,639 - distributed.scheduler - INFO - Receive client connection: Client-b69e5de4-c0cd-11ec-b1b6-acde48001122
[0] 2022-04-20 11:17:10,639 - distributed.core - INFO - Starting established connection
[0] 2022-04-20 11:17:10,642 - distributed.worker - INFO - Run out-of-band function 'stop'
[0] 2022-04-20 11:17:10,643 - distributed.scheduler - INFO - Scheduler closing...
[0] 2022-04-20 11:17:10,643 - distributed.scheduler - INFO - Scheduler closing all comms
[2] 2022-04-20 11:17:10,645 - distributed.worker - INFO - Stopping worker at tcp://192.168.1.27:63572
[3] 2022-04-20 11:17:10,644 - distributed.worker - INFO - Stopping worker at tcp://192.168.1.27:63570
[0] 2022-04-20 11:17:10,645 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://192.168.1.27:63570', name: 3, status: running, memory: 0, processing: 0>
[0] 2022-04-20 11:17:10,645 - distributed.core - INFO - Removing comms to tcp://192.168.1.27:63570
[0] 2022-04-20 11:17:10,645 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://192.168.1.27:63572', name: 2, status: running, memory: 0, processing: 0>
[0] 2022-04-20 11:17:10,645 - distributed.core - INFO - Removing comms to tcp://192.168.1.27:63572
[0] 2022-04-20 11:17:10,645 - distributed.scheduler - INFO - Lost all workers
[2] 2022-04-20 11:17:10,649 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://192.168.1.27:63576 remote=tcp://192.168.1.27:63569>
[2] Traceback (most recent call last):
[2] File "/Users/kpaul/Software/miniconda3/envs/dask-mpi-38/lib/python3.8/site-packages/distributed/batched.py", line 94, in _background_send
[2] nbytes = yield self.comm.write(
[2] File "/Users/kpaul/Software/miniconda3/envs/dask-mpi-38/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
[2] value = future.result()
[2] File "/Users/kpaul/Software/miniconda3/envs/dask-mpi-38/lib/python3.8/site-packages/distributed/comm/tcp.py", line 269, in write
[2] raise CommClosedError()
[2] distributed.comm.core.CommClosedError
[3] 2022-04-20 11:17:10,648 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://192.168.1.27:63577 remote=tcp://192.168.1.27:63569>
[3] Traceback (most recent call last):
[3] File "/Users/kpaul/Software/miniconda3/envs/dask-mpi-38/lib/python3.8/site-packages/distributed/batched.py", line 94, in _background_send
[3] nbytes = yield self.comm.write(
[3] File "/Users/kpaul/Software/miniconda3/envs/dask-mpi-38/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
[3] value = future.result()
[3] File "/Users/kpaul/Software/miniconda3/envs/dask-mpi-38/lib/python3.8/site-packages/distributed/comm/tcp.py", line 269, in write
[3] raise CommClosedError()
[3] distributed.comm.core.CommClosedErrorAnything else we need to know?:
After some checking, I believe that this comes from an exception being raised during the registered atexit function that dask_mpi.initialiize() uses to shutdown the scheduler and workers from the client process. One can test this easily with the following script:
import atexit
def exit_error():
raise RuntimeError('ERROR')
atexit.register(exit_error)Assuming this script is called exiterror.py, run and test and check the exit code:
> python exiterror.py
Exception ignored in atexit callback: <function exit_error at 0x10f827e20>
Traceback (most recent call last):
File ".../exiterror.py", line 5, in exit_error
raise RuntimeError('ERROR')
RuntimeError: ERROR
> echo $?
0This is a well-discussed "feature" of CPython, which you can reference in python/cpython#71222 and references therein.
So, assuming that a non-zero return code cannot be given if an error occurs in an atexit handler, then I think the best we can do is try to fix the errors that are actually occurring. With Python 3.8, this can be done by using the Client.shutdown() method in the atexit handler. That means changing:
Lines 153 to 160 in bb5e4a8
| async def stop(dask_scheduler): | |
| await dask_scheduler.close() | |
| await gen.sleep(0.1) | |
| local_loop = dask_scheduler.loop | |
| local_loop.add_callback(local_loop.stop) | |
| with Client() as c: | |
| c.run_on_scheduler(stop, wait=False) |
to
with Client() as c:
c.shutdown()With this change, the test above throws no exceptions at exit.
Environment:
- Dask version: 2022.4.0
- Python version: 3.8.13
- Operating System: MacOS, Linux
- Install method (conda, pip, source): conda