Batch-mode tests passing with raised exceptions

**What happened**:

In Python 3.8 tests, the `dask_mpi/tests/test_core.py` and `dask_mpi/tests/test_no_exit.py` tests execute processes that raise exceptions at shutdown and return `0` exit codes.  

**What you expected to happen**:

I would expect any test that raises an exception to return a non-`0` exit code.

**Minimal Complete Verifiable Example**:

With Python 3.8 on MacOS or Linux, run the command that is executed in the `dask_mpi/tests/test_core.py` test:

```console
mpirun -l -np 4 python dask_mpi/tests/test_core.py
```

Note the raised `CommClosedError` and that the exit code is `0`:

```console
echo $?
```

<details>
<summary>Full Logs</summary>

```console
[0] 2022-04-20 11:17:09,840 - distributed.scheduler - INFO - State start
[0] 2022-04-20 11:17:09,844 - distributed.scheduler - INFO - Clear task state
[0] 2022-04-20 11:17:09,844 - distributed.scheduler - INFO -   Scheduler at:  tcp://192.168.1.27:63569
[0] 2022-04-20 11:17:09,845 - distributed.scheduler - INFO -   dashboard at:                    :63568
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO -       Start worker at:   tcp://192.168.1.27:63572
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO -          Listening to:   tcp://192.168.1.27:63572
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO -          dashboard at:         192.168.1.27:63573
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO -       Start worker at:   tcp://192.168.1.27:63570
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO -          Listening to:   tcp://192.168.1.27:63570
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO - Waiting to connect to:   tcp://192.168.1.27:63569
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO -          dashboard at:         192.168.1.27:63574
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO - -------------------------------------------------
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO - Waiting to connect to:   tcp://192.168.1.27:63569
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO -               Threads:                          1
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO -                Memory:                   2.00 GiB
[2] 2022-04-20 11:17:09,868 - distributed.worker - INFO -       Local Directory: /Users/kpaul/Software/Development/dask-mpi/dask_mpi/tests/dask-worker-space/worker-iqjsppe2
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO - -------------------------------------------------
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO -               Threads:                          1
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO -                Memory:                   2.00 GiB
[3] 2022-04-20 11:17:09,868 - distributed.worker - INFO -       Local Directory: /Users/kpaul/Software/Development/dask-mpi/dask_mpi/tests/dask-worker-space/worker-lb3qdkh6
[2] 2022-04-20 11:17:09,869 - distributed.worker - INFO - -------------------------------------------------
[3] 2022-04-20 11:17:09,869 - distributed.worker - INFO - -------------------------------------------------
[0] 2022-04-20 11:17:10,267 - distributed.scheduler - INFO - Receive client connection: Client-b626d33c-c0cd-11ec-b1b6-acde48001122
[0] 2022-04-20 11:17:10,270 - distributed.core - INFO - Starting established connection
[0] 2022-04-20 11:17:10,338 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://192.168.1.27:63570', name: 3, status: undefined, memory: 0, processing: 0>
[0] 2022-04-20 11:17:10,339 - distributed.scheduler - INFO - Starting worker compute stream, tcp://192.168.1.27:63570
[0] 2022-04-20 11:17:10,339 - distributed.core - INFO - Starting established connection
[3] 2022-04-20 11:17:10,339 - distributed.worker - INFO -         Registered to:   tcp://192.168.1.27:63569
[3] 2022-04-20 11:17:10,339 - distributed.worker - INFO - -------------------------------------------------
[3] 2022-04-20 11:17:10,340 - distributed.core - INFO - Starting established connection
[0] 2022-04-20 11:17:10,344 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://192.168.1.27:63572', name: 2, status: undefined, memory: 0, processing: 0>
[0] 2022-04-20 11:17:10,344 - distributed.scheduler - INFO - Starting worker compute stream, tcp://192.168.1.27:63572
[0] 2022-04-20 11:17:10,344 - distributed.core - INFO - Starting established connection
[2] 2022-04-20 11:17:10,345 - distributed.worker - INFO -         Registered to:   tcp://192.168.1.27:63569
[2] 2022-04-20 11:17:10,345 - distributed.worker - INFO - -------------------------------------------------
[2] 2022-04-20 11:17:10,346 - distributed.core - INFO - Starting established connection
[0] 2022-04-20 11:17:10,627 - distributed.scheduler - INFO - Remove client Client-b626d33c-c0cd-11ec-b1b6-acde48001122
[0] 2022-04-20 11:17:10,627 - distributed.scheduler - INFO - Remove client Client-b626d33c-c0cd-11ec-b1b6-acde48001122
[0] 2022-04-20 11:17:10,628 - distributed.scheduler - INFO - Close client connection: Client-b626d33c-c0cd-11ec-b1b6-acde48001122
[0] 2022-04-20 11:17:10,639 - distributed.scheduler - INFO - Receive client connection: Client-b69e5de4-c0cd-11ec-b1b6-acde48001122
[0] 2022-04-20 11:17:10,639 - distributed.core - INFO - Starting established connection
[0] 2022-04-20 11:17:10,642 - distributed.worker - INFO - Run out-of-band function 'stop'
[0] 2022-04-20 11:17:10,643 - distributed.scheduler - INFO - Scheduler closing...
[0] 2022-04-20 11:17:10,643 - distributed.scheduler - INFO - Scheduler closing all comms
[2] 2022-04-20 11:17:10,645 - distributed.worker - INFO - Stopping worker at tcp://192.168.1.27:63572
[3] 2022-04-20 11:17:10,644 - distributed.worker - INFO - Stopping worker at tcp://192.168.1.27:63570
[0] 2022-04-20 11:17:10,645 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://192.168.1.27:63570', name: 3, status: running, memory: 0, processing: 0>
[0] 2022-04-20 11:17:10,645 - distributed.core - INFO - Removing comms to tcp://192.168.1.27:63570
[0] 2022-04-20 11:17:10,645 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://192.168.1.27:63572', name: 2, status: running, memory: 0, processing: 0>
[0] 2022-04-20 11:17:10,645 - distributed.core - INFO - Removing comms to tcp://192.168.1.27:63572
[0] 2022-04-20 11:17:10,645 - distributed.scheduler - INFO - Lost all workers
[2] 2022-04-20 11:17:10,649 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://192.168.1.27:63576 remote=tcp://192.168.1.27:63569>
[2] Traceback (most recent call last):
[2]   File "/Users/kpaul/Software/miniconda3/envs/dask-mpi-38/lib/python3.8/site-packages/distributed/batched.py", line 94, in _background_send
[2]     nbytes = yield self.comm.write(
[2]   File "/Users/kpaul/Software/miniconda3/envs/dask-mpi-38/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
[2]     value = future.result()
[2]   File "/Users/kpaul/Software/miniconda3/envs/dask-mpi-38/lib/python3.8/site-packages/distributed/comm/tcp.py", line 269, in write
[2]     raise CommClosedError()
[2] distributed.comm.core.CommClosedError
[3] 2022-04-20 11:17:10,648 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://192.168.1.27:63577 remote=tcp://192.168.1.27:63569>
[3] Traceback (most recent call last):
[3]   File "/Users/kpaul/Software/miniconda3/envs/dask-mpi-38/lib/python3.8/site-packages/distributed/batched.py", line 94, in _background_send
[3]     nbytes = yield self.comm.write(
[3]   File "/Users/kpaul/Software/miniconda3/envs/dask-mpi-38/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
[3]     value = future.result()
[3]   File "/Users/kpaul/Software/miniconda3/envs/dask-mpi-38/lib/python3.8/site-packages/distributed/comm/tcp.py", line 269, in write
[3]     raise CommClosedError()
[3] distributed.comm.core.CommClosedError
```

</details>

**Anything else we need to know?**:

After some checking, I believe that this comes from an exception being raised during the registered `atexit` function that `dask_mpi.initialiize()` uses to shutdown the scheduler and workers from the client process.  One can test this easily with the following script:

```python
import atexit

def exit_error():
    raise RuntimeError('ERROR')

atexit.register(exit_error)
```

Assuming this script is called `exiterror.py`, run and test and check the exit code:

```console
> python exiterror.py 
Exception ignored in atexit callback: <function exit_error at 0x10f827e20>
Traceback (most recent call last):
  File ".../exiterror.py", line 5, in exit_error
    raise RuntimeError('ERROR')
RuntimeError: ERROR
> echo $?
0
```

This is a well-discussed "feature" of CPython, which you can reference in https://github.com/python/cpython/issues/71222 and references therein.  

So, assuming that a non-zero return code cannot be given if an error occurs in an `atexit` handler, then I think the best we can do is try to fix the errors that are actually occurring.  With Python 3.8, this can be done by using the `Client.shutdown()` method in the `atexit` handler.  That means changing:

https://github.com/dask/dask-mpi/blob/bb5e4a8ed73d1b7c2927f10d8e17ffb34962951a/dask_mpi/core.py#L153-L160

to

```python
    with Client() as c:
        c.shutdown()
```

With this change, the test above throws no exceptions at exit.

**Environment**:

- Dask version:  2022.4.0
- Python version:  3.8.13
- Operating System:  MacOS, Linux
- Install method (conda, pip, source):  conda

	async def stop(dask_scheduler):
	await dask_scheduler.close()
	await gen.sleep(0.1)
	local_loop = dask_scheduler.loop
	local_loop.add_callback(local_loop.stop)

	with Client() as c:
	c.run_on_scheduler(stop, wait=False)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Batch-mode tests passing with raised exceptions #88

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Batch-mode tests passing with raised exceptions #88

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions