Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise an exception if the comm world is too small #107

Merged
merged 5 commits into from
Sep 21, 2023

Conversation

jacobtomlinson
Copy link
Member

@jacobtomlinson jacobtomlinson commented Sep 18, 2023

When using the CLI to start a cluster with a scheduler it assumes there will be at least two ranks, one for the scheduler and one for one or more workers. When using the initialize function it assumes at least three ranks, one for the scheduler, one for the client and one or more workers.

This PR adds a runtime check for this and if there aren't enough processes in the MPI Comm World it raises an exception to avoid hanging and waiting for processes that will never arrive.

$ mpirun -np 1 dask-mpi        
...
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jtomlinson/miniconda3/envs/dask/bin/dask-mpi", line 33, in <module>
    sys.exit(load_entry_point('dask-mpi', 'console_scripts', 'dask-mpi')())
  File "/home/jtomlinson/miniconda3/envs/dask/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/jtomlinson/miniconda3/envs/dask/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/jtomlinson/miniconda3/envs/dask/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jtomlinson/miniconda3/envs/dask/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/jtomlinson/Projects/dask/dask-mpi/dask_mpi/cli.py", line 103, in main
    raise WorldTooSmallException(
dask_mpi.exceptions.WorldTooSmallException: Not enough MPI ranks to start cluster, found 1, needs at least 2, one each for the scheduler and a worker.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[20199,1],0]
  Exit code:    1
--------------------------------------------------------------------------

dask_mpi/cli.py Outdated Show resolved Hide resolved
@jacobtomlinson
Copy link
Member Author

There have been no further comments so I'm going to go ahead and merge this.

@jacobtomlinson jacobtomlinson merged commit a8890e6 into dask:main Sep 21, 2023
7 checks passed
@jacobtomlinson jacobtomlinson deleted the small-world branch September 21, 2023 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants