Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
I've been dealing with an issue that...well, I was convinced shouldn't be an issue, so I never said anything about it until dask/dask-blog#5. And after a discussion with @guillaumeeb, I was convinced that maybe I'm not as crazy (or as ill-informed) as I thought I was. So, here's the issue...
I've been trying to figure out a way of launching the Dask Scheduler, Workers, and the Client script in the same MPI environment. Currently, the way
I discussed with @guillaumeeb one approach that should work, something like the following:
# [PBS header info requesting N MPI processes] mpirun -np N dask-mpi [dask-mpi options] & python my_dask_script.py
However, this launches Scheduler/Worker processes on all N allocated MPI processes, and then the
What I was originally hoping for was a solution that allowed something more like this:
# [PBS header info requesting N MPI processes] mpirun -np N dask-mpi [dask-mpi options] --script my_dask_script
But after thinking about it for a while, I found that what I really wanted was something that worked like this:
# [PBS header info requesting N MPI processes] mpirun -np N python my_dask_mpi_script.py
At this point, I feel like I could write this myself...except that I don't know how to implement the "run the
Any thoughts? Are there different solutions? Would you recommend something different?
Yes, something like this sounds like a great idea to me. I agree entirely with your design.
Yeah, that's not entirely trivial. In principle we want to do something like the following:
if rank == 0: client = Client('SCHEDULER_ADDRESS') <user's code> client.sync(client.scheduler.terminate)
Alternatively, @jacobtomlinson proposed another solution in #2346 where the scheduler would terminate automatically after 60s if no clients were connected. This was originally designed to clean up orphaned clusters, but could solve this problem as well.
As discussed with him, improving dask-mpi with what @kmpaul proposes seems really important.
That being said, I'm not sure of what design to follow. I see the point of the second proposed design, but I think we should clarify
Should the use script extend a dask-mpi class and implement a method? Should it only call something at the beginning of its main function?
# User script import dask_mpi dask_mpi.initialize() # client code continues from dask.distributed import Client client = Client() # grabs address from dask.config.get('scheduler-address') automatically
# Dask-mpi import file ... MPI Prelude def start(): if rank == 0: # start scheduler # wait until scheduler is finished sys.exit() elif rank == 1: return # pass on to the client code coming next else: # start worker # wait until worker is finished sys.exit()
Then we execute with something like
We could add a
# dask-mpi.py ... MPI Prelude if rank == 0: # start scheduler # wait for scheduler to finish elif script and rank == 1: SCHEDULER_ADDRESS = get_scheduler_address(scheduler_file) # search for scheduler_file in client.py with dask.config.set(scheduler_address=SCHEDULER_ADDRESS): importlib.import(script) else: # start worker # wait for worker to finish
I like the first pattern. I think it will make more sense to "traditional" MPI users. I wonder how you would feel about dropping the
I might have time (and I definitely have interest) in doing this today. I think the missing piece that @mrocklin provided for me was the
I wouldn't mind pulling dask-mpi out of the distributed codebase if you're willing to make a new repository.
My guess is that initialization at import time might be difficult. For example
Oh! I like that, idea. We could make
I'd be happy to make a new repository. If it's going to be maintained in the long term, I'll make it an NCAR repo.
This has now been completed in https://github.com/dask/dask-mpi with dask/dask-mpi#6. The PR implements the "functional initialization" enhancement and the "pulling dask-mpi out of the [distributed] codebase" request.
I will leave it to other dask developers to remove the dask-mpi code from distributed as they see fit.