Non-blocking operations on scheduler #7233

YJHMITWEB · 2022-10-29T19:55:44Z

Hi, this is just a general discussion on the scheduler's behavior. I noticed that in initialize:

async def run_scheduler():
            async with Scheduler(
                interface=interface,
                protocol=protocol,
                dashboard=dashboard,
                dashboard_address=dashboard_address,
            ) as scheduler:
                comm.bcast(scheduler.address, root=0)
                comm.Barrier()
                await scheduler.finished()

The scheduler only starts with 1 thread, and since the scheduler has to maintain communication with all workers and client, I'm very curious that once there is an await call, then the schedule will simply block itself there, and only when after that await, it can start to do other communications. It seems this is a huge overhead in communication.

For example, in distributed.comm.ucx,

@log_errors
    async def write(
        self,
        msg: dict,
        serializers: Collection[str] | None = None,
        on_error: str = "message",
    ) -> int:
        if self.closed():
            raise CommClosedError("Endpoint is closed -- unable to send message")
        try:
            if serializers is None:
                serializers = ("cuda", "dask", "pickle", "error")
            # msg can also be a list of dicts when sending batched messages
            logging.info("send msg={}".format(msg))
            frames = await to_frames(
                msg,
                serializers=serializers,
                on_error=on_error,
                allow_offload=self.allow_offload,
            )
            nframes = len(frames)
            cuda_frames = tuple(hasattr(f, "__cuda_array_interface__") for f in frames)
            sizes = tuple(nbytes(f) for f in frames)
            cuda_send_frames, send_frames = zip(
                *(
                    (is_cuda, each_frame)
                    for is_cuda, each_frame in zip(cuda_frames, frames)
                    if nbytes(each_frame) > 0
                )
            )

            # Send meta data

            # Send close flag and number of frames (_Bool, int64)
            await self.ep.send(struct.pack("?Q", False, nframes))
            # Send which frames are CUDA (bool) and
            # how large each frame is (uint64)
            await self.ep.send(
                struct.pack(nframes * "?" + nframes * "Q", *cuda_frames, *sizes)
            )

            # Send frames

            # It is necessary to first synchronize the default stream before start
            # sending We synchronize the default stream because UCX is not
            # stream-ordered and syncing the default stream will wait for other
            # non-blocking CUDA streams. Note this is only sufficient if the memory
            # being sent is not currently in use on non-blocking CUDA streams.
            if any(cuda_send_frames):
                synchronize_stream(0)

            for each_frame in send_frames:
                await self.ep.send(each_frame)
            return sum(sizes)
        except (ucp.exceptions.UCXBaseException):
            self.abort()
            raise CommClosedError("While writing, the connection was closed")

There are some await self.ep.send, and for example, this is a send from scheduler to worker dask/dask-mpi#1, then despite that all other workers can perform computation in parallel, they still have to sequentially wait for the communication with the scheduler. And in cases where communication is heavier than computation, the overhead will be significant.

I'm wondering if there is any way to perform nonblocking send/recv by giving the scheduler more threads.

The text was updated successfully, but these errors were encountered:

kmpaul · 2022-10-29T23:04:57Z

This seems to be a question more for the Distributed community. Dask-MPI is just the tool for launching a Dask cluster.

kmpaul · 2022-10-30T00:37:24Z

Also, all communication between the scheduler and workers is asynchronous. So, nothing is blocking.

Is there a particular use case you are having problems with?

YJHMITWEB · 2022-11-01T06:26:31Z

Oh thanks, I get it. But in case where all the communication happens at the same time, then even with async func, since the scheduler only has 1 thread, it is still going to be sequential.

jacobtomlinson · 2022-11-01T09:36:45Z

@pentschev may have some thoughts

pentschev · 2022-11-01T13:12:04Z

On a Dask worker, communication occurs on a different thread than that of compute. On the scheduler you're right that communication occurs on the same thread as the remaining of the work and that may in fact block at times, and I don't think there's currently a mechanism to allow offloading communications to a(multiple) separate thread(s). However, under normal circumstances, messages transferred between scheduler and worker as small and that blocking time may not be as substantial.

If you have the time, it would probably be an interesting experiment to do and check if there are any performance gains from offloading communication to one or more threads on the scheduler. As noted by @kmpaul , this discussion is anyway more suited for https://github.com/dask/distributed/issues , so maybe would be worth raising this question there to check what other people involved in Distributed think of this or whether there has been already someone who experimented on this.

jacobtomlinson · 2022-11-01T13:53:48Z

Agreed, transferring this issue to distributed.

pentschev · 2022-11-01T23:27:12Z

Also one important fact I should have mentioned is how async really works in UCX-Py, there are two modes: blocking (default) and non-blocking (aka polling).

In the blocking mode, UCX registers a file descriptor that UCX-Py keeps on watching. Once there is an event on that file descriptor, UCX-Py progress routine will be awaken and only then UCX work will truly block to complete communication, otherwise it will allow other tasks in the event loop to proceed.

The non-blocking mode is much more straightforward, it tries to progress the UCX worker continuously, if there's any work to be done it will block until that completes, otherwise UCX-Py yields for the event loop and try again once the event loop completes that iteration and arrives again at the progress routine.

jacobtomlinson transferred this issue from dask/dask-mpi Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-blocking operations on scheduler #7233

Non-blocking operations on scheduler #7233

YJHMITWEB commented Oct 29, 2022 •

edited

kmpaul commented Oct 29, 2022

kmpaul commented Oct 30, 2022

YJHMITWEB commented Nov 1, 2022

jacobtomlinson commented Nov 1, 2022

pentschev commented Nov 1, 2022

jacobtomlinson commented Nov 1, 2022

pentschev commented Nov 1, 2022 •

edited

Non-blocking operations on scheduler #7233

Non-blocking operations on scheduler #7233

Comments

YJHMITWEB commented Oct 29, 2022 • edited

kmpaul commented Oct 29, 2022

kmpaul commented Oct 30, 2022

YJHMITWEB commented Nov 1, 2022

jacobtomlinson commented Nov 1, 2022

pentschev commented Nov 1, 2022

jacobtomlinson commented Nov 1, 2022

pentschev commented Nov 1, 2022 • edited

YJHMITWEB commented Oct 29, 2022 •

edited

pentschev commented Nov 1, 2022 •

edited