I'm trying to support the following scenario: We have multiple machines with multiple NVIDIA GPUs. On each machine, we run one worker per GPU. When a task gets scheduled on a worker, we want to make sure that any GPU routine invoked by the task runs on a GPU that is not in use by any other worker. Libraries such as TensorFlow and pytorch will schedule GPU work on the visible GPU with the lowest bus id, so by default, tasks running in parallel will both try to schedule on the same GPU if they have the same visible GPU set. CUDA supports masking available GPUs to processes via setting the CUDA_VISIBLE_DEVICES environment variable. I want to ensure that if two tasks are running on workers on the same machine, the GPUs they can see (and hence the GPUs they can schedule on) do not overlap.
I can think of two possible approaches here:
- Assigning each worker a GPU to use on startup by running
dask-worker with CUDA_VISIBLE_DEVICES set such that each worker has a unique GPU, and --resources "GPU=1"
- Add a scheduler plugin that will assign tasks to available GPUs on the host of the worker they were scheduled on and add a preamble to the task's target that will set the environment variables properly.
I'm new to Dask, so I'm not sure if either of these is appropriate. Is there prior art on handling situations like this?
I'm trying to support the following scenario: We have multiple machines with multiple NVIDIA GPUs. On each machine, we run one worker per GPU. When a task gets scheduled on a worker, we want to make sure that any GPU routine invoked by the task runs on a GPU that is not in use by any other worker. Libraries such as TensorFlow and pytorch will schedule GPU work on the visible GPU with the lowest bus id, so by default, tasks running in parallel will both try to schedule on the same GPU if they have the same visible GPU set. CUDA supports masking available GPUs to processes via setting the
CUDA_VISIBLE_DEVICESenvironment variable. I want to ensure that if two tasks are running on workers on the same machine, the GPUs they can see (and hence the GPUs they can schedule on) do not overlap.I can think of two possible approaches here:
dask-workerwithCUDA_VISIBLE_DEVICESset such that each worker has a unique GPU, and--resources "GPU=1"I'm new to Dask, so I'm not sure if either of these is appropriate. Is there prior art on handling situations like this?