Is it possible to parallelize computations across GPUs? #27

KeAWang · 2019-10-01T16:10:53Z

Hi,

I'm trying to parallel computations on multiple GPUs with Keops, but it seems like the computation happens sequantially across the GPUs. What I'm doing is:

from gpytorch.kernels.keops import RBFKernel

# Instantiate a Module on every GPU
rbfs = [RBFKernel().to(d) for d in range(2)]

# Instantiate the tensors on every GPU
xs = [torch.randn(5000, 1).to(d) for d in range(2)]

# Create a wrapper around a keops.torch.LazyTensor on each device that carries out the kernel matrix multiplication 
lztsrs = [rbf.forward(x, x) for rbf, x in zip(rbfs, xs)]

# Get the actual Pytorch kernel tensors by multiplying by the identity matrix
res = [t.evaluate() for t in lztsrs]

However, according to the GPU usage in nvidia-smi, the matrix multiplications are happening sequentially since only one GPU has 100% utilization at a time.

On the other hand, in pytorch for example, the following will dispatch the computations in parallel and all GPUs will simultaneously have high usage:

import torch

xs = [torch.randn(30000, 30000, device=f"cuda:{i}") for i in range(2)]
res = [x @ x for x in xs]

Is there anyway to do keops computations on each GPU in parallel in the same way?

The text was updated successfully, but these errors were encountered:

jeanfeydy · 2019-10-02T12:46:24Z

Hi @KeAWang ,

I know that @benoitmartin88 and @bcharlier worked on it for the Deformetrica software ~6 months ago and got it to work (I think). They may be busy at the moment (as we're working on the R backends with deadlines on Tuesday + Wesnesday), but will probably be able to answer your question.

Best regards,
Jean

fradav · 2019-10-09T06:43:33Z

Hello,

@bcharlier told me about the issue sometimes ago and I forgot to write down the answer I got. I think there is a confusion about the parallelization/asynchronous features at the language level. It is true that pytorch is completely capable to distribute asynchronously workloads across multiple GPU.
On the other hand, the pykeops' kernel computations are launched at the python level and there is no way python could manage this asynchronously across multiple gpu by itself.

The solution is to instantiate tensors directly on multiple gpu and to use the asychronous/multiprocessing python facilities like multiprocess/async.

I'll provide a minimal example.

benoitmartin88 · 2019-10-09T15:10:01Z

Hi @KeAWang,

@bcharlier and I took a look at your issue.
One easy and quick way to solve this is to use python's multiprocessing module.

Here is a working example:

import torch
from pykeops.torch import Genred

def work(d):
  _my_conv = Genred('SqNorm2(x-y)', ['x = Vi(3)', 'y = Vj(3)'])
  _x = torch.randn(1000000, 3).to(d) #, device="cuda:" + str(d))
  _y = torch.randn(2000000, 3).to(d)
  return _my_conv(_x, _y, device_id=d).cpu()

if __name__ == '__main__':
  # an intempt of asynchroneous call to keops
  
  torch.multiprocessing.set_start_method("spawn", force=True)
  pool = mp.Pool(processes=3)
  res = pool.map(work, range(3))

Please note that using python's multiprocessing will cost you the instantiation of spawed python processes.

KeAWang · 2019-10-09T15:29:46Z

Thank you so much @benoitmartin88 and @fradav! I will try this out.

KeAWang · 2019-10-22T17:53:37Z

The code you provided works great! However, it seems that PyTorch is unable to maintain the autograd graph across processes.

Sadly, this means that using multiple GPUs with PyKeops through multiprocessing means you won't be able to use the automatic differentiation in PyTorch...

bcharlier · 2019-10-23T09:15:23Z

Hi @KeAWang,

The multiprocessing package is used in Deformetrica to dispatch the computation load on several GPUs. Deformetrica needs the gradient to perform the estimation process... Well, I am not sure, but I would bet that it is still possible (with some work) to keep the autograd graph alive with the map function.

@benoitmartin88 can you confirm ?

Best,

b.

benoitmartin88 · 2019-10-24T08:39:52Z

Hi @KeAWang,

As @bcharlier mentioned we do use Pytorch tensors in a multiprocessing context within Deformetrica.
That having been said, we compute the gradients and explicitly return them within each sub-process.
You might have to use a similar technique.

I hope this helps.

jeanfeydy assigned benoitmartin88 and bcharlier Oct 2, 2019

KeAWang closed this as completed Oct 9, 2019

bcharlier mentioned this issue Jan 30, 2020

Support for multiprocessing within PyTorch data loaders #45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to parallelize computations across GPUs? #27

Is it possible to parallelize computations across GPUs? #27

KeAWang commented Oct 1, 2019 •

edited

Loading

jeanfeydy commented Oct 2, 2019

fradav commented Oct 9, 2019

benoitmartin88 commented Oct 9, 2019

KeAWang commented Oct 9, 2019

KeAWang commented Oct 22, 2019

bcharlier commented Oct 23, 2019

benoitmartin88 commented Oct 24, 2019

Is it possible to parallelize computations across GPUs? #27

Is it possible to parallelize computations across GPUs? #27

Comments

KeAWang commented Oct 1, 2019 • edited Loading

jeanfeydy commented Oct 2, 2019

fradav commented Oct 9, 2019

benoitmartin88 commented Oct 9, 2019

KeAWang commented Oct 9, 2019

KeAWang commented Oct 22, 2019

bcharlier commented Oct 23, 2019

benoitmartin88 commented Oct 24, 2019

KeAWang commented Oct 1, 2019 •

edited

Loading