Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to parallelize computations across GPUs? #27

Closed
KeAWang opened this issue Oct 1, 2019 · 7 comments
Closed

Is it possible to parallelize computations across GPUs? #27

KeAWang opened this issue Oct 1, 2019 · 7 comments
Assignees

Comments

@KeAWang
Copy link

KeAWang commented Oct 1, 2019

Hi,

I'm trying to parallel computations on multiple GPUs with Keops, but it seems like the computation happens sequantially across the GPUs. What I'm doing is:

from gpytorch.kernels.keops import RBFKernel

# Instantiate a Module on every GPU
rbfs = [RBFKernel().to(d) for d in range(2)]

# Instantiate the tensors on every GPU
xs = [torch.randn(5000, 1).to(d) for d in range(2)]

# Create a wrapper around a keops.torch.LazyTensor on each device that carries out the kernel matrix multiplication 
lztsrs = [rbf.forward(x, x) for rbf, x in zip(rbfs, xs)]

# Get the actual Pytorch kernel tensors by multiplying by the identity matrix
res = [t.evaluate() for t in lztsrs]

However, according to the GPU usage in nvidia-smi, the matrix multiplications are happening sequentially since only one GPU has 100% utilization at a time.

On the other hand, in pytorch for example, the following will dispatch the computations in parallel and all GPUs will simultaneously have high usage:

import torch

xs = [torch.randn(30000, 30000, device=f"cuda:{i}") for i in range(2)]
res = [x @ x for x in xs]

Is there anyway to do keops computations on each GPU in parallel in the same way?

@jeanfeydy
Copy link
Contributor

Hi @KeAWang ,

I know that @benoitmartin88 and @bcharlier worked on it for the Deformetrica software ~6 months ago and got it to work (I think). They may be busy at the moment (as we're working on the R backends with deadlines on Tuesday + Wesnesday), but will probably be able to answer your question.

Best regards,
Jean

@fradav
Copy link
Contributor

fradav commented Oct 9, 2019

Hello,

@bcharlier told me about the issue sometimes ago and I forgot to write down the answer I got. I think there is a confusion about the parallelization/asynchronous features at the language level. It is true that pytorch is completely capable to distribute asynchronously workloads across multiple GPU.
On the other hand, the pykeops' kernel computations are launched at the python level and there is no way python could manage this asynchronously across multiple gpu by itself.

The solution is to instantiate tensors directly on multiple gpu and to use the asychronous/multiprocessing python facilities like multiprocess/async.

I'll provide a minimal example.

@benoitmartin88
Copy link

Hi @KeAWang,

@bcharlier and I took a look at your issue.
One easy and quick way to solve this is to use python's multiprocessing module.

Here is a working example:

import torch
from pykeops.torch import Genred

def work(d):
  _my_conv = Genred('SqNorm2(x-y)', ['x = Vi(3)', 'y = Vj(3)'])
  _x = torch.randn(1000000, 3).to(d) #, device="cuda:" + str(d))
  _y = torch.randn(2000000, 3).to(d)
  return _my_conv(_x, _y, device_id=d).cpu()

if __name__ == '__main__':
  # an intempt of asynchroneous call to keops
  
  torch.multiprocessing.set_start_method("spawn", force=True)
  pool = mp.Pool(processes=3)
  res = pool.map(work, range(3))

Please note that using python's multiprocessing will cost you the instantiation of spawed python processes.

@KeAWang
Copy link
Author

KeAWang commented Oct 9, 2019

Thank you so much @benoitmartin88 and @fradav! I will try this out.

@KeAWang KeAWang closed this as completed Oct 9, 2019
@KeAWang
Copy link
Author

KeAWang commented Oct 22, 2019

The code you provided works great! However, it seems that PyTorch is unable to maintain the autograd graph across processes.

Sadly, this means that using multiple GPUs with PyKeops through multiprocessing means you won't be able to use the automatic differentiation in PyTorch...

@bcharlier
Copy link
Member

Hi @KeAWang,

The multiprocessing package is used in Deformetrica to dispatch the computation load on several GPUs. Deformetrica needs the gradient to perform the estimation process... Well, I am not sure, but I would bet that it is still possible (with some work) to keep the autograd graph alive with the map function.

@benoitmartin88 can you confirm ?

Best,

b.

@benoitmartin88
Copy link

Hi @KeAWang,

As @bcharlier mentioned we do use Pytorch tensors in a multiprocessing context within Deformetrica.
That having been said, we compute the gradients and explicitly return them within each sub-process.
You might have to use a similar technique.

I hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants