Test and add a description of distributed computation #14

ar4 · 2018-07-14T19:38:45Z

PyTorch allows distributed computation, and this is usually necessary for realistic datasets. It should be tested to ensure that it works with Deepwave, and a description added to the documentation explaining to users how to do it.

vkazei · 2022-03-21T15:47:16Z

hi Alan,
Could you share some thoughts on the easiest way to distribute shots?
Best,
Vladimir

ar4 · 2022-03-22T09:19:24Z

Hi Vladimir, PyTorch provides several options for distributed execution: https://pytorch.org/tutorials/beginner/dist_overview.html . It should also be possible to achieve it manually using mpi4py, if you prefer. Do you wish to run on multiple GPUs connected to a single node, multiple nodes each with a single device (CPU or GPU), or multiple nodes each with multiple GPUs? Are you just doing forward modeling, or also backpropagation/inversion?

…

-Alan

vkazei · 2022-03-22T17:53:29Z

Thanks a lot for getting back. I am trying to distribute FWI and LSRTM within a single node. I tried to replace prop=nn.DataParallel(prop), which appears to be the most basic option.
source_amplitudes.swapaxes(0,1).swapaxes(1,2) before the propagator and swapping back within the propagator allows distributing inputs seemingly properly. The model itself does not get replicated staying at the "cuda:0" device so even forward propagation does not work...

ar4 · 2022-03-22T19:34:47Z

Hi Vladimir,

I think I see the problem. The model is not being registered as a parameter of the propagator, and so PyTorch doesn't know that it needs to copy it to the other devices. I don't have multiple GPUs to test it on at the moment, but does manually registering the parameters as below work?

prop = deepwave.scalar.Propagator({'vp': model}, dx)
prop.register_parameter('vp', torch.nn.Parameter(model))
prop = torch.nn.DataParallel(prop)

If so, I will fix it in the next release of Deepwave so that the parameters are automatically registered.

vkazei · 2022-03-23T14:12:22Z

Hi Alan,

Registering the model as a parameter did not change the behavior. Sending it manually to the device with inputs inside the forward method lets nn.DataParallel run, but it looks like it runs sequentially the forward propagation for different GPUs)

ar4 · 2022-03-23T14:32:26Z

Hi Vladimir, That is unfortunate. It should definitely be possible, though. Perhaps larger changes are required. I planned to overhaul Deepwave during the summer, but will try to get to it sooner. Please let me know if you can think of any other features that you would like it to have so that I can try to include them when planning the modifications. In the meantime, another option, if you are very keen to use multiple GPUs and feeling ambitious, might be to use mpi4py. It will be more work, as you will need to manually use mpi4py to run multiple processes and send data and gradient updates between them, but should provide you with enough control that you can make it work. If you can wait a few weeks, though, I will hopefully have a clean way working.

ar4 closed this as completed in ab95326 Sep 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test and add a description of distributed computation #14

Test and add a description of distributed computation #14

ar4 commented Jul 14, 2018

vkazei commented Mar 21, 2022

ar4 commented Mar 22, 2022 via email

vkazei commented Mar 22, 2022

ar4 commented Mar 22, 2022

vkazei commented Mar 23, 2022

ar4 commented Mar 23, 2022 via email

Test and add a description of distributed computation #14

Test and add a description of distributed computation #14

Comments

ar4 commented Jul 14, 2018

vkazei commented Mar 21, 2022

ar4 commented Mar 22, 2022 via email

vkazei commented Mar 22, 2022

ar4 commented Mar 22, 2022

vkazei commented Mar 23, 2022

ar4 commented Mar 23, 2022 via email