How to distribute the model to compute in multiple GPUs? #156

Adricu8 · 2022-04-14T09:15:56Z

I am trying to distribute the model computation in multiple GPUs using DataParallel from Pytorch Geometric library
I was trying to follow this example but I am running into errors.
Is this the way to do it or should I look somewhere else?
Are there any examples out there to distribute models from Pytorch Geometric Temporal library?

SherylHYX · 2022-04-18T14:46:02Z

Could you show your example explicitly and what error messages you got? Perhaps have a look at this tutorial?

Adricu8 · 2022-04-21T14:45:58Z

This how I am wrapping the model similarly as shown in the tutorial:

model = RecurrentGCN(node_features = n_features)
if torch.cuda.device_count() > 1:
    print("Available/CUDA_VISIBLE_DEVICES", os.environ["CUDA_VISIBLE_DEVICES"])
    print("Device count", torch.cuda.device_count())
    # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
    model = DataParallel(model, device_ids=[0, 1])
model.to(device)

When using DataParallel from torch.nn I got the following error:

Exception has occurred: RuntimeError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid0/users/acg384/workspace/code/LSTM_ddp2.py", line 50, in forward
h_0 = self.recurrent1(x, edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 163, in forward
Z = self._calculate_update_gate(X, edge_index, edge_weight, H)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 120, in _calculate_update_gate
Z = self.conv_x_z(X, edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/conv/cheb_conv.py", line 143, in forward
edge_index, norm = self.norm(edge_index, x.size(self.node_dim),
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/conv/cheb_conv.py", line 119, in norm
edge_weight = (2.0 * edge_weight) / lambda_max
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/_utils.py", line 434, in reraise
raise exception
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid0/users/acg384/workspace/code/LSTM_ddp2.py", line 173, in
y_hat = model(snapshot)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 268, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,

Moreover, after I was stuck with this I tried using DataParallel from torch_geometric.nn because It is the closest class used for graph data that I found. But it uses the Data format from torch_geometric.

'tuple' object has no attribute 'num_nodes'
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/data_parallel.py", line 75, in
count = torch.tensor([data.num_nodes for data in data_list])
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/data_parallel.py", line 75, in scatter
count = torch.tensor([data.num_nodes for data in data_list])
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/data_parallel.py", line 67, in forward
inputs = self.scatter(data_list, self.device_ids)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid0/users/acg384/workspace/code/LSTM_ddp2.py", line 172, in
y_hat = model(snapshot)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 268, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,

It is expecting graph parameters such as num_nodes that StaticGraphTemporalSignal is missing. I tried passing them as kwargs but it did not work either. I could try modifying the class but I wanted to ask first, what to try next?

SherylHYX · 2022-04-21T14:49:53Z

The issue seems to be that you are putting your data and model on both GPU and CPU: "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!" Could you try putting your data object on GPU also?

Adricu8 · 2022-04-21T14:57:24Z

I showed an old error by mistake, this is what I got when tensors are sent to the GPUs.

Exception has occurred: RuntimeError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid0/users/acg384/workspace/code/LSTM_ddp2.py", line 50, in forward
h_0 = self.recurrent1(x, edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 163, in forward
Z = self._calculate_update_gate(X, edge_index, edge_weight, H)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 120, in _calculate_update_gate
Z = self.conv_x_z(X, edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/conv/cheb_conv.py", line 143, in forward
edge_index, norm = self.norm(edge_index, x.size(self.node_dim),
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/conv/cheb_conv.py", line 119, in norm
edge_weight = (2.0 * edge_weight) / lambda_max
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/_utils.py", line 434, in reraise
raise exception
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid0/users/acg384/workspace/code/LSTM_ddp2.py", line 174, in
y_hat = model(snapshot)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 268, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,

SherylHYX · 2022-04-21T14:59:13Z

In the model definition, when doing calculation, you could map the data objects to the same device by e.g. "X = X.to(Y)"

Adricu8 · 2022-04-21T15:05:40Z

What do you mean by model definition? By calculation do you mean the forward pass?
In the tutorial it seems that using x.to(device) creates the batches and sends them to the GPUs.
I'm doing this in my training loop:

for snapshot in train_dataset:
            snapshot = snapshot.to(device)
            y_hat = model(snapshot)

I'm not sure what I am missing

Adricu8 · 2022-04-26T12:51:38Z

I am still running into errors regarding this. It seems that the previous error was caused by sending a custom object to the model, instead of tensors. This is not supported by DataParallel as only the tensors can be split into their batch dimension.

Now then, after solving the previous error, there is another error triggered by pytorch geometric:

Exception has occurred: IndexError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Caught IndexError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid0/users/acg384/workspace/code/LSTM_ddp2.py", line 49, in forward
h_0 = self.recurrent1(x, edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 163, in forward
Z = self._calculate_update_gate(X, edge_index, edge_weight, H)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 120, in _calculate_update_gate
Z = self.conv_x_z(X, edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/conv/cheb_conv.py", line 143, in forward
edge_index, norm = self.norm(edge_index, x.size(self.node_dim),
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/conv/cheb_conv.py", line 110, in norm
edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/utils/loop.py", line 36, in remove_self_loops
mask = edge_index[0] != edge_index[1]
IndexError: index 1 is out of bounds for dimension 0 with size 1
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/_utils.py", line 434, in reraise
raise exception
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid0/users/acg384/workspace/code/LSTM_ddp2.py", line 176, in
y_hat = model(x, edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 268, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,

Any idea what is causing this?
Thank you!

benedekrozemberczki closed this as completed Apr 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to distribute the model to compute in multiple GPUs? #156

How to distribute the model to compute in multiple GPUs? #156

Adricu8 commented Apr 14, 2022 •

edited

SherylHYX commented Apr 18, 2022

Adricu8 commented Apr 21, 2022

SherylHYX commented Apr 21, 2022 •

edited

Adricu8 commented Apr 21, 2022

SherylHYX commented Apr 21, 2022

Adricu8 commented Apr 21, 2022

Adricu8 commented Apr 26, 2022

How to distribute the model to compute in multiple GPUs? #156

How to distribute the model to compute in multiple GPUs? #156

Comments

Adricu8 commented Apr 14, 2022 • edited

SherylHYX commented Apr 18, 2022

Adricu8 commented Apr 21, 2022

SherylHYX commented Apr 21, 2022 • edited

Adricu8 commented Apr 21, 2022

SherylHYX commented Apr 21, 2022

Adricu8 commented Apr 21, 2022

Adricu8 commented Apr 26, 2022

Adricu8 commented Apr 14, 2022 •

edited

SherylHYX commented Apr 21, 2022 •

edited