Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to distribute the model to compute in multiple GPUs? #156

Closed
Adricu8 opened this issue Apr 14, 2022 · 7 comments
Closed

How to distribute the model to compute in multiple GPUs? #156

Adricu8 opened this issue Apr 14, 2022 · 7 comments

Comments

@Adricu8
Copy link

Adricu8 commented Apr 14, 2022

I am trying to distribute the model computation in multiple GPUs using DataParallel from Pytorch Geometric library
I was trying to follow this example but I am running into errors.
Is this the way to do it or should I look somewhere else?
Are there any examples out there to distribute models from Pytorch Geometric Temporal library?

@SherylHYX
Copy link
Collaborator

Could you show your example explicitly and what error messages you got? Perhaps have a look at this tutorial?

@Adricu8
Copy link
Author

Adricu8 commented Apr 21, 2022

This how I am wrapping the model similarly as shown in the tutorial:

model = RecurrentGCN(node_features = n_features)
if torch.cuda.device_count() > 1:
    print("Available/CUDA_VISIBLE_DEVICES", os.environ["CUDA_VISIBLE_DEVICES"])
    print("Device count", torch.cuda.device_count())
    # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
    model = DataParallel(model, device_ids=[0, 1])
model.to(device)

When using DataParallel from torch.nn I got the following error:

Exception has occurred: RuntimeError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid0/users/acg384/workspace/code/LSTM_ddp2.py", line 50, in forward
h_0 = self.recurrent1(x, edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 163, in forward
Z = self._calculate_update_gate(X, edge_index, edge_weight, H)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 120, in _calculate_update_gate
Z = self.conv_x_z(X, edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/conv/cheb_conv.py", line 143, in forward
edge_index, norm = self.norm(edge_index, x.size(self.node_dim),
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/conv/cheb_conv.py", line 119, in norm
edge_weight = (2.0 * edge_weight) / lambda_max
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/_utils.py", line 434, in reraise
raise exception
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid0/users/acg384/workspace/code/LSTM_ddp2.py", line 173, in
y_hat = model(snapshot)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 268, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,

Moreover, after I was stuck with this I tried using DataParallel from torch_geometric.nn because It is the closest class used for graph data that I found. But it uses the Data format from torch_geometric.

'tuple' object has no attribute 'num_nodes'
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/data_parallel.py", line 75, in
count = torch.tensor([data.num_nodes for data in data_list])
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/data_parallel.py", line 75, in scatter
count = torch.tensor([data.num_nodes for data in data_list])
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/data_parallel.py", line 67, in forward
inputs = self.scatter(data_list, self.device_ids)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid0/users/acg384/workspace/code/LSTM_ddp2.py", line 172, in
y_hat = model(snapshot)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 268, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,

It is expecting graph parameters such as num_nodes that StaticGraphTemporalSignal is missing. I tried passing them as kwargs but it did not work either. I could try modifying the class but I wanted to ask first, what to try next?

@SherylHYX
Copy link
Collaborator

SherylHYX commented Apr 21, 2022

The issue seems to be that you are putting your data and model on both GPU and CPU: "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!" Could you try putting your data object on GPU also?

@Adricu8
Copy link
Author

Adricu8 commented Apr 21, 2022

I showed an old error by mistake, this is what I got when tensors are sent to the GPUs.

Exception has occurred: RuntimeError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid0/users/acg384/workspace/code/LSTM_ddp2.py", line 50, in forward
h_0 = self.recurrent1(x, edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 163, in forward
Z = self._calculate_update_gate(X, edge_index, edge_weight, H)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 120, in _calculate_update_gate
Z = self.conv_x_z(X, edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/conv/cheb_conv.py", line 143, in forward
edge_index, norm = self.norm(edge_index, x.size(self.node_dim),
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/conv/cheb_conv.py", line 119, in norm
edge_weight = (2.0 * edge_weight) / lambda_max
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/_utils.py", line 434, in reraise
raise exception
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid0/users/acg384/workspace/code/LSTM_ddp2.py", line 174, in
y_hat = model(snapshot)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 268, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,

@SherylHYX
Copy link
Collaborator

In the model definition, when doing calculation, you could map the data objects to the same device by e.g. "X = X.to(Y)"

@Adricu8
Copy link
Author

Adricu8 commented Apr 21, 2022

What do you mean by model definition? By calculation do you mean the forward pass?
In the tutorial it seems that using x.to(device) creates the batches and sends them to the GPUs.
I'm doing this in my training loop:

for snapshot in train_dataset:
            snapshot = snapshot.to(device)
            y_hat = model(snapshot)

I'm not sure what I am missing

@Adricu8
Copy link
Author

Adricu8 commented Apr 26, 2022

I am still running into errors regarding this. It seems that the previous error was caused by sending a custom object to the model, instead of tensors. This is not supported by DataParallel as only the tensors can be split into their batch dimension.

Now then, after solving the previous error, there is another error triggered by pytorch geometric:

Exception has occurred: IndexError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Caught IndexError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid0/users/acg384/workspace/code/LSTM_ddp2.py", line 49, in forward
h_0 = self.recurrent1(x, edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 163, in forward
Z = self._calculate_update_gate(X, edge_index, edge_weight, H)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 120, in _calculate_update_gate
Z = self.conv_x_z(X, edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/conv/cheb_conv.py", line 143, in forward
edge_index, norm = self.norm(edge_index, x.size(self.node_dim),
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/nn/conv/cheb_conv.py", line 110, in norm
edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch_geometric/utils/loop.py", line 36, in remove_self_loops
mask = edge_index[0] != edge_index[1]
IndexError: index 1 is out of bounds for dimension 0 with size 1
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/_utils.py", line 434, in reraise
raise exception
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid0/users/acg384/workspace/code/LSTM_ddp2.py", line 176, in
y_hat = model(x, edge_index, edge_weight)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 268, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/acg384/miniconda3/envs/pytorch_test/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,

Any idea what is causing this?
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants