New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to distribute the model to compute in multiple GPUs? #156
Comments
Could you show your example explicitly and what error messages you got? Perhaps have a look at this tutorial? |
This how I am wrapping the model similarly as shown in the tutorial: model = RecurrentGCN(node_features = n_features)
if torch.cuda.device_count() > 1:
print("Available/CUDA_VISIBLE_DEVICES", os.environ["CUDA_VISIBLE_DEVICES"])
print("Device count", torch.cuda.device_count())
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = DataParallel(model, device_ids=[0, 1])
model.to(device) When using DataParallel from torch.nn I got the following error:
Moreover, after I was stuck with this I tried using DataParallel from torch_geometric.nn because It is the closest class used for graph data that I found. But it uses the Data format from torch_geometric.
It is expecting graph parameters such as num_nodes that StaticGraphTemporalSignal is missing. I tried passing them as kwargs but it did not work either. I could try modifying the class but I wanted to ask first, what to try next? |
The issue seems to be that you are putting your data and model on both GPU and CPU: "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!" Could you try putting your data object on GPU also? |
I showed an old error by mistake, this is what I got when tensors are sent to the GPUs.
|
In the model definition, when doing calculation, you could map the data objects to the same device by e.g. "X = X.to(Y)" |
What do you mean by model definition? By calculation do you mean the forward pass? for snapshot in train_dataset:
snapshot = snapshot.to(device)
y_hat = model(snapshot) I'm not sure what I am missing |
I am still running into errors regarding this. It seems that the previous error was caused by sending a custom object to the model, instead of tensors. This is not supported by DataParallel as only the tensors can be split into their batch dimension. Now then, after solving the previous error, there is another error triggered by pytorch geometric:
Any idea what is causing this? |
I am trying to distribute the model computation in multiple GPUs using DataParallel from Pytorch Geometric library
I was trying to follow this example but I am running into errors.
Is this the way to do it or should I look somewhere else?
Are there any examples out there to distribute models from Pytorch Geometric Temporal library?
The text was updated successfully, but these errors were encountered: