-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Having trouble moving a module from one GPU to another. #1191
Comments
Is there a more detailed traceback? Given that it's in Do you know if this works in PyTorch? Also, and this is just out of curiosity -- why move back and forth between GPUs? |
Sorry for the massive block.
That's with trying to zero_grad the model. I had things running in pytorch for for some external reasons I wanted to switch over to C#. I had most of my code ported over before actually testing on the mutli-gpu pc. Runs great on 1 gpu. |
Okay... The Stopping in the debugger just before calling Unfortunately, I'm not fortunate enough to have two GPUs that are usable for training, so I can't help debug, myself. |
I've tried just about everything I could think of. aModel is the model that's training and aOutput is the loss output about to be .backwards()'d. I didn't expand them all for the screenshot, but each _module and _internal_params were tagged to the right cuda:1 as expected |
Right, but could you do a |
Oh hey!! Thank you! |
@shaltielshmid -- could this have anything to do with the recent @Biotot -- does this work better on a version < 0.101.3? |
Same issue in 0.101.0 Do you have an idea for a work-around that I could put together for now? |
Not at the moment -- @shaltielshmid recently worked on fixing some stuff in |
BTW, tomorrow is my last day working this calendar year, so there's not likely to be another release before 2024. |
If you call |
No luck. Even calling it everywhere. It's zeroed (I'm assuming) but still on the original device. |
Can you provide sample minimal code to replicate the issue? I thought I replicated it on my end, but the |
I added a few extra zero_grads for fun. Excuse some lazy naming. copypasting from tutorials and my own code.
System.Runtime.InteropServices.ExternalException: 'Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! Error triggers on the 2nd output.backward() output from added prints.
|
Okay, I see two things wrong here. 1] The gradients of a parameter should be copied during a move, as in PyTorch. I should have a fix ready for that in a few minutes. 2] When you call |
Awesome to have you involved with TorchSharp, @shaltielshmid! |
@NiklasGustafsson awesome to be involved! Question for you: Right now with the fix of moving the gradients the bug no longer occurs, but this raises that questions. |
Update: It seems like libtorch's default behavior is to set to none as well: https://github.com/pytorch/pytorch/blob/main/torch/csrc/api/src/nn/module.cpp#L253 Updating TorchSharp's behavior accordingly |
You guys are fantastic. Thanks for the quick turn around. |
If everything goes well, there should be a release today with this fix in it. |
@Biotot Is the problem solved in version 0.101.5? |
It's working fantastic now. Thanks again. |
Discussed in #1190
Originally posted by Biotot December 19, 2023
I've been banging my head against this for a couple days and I'm still coming up empty.
I have multiple modules and multiple GPUs, however this sequence continues to fail. I've narrowed it down to being a problem with the model. I can load it fresh from a file each loop and the error no longer exists.
(Pseudocode)
ModuleA.to(cuda:0)
TrainLoop
ModuleA.to(cpu)
ModuleA.to(cuda:1)
TrainLoop
Exception: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1
The error is consistently in the loss output.backward() call if that helps.
This error doesn't happen if I load the module from a file each loop. Input data is not the issue, the model isn't correctly switching devices. I've tried many different combinations of code and have tried directly moving from cuda:0 to cuda:1 without luck.
I'm not sure what is going wrong, I've been porting over my code from pytorch and I've been trying to get over this hurdle. Any help would be appreciated.
Running on TorchSharp-cuda-windows 0.101.4
The text was updated successfully, but these errors were encountered: