New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model loading failed when training paralleled #8
Comments
An explanation & solution of batchnorm when training on multiple gpus: |
I'll take a look at this, also here's the link for the model: https://drive.google.com/file/d/12Qr7SUhGTWugqJ9AvBEl4aDTDOvTEm-h/view?usp=sharing |
#47 solves the optimizer issue when resume training a model |
This hasn't been fixed yet. The problem is not about the "module" prefix in multi-gpu state_dict keywords. Multi-gpu trained models can be loaded successfully programming wise but they don't perform as they should do. Probably due to batch normalization being calculated on separate device and is not synchronized across devices. I tested this last week and the previous optimizer fix doesn't solve it. |
A quick fix would be: |
I believe #53 has fixed the issue When doing the evaluation, please load the model via: This way the framework will try to wrap the model with nn.DataParallel instead of create a matching pattern to load the weights I have tried and it seems have fixed the issue. But feel free to reopen it if it does not solve your problem Also, one down side of the current fix is that it might not be downward-compatible with previous multi-gpu trained models |
The new method could not distribute memory across multiple gpus |
8b23932 should've fixed this issue:
|
When training in multiple gpus, model can only be loaded with gpu:0, not gpu 1. And most of the times still get OOM erros |
288b2ef fixs this issue: |
No description provided.
The text was updated successfully, but these errors were encountered: