Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model loading failed when training paralleled #8

Closed
waynehuu opened this issue Oct 31, 2019 · 10 comments
Closed

Model loading failed when training paralleled #8

waynehuu opened this issue Oct 31, 2019 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@waynehuu
Copy link
Collaborator

No description provided.

@waynehuu waynehuu added the bug Something isn't working label Nov 1, 2019
@bohaohuang bohaohuang self-assigned this Nov 1, 2019
@bohaohuang
Copy link
Owner

An explanation & solution of batchnorm when training on multiple gpus:
https://github.com/dougsouza/pytorch-sync-batchnorm-example

@waynehuu
Copy link
Collaborator Author

waynehuu commented Nov 1, 2019

An explanation & solution of batchnorm when training on multiple gpus:
https://github.com/dougsouza/pytorch-sync-batchnorm-example

I'll take a look at this, also here's the link for the model: https://drive.google.com/file/d/12Qr7SUhGTWugqJ9AvBEl4aDTDOvTEm-h/view?usp=sharing

@waynehuu waynehuu closed this as completed Nov 1, 2019
@waynehuu waynehuu reopened this Nov 1, 2019
@bohaohuang
Copy link
Owner

#47 solves the optimizer issue when resume training a model

@waynehuu
Copy link
Collaborator Author

waynehuu commented Mar 2, 2020

This hasn't been fixed yet.

The problem is not about the "module" prefix in multi-gpu state_dict keywords. Multi-gpu trained models can be loaded successfully programming wise but they don't perform as they should do. Probably due to batch normalization being calculated on separate device and is not synchronized across devices. I tested this last week and the previous optimizer fix doesn't solve it.

@waynehuu waynehuu reopened this Mar 2, 2020
@bohaohuang
Copy link
Owner

bohaohuang commented Mar 3, 2020

A quick fix would be:
model.encoder = nn.DataParallel(model.encoder)
model.decoder = nn.DataParallel(model.decoder)
network_utils.load(model, ckpt_dir, disable_parallel=True)
But l think this is due to the data parallel wrapping in the training process, let me investigate this a little bit

@bohaohuang
Copy link
Owner

I believe #53 has fixed the issue

When doing the evaluation, please load the model via:
network_utils.load(model, ckpt_dir)
instead of:
network_utils.load(model, ckpt_dir, disable_parallel=True)

This way the framework will try to wrap the model with nn.DataParallel instead of create a matching pattern to load the weights

I have tried and it seems have fixed the issue. But feel free to reopen it if it does not solve your problem

Also, one down side of the current fix is that it might not be downward-compatible with previous multi-gpu trained models

@bohaohuang
Copy link
Owner

The new method could not distribute memory across multiple gpus

@bohaohuang bohaohuang reopened this Mar 4, 2020
@bohaohuang
Copy link
Owner

8b23932 should've fixed this issue:

  1. encoder and decoder still need to be wrapped with DataParallel separately to enable memory distributing across gpus
  2. model attributes need to be forwarded by custom DataParallel class to avoid OOM error at inference after loading the model

@bohaohuang
Copy link
Owner

When training in multiple gpus, model can only be loaded with gpu:0, not gpu 1. And most of the times still get OOM erros

@bohaohuang bohaohuang reopened this Mar 6, 2020
@bohaohuang
Copy link
Owner

288b2ef fixs this issue:
gpu loading error is solved by setting the primary device properly for dataparallel, OOM error seems like a cuda bug that occurs rarely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants