Model loading failed when training paralleled #8

waynehuu · 2019-10-31T18:07:57Z

No description provided.

bohaohuang · 2019-11-01T17:23:17Z

An explanation & solution of batchnorm when training on multiple gpus:
https://github.com/dougsouza/pytorch-sync-batchnorm-example

waynehuu · 2019-11-01T18:55:42Z

An explanation & solution of batchnorm when training on multiple gpus:
https://github.com/dougsouza/pytorch-sync-batchnorm-example

I'll take a look at this, also here's the link for the model: https://drive.google.com/file/d/12Qr7SUhGTWugqJ9AvBEl4aDTDOvTEm-h/view?usp=sharing

bohaohuang · 2020-01-30T20:48:11Z

#47 solves the optimizer issue when resume training a model

waynehuu · 2020-03-02T22:07:09Z

This hasn't been fixed yet.

The problem is not about the "module" prefix in multi-gpu state_dict keywords. Multi-gpu trained models can be loaded successfully programming wise but they don't perform as they should do. Probably due to batch normalization being calculated on separate device and is not synchronized across devices. I tested this last week and the previous optimizer fix doesn't solve it.

bohaohuang · 2020-03-03T18:05:12Z

A quick fix would be:
model.encoder = nn.DataParallel(model.encoder)
model.decoder = nn.DataParallel(model.decoder)
network_utils.load(model, ckpt_dir, disable_parallel=True)
But l think this is due to the data parallel wrapping in the training process, let me investigate this a little bit

bohaohuang · 2020-03-03T19:06:13Z

I believe #53 has fixed the issue

When doing the evaluation, please load the model via:
network_utils.load(model, ckpt_dir)
instead of:
network_utils.load(model, ckpt_dir, disable_parallel=True)

This way the framework will try to wrap the model with nn.DataParallel instead of create a matching pattern to load the weights

I have tried and it seems have fixed the issue. But feel free to reopen it if it does not solve your problem

Also, one down side of the current fix is that it might not be downward-compatible with previous multi-gpu trained models

bohaohuang · 2020-03-04T17:21:32Z

The new method could not distribute memory across multiple gpus

bohaohuang · 2020-03-04T19:22:30Z

8b23932 should've fixed this issue:

encoder and decoder still need to be wrapped with DataParallel separately to enable memory distributing across gpus
model attributes need to be forwarded by custom DataParallel class to avoid OOM error at inference after loading the model

bohaohuang · 2020-03-06T18:24:38Z

When training in multiple gpus, model can only be loaded with gpu:0, not gpu 1. And most of the times still get OOM erros

bohaohuang · 2020-03-06T20:14:33Z

288b2ef fixs this issue:
gpu loading error is solved by setting the primary device properly for dataparallel, OOM error seems like a cuda bug that occurs rarely

waynehuu added the bug Something isn't working label Nov 1, 2019

bohaohuang self-assigned this Nov 1, 2019

waynehuu closed this as completed Nov 1, 2019

waynehuu reopened this Nov 1, 2019

bohaohuang mentioned this issue Mar 2, 2020

fix parallel to single gpu loading problem #52

Merged

bohaohuang closed this as completed Mar 2, 2020

waynehuu reopened this Mar 2, 2020

bohaohuang closed this as completed Mar 3, 2020

bohaohuang reopened this Mar 4, 2020

bohaohuang closed this as completed Mar 4, 2020

bohaohuang reopened this Mar 6, 2020

bohaohuang closed this as completed Mar 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model loading failed when training paralleled #8

Model loading failed when training paralleled #8

waynehuu commented Oct 31, 2019

bohaohuang commented Nov 1, 2019

waynehuu commented Nov 1, 2019 •

edited

bohaohuang commented Jan 30, 2020

waynehuu commented Mar 2, 2020

bohaohuang commented Mar 3, 2020 •

edited

bohaohuang commented Mar 3, 2020

bohaohuang commented Mar 4, 2020

bohaohuang commented Mar 4, 2020

bohaohuang commented Mar 6, 2020

bohaohuang commented Mar 6, 2020

Model loading failed when training paralleled #8

Model loading failed when training paralleled #8

Comments

waynehuu commented Oct 31, 2019

bohaohuang commented Nov 1, 2019

waynehuu commented Nov 1, 2019 • edited

bohaohuang commented Jan 30, 2020

waynehuu commented Mar 2, 2020

bohaohuang commented Mar 3, 2020 • edited

bohaohuang commented Mar 3, 2020

bohaohuang commented Mar 4, 2020

bohaohuang commented Mar 4, 2020

bohaohuang commented Mar 6, 2020

bohaohuang commented Mar 6, 2020

waynehuu commented Nov 1, 2019 •

edited

bohaohuang commented Mar 3, 2020 •

edited