Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch lighting seems can't use? #36

Closed
langdaoliu opened this issue Jul 27, 2021 · 2 comments
Closed

pytorch lighting seems can't use? #36

langdaoliu opened this issue Jul 27, 2021 · 2 comments
Labels

Comments

@langdaoliu
Copy link

i use SAM train with pytorch lighting.when i use muti-GPU,there is some error.

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256]], which is output 73 of BroadcastBackward, is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

@Alicegaz
Copy link

Alicegaz commented Aug 7, 2021

This error arises when you do two subsequent forward passes, then try to backward separately. I was able to solve this in DDP case by merging training_step and training_step_end this way:

def __init__(self, hparams, data_module):
    super().__init__(hparams, data_module)
    self.automatic_optimization = False

def training_step(self, batch, batch_idx,
                      dataloader_idx=None):
    self.enable_bn(self.model)
    out = self(x)
    loss, losses = self.criterion(out, y)
    self.manual_backward(loss, opt)
    opt.first_step(zero_grad=True)
    
    self.disable_bn(self.model)
    out_2, emb_2 = self(x)
    loss_2, losses_2 = self.criterion(out_2, y)
    self.manual_backward(loss_2, opt)
    
    opt.second_step(zero_grad=True)
    
    self.trainer.train_loop.running_loss.append(loss)
    return loss

However, note that in Pytorch Lightning versions <1.0.7 automatic_optimization set to False leads to logging bugs. If you see NaN loss in the progress bar when its actual value is not NaN, then upgrading Pytorch Lightning to 1.0.7 and adding self.trainer.train_loop.running_loss.append(loss) to the training_step() should solve the problem.

@stale
Copy link

stale bot commented Aug 28, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 28, 2021
@stale stale bot closed this as completed Sep 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants