Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Error when backward with retain_graph=True #1046

Closed
maximillian91 opened this issue Nov 26, 2019 · 8 comments
Closed

[Bug] Error when backward with retain_graph=True #1046

maximillian91 opened this issue Nov 26, 2019 · 8 comments
Labels
bug:confirmed Something isn't working

Comments

@maximillian91
Copy link

maximillian91 commented Nov 26, 2019

❓ Questions and Help

Any reason for clearing the ctx.backward_cache explicitly in every backward() method?

This is causing errors when calling loss.backward(retain_graph=True) twice:
TypeError: 'NoneType' object is not iterable

Here's an example:
typeerror-nonetype-object-is-not-iterable-in-loss-backward-in-pytorch

@VoVAllen
Copy link
Collaborator

Hi,

Why you set retain_graph to true? Do you need second-order derivative?

@maximillian91
Copy link
Author

maximillian91 commented Nov 27, 2019

Hypothetically yes, but did not encounter the problem for that. For me it was just a quick fix of mistakenly constructing target with required gradients, so the call of backward() in 2nd epoch caused the same error as in typeerror-nonetype-object-is-not-iterable-in-loss-backward-in-pytorch. In that question I think the DQN basically refer twice to the same instance of tensors and thereby meeting None in the ctx.backward_cache on the second call of it. I'm not an expert on the DQN, so might be wrong.

Anyway I don't understand why we need to explicitly clear the backward cache, ctx.backward_cache = None. Maybe you can clarify?

@VoVAllen
Copy link
Collaborator

VoVAllen commented Nov 28, 2019

We found if we didn't do this, the cache would not be properly cleared by pytorch, which results in memory leak.

@jermainewang
Copy link
Member

@BarclayII is re-confirming the memory leak issue. The latest pytorch may have already resolved this issue. Either way, this is likely a bug on our side and we never test retain_graph=True.

@jermainewang jermainewang changed the title Double backward pass cause error in DGL. Why clearing the backward_cache in each backward() method? [Bug] Error when backward with retain_graph=True Nov 28, 2019
@jermainewang jermainewang added the bug:confirmed Something isn't working label Nov 28, 2019
@BarclayII
Copy link
Collaborator

@maximillian91 Could you give a minimal example of retain_graph=True failing? The following seems to work for me on the master branch.

import dgl
import numpy as np
import scipy.sparse as ssp
from dgl.nn.pytorch import SAGEConv
import torch
import torch.nn as nn

g = dgl.DGLGraph(ssp.random(20, 20, 0.2))
x = torch.randn(20, 10)
m = nn.Linear(10, 20)
g.ndata['x'] = x
g.ndata['w'] = m(x)

g.update_all(dgl.function.copy_u('w', 'm'), dgl.function.sum('m', 'y'))
g.update_all(dgl.function.copy_u('w', 'm'), dgl.function.max('m', 'z'))
loss = g.ndata['y'].sum()
loss2 = g.ndata['z'].sum()
loss.backward(retain_graph=True)
loss2.backward()

As per second-order derivative, I'm not sure if/when we would support it, since it needs another kernel that computes such a derivative and so far I'm not aware of any model that needs this functionality.

As per memory leak, please see #1060 , although I need to confirm a failing example of retain_graph=True first.

@maximillian91
Copy link
Author

maximillian91 commented Dec 2, 2019

My "minimal" failing example became this based on my implementation of the Deep Tensor Neural Network K. Schütt 2017, where the error can be produced (and thereby also resolved) under 2 circumstances:

  1. The target requires gradients through the PolarGaussianExpansionLayer, so it fails on the second backward, when the cache have been cleared (even when retain_graph=True)

  2. The forward-pass pred = net(g) has not been called prior to the backward-pass.

Here's the last 20 lines of the code failing and the rest is in the .zip.

net = DTNN(
    num_node_feats=1,
    num_edge_feats=num_edge_feats,
    num_latent_feats=3,
    num_out=1,
    stat_radi=(mu_radi_rand, sigma2_radi_rand),
    stat_azim=(mu_azim_rand, sigma2_azim_rand)
)

pred = net(g)
loss = F.mse_loss(pred, target)

loss.backward(retain_graph=True)

# BUG! 2nd call of backward causes error if forward of net is not called 
# again, even though retain_graph=True. Same problem for when loss is not
# computed again.
# pred = net(g)
loss = F.mse_loss(pred, target)

# BUG!
loss.backward()

with the following error message:

python debug_retain_graph.py
DGL Version: 0.4.1
PyTorch Version: 1.3.1
Traceback (most recent call last):
  File "debug_retain_graph.py", line 240, in <module>
    loss.backward()
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 77, in apply
    return self._forward_cls.backward(self, *args)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/dgl/backend/pytorch/tensor.py", line 396, in backward
    = ctx.backward_cache
TypeError: 'NoneType' object is not iterable

debug_retain_graph.py.zip

I know that my example is basically wrong and sloppy in the sense, that I do not need to backprob gradients through the target, but I would still expect retain_graph=Train to solve it and it seems as if this issue is more difficult to solve by correct direction of gradients. I hope this helps.

@BarclayII
Copy link
Collaborator

@maximillian91 Your code ran fine with the latest master branch. Currently we only have nightly build for Linux. Could you install from source and try again? Thanks.

@maximillian91
Copy link
Author

@BarclayII I did now and the issue was solved when building from source in the latest master branch on MacOS. Seems like backend.pytorch.tensor.CopyReduce has been rewritten to not clear backward cache explicitly by ctx.backward_cache=None since v. 0.4.1, which I was using. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:confirmed Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants