[Bug] Error when backward with retain_graph=True #1046

maximillian91 · 2019-11-26T15:42:08Z

❓ Questions and Help

Any reason for clearing the ctx.backward_cache explicitly in every backward() method?

This is causing errors when calling loss.backward(retain_graph=True) twice:
TypeError: 'NoneType' object is not iterable

Here's an example:
typeerror-nonetype-object-is-not-iterable-in-loss-backward-in-pytorch

The text was updated successfully, but these errors were encountered:

VoVAllen · 2019-11-27T16:35:04Z

Hi,

Why you set retain_graph to true? Do you need second-order derivative?

maximillian91 · 2019-11-27T16:54:38Z

Hypothetically yes, but did not encounter the problem for that. For me it was just a quick fix of mistakenly constructing target with required gradients, so the call of backward() in 2nd epoch caused the same error as in typeerror-nonetype-object-is-not-iterable-in-loss-backward-in-pytorch. In that question I think the DQN basically refer twice to the same instance of tensors and thereby meeting None in the ctx.backward_cache on the second call of it. I'm not an expert on the DQN, so might be wrong.

Anyway I don't understand why we need to explicitly clear the backward cache, ctx.backward_cache = None. Maybe you can clarify?

VoVAllen · 2019-11-28T06:33:59Z

We found if we didn't do this, the cache would not be properly cleared by pytorch, which results in memory leak.

jermainewang · 2019-11-28T15:03:16Z

@BarclayII is re-confirming the memory leak issue. The latest pytorch may have already resolved this issue. Either way, this is likely a bug on our side and we never test retain_graph=True.

BarclayII · 2019-11-29T03:19:43Z

@maximillian91 Could you give a minimal example of retain_graph=True failing? The following seems to work for me on the master branch.

import dgl
import numpy as np
import scipy.sparse as ssp
from dgl.nn.pytorch import SAGEConv
import torch
import torch.nn as nn

g = dgl.DGLGraph(ssp.random(20, 20, 0.2))
x = torch.randn(20, 10)
m = nn.Linear(10, 20)
g.ndata['x'] = x
g.ndata['w'] = m(x)

g.update_all(dgl.function.copy_u('w', 'm'), dgl.function.sum('m', 'y'))
g.update_all(dgl.function.copy_u('w', 'm'), dgl.function.max('m', 'z'))
loss = g.ndata['y'].sum()
loss2 = g.ndata['z'].sum()
loss.backward(retain_graph=True)
loss2.backward()

As per second-order derivative, I'm not sure if/when we would support it, since it needs another kernel that computes such a derivative and so far I'm not aware of any model that needs this functionality.

As per memory leak, please see #1060 , although I need to confirm a failing example of retain_graph=True first.

maximillian91 · 2019-12-02T10:44:11Z

My "minimal" failing example became this based on my implementation of the Deep Tensor Neural Network K. Schütt 2017, where the error can be produced (and thereby also resolved) under 2 circumstances:

The target requires gradients through the PolarGaussianExpansionLayer, so it fails on the second backward, when the cache have been cleared (even when retain_graph=True)
The forward-pass pred = net(g) has not been called prior to the backward-pass.

Here's the last 20 lines of the code failing and the rest is in the .zip.

net = DTNN(
    num_node_feats=1,
    num_edge_feats=num_edge_feats,
    num_latent_feats=3,
    num_out=1,
    stat_radi=(mu_radi_rand, sigma2_radi_rand),
    stat_azim=(mu_azim_rand, sigma2_azim_rand)
)

pred = net(g)
loss = F.mse_loss(pred, target)

loss.backward(retain_graph=True)

# BUG! 2nd call of backward causes error if forward of net is not called 
# again, even though retain_graph=True. Same problem for when loss is not
# computed again.
# pred = net(g)
loss = F.mse_loss(pred, target)

# BUG!
loss.backward()

with the following error message:

python debug_retain_graph.py
DGL Version: 0.4.1
PyTorch Version: 1.3.1
Traceback (most recent call last):
  File "debug_retain_graph.py", line 240, in <module>
    loss.backward()
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 77, in apply
    return self._forward_cls.backward(self, *args)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/dgl/backend/pytorch/tensor.py", line 396, in backward
    = ctx.backward_cache
TypeError: 'NoneType' object is not iterable

debug_retain_graph.py.zip

I know that my example is basically wrong and sloppy in the sense, that I do not need to backprob gradients through the target, but I would still expect retain_graph=Train to solve it and it seems as if this issue is more difficult to solve by correct direction of gradients. I hope this helps.

BarclayII · 2019-12-03T07:46:40Z

@maximillian91 Your code ran fine with the latest master branch. Currently we only have nightly build for Linux. Could you install from source and try again? Thanks.

maximillian91 · 2019-12-04T16:29:09Z

@BarclayII I did now and the issue was solved when building from source in the latest master branch on MacOS. Seems like backend.pytorch.tensor.CopyReduce has been rewritten to not clear backward cache explicitly by ctx.backward_cache=None since v. 0.4.1, which I was using. Thanks

jermainewang changed the title ~~Double backward pass cause error in DGL. Why clearing the backward_cache in each backward() method?~~ [Bug] Error when backward with retain_graph=True Nov 28, 2019

jermainewang added the bug:confirmed Something isn't working label Nov 28, 2019

BarclayII mentioned this issue Nov 29, 2019

[Backend] Memory leak fix in PyTorch backend #1060

Merged

6 tasks

jermainewang closed this as completed Mar 23, 2020

PiggerZZM mentioned this issue Dec 1, 2021

"TypeError: cannot unpack non-iterable NoneType object" when loss.backward() #3557

Closed

smiret-intel mentioned this issue Jun 2, 2022

Multiple Backwards Pass Error with Retain_Graph=True #4078

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Error when backward with retain_graph=True #1046

[Bug] Error when backward with retain_graph=True #1046

maximillian91 commented Nov 26, 2019 •

edited

Loading

VoVAllen commented Nov 27, 2019

maximillian91 commented Nov 27, 2019 •

edited

Loading

VoVAllen commented Nov 28, 2019 •

edited

Loading

jermainewang commented Nov 28, 2019

BarclayII commented Nov 29, 2019

maximillian91 commented Dec 2, 2019 •

edited

Loading

BarclayII commented Dec 3, 2019

maximillian91 commented Dec 4, 2019

[Bug] Error when backward with retain_graph=True #1046

[Bug] Error when backward with retain_graph=True #1046

Comments

maximillian91 commented Nov 26, 2019 • edited Loading

❓ Questions and Help

VoVAllen commented Nov 27, 2019

maximillian91 commented Nov 27, 2019 • edited Loading

VoVAllen commented Nov 28, 2019 • edited Loading

jermainewang commented Nov 28, 2019

BarclayII commented Nov 29, 2019

maximillian91 commented Dec 2, 2019 • edited Loading

BarclayII commented Dec 3, 2019

maximillian91 commented Dec 4, 2019

maximillian91 commented Nov 26, 2019 •

edited

Loading

maximillian91 commented Nov 27, 2019 •

edited

Loading

VoVAllen commented Nov 28, 2019 •

edited

Loading

maximillian91 commented Dec 2, 2019 •

edited

Loading