Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attach_grad of intermediate variables causes the gradient graph to be lost #11865

Open
szha opened this issue Jul 24, 2018 · 7 comments

Comments

@szha
Copy link
Member

commented Jul 24, 2018

import mxnet as mx
from mxnet import gluon

net = gluon.model_zoo.vision.mobilenet0_25(pretrained=True)
loss = gluon.loss.SoftmaxCELoss()
with mx.autograd.record():
    output = net(mx.random.uniform(shape=(5,3,224,224)))
    output.attach_grad()
    l = loss(output, mx.nd.arange(5))
l.backward()
print(net.features[0].weight.grad()) # shouldn't be zeros

@szha szha added Bug Autograd labels Jul 24, 2018

@anirudhacharya

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

To understand this right, in scenarios such as the following

x = mx.nd.array([0, 7], ctx = mx.cpu())
x.attach_grad()
with mx.autograd.record():
    y = ((5 * (x**2)) + (13 * x) + 10)
    y.attach_grad()
    z = 2 * y
z.backward()
print(x.grad)

what you want is we should be able to get x.grad as non-zero values even though the intermediate variable y has been marked with attach_grad.

In the above example would you also want the result of y.grad to be retained? Because that would be a bit different and more like a feature request than a bug. It will probably involve storing intermediate gradients of non-leaf variables in some sort of a buffer by providing a hook/function call to the user to enable storing the gradients of non-leaf variables because storing it every time by default might be a waste of memory.

Have I understood this right? Which of the two above situations is your issue pointing to?

@szha

This comment has been minimized.

Copy link
Member Author

commented May 1, 2019

would you also want the result of y.grad to be retained

Yes, that's what I'd like to have. In the current implementation, instead of marking the y's gradient to be one of the output, the above code discards the previous graph in which y resides.

@anirudhacharya

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

It will probably involve storing intermediate gradients of non-leaf variables in some sort of a buffer by providing a hook/function call to the user to enable storing the gradients of non-leaf variables because storing it every time by default might be a waste of memory.

@szha would you agree with the above part that these intermediate gradients should not be stored by default but rather we should provide a function call, something like persist_grad which the user can call along with the variables( like y.persist_grad()) to enable storing of intermediate gradients?

@szha

This comment has been minimized.

Copy link
Member Author

commented May 1, 2019

Yes, there should be an explicit mechanism to mark new outputs. Whether it's reusing attach_grad or a new method is up for debate.

@anirudhacharya

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2019

Here is another usecase where using attach_grad() with intermediate variables gives erroneous results -

With the following example I would expect x.grad to be [10, 24, 42, 64] but using head gradients and chain rule as per the autograd documentation gives me [5, 12, 21, 32]

from mxnet import ndarray as nd
from mxnet import autograd as ag
x = nd.array([1,2,3,4])
x.attach_grad()
y = nd.array([5,6,7,8])
y.attach_grad()

ag.set_recording(True)
u = x * y
v = u.detach()
v.attach_grad()
z = v * x
ag.set_recording(False)
z.backward()
u.backward(v.grad)
print(x.grad, y.grad)

But when I do it without using head gradients like as follows I get the correct gradients -

from mxnet import autograd as ag
x = nd.array([1,2,3,4])
x.attach_grad()
y = nd.array([5,6,7,8])
y.attach_grad()

ag.set_recording(True)
u = x * y
z = u * x
ag.set_recording(False)
z.backward()
print(x.grad, y.grad)
@anirudhacharya

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2019

And as per the autograd documentation here - https://www.d2l.ai/chapter_crashcourse/autograd.html#attach-gradients-to-internal-variables

it would seem we are expecting the computation graph to be thrown away when we execute x.attach_grad() because we are implicitly running detach() every time attach_grad() is called.

We need to get a clear understanding of what the expected behavior is here.

@larroy

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2019

Unfortunately this is how it's implemented. Why do you want to attach grad to output again?

detach is not the cause, as far as I understand the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.