In [1]:
import sys
import torch
from torch import nn
print(torch.__version__)
print(sys.version_info)

1.10.2
sys.version_info(major=3, minor=6, micro=8, releaselevel='final', serial=0)


- compute loss
    - forward
- loss.backward()（或者任意的 objective.backward()）
    - backward (compute grad)
- optimizer.step()
    - x = x - lr*x.grad

## 两种不被允许的 inplace operation

1. 对于 requires_grad==True 的 叶子张量 (leaf tensor) 不能使用 inplace operation
    - all `Parameters` are leaf node and requires grad
    - tensor.is_leaf == True
2. 对于在求梯度阶段需要用到的张量不能使用 inplace operation

### 叶子节点（leaf node）

In [2]:
w = torch.FloatTensor(10) # w 是个 leaf tensor
w.requires_grad = True    # 将 requires_grad 设置为 True

In [3]:
w

tensor([4.0725e-36, 1.4013e-45, 4.7817e-33, 1.4013e-45, 4.7732e-33, 1.4013e-45,
        4.7732e-33, 1.4013e-45, 4.7733e-33, 1.4013e-45], requires_grad=True)

In [4]:
w.is_leaf

True

In [5]:
# inplace operation
w.normal_()

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

In [6]:
w.data.requires_grad

False

In [9]:
w.data.normal_()

tensor([-1.6973, -0.1248, -1.1631,  0.6612,  0.2086, -0.1125, -0.5158,  0.2699,
         0.2380,  0.6100])

In [10]:
w.data

tensor([-1.6973, -0.1248, -1.1631,  0.6612,  0.2086, -0.1125, -0.5158,  0.2699,
         0.2380,  0.6100])

### 求梯度阶段（不限于是否是 leaf node/variable/parameters）需要用到的张量

In [11]:
x = torch.FloatTensor([[1., 2.]])
w1 = torch.FloatTensor([[2.], [1.]])
w2 = torch.FloatTensor([3.])
w1.requires_grad = True
w2.requires_grad = True

In [15]:
# x.is_leaf
# w1.is_leaf
w2.is_leaf

True

``
x
w1 -> d
      w2 -> f
```

In [17]:
d = torch.matmul(x, w1)
f = torch.matmul(d, w2)
d[:] = 0

f.backward()

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 1]], which is output 0 of torch::autograd::CopySlices, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

In [18]:
d = torch.matmul(x, w1)
d[:] = 0
f = torch.matmul(d, w2)

f.backward()

In [19]:
w2.grad

tensor([0.])

- 在计算 f 的时候, d 是等于某个值的, f 对于 w2 的导数是和这时候的 d 值相关的
- 但是计算完 f 之后, d 的值变了, 这就会导致 f.backward() 对于 w2 的导数计算出错误, 为了防止这种错误, pytorch 选择了报错的形式.
- 造成这个问题的主要原因是因为 在执行 f = torch.matmul(d, w2) 这句的时候, pytorch 的反向求导机制保存了 d 的引用为了之后的反向求导计算.

## `.data`与`.detach`

- detach
  - Returns a new Tensor, detached from the current graph.
  - The result will never require gradient.
- x.data 与 x.detach() 返回的 tensor 有相同的地方, 也有不同的地方，相同点如下
  - 都和 x 共享同一块数据
  - 都和 x 的 计算历史无关
  - requires_grad = False
- x.data 的修改不会导致报错，但其实计算是有问题的（相当于埋了一个bug）；
    - x.detach() 会直接报错（更加梯度安全）；

In [None]:
a = torch.tensor([1, 2, 3.], requires_grad=True)

out = a.sigmoid()

c = out.data
# c = out.detach()

print(f'a.requires_grad: {a.requires_grad}, out.requires_grad: {out.requires_grad}, c.requires_grad: {c.requires_grad}')

print(out)
print(c)
c.zero_()

print(out)
print(c)

out.sum().backward()
print(a.grad, a.sigmoid()*(1-a.sigmoid()))

In [20]:
a = torch.tensor([1, 2, 3.], requires_grad=True)
out = a.sigmoid()

In [21]:
out

tensor([0.7311, 0.8808, 0.9526], grad_fn=<SigmoidBackward0>)

In [22]:
c = out.data

In [23]:
print(f'a.requires_grad:{a.requires_grad}, out.requires_grad: {out.requires_grad}, c.requires_grad: {c.requires_grad}')

a.requires_grad:True, out.requires_grad: True, c.requires_grad: False


In [24]:
out

tensor([0.7311, 0.8808, 0.9526], grad_fn=<SigmoidBackward0>)

In [25]:
c

tensor([0.7311, 0.8808, 0.9526])

In [26]:
c.zero_()

tensor([0., 0., 0.])

In [27]:
out

tensor([0., 0., 0.], grad_fn=<SigmoidBackward0>)

In [28]:
out.requires_grad

True

In [29]:
c.requires_grad

False

In [31]:
# 应该报错，而未报错
out.sum().backward()

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

In [33]:
a = torch.tensor([1, 2, 3.], requires_grad=True)

out = a.sigmoid()

In [35]:
out.sum().backward()

In [36]:
print(a.grad, a.sigmoid()*(1-a.sigmoid()))

tensor([0.1966, 0.1050, 0.0452]) tensor([0.1966, 0.1050, 0.0452], grad_fn=<MulBackward0>)


In [32]:
a = torch.tensor([1, 2, 3.], requires_grad=True)

out = a.sigmoid()

c = out.data
# c = out.detach()

print(f'a.requires_grad: {a.requires_grad}, out.requires_grad: {out.requires_grad}, c.requires_grad: {c.requires_grad}')

print(out)
print(c)
c.zero_()

print(out)
print(c)

out.sum().backward()
print(a.grad, a.sigmoid()*(1-a.sigmoid()))

a.requires_grad: True, out.requires_grad: True, c.requires_grad: False
tensor([0.7311, 0.8808, 0.9526], grad_fn=<SigmoidBackward0>)
tensor([0.7311, 0.8808, 0.9526])
tensor([0., 0., 0.], grad_fn=<SigmoidBackward0>)
tensor([0., 0., 0.])
tensor([0., 0., 0.]) tensor([0.1966, 0.1050, 0.0452], grad_fn=<MulBackward0>)


In [37]:
a = torch.tensor([1, 2, 3.], requires_grad=True)

out = a.sigmoid()

# c = out.data
c = out.detach()

print(f'a.requires_grad: {a.requires_grad}, out.requires_grad: {out.requires_grad}, c.requires_grad: {c.requires_grad}')

print(out)
print(c)
c.zero_()

print(out)
print(c)

out.sum().backward()
print(a.grad, a.sigmoid()*(1-a.sigmoid()))

a.requires_grad: True, out.requires_grad: True, c.requires_grad: False
tensor([0.7311, 0.8808, 0.9526], grad_fn=<SigmoidBackward0>)
tensor([0.7311, 0.8808, 0.9526])
tensor([0., 0., 0.], grad_fn=<SigmoidBackward0>)
tensor([0., 0., 0.])


RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [3]], which is output 0 of SigmoidBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

## embedding

In [None]:
n, d, m = 3, 5, 7
# embedding = nn.Embedding(n, d, max_norm=True)
embedding = nn.Embedding(n, d, max_norm=1)
W = torch.randn((m, d), requires_grad=True)
idx = torch.tensor([1, 2])
a = embedding.weight.clone() @ W.t()  # weight must be cloned for this to be differentiable
b = embedding(idx) @ W.t()  # modifies weight in-place
out = (a.unsqueeze(0) + b.unsqueeze(1))
loss = out.sigmoid().prod()
loss.backward()

e
f(w) -> a
                out => loss
e
g(w) -> b


In [38]:
n, d, m = 3, 5, 7
# embedding = nn.Embedding(n, d, max_norm=True)
embedding = nn.Embedding(n, d, max_norm=1)
W = torch.randn((m, d), requires_grad=True)
idx = torch.tensor([1, 2])
a = embedding.weight @ W.t()  # weight must be cloned for this to be differentiable
b = embedding(idx) @ W.t()  # modifies weight in-place
out = (a.unsqueeze(0) + b.unsqueeze(1))
loss = out.sigmoid().prod()
loss.backward()

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [3, 5]] is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

In [39]:
n, d, m = 3, 5, 7
# embedding = nn.Embedding(n, d, max_norm=True)
embedding = nn.Embedding(n, d, max_norm=1)
W = torch.randn((m, d), requires_grad=True)
idx = torch.tensor([1, 2])
b = embedding(idx) @ W.t()  # modifies weight in-place
a = embedding.weight @ W.t()  # weight must be cloned for this to be differentiable
out = (a.unsqueeze(0) + b.unsqueeze(1))
loss = out.sigmoid().prod()
loss.backward()

Because W in the line computing a requires gradients, we must save embedding.weight to compute those gradients in the backward pass. However, in the line computing b, executing embedding(idx) will scale embedding.weight by max_norm - in place. So, without cloning it in line a, embedding.weight will be modified when line b is executed - changing what was saved for the backward pass to update W. Hence the requirement to clone embedding.weight - to save it before it gets scaled in line b.
If you don't use embedding.weight outside of the normal forward pass, you don't need to worry about all this.
If you get an error, post it (and your code).