In [1]:
import torch
torch.manual_seed(42)

<torch._C.Generator at 0x7ff9ac026c70>

## leaf node ?

- 一个 Tensor/Parameter/Weight 如果是 leaf node
    - 其 parts/slice（`W[2,3]`, `W[:2, :3]`） 就不可能再是 leaf node
    - 而修改一个 tensor 的 require_grad 值，需要这个 tensor 是 leaf node
    - 因此，不能对一个 require_grad 为 True 的 tensor 将其部分置为 False
- 那么如何实现对一个 tensor 的部分进行求导呢？

In [2]:
W = torch.randn(4, 4, requires_grad=True)
W

tensor([[ 1.9269,  1.4873,  0.9007, -2.1055],
        [ 0.6784, -1.2345, -0.0431, -1.6047],
        [-0.7521,  1.6487, -0.3925, -1.4036],
        [-0.7279, -0.5594, -0.7688,  0.7624]], requires_grad=True)

In [3]:
W.is_leaf

True

In [4]:
W[2, 3].is_leaf

False

In [5]:
W[:2, :3].is_leaf

False

In [6]:
W[:2, :3].requires_grad_(False)

RuntimeError: you can only change requires_grad flags of leaf variables. If you want to use a computed variable in a subgraph that doesn't require differentiation use var_no_grad = var.detach().

## Mask

- 那么如何实现对一个 tensor 的部分进行求导呢？
    - Hadamard product（按位乘）一个 Mask 矩阵
    - `W*M`: 是按位乘，而不是矩阵乘法（`W@M`，才是矩阵乘法）
- 后边我们可以验证的是，前向过程中对 Weight 进行 Mask，与先 backward，再对 grad 进行 Mask 效果是一致的

In [7]:
W

tensor([[ 1.9269,  1.4873,  0.9007, -2.1055],
        [ 0.6784, -1.2345, -0.0431, -1.6047],
        [-0.7521,  1.6487, -0.3925, -1.4036],
        [-0.7279, -0.5594, -0.7688,  0.7624]], requires_grad=True)

In [8]:
M = torch.bernoulli(torch.full(W.shape, 0.5))
M

tensor([[0., 0., 1., 0.],
        [1., 1., 1., 0.],
        [1., 1., 1., 1.],
        [0., 1., 0., 1.]])

### 前向对 Weight 进行 Mask

$$
\begin{split}
&y=Wx, L=f(y)\\
&\frac{\partial L}{\partial W}=\frac{\partial L}{\partial y}\frac{\partial y}{\partial W}=\frac{\partial L}{\partial y}x^T\\
\end{split}
$$

In [9]:
x = torch.randn(4)
x

tensor([ 1.3221,  0.8172, -0.7658, -0.7506])

In [10]:
W_m = W*M
W_m

tensor([[ 0.0000,  0.0000,  0.9007, -0.0000],
        [ 0.6784, -1.2345, -0.0431, -0.0000],
        [-0.7521,  1.6487, -0.3925, -1.4036],
        [-0.0000, -0.5594, -0.0000,  0.7624]], grad_fn=<MulBackward0>)

In [11]:
y = W_m @ x
y.sum().backward()
W.grad

tensor([[ 0.0000,  0.0000, -0.7658, -0.0000],
        [ 1.3221,  0.8172, -0.7658, -0.0000],
        [ 1.3221,  0.8172, -0.7658, -0.7506],
        [ 0.0000,  0.8172, -0.0000, -0.7506]])

### 先 backward，对 Grad 进行 Mask 

In [12]:
W.grad.zero_()

tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])

In [13]:
y = W @ x
y.sum().backward()

In [15]:
W.grad * M

tensor([[ 0.0000,  0.0000, -0.7658, -0.0000],
        [ 1.3221,  0.8172, -0.7658, -0.0000],
        [ 1.3221,  0.8172, -0.7658, -0.7506],
        [ 0.0000,  0.8172, -0.0000, -0.7506]])