- import `autograd`

In [1]:
from mxnet import autograd, nd
x = nd.arange(4).reshape((4,1))
print(x)


[[0.]
 [1.]
 [2.]
 [3.]]
<NDArray 4x1 @cpu(0)>


attacht grad to x
- It allocates memory to store its gradient, which has the same shape as x
- It also tell the system to compute gradient

In [2]:
x.attach_grad()
x.grad


[[0.]
 [0.]
 [0.]
 [0.]]
<NDArray 4x1 @cpu(0)>

for example, we compute
$$y = 2\mathbf{x}^{\top}\mathbf{x}$$


In [3]:
y = 2*nd.dot(x.T, x)
y


[[28.]]
<NDArray 1x1 @cpu(0)>

In [4]:
# backward
y.backward()

MXNetError: [10:22:54] C:\Jenkins\workspace\mxnet-tag\mxnet\src\imperative\imperative.cc:295: Check failed: !AGInfo::IsNone(*i): Cannot differentiate node because it is not in a computational graph. You need to set is_recording to true or use autograd.record() to save computational graphs for backward. If you want to differentiate the same graph twice, you need to pass retain_graph=True to backward.

**We must place code inside a with autograd.record() block. Then mxnet wiil build the according computation graph**

In [6]:
with autograd.record():
    y = 2*nd.dot(x.T, x)
# `y.backward()` equals to `y.sum().backward()`
y.backward()
x.grad


[[ 0.]
 [ 4.]
 [ 8.]
 [12.]]
<NDArray 4x1 @cpu(0)>

- verify:
for $y=2\mathbf{x}^\top\mathbf{x}$, we get $\frac{\partial y}{\partial\mathbf{x}}=4\mathbf{x}$

In [7]:
x.grad-4*x


[[0.]
 [0.]
 [0.]
 [0.]]
<NDArray 4x1 @cpu(0)>

- the `record` scope will alter the mode by assuming that gradient is only required for training.

In [8]:
print(autograd.is_training())
with autograd.record():
    print(autograd.is_training())

False
True


- autograd also works with python functions and control flows:

In [14]:
def f(a):
    b = 2*a
    while b.norm().asscalar() < 1000:
        b = 2*b
    if b.sum().asscalar() >0:
        c = b
    else: 
        c = 100 * b
    return c

a = nd.random.normal(shape=1)
print("a:", a)
a.attach_grad()
with autograd.record():
    d = f(a)
d.backward()
a.grad

a: 
[0.29956347]
<NDArray 1 @cpu(0)>



[4096.]
<NDArray 1 @cpu(0)>

- verify: f is piecewise linear in input a. There exists g such that $f(a)=ga$, therefore, $\frac{\partial f}{\partial a}=g$

In [15]:
d/a


[4096.]
<NDArray 1 @cpu(0)>

## chain rule
- we can break the chain rule manually. for $\frac{\partial z}{\partial x}=\frac{\partial z}{\partial y}\frac{\partial y}{\partial x}$. `y.backward()` will only compute $\frac{\partial y}{\partial x}$. To get $\frac{\partial z}{\partial x}$, we can first compute $\frac{\partial z}{\partial y}$, and then pass it as head gradient to y.backward.

In [16]:
with autograd.record():
    y = x * 2
y.attach_grad()
with autograd.record():
    z = y*x
z.backward()
# y.gard = \partial z / \partial y
y.backward(y.grad)
# x.gard = \partial z / \partial x
x.grad


[[0.]
 [2.]
 [4.]
 [6.]]
<NDArray 4x1 @cpu(0)>

In [18]:
2*x


[[0.]
 [2.]
 [4.]
 [6.]]
<NDArray 4x1 @cpu(0)>

In [24]:
z.attach_grad()
with autograd.record():
    y = x * 2
    z = y*x
z.backward()
x.grad


[[ 0.]
 [ 4.]
 [ 8.]
 [12.]]
<NDArray 4x1 @cpu(0)>