# Automatic differentiation with autograd

This refers to https://gluon-crash-course.mxnet.io/autograd.html

## 1. Basic usage

In [2]:
from mxnet import nd
from mxnet import autograd

we are interested in differentiating a function <br>

\begin{equation*}
f(x)=2x^2
\end{equation*}

 with respect to parameter x

We can start by assigning an initial value of x like below.

\begin{equation*}
\mathbf{x} = \begin{vmatrix}
\mathbf{1} & \mathbf{2} \\
\mathbf{3} & \mathbf{4} \\
\end{vmatrix}
\end{equation*}

In [6]:
x = nd.array([[1, 2], [3, 4]])

In [7]:
x


[[1. 2.]
 [3. 4.]]
<NDArray 2x2 @cpu(0)>

Once we compute the gradient of f(x) with respect to x, we’ll need a place to store it.<br>
In MXNet, we can tell an NDArray that we plan to store a gradient by invoking its attach_grad method.

In [8]:
x.attach_grad()

Now we’re going to define the function y=f(x).<br>
To let MXNet store y, so that we can compute gradients later,<br>
we need to put the definition inside a autograd.record() scope.

In [9]:
with autograd.record():
    y = 2 * x * x

Let’s invoke back propagation (backprop) by calling y.backward().<br>
When y has more than one entry, y.backward() is equivalent to y.sum().backward().

In [10]:
y.backward()

Note that y=2x<sup>2</sup> and 
\begin{equation*}
\frac{dy}{dx}=4x,
\end{equation*} <br>
which should be

\begin{equation*}
\mathbf{x}' = \begin{vmatrix}
\mathbf{4} & \mathbf{8} \\
\mathbf{12} & \mathbf{16} \\
\end{vmatrix}
\end{equation*}

In [11]:
x.grad


[[ 4.  8.]
 [12. 16.]]
<NDArray 2x2 @cpu(0)>

## 2. Using Python control flows

Sometimes we want to write dynamic programs where the execution depends on some real-time values. <br>MXNet will record the execution trace and compute the gradient as well.

Consider the following function f: it doubles the inputs until it’s norm reaches 1000. <br>
Then it selects one element depending on the sum of its elements.

In [3]:
def f(a):
    b = a * 2
    while b.norm().asscalar() < 1000:
        b = b * 2
        if b.sum().asscalar() >= 0:
            c = b[0]
        else:
            c = b[1]
    return c

We record the trace and feed in a random value:<br>
To use autograd, we must first mark variables that require gradient and attach gradient buffers to them

In [4]:
a = nd.random.uniform(shape=2)
a


[0.5488135 0.5928446]
<NDArray 2 @cpu(0)>

This is reference to the Norm https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.linalg.norm.html

\begin{equation*}
\mathbf{L}_p = 
           \biggl(\sum_{i}^n \left\vert{x_i}\right\vert^p\biggr)^\frac{1}{p}
\end{equation*}

In [20]:
a.norm()


[0.80787444]
<NDArray 1 @cpu(0)>

In [6]:
a.attach_grad()
with autograd.record():
    c = f(a)

In [7]:
c.backward()

In [8]:
print(a.grad)


[2048.    0.]
<NDArray 2 @cpu(0)>


c.backward() is equivalent to mx.nd.sum(c).backward()

We know that b is a linear function of a, and c is chosen from b. <br>
Then the gradient with respect to a be will be either [c/a[0], 0] or [0, c/a[1]], <br>
depending on which element from b we picked. Let’s find the results:

In [16]:
[a.grad, c/a]

[
 [2048.    0.]
 <NDArray 2 @cpu(0)>, 
 [2048.     1895.8933]
 <NDArray 2 @cpu(0)>]

Here, y = f(x), z = f(y) = f(g(x)) which means y = 2 * x and z = 2 * x * x.

After, doing backprop with z.backward(), we will get gradient dz/dx as follows:

dy/dx = 2, dz/dx = 4 * x