Calculating derivatives is the crucial step in all the optimization algorithms that we will use to train deep networks.

Working them by hand can be tedius and error-prone.
Fortunately, all modern deep learning frameworks take this work off our plates by offering automatic differentiation/ **autograd**. 
As we pass data through each successive function, the framework builds a computational graph that tracks how each value depends on others.

In [1]:
import torch

### A simple Function

In [2]:
x = torch.arange(4.0)
x.requires_grad = True
# We can also set this attribute true in the definition as:
# x = torch.arange(4.0, requires_grad = True)
x

tensor([0., 1., 2., 3.], requires_grad=True)

Setting the *requires_grad* attribute to true makes a place to store gradient that is respect to x.

In [3]:
y = 2* torch.dot(x,x)
y

tensor(28., grad_fn=<MulBackward0>)

Now, we can take the gradient of y with respect to x calling it's backward method and we can access the gradient with x's grad attribute

In [4]:
y.backward()
x.grad

tensor([ 0.,  4.,  8., 12.])

In [5]:
x.grad == 4*x

tensor([True, True, True, True])

Note that PyTorch doesn't automatically reset the gradient buffer when we record a new gradient. Instead, the new gradient is added to the already-stored gradient.

In [6]:
y = x.sum()
print(y)
y.backward()
x.grad

tensor(6., grad_fn=<SumBackward0>)


tensor([ 1.,  5.,  9., 13.])

We can reset the gradient if we don't want the previous gradients.

In [7]:
x.grad.zero_()
y.backward()
x.grad

tensor([1., 1., 1., 1.])

#### Another Example 

In [8]:
a = torch.arange(5.0, requires_grad = True)
b = torch.sum(a**2)

In [9]:
b.backward()
print(a)
print(a.grad)

tensor([0., 1., 2., 3., 4.], requires_grad=True)
tensor([0., 2., 4., 6., 8.])


### Backward for Non-Scalar Variables

The above implementation of autograd works only when the output (y) is scalar.

When y is a vector, the most natural representation of the derivative of y with respect to a vector x is called the **Jacobian Matrix** that contains partial derivative of each component of y with respect to each component of x.

While Jacobian Matrix are useful in advanced machine learning techniques, more commonly we want to sum up the gradients of each component of y with respect to the full vector x, yielding a vector of the same shape as x. 

In [10]:
x.grad.zero_()
y = x * x
y.sum().backward()
x.grad

tensor([0., 2., 4., 6.])

### Detaching Computation

Sometimes we want to move some calculations outside of the recorded computational graph. We might have some intermediate terms for which we don't need to compute a gradient. 

In that case, we can detach the respective computational graph from the final result. 

In [11]:
x.grad.zero_()
y = x * x
u = y.detach()
z = u * x

z.sum().backward()
if y.grad:
    print(y.grad)
else:
    print("No y-gradients")

x.grad == u

No y-gradients


  if y.grad:


tensor([True, True, True, True])

In [12]:
x.grad.zero_()
y = x * x
u = y.detach()
z = y * x

z.sum().backward()
if y.grad:
    print(y.grad)
else:
    print("No y-gradients")

x.grad == 3 * x ** 2

No y-gradients


  if y.grad:


tensor([True, True, True, True])

Detaching computation may not be fully mathematically correct. BUT,
In deep learning, detaching is used when we don’t want certain parts of the network to contribute to backpropagation.

### Gradients and Python Control Flow

One benefit of using automatic differentiation is that even if building the computational graph of a function required passing through a maze of python control flow, we can still calculate the gradient of the resulting variable. 

In [13]:
def f(a):
    b = a * 2 
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c

In [14]:
a = torch.randn(size=(), requires_grad = True)
print(a)
d = f(a)
d.backward()
a.grad == d / a

tensor(-1.2669, requires_grad=True)


tensor(True)

These are the basic steps for using autograd:
- Attach gradients to those variables with respect to which we desire derivatives
- Record the computation of the target value
- Execute the backpropagation function
- Access the resulting gradient