In [None]:
%matplotlib inline

<div class="alert alert-info"><h4>Further reading:</h4><p>This notebook is adapted from the <a href="https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html">PyTorch: A 60 Minute Blitz</a> tutorial on the PyTorch website. For documentation and more tutorials, visit <a href="https://pytorch.org">pytorch.org</a></p></div>


# Autograd

``torch.autograd`` is PyTorch’s automatic differentiation engine for training neural networks. This notebook will help you build a conceptual understanding of how autograd works.

## Background
Neural networks (NNs) are collections of nested functions that are executed on some input data. These functions are defined by *parameters* (consisting of weights and biases), which in PyTorch are stored in tensors (see the previous notebook for more on tensors).

Training a NN happens in two steps:

**Forward Propagation**: In forward prop, the NN runs the input data through each of its functions to generate its output. This output might be something like a guess for whether an input image is of a cat, a dog, etc.

**Backward Propagation**: In backprop, the NN adjusts its parameters proportionate to the error in its output. It does this by traversing backwards, starting from the output and moving toward the input, collecting the derivatives of the error with respect to the parameters of the functions (*gradients*), and optimizing the parameters using gradient descent. For a more detailed walkthrough of backprop, check out [this video](https://www.youtube.com/watch?v=tIeHLnjs5U8) from Grant Sanderson (3Blue1Brown).

In [None]:
import torch

## Differentiation using Autograd
Let's take a look at how ``autograd`` collects gradients with a very simple example. We'll create a 1x1 tensor ``x`` and set ``requires_grad=True``. This signals to ``autograd`` that every operation on ``x`` should be tracked.

In [None]:
x = torch.tensor([5.], requires_grad=True)
print(x)

Now let's make another tensor, ``y``, that's a function of ``x``:

$$y = x^3$$

In [None]:
y = x ** 3
print(y)

Notice that ``y`` has a ``grad_fn`` attribute. The gradient here is just a derivative:

$$ \frac{\partial y}{\partial x} = 3x^2 $$

And we can check if autograd did its job correctly by calling ``.backward()`` on ``y``, which will store the gradient in x.grad. We expect that to be the same as $3x^2$—is it?

In [None]:
y.backward()
if x.grad == 3 * x ** 2:
    print("Gradients match!")

------

Now let's do a slightly more complex example: Let's say we have a tensor ``a``, which represents the second-to-last hidden layer of a neural net, ``b``, which represents the last hidden layer, and ``y_hat``, which represents the output. These will all be 1x3 tensors this time (rather than a scalars). 

In [None]:
a = torch.tensor([1., 2., 3.], requires_grad=True)
b = a ** 3
b.retain_grad()
y_hat = b ** 2

And we'll say for simplicity that the error is just the sum of the elements of $\hat{y}$.

In [None]:
error = torch.sum(y_hat)

After calling ``.backward`` on ``error``, its gradients with respect to ``a`` and ``b`` should be stored in ``a.grad`` and ``b.grad``. Are they what we'd expect?

$$
\begin{align}
\frac{\partial \hat{y}}{\partial b} &= 2b \\
\frac{\partial \hat{y}}{\partial a} &= \frac{\partial \hat{y}}{\partial b} \frac{\partial b}{\partial a} = (2b)(3a^2) = 6ba^2
\end{align}
$$

In [None]:
error.backward()
if all(a.grad == 6 * b * a ** 2) and all(b.grad == 2 * b):
    print("Gradients match!")