<a href="https://colab.research.google.com/github/bptripp/ai-course/blob/main/finding_the_gradient.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Introduction
Training a neural network requires calculating the gradient of the loss with respect to the network's learnable parameters.

The gradient is calculated with the backpropagation algorithm, which is essentially the chain rule of calculus. The chain rule is applicable because neural networks involve functions of functions of functions. For example, the output of a neuron in the first layer is a function of the network's inputs, the output of a neuron in the second layer is a function of outputs of the first layer, and so on.

##The Chain Rule
Recall from calculus that for a function $h(x) = g_2(g_1(x))$,

$h'(x)=g'_2(g_1(x))g'_1(x)$,

where ${}'$ indicates the derivative. This can also be written,

$\frac{dh}{dx}=\frac{dg_2}{dg_1}\frac{dg_1}{dx}$.

This approach also extends to compositions of more than two functions.

##The Idea of Backpropagation
The key idea of backpropagation is to re-use intermediate calculations that are needed for multiple derivatives (i.e. derivatives of the loss with respect to multiple network parameters). In a large network, this is much more efficient than using the chain rule independently for each parameter.

The diagram below shows a simple network with inputs $x_1$ and $x_2$, hidden-neuron outputs $h_1$ and $h_2$, output neurons $\hat{y}$, and loss $L$. The diagram also shows the derivatives of the loss with respect to two of the network's weights, $w^h_{11}$ and $w^h_{12}$. In these expressions, $g$ refers to a neuron's nonlinearity, which may be the ReLU function for example. These derivatives can be calculated with the chain rule. However, notice that the first three terms in these two derivatives are identical. The shared terms only need to be calculated once, and the result multiplied by each of the final terms (which differ).

<img src='https://github.com/bptripp/ai-course/blob/main/simple-network-derivatives.png?raw=true' width=500>

The first shared term, $\hat{y}-y$, is shared by all parameters in the network. It is calculated first. The second shared term, $w_1^y$, is shared by all parameters that affect $h_1$. It is calculated next, along with an analogous term shared by all parameters that affect $h_2$. Each of these terms is evaluated as numbers rather than symbols. In this way, the algorithm works its way backward through the network, accumulating shared terms as it goes.

##Forward and Backward Pass
The derivative of a nonlinear function depends on the function's argument(s). For example, the derivative of the ReLU function is one if the input is greater than zero, but it is zero if the input is less than zero.

Because the derivative terms depend on their arguments, the first step is to propagate an input through the network, so that the inputs to all the nonlinear functions take on the right values. This is called "forward propagation". Then the derivatives are propagated backward, which is called "backpropagation". In a neural network, forward and backpropagation involve similar equations. For example, forward propagation involves weighted sums of neurons' outputs, while backpropagation involves weighted sums of derivative terms, provided some of the neurons send their outputs to multiple targrets.

A practical consideration that affects larger networks is that the neurons' activations can't be discarded from memory once the network's output has been calculated. The must be kept until they are used to calculate the required derivatives.

##Automatic Differentiation
Deep learning software mostly takes care of these details mostly automatically. The software implementations of any mathematical expressions in a deep learning package have extra machinery for receiving and passing on derivative terms appropriately.

Below is an example of how this works in the PyTorch deep-learning package. PyTorch calls variables "tensors". This machinery is active for tensors that have a property called "requires_grad" set to True. The code below creates variables, $a=2$, $b=0$, and $c=(a-b)^2$. The derivative of $c$ with respect to $a$ and $b$ is then calculated by calling the method, c.backward().

In [None]:
import torch
a = torch.tensor(2., requires_grad=True)
b = torch.tensor(0., requires_grad=True)
c = (a-b)**2
c.backward()
print(a.grad)
print(b.grad)

tensor(4.)
tensor(-4.)
