<div style="float:center;width:100%;text-align: center;"><strong style="height:60px;color:darkred;font-size:40px;">The Backpropagation Algorithm</strong></div>

The backpropagation algorithm uses the chain rule to update a set of derivatives.<br>
$\qquad$ It is easy to explain using a simple example.

# Example

<div style="float:left;width:55%;">
Consider $f(x,y) = 5 x^2 -3 y, g(x,y)= 2 x + y -5$ and $h(x,y) = x y$.<br><br>
$\qquad$ Let's compute $\frac{\partial}{\partial x} \left. { h(f(x,y), g(x,y) } \right|_{x=1,y=3}$<br><br>

Using the chain rule, we obtain:

$\qquad \begin{align} \frac{\partial}{\partial x} { g(x,y) f(x,y) } = \;
      & \left.\frac{\partial h(f,g)}{\partial f}\right|_{f=f(x,y), g=g(x,y)} \frac{\partial f(x,y)}{\partial x} \\
    + & \left.\frac{\partial h(f,g)}{\partial g}\right|_{f=f(x,y), g=g(x,y)} \frac{\partial g(x,y)}{\partial x}
    \end{align}
$
<br>

where we have introduced variables $f$ and $g$ that will be evaluated as<br> $\qquad f = f(x,y)$ and $g = g(x,y)$.<br><br>

The graph representing the operations is shown on the right:<br>

Since we are interested in specific values $x=1$ and $y=3$,<br>
$\qquad$ each function and each derivative<br>$\qquad$ need to be computed for these values.
</div><div style="float:left;width:45%;">
    <img src="Figs/backpropagation.svg" width="450">
</div>

The computation proceeds in two phases:
* phase 1: start at the bottom, and substitute the values for $x$ and $y$ in each of the functions as we work up toward $h$.<br>
$\qquad \left.
\left. \begin{align}
x=1, y=3 \\
f=5x^2-3y, g=2x+y-5
\end{align}\right\} \Rightarrow f = -4, g= 0 \right\} \Rightarrow h = f g = 0$

* phase 2: start at the top, and evaluate each of the derivatives as we move down to the variables $x$ and $y$<br>
using the values calculated in phase 1.

$ \qquad \begin{align}
& \frac{\partial h(f,g)}{\partial f} = g = 0,    & \frac{\partial h(f,g)}{\partial g} =& f = -4 & \\
& \frac{\partial f(x,y)}{\partial x} = 10 x =10, & \frac{\partial g(x,y)}{\partial x} =& \ 2  &\\
& \frac{\partial f(x,y)}{\partial y} = -3,       & \frac{\partial g(x,y)}{\partial y} =& \ 1,  & \frac{\partial h}{\partial x} = -8, \;\;\frac{\partial h}{\partial y} = -4\\
\end{align}
$

**Remark:** this computes each of the partial derivatives.

<div style="float:left;width:45%;text-align:center;">
<strong>Backpropagation Phase 1</strong>

<br><br>
<img src="Figs/backpropagation_phase_1.svg" width="450">
</div>
<div style="float:right;width:45%;text-align:center;">
<strong>Backpropagation Phase 2</strong>
<br><br>
<img src="Figs/backpropagation_phase_2.svg" width="450">
</div>

**Remark:** If we do not require $\frac{\partial h}{\partial y}$, we do not need to compute $\frac{\partial f}{\partial y}$ and $\frac{\partial g}{\partial y}.$

The **backpropagation** terminology is due to neural networks:<br>
$\qquad$ the variables $x,y$ are values at the input layer.<br>
$\qquad$ As we move upward in the dependency graph, we move toward the output layer.

$\qquad$ Phase 2 computes the partial derivatives at the output layer,<br>
$\qquad$ then propagates these values back toward the input layer<br>
$\qquad$ by making use of the chain rule.

# Forward Versus Backpropagation

Let us expand the example by one layer, and compute the partial derivatives by forward propagation as well as backpropagation.

<div style="float:left;width:45%;text-align:center;">
<strong>Forward Propagation</strong>
<img src="Figs/backpropagation_forward.svg"  width="450">

Using forward propagation, we start at the input layer and compute<br>
$\qquad$ the partial derivatives along each path<br>
$\qquad$ **with respect to the input variable,** e.g.,<br><br>
$\qquad\qquad \frac{\partial f}{\partial \color{red}{a}} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial a} +  \frac{\partial f}{\partial y} \frac{\partial y}{\partial a}$

$\qquad$ We obtain the partial derivatives of each node<br>$\qquad\qquad$ with respect to the input variable $a$.
</div>
<div style="float:right;width:45%;text-align:center;">
<strong>Backward Propagation</strong>
<img src="Figs/backpropagation_backward.svg"  width="450">

Using backward propagation, we start at the output layer and compute<br>
$\qquad$ the partial derivatives of the output variable along each path<br>
$\qquad$ **with respect to the the current node,** e.g.,<br><br>
$\qquad\qquad \frac{\partial \color{red}{h}}{\partial x} = \frac{\partial h}{\partial f}\frac{\partial f}{\partial x} +  \frac{\partial h}{\partial g} \frac{\partial g}{\partial x}$

$\qquad$We obtain the partial derivatives of the output $h$<br>$\qquad\qquad$ with respect to each node.
</div>

In machine learning, we are concerned with the partial derivatives of the outputs with respect to the inputs:
* backsubstitution computes all of them in a single pass, e.g., $\frac{\partial h}{\partial x}$ and  $\frac{\partial h}{\partial y}$
* forward substitution requires a pass for each input variable.

____
A good explanation is given by [Chris Olah](https://colah.github.io/posts/2015-08-Backprop/)