<div style="float:center;width:100%;text-align: center;"><strong style="height:60px;color:darkred;font-size:40px;">The Backpropagation Algorithm</strong></div>

The backpropagation algorithm uses the chain rule to update a set of derivatives.<br>
$\qquad$ It is easy to explain using a simple example.

# Example

<div style="float:left;width:50%;">
Consider $f(x,y) = 5 x^2 -3 y, g(x,y)= 2 x + y -5$ and $h(x,y) = x y$.<br><br>
$\qquad$ Let's compute $\frac{\partial}{\partial x} \left. { h(f(x,y), g(x,y) } \right|_{x=1,y=3}$<br><br>

Using the chain rule, we obtain:

$\qquad \begin{align} \frac{\partial}{\partial x} { g(x,y) f(x,y) } = \;
      & \left.\frac{\partial h(f,g)}{\partial f}\right|_{f=f(x,y), g=g(x,y)} \frac{\partial f(x,y)}{\partial x} \\
    + & \left.\frac{\partial h(f,g)}{\partial g}\right|_{f=f(x,y), g=g(x,y)} \frac{\partial g(x,y)}{\partial x}
    \end{align}
$
<br>

where we have introduced variables $f$ and $g$ that will be evaluated as<br> $\qquad f = f(x,y)$ and $g = g(x,y)$.<br><br>

The graph representing the operations is shown on the right:<br>

Since we are interested in specific values $x=1$ and $y=3$,<br>
$\qquad$ each function and each derivative<br>$\qquad$ need to be computed for these values.
</div><div style="float:left;width:50%;">
    <img src="Figs/backpropagation.svg">
</div>

The computation proceeds in two phases:
* phase 1: start at the bottom, and substitute the values for $x$ and $y$ in each of the functions as we work up toward $h$.<br>
$\qquad \left.
\left. \begin{align}
x=1, y=3 \\
f=5x^2-3y, g=2x+y-5
\end{align}\right\} \Rightarrow f = -4, g= 0 \right\} \Rightarrow h = f g = 0$

* phase 2: start at the top, and evaluate each of the derivatives as we move down to the variables $x$ and $y$<br>
using the values calculated in phase 1.

$ \qquad \begin{align}
& \frac{\partial h(f,g)}{\partial f} = g = 0,    & \frac{\partial h(f,g)}{\partial g} =& f = -4 & \\
& \frac{\partial f(x,y)}{\partial x} = 10 x =10, & \frac{\partial g(x,y)}{\partial x} =& \ 2  &\\
& \frac{\partial f(x,y)}{\partial y} = -3,       & \frac{\partial g(x,y)}{\partial y} =& \ 1,  & \frac{\partial h}{\partial x} = -8, \;\;\frac{\partial h}{\partial y} = -4\\
\end{align}
$

**Remark:** this computes each of the partial derivatives.

<div style="float:left;width:45%;">
<strong>Backpropagation Phase 1</strong>

<img src="Figs/backpropagation_phase_1.svg">
</div>
<div style="float:right;width:45%;">
<strong>Backpropagation Phase 2</strong>
<br><br>
<img src="Figs/backpropagation_phase_2.svg">
</div>

**Remark:** If we do not require $\frac{\partial h}{\partial y}$, we do not need to compute $\frac{\partial f}{\partial y}$ and $\frac{\partial g}{\partial y}.$

The **backpropagation** terminology is due to neural networks:<br>
$\qquad$ the variables $x,y$ are values at the input layer.<br>
$\qquad$ As we move upward in the dependency graph, we move toward the output layer.

$\qquad$ Phase 2 computes the partial derivatives at the output layer,<br>
$\qquad$ then propagates these values back toward the input layer<br>
$\qquad$ by making use of the chain rule.

____
<div style="float:left;width:60%;">

**Could we compute the derivatives starting from the input layer instead?**<br>
We certainly can compute the derivatives displayed on the edges of the graphs as before.

However, to then compute $\frac{\partial h}{\partial x}$ for example, we need to compute the derivatives along each path<br>$\qquad$
leading from the output node to the input node $x$, i.e.,
$\;\;\frac{\partial h}{\partial x} = \frac{\partial h}{\partial f}  \frac{\partial f}{\partial x} +  \frac{\partial h}{\partial g} \frac{\partial g}{\partial x}$.

The problem with "summing over the paths" is that for large graphs we get a combinatorial explosion<br>
    $\qquad$ in the number of possible paths.
<br><br>
    
To see what happens more clearly, let us **add another layer** to our example:

$\qquad$ Starting from the input layer, we obtain

$\qquad\qquad
\frac{\partial h}{\partial a} = \alpha_1 \beta_1 \gamma_1 + \alpha_1 \beta_3 \gamma_2 + \alpha_2 \beta_2 \gamma_1 + \alpha_2 \beta_4 \gamma_2$

$\qquad$ Using backpropagation, the computation is

$\qquad\qquad
\frac{\partial h}{\partial a} = \left( \gamma_1 \beta_1 + \gamma_2 \beta_3 \right) \alpha_1 +
\left( \gamma_1 \beta_2 + \gamma_2 \beta_4 \right) \alpha_2$

$\therefore$ The computation has been factored. For large graphs, this represents a large saving!
</div>
<div style="float:left;width:40%;"><img src="Figs/backpropagation_factorization.svg"></div>

A good explanation is given by [Chris Olah](https://colah.github.io/posts/2015-08-Backprop/)