## Chapter 3: Derivatives and Automatic Differentiation

# 3.9 The reverse mode of automatic differentiation

The automatic differentiator we have been discussing so far operates in the forward mode, meaning that all function and derivative/gradient calculations are done in one sweep (the forward sweep) traversing the computational graph forward, as shown in the left panel of Figure 1 for the quadratic function $w_1^2+w_2^2$. Notice, however, that in this case calculating the full gradient at each node is redundant/wasteful because some gradient entries will always be zero, regardless of how the AD is initialized. In this example, the second gradient entry at nodes $w_1$ and $a$, as well as the first gradient entry at nodes $w_2$ and $b$ are always zero, as depicted in the right panel of Figure 1.            

<p><img src="../../mlrefined_images/calculus_images/forward_2.png" width="100%" height="auto"></p>

This computational waste in computing, storing, and propagating zeros forward through the graph is more accentuated when considering functions with larger number of input variables. In Figure 2 we show the computational graph for another simple quadratic function, this time one that takes in four inputs, i.e., $w_1^2+w_2^2+w_3^2+w_4^2$. In this case more than half of all gradient entries need not be evaluated as they are always zero.         

<p><img src="../../mlrefined_images/calculus_images/forward_4.png" width="100%" height="auto"></p>

To remedy this inefficiency, automatic differentiation can also be done in what is typically referred to as the reverse mode, consisting of a forward sweep and a backward sweep. The forward sweep of the reverse mode is essentially similar to the forward mode of AD in that the computational graph is traversed once in the forward direction, i.e, from left (input side) to right (output side). The only difference is that we no longer compute the full gradient at each node going forward through the graph (as was the case with the forward mode). Instead, we only compute and collect the partial derivatives of each parent with respect to its children during the forward sweep. Once the forward sweep is complete, we change course and go backwards through graph, visiting all nodes once more, precisely in the reverse order of the forward mode, this time  updating the partial derivatives computed at each node by multiplying each by the (relevant) partial derivative of its parent. When the backward sweep is completed, the derivative attribute of each input variable will contain the corresponding entry of the full gradient.          

#### <span style="color:#a50e3e;">Example 3. </span> Computing the gradient of a simple quadratic using the reverse mode

Starting the forward pass, we first visit node $a$ and compute the partial derivative of $a$ with respect to its only child, in this case $w_1$.

\begin{equation}
\frac{\partial}{\partial w_{1}}a = 2w_1
\end{equation}

<p><img src="../../mlrefined_images/calculus_images/R_f_0.png" width="55%" height="auto"></p>

Next, we move on to node $b$ and compute its partial derivative with respect to its only child, $w_2$. 

\begin{equation}
\frac{\partial}{\partial w_{2}}b = 2w_2
\end{equation}

<p><img src="../../mlrefined_images/calculus_images/R_f_1.png" width="55%" height="auto"></p>

Finally, we visit node $c=a+b$ where we compute its partial derivatives with respect to all its children, here $a$ and $b$.  

\begin{equation}
\frac{\partial}{\partial a}c = 1\\
\frac{\partial}{\partial b}c = 1
\end{equation}

<p><img src="../../mlrefined_images/calculus_images/R_f_3.png" width="55%" height="auto"></p>

Once the forward pass is complete, we traverse the graph backwards, visiting each node in the reverse order.       

<p><img src="../../mlrefined_images/calculus_images/R_f_2.png" width="55%" height="auto"></p>

At every step of the process we update the partial derivative of each child by multiplying it by the partial derivative of the parent node with respect to that child. Starting from the last node in our forward sweep, i.e., node $c$, we observe that $c$ has two children: $a$ and $b$. Therefore we update the derivative at $a$ by (left) multiplying it by $\frac{\partial}{\partial a}c$, and similarly update the derivative at $b$ by (left) multiplying it by $\frac{\partial}{\partial b}c$.        

<p><img src="../../mlrefined_images/calculus_images/R_r_0.png" width="55%" height="auto"></p>

Following the reverse order of our backward sweep, the next node in the graph to visit is $w_2$ where its derivative (which is simply $1$) gets multiplied by the derivative of its parent $b$, giving 

\begin{equation}
\frac{\partial}{\partial b}c \frac{\partial}{\partial w_2}b \times 1
\end{equation}

<p><img src="../../mlrefined_images/calculus_images/R_r_1.png" width="55%" height="auto"></p>

Notice, this is precisely the partial derivative of $c$ with respect to $w_2$, or in other words, the second element of our desired gradient. To finish the backward sweep we finally visit node $w_1$ - at which forward sweep started - where  we multiply its derivative, i.e., $1$, by the derivative of its parent $a$

\begin{equation}
\frac{\partial}{\partial a}c \frac{\partial}{\partial w_1}a \times 1
\end{equation}

which is precisely $\frac{\partial}{\partial w_1}c$ or the first element of our desired gradient.  

<p><img src="../../mlrefined_images/calculus_images/R_r_2.png" width="55%" height="auto"></p>

## Which mode of auto-differentiator should one choose? 

Although both forward and reverse modes of auto-differentiator are valid to use for any cost function formed using elementary functions and operations, one of the two can be considerably more efficient than the other depending on the structure and number of inputs/outputs to the cost function. In general, the cost of the forward mode of AD is dependent on the number of input variables while that of the reverse mode is dependent on the number of output variables. Since the cost functions of interest in machine learning, and particularly deep learning, typically consist of a large number of input variables (sometimes in hundreds of millions) with only one or few output variables, it therefore makes practical sense to use the reverse mode of AD to avoid wastefully carrying extremely large and sparse gradients through the computational graph.        