## Chapter 3: Derivatives and Automatic Differentiation

#### <span style="color:#a50e3e;">Example 1. </span> Computing the gradient of a simple quadratic using the forward mode

We first decompose the quadratic function $w_1^2+w_2^2$ into its computational graph.

<p><img src="../../mlrefined_images/calculus_images/quad_0.png" width="65%" height="auto"></p>

Starting from node $a$, we compute the partial derivatives of $a\left(w_1\right)=w_1^2$ with respect to both $w_1$ and $w_2$

$$
\frac{\partial}{\partial w_{1}}a = 2w_1\\
\frac{\partial}{\partial w_{2}}a = 0
$$

<p><img src="../../mlrefined_images/calculus_images/quad_1.png" width="65%" height="auto"></p>

Similarly, we can compute the partial derivatives of $b\left(w_1\right)=w_2^2$ with respect to $w_1$ and $w_2$, as 

$$
\frac{\partial}{\partial w_{1}}b = 0\\
\frac{\partial}{\partial w_{2}}b = 2w_2
$$

<p><img src="../../mlrefined_images/calculus_images/quad_2.png" width="65%" height="auto"></p>

Finally, the partial derivatives of $c(a,b)=a+b$ with respect to $w_1$ and $w_2$ are computed via the chain rule as 

$$
\frac{\partial}{\partial w_{1}}c = \frac{\partial}{\partial a}c\,\frac{\partial}{\partial w_{1}}a+\frac{\partial}{\partial b}c\,\frac{\partial}{\partial w_{1}}b = 1 \times 2w_1 + 1 \times 0 = 2w_1\\
\frac{\partial}{\partial w_{2}}c = \frac{\partial}{\partial a}c\,\frac{\partial}{\partial w_{2}}a+\frac{\partial}{\partial b}c\,\frac{\partial}{\partial w_{2}}b = 1 \times 0 + 1 \times 2w_2 = 2w_2  
$$

<p><img src="../../mlrefined_images/calculus_images/quad_3.png" width="65%" height="auto"></p>

#### <span style="color:#a50e3e;">Example 2. </span> Computing the gradient of a multi-input function using the forward mode

In this Example we illustrate how to compute the gradient of $\text{tanh}\left(w_4+w_3\text{tanh}\left(w_2+w_1\right)\right)$, a function with four input variables, using its computational graph shown below.

<p><img src="../../mlrefined_images/calculus_images/2layer_0.png" width="75%" height="auto"></p>

Our goal here is to compute $\nabla \, e=\left[
\frac{\partial}{\partial w_{1}}e,
\frac{\partial}{\partial w_{2}}e,
\frac{\partial}{\partial w_{3}}e,
\frac{\partial}{\partial w_{4}}e
\right]$. 

Starting from node $a$, having nodes $w_1$ and $w_2$ as its children, we compute the partial derivatives of $a\left(w_1,w_2\right)=w_1+w_2$ with respect to all four input variables, giving

$$
\frac{\partial}{\partial w_{1}}a = w_2\\
\frac{\partial}{\partial w_{2}}a = w_1\\
\frac{\partial}{\partial w_{3}}a = 0\\
\frac{\partial}{\partial w_{4}}a = 0
$$

or, more compactly, $\nabla \, a=\left[
w_2,
w_1,
0,
0
\right]$. 

<p><img src="../../mlrefined_images/calculus_images/2layer_1.png" width="75%" height="auto"></p>

Moving forward to the next parent node $b(a) = \text{tanh}\left(a\right)$ we compute the partial derivative of $b$ with respect to all input variables

$$
\frac{\partial}{\partial w_{1}}b = \frac{\partial}{\partial a}b\,\frac{\partial}{\partial w_{1}}a=\left(1-\text{tanh}^2(a)\right)w_2\\
\frac{\partial}{\partial w_{2}}b = \frac{\partial}{\partial a}b\,\frac{\partial}{\partial w_{2}}a=\left(1-\text{tanh}^2(a)\right)w_1\\
\frac{\partial}{\partial w_{3}}b = \frac{\partial}{\partial a}b\,\frac{\partial}{\partial w_{3}}a=\left(1-\text{tanh}^2(a)\right)\times 0=0\\
\frac{\partial}{\partial w_{4}}b = \frac{\partial}{\partial a}b\,\frac{\partial}{\partial w_{4}}a=\left(1-\text{tanh}^2(a)\right)\times 0=0
$$

represented more compactly via $\nabla \, b=\left[
\left(1-\text{tanh}^2(a)\right)w_2,
\left(1-\text{tanh}^2(a)\right)w_1,
0,
0
\right]$. 

<p><img src="../../mlrefined_images/calculus_images/2layer_2.png" width="75%" height="auto"></p>

Next, looking back at the graph, we see the next parent node is $c(b,w_3) = b\,w_3$.  

Writing out its partial derivatives with respect to $w_1$ through $w_4$ we have

$$
\frac{\partial}{\partial w_{1}}c = \frac{\partial}{\partial b}c\,\frac{\partial}{\partial w_{1}}b+\frac{\partial}{\partial w_3}c\,\frac{\partial}{\partial w_{1}}w_3=w_3 \times \left(1-\text{tanh}^2(a)\right)w_2 + b \times 0 = w_3 \left(1-\text{tanh}^2(a)\right)w_2 \\
\frac{\partial}{\partial w_{2}}c = \frac{\partial}{\partial b}c\,\frac{\partial}{\partial w_{2}}b+\frac{\partial}{\partial w_3}c\,\frac{\partial}{\partial w_{2}}w_3= w_3 \times \left(1-\text{tanh}^2(a)\right)w_1 + b \times 0 =w_3\left(1-\text{tanh}^2(a)\right)w_1\\
\frac{\partial}{\partial w_{3}}c = \frac{\partial}{\partial b}c\,\frac{\partial}{\partial w_{3}}b+\frac{\partial}{\partial w_3}c\,\frac{\partial}{\partial w_{3}}w_3 = w_3 \times 0 + b \times 1 = b\\
\frac{\partial}{\partial w_{4}}c = \frac{\partial}{\partial b}c\,\frac{\partial}{\partial w_{4}}b+\frac{\partial}{\partial w_3}c\,\frac{\partial}{\partial w_{4}}w_3 = w_3 \times 0 + b \times 0 = 0   
$$

or, put together, $\nabla \, c=\left[
w_3 \left(1-\text{tanh}^2(a)\right)w_2,
w_3\left(1-\text{tanh}^2(a)\right)w_1,
b,
0
\right]$.  

<p><img src="../../mlrefined_images/calculus_images/2layer_3.png" width="75%" height="auto"></p>

Next, we have $d(c,w_4) = c+w_4$, with its gradient computed as 

$$
\frac{\partial}{\partial w_{1}}d = \frac{\partial}{\partial c}d\,\frac{\partial}{\partial w_{1}}c+\frac{\partial}{\partial w_4}d\,\frac{\partial}{\partial w_{1}}w_4=1 \times w_3 \left(1-\text{tanh}^2(a)\right)w_2 + 1 \times 0 = w_3 \left(1-\text{tanh}^2(a)\right)w_2 \\
\frac{\partial}{\partial w_{2}}d = \frac{\partial}{\partial c}d\,\frac{\partial}{\partial w_{2}}c+\frac{\partial}{\partial w_4}d\,\frac{\partial}{\partial w_{2}}w_4= 1 \times w_3 \left(1-\text{tanh}^2(a)\right)w_1 + 1 \times 0 =w_3\left(1-\text{tanh}^2(a)\right)w_1\\
\frac{\partial}{\partial w_{3}}d = \frac{\partial}{\partial c}d\,\frac{\partial}{\partial w_{3}}c+\frac{\partial}{\partial w_4}d\,\frac{\partial}{\partial w_{3}}w_4 = 1 \times b + 1 \times 0 = b\\
\frac{\partial}{\partial w_{4}}d = \frac{\partial}{\partial c}d\,\frac{\partial}{\partial w_{4}}c+\frac{\partial}{\partial w_4}d\,\frac{\partial}{\partial w_{4}}w_4 = 1 \times 0 + 1 \times 1 = 1   
$$


and written compactly as $\nabla \, d=\left[
w_3 \left(1-\text{tanh}^2(a)\right)w_2,
w_3\left(1-\text{tanh}^2(a)\right)w_1,
b,
1
\right]$.  

<p><img src="../../mlrefined_images/calculus_images/2layer_4.png" width="75%" height="auto"></p>

Finally, computing the gradient of the last parent node $e(d) = \text{tanh}\left(d\right)$

$$
\frac{\partial}{\partial w_{1}}e = \frac{\partial}{\partial d}e\,\frac{\partial}{\partial w_{1}}d=\left(1-\text{tanh}^2(d)\right)w_3 \left(1-\text{tanh}^2(a)\right)w_2\\
\frac{\partial}{\partial w_{2}}e = \frac{\partial}{\partial d}e\,\frac{\partial}{\partial w_{2}}d=\left(1-\text{tanh}^2(d)\right)w_3\left(1-\text{tanh}^2(a)\right)w_1\\
\frac{\partial}{\partial w_{3}}e = \frac{\partial}{\partial d}e\,\frac{\partial}{\partial w_{3}}d=\left(1-\text{tanh}^2(d)\right)b\\
\frac{\partial}{\partial w_{4}}e = \frac{\partial}{\partial d}e\,\frac{\partial}{\partial w_{4}}d=\left(1-\text{tanh}^2(d)\right)
$$

<p><img src="../../mlrefined_images/calculus_images/2layer_5.png" width="75%" height="auto"></p>

Substituting $a=w_1+w_2$, $b=\text{tanh}\left(w_1+w_2\right)$, and $d=w_4+w_3\text{tanh}\left(w_1+w_2\right)$, we can write the final gradient in terms of the input variables as 

$$\nabla \, e=\left[
\left(1-\text{tanh}^2(w_4+w_3\text{tanh}\left(w_1+w_2\right))\right)w_3 \left(1-\text{tanh}^2(w_1+w_2)\right)w_2,
w_3\left(1-\text{tanh}^2(w_1+w_2)\right)w_1,
\text{tanh}\left(w_1+w_2\right),
1
\right]$$

#### <span style="color:#a50e3e;">Example 3. </span> Computing the gradient of a simple quadratic using the reverse mode

The reverse mode of auto-differentiator consists of two passes: the forward pass and the backward pass. The forward pass is essentially similar to the forward mode of AD we have seen before where the computational graph is traversed only once in the forward direction, i.e, from left (input side) to right (output side). The only difference is that we no longer compute the full gradient at each node going forward through the graph (as was the case with the forward mode). Instead, we only compute and collect the partial derivatives of each parent with respect to its children during the forward pass. 

Starting the forward pass, we first visit node $a$ and compute the partial derivative of $a$ with respect to its only child, in this case $w_1$.

$$
\frac{\partial}{\partial w_{1}}a = 2w_1
$$

<p><img src="../../mlrefined_images/calculus_images/R_f_0.png" width="55%" height="auto"></p>

Next, we move on to node $b$ and compute its partial derivative with respect to its only child, $w_2$. 

$$
\frac{\partial}{\partial w_{2}}b = 2w_2
$$

<p><img src="../../mlrefined_images/calculus_images/R_f_1.png" width="55%" height="auto"></p>

Finally, we visit node $c=a+b$ where we compute its partial derivatives with respect to all its children, here $a$ and $b$.  

$$
\frac{\partial}{\partial a}c = 1\\
\frac{\partial}{\partial b}c = 1
$$

<p><img src="../../mlrefined_images/calculus_images/R_f_3.png" width="55%" height="auto"></p>

Once the forward pass is complete, we traverse the graph backwards, visiting each node in the reverse order.       

<p><img src="../../mlrefined_images/calculus_images/R_f_2.png" width="55%" height="auto"></p>

At every step of the process we update the partial derivative of each child by multiplying it by the partial derivative of the parent node with respect to that child. Starting from the last node in our forward sweep, i.e., node $c$, we observe that $c$ has two children: $a$ and $b$. Therefore we update the derivative at $a$ by (left) multiplying it by $\frac{\partial}{\partial a}c$, and similarly update the derivative at $b$ by (left) multiplying it by $\frac{\partial}{\partial b}c$.        

<p><img src="../../mlrefined_images/calculus_images/R_r_0.png" width="55%" height="auto"></p>

Following the reverse order of our backward sweep, the next node in the graph to visit is $w_2$ where its derivative (which is simply $1$) gets multiplied by the derivative of its parent $b$, giving 

$$
\frac{\partial}{\partial b}c \frac{\partial}{\partial w_2}b \times 1
$$

<p><img src="../../mlrefined_images/calculus_images/R_r_1.png" width="55%" height="auto"></p>

Notice, this is precisely the partial derivative of $c$ with respect to $w_2$, or in other words, the second element of our desired gradient. To finish the backward sweep we finally visit node $w_1$ - at which forward sweep started - where  we multiply its derivative, i.e., $1$, by the derivative of its parent $a$

$$
\frac{\partial}{\partial a}c \frac{\partial}{\partial w_1}a \times 1
$$

which is precisely $\frac{\partial}{\partial w_1}c$ or the first element of our desired gradient.  

<p><img src="../../mlrefined_images/calculus_images/R_r_2.png" width="55%" height="auto"></p>

## Which mode of auto-differentiator should one choose? 

Although both forward and reverse modes of auto-differentiator are valid to use for any cost function formed using elementary functions and operations, one of the two can be considerably more efficient than the other depending on the structure and number of inputs/outputs to the cost function. In general, the cost of the forward mode of AD is dependent on the number of input variables while that of the reverse mode is dependent on the number of output variables. Since the cost functions of interest in machine learning, and particularly deep learning, typically consist of a large number of input variables (sometimes in hundreds of millions) with only one or few output variables, it therefore makes practical sense to use the reverse mode of AD to avoid wastefully carrying extremely large and sparse gradients through the computational graph.        