## Chapter 3: Derivatives and Automatic Differentiation

#### <span style="color:#a50e3e;">Example 2. </span> Computing the gradient of a multi-input function using the forward mode

In this Example we illustrate how to compute the gradient of $\text{tanh}\left(w_4+w_3\text{tanh}\left(w_2+w_1\right)\right)$, a function with four input variables, using its computational graph shown below.

<p><img src="../../mlrefined_images/calculus_images/2layer_0.png" width="75%" height="auto"></p>

Our goal here is to compute $\nabla \, e=\left[
\frac{\partial}{\partial w_{1}}e,
\frac{\partial}{\partial w_{2}}e,
\frac{\partial}{\partial w_{3}}e,
\frac{\partial}{\partial w_{4}}e
\right]$. 

Starting from node $a$, having nodes $w_1$ and $w_2$ as its children, we compute the partial derivatives of $a\left(w_1,w_2\right)=w_1+w_2$ with respect to all four input variables, giving

\begin{equation}
\frac{\partial}{\partial w_{1}}a = w_2\\
\frac{\partial}{\partial w_{2}}a = w_1\\
\frac{\partial}{\partial w_{3}}a = 0\\
\frac{\partial}{\partial w_{4}}a = 0
\end{equation}


or, more compactly, $\nabla \, a=\left[
w_2,
w_1,
0,
0
\right]$. 

<p><img src="../../mlrefined_images/calculus_images/2layer_1.png" width="75%" height="auto"></p>

Moving forward to the next parent node $b(a) = \text{tanh}\left(a\right)$ we compute the partial derivative of $b$ with respect to all input variables

\begin{equation}
\frac{\partial}{\partial w_{1}}b = \frac{\partial}{\partial a}b\,\frac{\partial}{\partial w_{1}}a=\left(1-\text{tanh}^2(a)\right)w_2\\
\frac{\partial}{\partial w_{2}}b = \frac{\partial}{\partial a}b\,\frac{\partial}{\partial w_{2}}a=\left(1-\text{tanh}^2(a)\right)w_1\\
\frac{\partial}{\partial w_{3}}b = \frac{\partial}{\partial a}b\,\frac{\partial}{\partial w_{3}}a=\left(1-\text{tanh}^2(a)\right)\times 0=0\\
\frac{\partial}{\partial w_{4}}b = \frac{\partial}{\partial a}b\,\frac{\partial}{\partial w_{4}}a=\left(1-\text{tanh}^2(a)\right)\times 0=0
\end{equation}

represented more compactly via $\nabla \, b=\left[
\left(1-\text{tanh}^2(a)\right)w_2,
\left(1-\text{tanh}^2(a)\right)w_1,
0,
0
\right]$. 

<p><img src="../../mlrefined_images/calculus_images/2layer_2.png" width="75%" height="auto"></p>

Next, looking back at the graph, we see the next parent node is $c(b,w_3) = b\,w_3$.  

Writing out its partial derivatives with respect to $w_1$ through $w_4$ we have


\begin{equation}
\frac{\partial}{\partial w_{1}}c = \frac{\partial}{\partial b}c\,\frac{\partial}{\partial w_{1}}b+\frac{\partial}{\partial w_3}c\,\frac{\partial}{\partial w_{1}}w_3=w_3 \times \left(1-\text{tanh}^2(a)\right)w_2 + b \times 0 = w_3 \left(1-\text{tanh}^2(a)\right)w_2 \\
\frac{\partial}{\partial w_{2}}c = \frac{\partial}{\partial b}c\,\frac{\partial}{\partial w_{2}}b+\frac{\partial}{\partial w_3}c\,\frac{\partial}{\partial w_{2}}w_3= w_3 \times \left(1-\text{tanh}^2(a)\right)w_1 + b \times 0 =w_3\left(1-\text{tanh}^2(a)\right)w_1\\
\frac{\partial}{\partial w_{3}}c = \frac{\partial}{\partial b}c\,\frac{\partial}{\partial w_{3}}b+\frac{\partial}{\partial w_3}c\,\frac{\partial}{\partial w_{3}}w_3 = w_3 \times 0 + b \times 1 = b\\
\frac{\partial}{\partial w_{4}}c = \frac{\partial}{\partial b}c\,\frac{\partial}{\partial w_{4}}b+\frac{\partial}{\partial w_3}c\,\frac{\partial}{\partial w_{4}}w_3 = w_3 \times 0 + b \times 0 = 0 
\end{equation}

or, put together, $\nabla \, c=\left[
w_3 \left(1-\text{tanh}^2(a)\right)w_2,
w_3\left(1-\text{tanh}^2(a)\right)w_1,
b,
0
\right]$.  

<p><img src="../../mlrefined_images/calculus_images/2layer_3.png" width="75%" height="auto"></p>

Next, we have $d(c,w_4) = c+w_4$, with its gradient computed as 

\begin{equation}
\frac{\partial}{\partial w_{1}}d = \frac{\partial}{\partial c}d\,\frac{\partial}{\partial w_{1}}c+\frac{\partial}{\partial w_4}d\,\frac{\partial}{\partial w_{1}}w_4=1 \times w_3 \left(1-\text{tanh}^2(a)\right)w_2 + 1 \times 0 = w_3 \left(1-\text{tanh}^2(a)\right)w_2 \\
\frac{\partial}{\partial w_{2}}d = \frac{\partial}{\partial c}d\,\frac{\partial}{\partial w_{2}}c+\frac{\partial}{\partial w_4}d\,\frac{\partial}{\partial w_{2}}w_4= 1 \times w_3 \left(1-\text{tanh}^2(a)\right)w_1 + 1 \times 0 =w_3\left(1-\text{tanh}^2(a)\right)w_1\\
\frac{\partial}{\partial w_{3}}d = \frac{\partial}{\partial c}d\,\frac{\partial}{\partial w_{3}}c+\frac{\partial}{\partial w_4}d\,\frac{\partial}{\partial w_{3}}w_4 = 1 \times b + 1 \times 0 = b\\
\frac{\partial}{\partial w_{4}}d = \frac{\partial}{\partial c}d\,\frac{\partial}{\partial w_{4}}c+\frac{\partial}{\partial w_4}d\,\frac{\partial}{\partial w_{4}}w_4 = 1 \times 0 + 1 \times 1 = 1   
\end{equation}


and written compactly as $\nabla \, d=\left[
w_3 \left(1-\text{tanh}^2(a)\right)w_2,
w_3\left(1-\text{tanh}^2(a)\right)w_1,
b,
1
\right]$.  

<p><img src="../../mlrefined_images/calculus_images/2layer_4.png" width="75%" height="auto"></p>

Finally, computing the gradient of the last parent node $e(d) = \text{tanh}\left(d\right)$


\begin{equation}
\frac{\partial}{\partial w_{1}}e = \frac{\partial}{\partial d}e\,\frac{\partial}{\partial w_{1}}d=\left(1-\text{tanh}^2(d)\right)w_3 \left(1-\text{tanh}^2(a)\right)w_2\\
\frac{\partial}{\partial w_{2}}e = \frac{\partial}{\partial d}e\,\frac{\partial}{\partial w_{2}}d=\left(1-\text{tanh}^2(d)\right)w_3\left(1-\text{tanh}^2(a)\right)w_1\\
\frac{\partial}{\partial w_{3}}e = \frac{\partial}{\partial d}e\,\frac{\partial}{\partial w_{3}}d=\left(1-\text{tanh}^2(d)\right)b\\
\frac{\partial}{\partial w_{4}}e = \frac{\partial}{\partial d}e\,\frac{\partial}{\partial w_{4}}d=\left(1-\text{tanh}^2(d)\right)
\end{equation}

<p><img src="../../mlrefined_images/calculus_images/2layer_5.png" width="75%" height="auto"></p>

Substituting $a=w_1+w_2$, $b=\text{tanh}\left(w_1+w_2\right)$, and $d=w_4+w_3\text{tanh}\left(w_1+w_2\right)$, we can write the final gradient in terms of the input variables as 

\begin{equation}
\nabla \, e=\left[
\left(1-\text{tanh}^2(w_4+w_3\text{tanh}\left(w_1+w_2\right))\right)w_3 \left(1-\text{tanh}^2(w_1+w_2)\right)w_2,
w_3\left(1-\text{tanh}^2(w_1+w_2)\right)w_1,
\text{tanh}\left(w_1+w_2\right),
1
\right]
\end{equation}