In [1]:
import numpy as np
import tensorflow as tf

## Partial Derivatives, Jacobian Matrix and Gradient
Suppose $\mathbf{\hat{y}}$ is an $m$ length vector is a function of another variable vector $\mathbf{w}$ with length $n$ (i.e. $\mathbf{\hat{y}} = \psi(\mathbf{w})$, where $\psi: \mathbb{R}^n \to \mathbb{R}^m$). The __Jacobian matrix__ (matrix with the first-order partial derivatives) of $\mathbf{\hat{y}}$ with respect to $\mathbf{w}$ is:

$$
\mathbf{\hat{y}}=[\hat{y}_{i}]
=\begin{bmatrix}
    \hat{y}_{1}\\
    \hat{y}_{2}\\
    \vdots\\
    \hat{y}_{m}
\end{bmatrix}, \;\;\;
\mathbf{w}=[w_{j}]
=\begin{bmatrix}
    w_{1}\\
    w_{2}\\
    \vdots\\
    w_{n}
\end{bmatrix}, \;\;\;
\mathbf{J}_\psi(\mathbf{w})=\frac{\partial\mathbf{\hat{y}}}{\partial\mathbf{w}}=\begin{bmatrix}
    \frac{\partial\hat{y}_1}{\partial w_1} & \frac{\partial\hat{y}_1}{\partial w_2} & \cdots & \frac{\partial\hat{y}_1}{\partial w_n}\\
    \frac{\partial\hat{y}_2}{\partial w_1} & \frac{\partial\hat{y}_2}{\partial w_2} & \cdots & \frac{\partial\hat{y}_2}{\partial w_n}\\
    \vdots & \vdots & \ddots & \vdots\\ 
    \frac{\partial\hat{y}_m}{\partial w_1} & \frac{\partial\hat{y}_m}{\partial w_2} & \cdots & \frac{\partial\hat{y}_m}{\partial w_n}
\end{bmatrix}
$$

For instance, let $\mathbf{\hat{y}}=\mathbf{Xw} + b$, where $\mathbf{X}$ is a matrix independent from $\mathbf{w}$:

$$
\mathbf{X}=[x_{i,j}] 
=\begin{bmatrix}
    x_{1,1} & x_{1,2} & \cdots & x_{1,n}\\
    x_{2,1} & x_{2,2} & \cdots & x_{2,n}\\
    \vdots  & \vdots  & \ddots & \vdots \\
    x_{m,1} & x_{m,2} & \cdots & x_{m,n}\\
    \end{bmatrix}, \;\;\;
\mathbf{\hat{y}}
=\begin{bmatrix}
    x_{1,1} & x_{1,2} & \cdots & x_{1,n}\\
    x_{2,1} & x_{2,2} & \cdots & x_{2,n}\\
    \vdots & \vdots & \ddots & \vdots \\
    x_{m,1} & x_{m,2} & \cdots & x_{m,n}\\
\end{bmatrix}
\begin{bmatrix}
    w_{1}\\
    w_{2}\\
    \vdots\\
    w_{n}\\
\end{bmatrix} + b 
=\begin{bmatrix}
    w_1x_{1,1} + w_2x_{1,2} + \cdots + w_mx_{1,n} + b\\
    w_1x_{2,1} + w_2x_{2,2} + \cdots + w_mx_{2,n} + b\\
    \vdots\\
    w_1x_{m,1} + w_2x_{m,2} + \cdots + w_mx_{m,n} + b\\
\end{bmatrix}
$$

The Jacobian matrix of $\mathbf{\hat{y}}$ with respect to $\mathbf{w}$ is:
$$
\mathbf{J}_{\mathbf{Xw}+b}(\mathbf{w})=\mathbf{\frac{\partial \hat{y}}{\partial w}}
=\begin{bmatrix}
    x_{1,1} & x_{1,2} & \cdots & x_{1,n}\\
    x_{2,1} & x_{2,2} & \cdots & x_{2,n}\\
    \vdots & \vdots & \ddots & \vdots\\
    x_{m,1} & x_{m,2} & \cdots & x_{m,n}\\
\end{bmatrix} = \mathbf{X}
$$

Suppose $f(\mathbf{\hat{y}})=\mathbf{\hat{y}}^T\mathbf{A}\mathbf{\hat{y}}$, where $\mathbf{A} \in \mathbb{R}^{m\times m}$ is matrix independent from $\mathbf{\hat{y}}$, then $f(\mathbf{\hat{y}})$ is:

$$
\begin{align}
f(\mathbf{\hat{y}})
&=\begin{bmatrix}
\hat{y}_1 & \hat{y}_2 & \cdots & \hat{y}_m
\end{bmatrix}
\begin{bmatrix}
    a_{1,1} & a_{1,2} & \cdots & a_{1,m}\\
    a_{2,1} & a_{2,2} & \cdots & a_{2,m}\\
    \vdots & \vdots & \ddots & \vdots\\
    a_{m,1} & a_{m,2} & \cdots & a_{m,m}\\
\end{bmatrix}
\begin{bmatrix} 
    \hat{y}_1 \\ \hat{y}_2 \\ \vdots \\ \hat{y}_m
\end{bmatrix}\\
&=\begin{bmatrix} 
    \sum\limits_{i=1}^{m}\hat{y}_ia_{i,1} & \sum\limits_{i=1}^{m}\hat{y}_ia_{i,2} & \cdots & \sum\limits_{i=1}^{m}\hat{y}_ia_{i,m}\\ 
\end{bmatrix}
\begin{bmatrix} 
    \hat{y}_1 \\  \hat{y}_2 \\ \vdots \\ \hat{y}_m\\ 
\end{bmatrix}\\
&=\sum\limits_{j=1}^{m}\hat{y}_j\sum\limits_{i=1}^{m}\hat{y}_ia_{i,j}\\
\end{align}
$$

then the __gradient__ (the partial derivatives of a function $f: \mathbb{R}^m \to \mathbb{R}$ with respect to a $m$ length vector) for the above mentioned function $f(\hat{y})$ with respect to $\hat{y}$ is:

$$
\begin{align}
\nabla_{\hat{y}}f
&=\begin{bmatrix}
    \frac{\partial f}{\partial\hat{y}_1} & \frac{\partial f}{\partial\hat{y}_2} & \cdots & \frac{\partial f}{\partial\hat{y}_m} 
\end{bmatrix}^T\\
&=\begin{bmatrix}
    2\hat{y}_1\sum\limits_{i=1}^{m}a_{i,1} & 2\hat{y}_2\sum\limits_{i=1}^{m}a_{i,2} & \cdots & 2\hat{y}_m\sum\limits_{i=1}^{m}a_{i,m}
\end{bmatrix}^T\\
&=[2\hat{y}^T\mathbf{A}]^T
\end{align}
$$

Suppose we have another function that map an vector to a scalar $L(\mathbf{\hat{y}}) = (\mathbf{y} - \mathbf{\hat{y}})^T(\mathbf{y} - \mathbf{\hat{y}}$), where $\mathbf{\hat{y}}$ is a $m$ length vector independent from $\mathbf{y}$ ($L$ is actually the square deviation between $\mathbf{y}$ and $\mathbf{\hat{y}}$)

$$
\mathbf{y}=[y_{i}]=\begin{bmatrix}
    y_{1}\\
    y_{2}\\
    \vdots\\
    y_{m}\\
\end{bmatrix}, \;\;\;
\mathbf{\hat{y}}=[\hat{y}_{i}]=\begin{bmatrix}
    \hat{y}_{1}\\
    \hat{y}_{2}\\
    \vdots\\
    \hat{y}_{m}\\
\end{bmatrix}=\mathbf{Xw}+b, \;\;\;
L(\mathbf{\hat{y}})=(\mathbf{y} - \mathbf{\hat{y}})^T(\mathbf{y} - \mathbf{\hat{y}})=\sum\limits_{i=1}^{m}(y_i - \hat{y}_i)^2
$$

We can compute the gradient of this function with respect to $\mathbf{\hat{y}}$

$$
\begin{align}
\nabla_{\hat{y}}L&=\begin{bmatrix} \frac{\partial L}{\partial\hat{y_1}} & \frac{\partial L}{\partial\hat{y_2}} & \cdots & \frac{\partial L}{\partial\hat{y_m}} \end{bmatrix}^T\\
&=\begin{bmatrix} 
    \frac{\partial L}{\partial(y_1 - \hat{y}_1)}\frac{\partial(y_1 - \hat{y}_1)}{\partial\hat{y}_1} & 
    \frac{\partial L}{\partial(y_2 - \hat{y}_2)}\frac{\partial(y_2 - \hat{y}_2)}{\partial\hat{y}_2} & 
    \cdots & 
    \frac{\partial L}{\partial(y_m - \hat{y}_m)}\frac{\partial(y_m - \hat{y}_m)}{\partial\hat{y}_m} \end{bmatrix}^T\\
&=\begin{bmatrix} 
    2(\hat{y}_1 - y_1) & 
    2(\hat{y}_2 - y_2) & 
    \cdots & 
    2(\hat{y}_m - y_m) \end{bmatrix}^T\\
&=2(\mathbf{\hat{y}} - \mathbf{y})
\end{align}
$$

Moreover, we can compute the gradient of $L$ with respect to $\mathbf{w}$ using multivariate chain rule:

$$
\begin{align}
\nabla_{\mathbf{w}}L&=\begin{bmatrix} \frac{\partial L}{\partial w_1} & \frac{\partial L}{\partial w_2} & \cdots & \frac{\partial L}{\partial w_n} \end{bmatrix}^T\\
&=\begin{bmatrix}
    \frac{\partial L}{\partial\mathbf{\hat{y}}}\frac{\partial\mathbf{\hat{y}}}{\partial w_1} & \frac{\partial L}{\partial\mathbf{\hat{y}}}\frac{\partial\mathbf{\hat{y}}}{\partial w_2} & \cdots &\frac{\partial L}{\partial\mathbf{\hat{y}}}\frac{\partial\mathbf{\hat{y}}}{\partial w_n} 
\end{bmatrix}^T\\
&=\begin{bmatrix}
    \sum\limits_{i=1}^{m}\frac{\partial L}{\partial\hat{y}_i}\frac{\partial\hat{y}_i}{\partial w_1} & \sum\limits_{i=1}^{m}\frac{\partial L}{\partial\hat{y}_i}\frac{\partial\hat{y}_i}{\partial w_2} & \cdots & \sum\limits_{i=1}^{m}\frac{\partial L}{\partial\hat{y}_i}\frac{\partial\hat{y}_i}{\partial w_n} 
\end{bmatrix}^T\\
&=[2(\mathbf{\hat{y}} - \mathbf{y})^T\frac{\partial\mathbf{\hat{y}}}{\partial\mathbf{w}}]^T\\
&=[2(\mathbf{\hat{y}} - \mathbf{y})^T\mathbf{X}]^T
\end{align}
$$

In [2]:
tf.random.set_seed(0)
X = tf.random.uniform((4, 5), minval=-1, maxval=1)
w = tf.Variable(tf.random.uniform((5, 1), minval=-1, maxval=1))
y = tf.Variable(tf.random.uniform((4, 1), minval=-5, maxval=5))

with tf.GradientTape() as g:
    y_hat = tf.matmul(X, w)
    loss = tf.reduce_sum(tf.pow(y - y_hat, 2))
tf_autograd = g.gradient(loss, w)

derived_grad = tf.transpose(2 * tf.matmul(tf.transpose(y_hat - y), X))

print('Are auto-computed gradient and self-derived gradient the same?')
print(f'{np.allclose(derived_grad.numpy(), tf_autograd.numpy())}')

Are auto-computed gradient and self-derived gradient the same?
True
