# Matrix Differentiation Rules

It is often convenient to express the derivatives in matrix form and use vectorized operations when updating the weights in neural networks. The reason for this is that vectorized operations in `numpy` for example are much faster than using Python loops. In this notebook, we will summarize a couple of useful rules for matrix differentiation.

:::{#def-m-from-vec-n}
Let $f: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$ be a function that maps a m-dimensional vector to an n-dimensional vector. Then the derivative of $f$ with respect to a vector $x$ is a matrix (called the Jacobian matrix of $f$) of shape $m \times n$ and is given by

$$
\frac{\partial f}{\partial x} = \begin{pmatrix}
\frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{pmatrix}
$$
:::


:::{#exm-1-to-2}
## $f: \mathbb{R}^{1} \to \mathbb{R}^{2}$
As an example, let's look at a function $f: \mathbb{R}^{1} \rightarrow \mathbb{R}^{2}$ that maps a scalar to a 2-dimensional vector. The function is defined as

$$
f(x) = \begin{pmatrix} x^2 \\ x^3 \end{pmatrix}
$$

According to @def-m-from-vec-n the Jacobian matrix of $f$ is a $2 \times 1$ matrix and is given by

$$
\frac{\partial f}{\partial x} = \begin{pmatrix}
\frac{\partial f_1}{\partial x} \\ \frac{\partial f_2}{\partial x}
\end{pmatrix} = \begin{pmatrix}
2x \\ 3x^2
\end{pmatrix}
$$




:::{#def-matrix-to-scalar}

Let $f: \mathbb{R}^{m \times n} \rightarrow \mathbb{R}$ be a function that maps a matrix to a scalar (e.g. a loss function). Then the derivative of $f$ with respect to a matrix $W$ is a matrix of the same shape as $W$ and is given by

$$
\frac{\partial f}{\partial W} = \begin{pmatrix}
\frac{\partial f}{\partial W_{11}} & \cdots & \frac{\partial f}{\partial W_{1n}} \\
\vdots & \ddots & \vdots \\
\frac{\partial f}{\partial W_{m1}} & \cdots & \frac{\partial f}{\partial W_{mn}}
\end{pmatrix}
$$
:::



:::{#thm-matrix-vector-by-vector}
## Matrix Multiplication by Vector

Let $A \in \mathbb{R}^{m \times n}$ and $x \in \mathbb{R}^{n}$. Then the derivative of $Ax$ with respect to $x$ is $A^T$.

$$
\frac{\partial Ax}{\partial x} = A^T
$$
:::
:::{.proof}

Let $y = Ax$. Then $y_i = \sum_{j=1}^{n} A_{ij}x_j$. The derivative of $y_i$ with respect to $x_k$ is

$$
\frac{\partial y_i}{\partial x_k} = A_{ik}
$$

Therefore, the derivative of $Ax$ with respect to $x$ is

$$
\frac{\partial Ax}{\partial x} = \begin{pmatrix}
\frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{pmatrix} = A^T
$$
:::



:::{#thm-vector-by-vector}
## Vector Derivative with Respect to Itself

Let $x \in \mathbb{R}^{n}$. Then the derivative of $x$ with respect to $x$ is an $n \times n$ matrix with 1s on the diagonal and 0s elsewhere.

$$
\frac{\partial x}{\partial x} = \begin{pmatrix}
\frac{\partial x_1}{\partial x_1} & \cdots & \frac{\partial x_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial x_n}{\partial x_1} & \cdots & \frac{\partial x_n}{\partial x_n}
\end{pmatrix} = \begin{pmatrix}
1 & 0 & \cdots & 0 \\
0 & 1 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & 1
\end{pmatrix}
 = I_n
$$

where $I_n$ is the identity matrix of size $n \times n$.

:::

:::{#thm-elementwise-function-vector}
## Derivative of Elementwise Applied Function with Respect to Vector

Let $f: \mathbb{R} \to \mathbb{R}$ be a scalar function that is applied elementwise to a vector $x \in \mathbb{R}^{n}$. Then the derivative of $y = f(x)$ with respect to $x$ is a diagonal matrix with the derivative of $f$ with respect to $x_i$ on the diagonal.

$$
\frac{\partial y}{\partial x} = \begin{pmatrix}
\frac{\partial y_1}{\partial x_1} & \cdots & 0 \\
\vdots & \ddots & \vdots \\
0 & \cdots & \frac{\partial y_n}{\partial x_n}
\end{pmatrix}
$$

:::

The multiplication of a diagonal matrix with a vector is equivalent to elementwise multiplication of the vector with the diagonal elements of the matrix.


:::{#exm-relu-times-vec}
## ReLU Applied Elementwise to a Vector multiplied by a Vector

Let $f: \mathbb{R} \to \mathbb{R}$ be some (differentiable) scalar function and let $\delta, x \in \mathbb{R}^{n}$. Then the derivative of $y = f(x) \cdot x$ with respect to $x$ is a diagonal matrix with the derivative of $f$ with respect to $x_i$ on the diagonal.

$$
\delta \frac{\partial y}{\partial x} = \begin{pmatrix} \delta_1 \\ \vdots \\ \delta_n \end{pmatrix} \begin{pmatrix}
\frac{\partial y_1}{\partial x_1} & \cdots & 0 \\
\vdots & \ddots & \vdots \\
0 & \cdots & \frac{\partial y_n}{\partial x_n}
\end{pmatrix}
= \begin{pmatrix} \delta_1 \frac{\partial y_1}{\partial x_1} \\ \vdots \\ \delta_n \frac{\partial y_n}{\partial x_n} \end{pmatrix}
= \delta \odot \begin{pmatrix} \frac{\partial y_1}{\partial x_1} \\ \vdots \\ \frac{\partial y_n}{\partial x_n} \end{pmatrix}
$$

The symbol $\odot$ denotes elementwise multiplication.
:::

