In [1]:
from sympy import *
U=IndexedBase("U")
V=IndexedBase("V")
X=IndexedBase("X")
Y=IndexedBase("Y")

x=IndexedBase("x")
u=IndexedBase("u")

i,j,k,l,m,n=symbols("i,j,k,l,m,n", integer=true)

<div style="float:center;width:100%;text-align: center;"><strong style="height:60px;color:darkred;font-size:40px;">Matrix Derivatives</strong></div>

# 1. Introduction 

**Convention:** all vectors are **column vectors.** Note different fields use different conventions. One must check before using specific results.

## 1.1 Einstein Summation Convention

One can easily write matrix expressions by writing sums over the entries of the matrices and vectors.

E.g, the matrix vector product $A x$, where $A = \left( a_{i j} \right)$ and $x = \left( x_i \right)$<br>$\qquad$ can be written as

$$\left( A x \right)_i = \sum_{k=1}^{K} a_{i k} x_k$$

This normally results in a proliferation of summation symbols which can readily be omitted.

#### **Einstein Summation Convention:**

* if an **index variable** appears twice in a single term and is not otherwise defined,<br>
$\qquad$ it implies summation of that term over all the values of the index
* an index that is not summed over (a **free index**) should appear only once in each term.<br>
$\qquad$ It will generally appear as a free index in each term of an equation.

##### **Examples:**

* The term $a_{i k} x_k$ has a repeated index $k$ that is summed over.<br>
The index $i$ is free. This term thus stands for $\sum_k { a_i k} x_k$.<br>
It equals the $i^{th}$ entry of the matrix vector product $A x$.
* The equation $p_i = a_{i j} x_j - b_i$ in matrix vector notation is $p = A x$.
* The $trace(A) = a_{i i}$

##### **Kronecker Delta:**

$\qquad \delta_{i j} = \left\{ \begin{align} 1 \quad & i = j \\ 0 \quad &\text{otherwise}.\end{align}\right.$

* In a term with a Kronecker delta, a sum over a repeated index of $\delta_{i j}$ reduces to a single entry corresponding to the other index, e.g.<br>
$\qquad a_{i k} \delta_{ j k} = a_{ i j}$

* This is useful with partial derivatives, for example:<br><br>$\qquad$  $\frac{\partial x_i}{\partial x_j} = \delta_{i j}\quad$ and $\;\;$
 $\frac{\partial x_{i j}}{\partial x_{k l}} = \delta_{i k} \delta_{ j l }$ 


##### **Levi-Civitta Density:**

$\qquad$ See [section 1.1.2 in 14_Determinants.ipynb](14_Determinants.ipynb)

* The Leibnitz Formula for the determinant $\det(A) = \epsilon_{j_1 j_2\dots j_n}\; a_{1 j_1} a_{2 j_2} \dots  a_{n j_n} $.

## 1.2 Derivatives

The Einstein Summation Convention is useful for computing derivatives. For example

$\qquad
\begin{align}
\left( \frac{\partial}{\partial x_1} A x \right)_i &=\ \frac{\partial}{\partial x_1} a_{i j} x_j \\
                                &=\ a_{i j} \frac{\partial x_j}{\partial x_1} \\
                                &=\ a_{i j} \delta_{j 1} \\
                                &=\ a_{i 1}.
\end{align}$

While convenient and easy to use since we are dealing with scalar equations,<br>
$\qquad$ reassembling the results into matrices provides a convenient higher level view.

**Remark:** There are a number of **vector and matrix layout conventions** in use.<br>$\qquad$ In the following, we will follow the [Numerator Layout Convention](https://en.wikipedia.org/wiki/Matrix_calculus):<br>
* vectors in numerators are laid out in columns
* vectors in denominators are laid out in rows

**Caveat Emptor:** Perusing texts or using software packages, it is **important to check the conventions** used!

Below are **definitions of the vector and matrix views** of common derivatives.<br>
$\qquad$ To emphasize the distinction between vectors and scalars, vectors will be written in bold face.<br>
$\qquad$ For a step by step introduction, see [Ritvik Kharkar](https://www.youtube.com/watch?v=e73033jZTCI)

### 1.2.1 Vector by Scalar Derivative

Given a vector $\mathbf{y} = \begin{pmatrix} y_1\\y_2\\ \dots \\ y_n \end{pmatrix} \in \mathbb{R}^n,\;\;$
define $\frac{\partial \mathbf{y}}{\partial x} = \begin{pmatrix} \frac{\partial y_1}{\partial x} \\
                                                                 \frac{\partial y_2}{\partial x} \\
                                                                 \dots \\
                                                                 \frac{\partial y_n}{\partial x}  \end{pmatrix},\;\;$ a column vector.

### 1.2.2  Scalar by Vector Derivative

Given a vector $\mathbf{x} = \begin{pmatrix} x_1\\x_2\\ \dots \\ x_n \end{pmatrix} \in \mathbb{R}^n,\;\;$ define $\frac{\partial y}{\partial \mathbf{x}} =
\begin{pmatrix} \frac{\partial y}{\partial x_1} & \frac{\partial y}{\partial x_2} & \dots & \frac{\partial y}{\partial x_n}  \end{pmatrix}.$

If we define the gradient of a function $f: \mathbf{x} \in \mathbb{R}^n \rightarrow \mathbb{R}$ to be a column vector,<br>
$\qquad$ we have $\nabla f(\mathbf{x}) = \left( \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}\right)^t$

### 1.2.3  Vector by Vector Derivative

Given column vectors $\mathbf{x} \in \mathbb{R}^n$ and $\mathbf{y} \in \mathbb{R}^m$

$\qquad
\frac{\partial \mathbf{y}}{\partial \mathbf{x}}=
\left(\begin{matrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \dots & \frac{\partial y_1}{\partial x_n}\\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \dots & \frac{\partial y_2}{\partial x_n}\\
\dots                             & \dots                             & \dots & \dots                       \\
\frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \dots & \frac{\partial y_m}{\partial x_n}
\end{matrix}\right)
$

### 1.2.3  Matrix by Scalar Derivative

Given matrix $\mathbf{\mathbf{Y}} \in \mathbb{R}^{m \times n}$ and a scalar $x$

$\qquad
\frac{\partial \mathbf{Y}}{\partial x}=
\left(\begin{matrix}
\frac{\partial y_{1 1}}{\partial x} & \frac{\partial y_{1 2}}{\partial x} & \dots & \frac{\partial y_{1 n}}{\partial x}\\
\frac{\partial y_{2,1}}{\partial x} & \frac{\partial y_{2,2}}{\partial x} & \dots & \frac{\partial y_{2,n}}{\partial x}\\
\dots                             & \dots                             & \dots & \dots                       \\
\frac{\partial y_{4,1}}{\partial x} & \frac{\partial y_{4,2}}{\partial x} & \dots & \frac{\partial y_{m,n}}{\partial x}
\end{matrix}\right)
$

### 1.2.4  Scalar by Matrix Derivative

Given matrix $\mathbf{\mathbf{X}} \in \mathbb{R}^{m \times n}$ and a scalar $y$

$\qquad
\frac{\partial y}{\partial \mathbf{X}}=
\left(\begin{matrix}
\frac{\partial y}{\partial x_{1 1}} & \frac{\partial y}{\partial x_{2 1}} & \dots & \frac{\partial y}{\partial x_{m 1}}\\
\frac{\partial y}{\partial x_{1 2}} & \frac{\partial y}{\partial x_{2 2}} & \dots & \frac{\partial y}{\partial x_{m 2}}\\
\dots                             & \dots                             & \dots & \dots                       \\
\frac{\partial y}{\partial x_{1 n}} & \frac{\partial y}{\partial x_{2 n}} & \dots & \frac{\partial y}{\partial x_{m n}}
\end{matrix}\right)
$

Note the derivative has size $n \times m$.

### 1.2.5 The Differential of a Matrix

Given matrix $\mathbf{\mathbf{X}} \in \mathbb{R}^{m \times n}$ 

$\qquad
d\mathbf{X} =
\left(\begin{matrix}
dx_{1 1} & dx_{1 2} & \dots & dx_{1 n}\\
dx_{2 1} & dx_{2 2} & \dots & dx_{2 n}\\
\dots & \dots  & \dots & \dots \\
dx_{m 1} & dx_{m 2} & \dots & dx_{m n}
\end{matrix}\right)
$

### 1.2.6 Example

Let $f( x_{1 1}, x_{1 2}, x_{2 1}, x_{2 2}) = x_{1 1}+ 2 x_{1 2} - x_{2 1} x_{2 2}\;\;$ which we can think of as a function $f: \mathbb{R}^{2 \times 2} \rightarrow \mathbb{R},$<br>$\qquad$i.e., a function $f(\mathbf X)$ for $\mathbf X = \begin{pmatrix} x_{1 1} & x_{1 2}\\ x_{2 1} & x_{2 2} \end{pmatrix}$

All possible derivatives are<br>
$\qquad
\frac{\partial f(\mathbf X)}{\partial \mathbf X}
= \begin{pmatrix}
\frac{\partial f(\mathbf X)}{\partial x_{1 1}} & \frac{\partial f(\mathbf X)}{\partial x_{1 2}} \\ \frac{\partial f(\mathbf X)}{\partial x_{2 1}} & \frac{\partial f(\mathbf X)}{\partial x_{2 2}} \\ 
\end{pmatrix}
= \begin{pmatrix} 1 & 2 \\ -x_{2 2} & -x_{2 1}\end{pmatrix}
$

### 1.2.7 Vector by Matrix, Matrix by Matrix, etc..

These derivatives require more than 2 indices (tensors!). While they can be vectorized, the various notations are unwieldy.<br>
$\quad$ When needed, we will confine ourselves to index notation and the Einstein Summation Convention.

# 2. Basic Formulae

For more detail, consult the [Matrix Cookbook](http://www2.imm.dtu.dk/pubdb/edoc/imm3274.pdf)

## 2.1 Sums and Products

Consider two matrices $\mathbf U$ and $\mathbf V$ with entries that are functions of a vector $\mathbf x$.

* Consider the derivatives of a sum: $\frac{\partial}{\partial x}  ( \mathbf{U + V})$<br><br>
$\qquad \left(
\frac{\partial}{\partial \mathbf x} ( \mathbf U +  \mathbf V )
\right)_{i j k}
= \frac{\partial (u_{i j} + v_{i j})}{\partial x_k}
= \frac{\partial u_{i j}}{\partial x_k}
+ \frac{\partial v_{i j}}{\partial x_k}
\;\;\Leftrightarrow\;\;
\frac{\partial}{\partial x}  ( \mathbf U +  \mathbf V)
= \frac{\partial  \mathbf U}
{\partial  \mathbf x} + \frac{\partial  \mathbf V}{\partial  \mathbf x}
$

* Next, look at the derivatives of a product $\frac{\partial (U V)}{\partial x}$<br><br>
$\qquad \left( \frac{\partial (U V)}{\partial x}\right)_{i j k}
%= \frac{\partial (u_{i l} v_{l j})}{\partial x_k} = \frac{\partial u_{i l}}{\partial x_k}{v_{l j} + u_{i l} %\frac{\partial v_{l j}}{\partial x_k}
\;\;\Leftrightarrow\;\;
\frac{\partial}{\partial \mathbf x}  (\mathbf{U V}) = \frac{\partial \mathbf U}{\partial \mathbf x} \mathbf V + \mathbf U \frac{\partial \mathbf V}{\partial \mathbf x}
$

## 2.2 The Chain Rule

### 2.2.1 Functions ${g: \mathbb{R}^n \rightarrow \mathbb{R}^m}$

$\qquad
\left( \frac{\partial f( g( \mathbf x))}{\partial \mathbf x}\right)_{i j}
=\ \frac{\partial \left( f( g( \mathbf x)) \right)_{i}}{\partial x_{j}}
=\ 
\frac{\partial g(x)_{k}}{\partial x_{j}} \ 
\left.\frac{\partial f(\mathbf u)_{i}}{\partial u_{k}}\right|_{\mathbf u= g(\mathbf x)}\ 
\;\;\Leftrightarrow \;\;
\large{
\frac{ \partial f(g(\mathbf x))}{\partial \mathbf  x} = \ 
\frac{\partial g( \mathbf x )}{\partial \mathbf x}\ 
\left. \frac{\partial f( \mathbf u )}{\partial \mathbf u}\right|_{\mathbf u = g( \mathbf x)}\;
}
$

**Remark:** Note the order of the matrices!
____

Consider a simple example:

Let $\mathbf y = \mathbf B \mathbf A \mathbf x,$ where $\mathbf A$ and $\mathbf B$ are constant matrices.<br>
$\qquad$ Setting $g(\mathbf x) = \mathbf A \mathbf x$, and $f(\mathbf u) = \mathbf B \mathbf u,$ we have $\mathbf y = f(g(\mathbf x)),$<br>
$\qquad$ where $f(x)$ and $g(x)$ are functions transforming **vectors into vectors.**

We can compute the derivatives directly:

$\qquad
\frac{ \partial y_i}{\partial x_j} = \frac{ \partial b_{i k} a_{k l} x_l}{\partial x_j} = b_{i k} a_{k l} \delta_{l j} = b_{i k} a_{k j}\;\; \Leftrightarrow \frac{ \partial \mathbf y}{\partial  \mathbf x} =  \mathbf A  \mathbf B
$

We want to compare this to $\frac{\partial f( g( \mathbf x))}{\partial \mathbf x}$

$\qquad$ The chain rule calls for $\frac{\partial f( \mathbf u )}{\partial \mathbf u}$ and $\frac{\partial g( \mathbf x)}{\partial \mathbf x}$

$\qquad \left. \begin{align}
\left( \frac{\partial f( \mathbf u )}{\partial \mathbf u} \right)_{i j}
= \frac{\partial b_{i k} u_k}{\partial u_j} = b_{i k} \delta_{k j} = b_{i j} \Rightarrow \frac{\partial f( \mathbf u )}{\partial \mathbf u} = \mathbf B
\\
\left( \frac{\partial g( \mathbf x )}{\partial \mathbf x} \right)_{i j}
= \frac{\partial a_{i k} x_k}{\partial x_j} = a_{i k} \delta_{k j} = a_{i j} \Rightarrow  \frac{\partial g( \mathbf x )}{\partial \mathbf x} = \mathbf A
\end{align}\right\}
\;\;\Rightarrow\;\;
a_{i k} b_{k j} = 
\left( \frac{\partial g( \mathbf x )}{\partial \mathbf x}\right)_{i k} 
\left( \frac{\partial f( \mathbf u )}{\partial \mathbf u} \right)_{k j}
\;\;\Rightarrow\;\;
A B =\ \frac{\partial g( \mathbf x )}{\partial \mathbf x}\;
\left. 
\frac{\partial f( \mathbf u )}{\partial \mathbf u}
\right|_{\mathbf u = g( \mathbf x)}
$

### 2.2.2 Functions $g: \mathbb{R}^{m \times n} \rightarrow \mathbb{R}^{k \times l}$

Apply the chain rule to $\frac{\partial f( g( \mathbf X))}{\partial \mathbf X}.$ We get<br>

$\qquad \large{
\left( \frac{\partial f( g( \mathbf X))}{\partial \mathbf X}\right)_{i j m n}
=\ \frac{\partial \left( f( g( \mathbf X)) \right)_{i j}}{\partial x_{m n}}
=\ 
\frac{\partial g(X)_{k l}}{\partial x_{m n}} \ 
\left.\frac{\partial f(\mathbf U)_{i j}}{\partial u_{k l}}\right|_{\;\mathbf U = g(\mathbf X)}\ 
}$

These are Tensors! To fit these into matrices, we would have to vectorize them, resulting in forbidding looking formulae...
____

#### Example

Let $g(\mathbf X) = \mathbf X + \mathbf X^t$ and $f(\mathbf{X}) = \mathbf X^2$, and consider $F(\mathbf X) = f( g( \mathbf X ))$

$\qquad$ We want to compute $\frac{\partial F(\mathbf X)}{\partial \mathbf X}$<br><br>

##### Direct Computation

Substituting, we find $F(\mathbf X) = \left( \mathbf X + \mathbf X^t \right)^2 = \mathbf X^2 + ( \mathbf X^t)^2 +  \mathbf X^t \mathbf X + \mathbf X \mathbf X^t.\;\;$

$\qquad \begin{align}
\left( \frac{\partial F(\mathbf X)}{\partial \mathbf X}\right)_{i j m n} 
=&\ \frac{\partial}{\partial x_{m n}}
\left( x_{i k} x_{k j} + x_{j k} x_{k i} + x_{i k} x_{j k} + x_{k i} x_{k j}\right) \\
=&\ x_{k j} \delta_{i m} \delta_{k n} + x_{i k} \delta_{k m}\delta_{j n}
  + x_{k i} \delta_{j m} \delta_{k n} + x_{j k} \delta_{k m}\delta_{i n}
  + x_{j k} \delta_{i m} \delta_{k n} + x_{i k} \delta_{j m}\delta_{k n}
  + x_{k j} \delta_{k m} \delta_{i n} + x_{k i} \delta_{k m}\delta_{j n} \\
=&\ x_{n j} \delta_{i m} + x_{i m} \delta_{j n}
  + x_{n i} \delta_{j m} + x_{j m} \delta_{i n}
  + x_{j n} \delta_{i m} + x_{i n} \delta_{j m}
  + x_{l j} \delta_{i n} + x_{m i} \delta_{j n} \\
=&\ 
   (x_{n j}+x_{j n}) \delta_{i m}
 + (x_{m j}+x_{j m}) \delta_{i n}
 + (x_{n i}+x_{i n}) \delta_{j m}
 + (x_{m i}+x_{i m}) \delta_{j n}
\end{align}$

##### Chain Rule Computation

The chain rule applied to $\frac{\partial f(g(\mathbf X))}{\partial \mathbf X}$ calls for $\frac{\partial f( \mathbf U )}{\partial \mathbf U}$ and $\frac{\partial g( \mathbf X)}{\partial \mathbf X}$

$\qquad\qquad \left. \begin{align}
&\left( \frac{\partial f( \mathbf U )}{\partial \mathbf U} \right)_{i j k l}
= \frac{\partial}{\partial u_{k l}}\left( u_{i m} u_{m j} \right) = u_{l j} \delta_{i k} + u_{i k} \delta_{j l}
\\
&\left( \frac{\partial g( \mathbf X )}{\partial \mathbf X} \right)_{k l m n}
= \frac{\partial \left(x_{k l} + x_{l k}\right)}{\partial x_{m n}} = \delta_{k m} \delta_{l n} + \delta_{l m} \delta_{k n}
\end{align}\right\}
\;\;\Leftrightarrow\;\; \left(\frac{\partial F(\mathbf X)}{\partial \mathbf X}\right)_{i j m n} =
\left. \left(\delta_{k m} \delta_{l n} + \delta_{l m} \delta_{k n}\right)\left( u_{l j} \delta_{i k} + u_{i k} \delta_{j l} \right)\right|_{\ \mathbf U = X+X^t}
$

$\qquad$ Simplifying, we get

$\qquad\qquad \left(\frac{\partial F(\mathbf X)}{\partial \mathbf X}\right)_{i j m n}\ 
\begin{align}=&\ \left(
u_{n j} \delta_{i m} +
u_{m j} \delta_{i n} +
u_{i m} \delta_{j n} +
u_{i n} \delta_{j m} \right)_{\ \mathbf U = X+X^t}\\
=&
   (x_{n j}+x_{j n}) \delta_{i m}
 + (x_{m j}+x_{j m}) \delta_{i n}
 + (x_{n i}+x_{i n}) \delta_{j m}
 + (x_{m i}+x_{i m}) \delta_{j n}
\end{align}$

$\qquad$ the same expression as before.

##### Why so Many Indices?

Let's look at a matrix $\mathbf X$ of size $2\times 2$ for this example
____

In [7]:
Xmatrix  = Matrix( [[x[r,c] for r in range(1,3)] for c in range(1,3)])
Fmatrix = expand( (Xmatrix + Xmatrix.T)**2 )
print("The matrix F(X) has size 2x2, with entries made up of 4 variables")
print("F(X) =")
Fmatrix

The matrix F(X) has size 2x2, with entries made up of 4 variables
F(X) =


Matrix([
[                   4*x[1, 1]**2 + x[1, 2]**2 + 2*x[1, 2]*x[2, 1] + x[2, 1]**2, 2*x[1, 1]*x[1, 2] + 2*x[1, 1]*x[2, 1] + 2*x[1, 2]*x[2, 2] + 2*x[2, 1]*x[2, 2]],
[2*x[1, 1]*x[1, 2] + 2*x[1, 1]*x[2, 1] + 2*x[1, 2]*x[2, 2] + 2*x[2, 1]*x[2, 2],                    x[1, 2]**2 + 2*x[1, 2]*x[2, 1] + x[2, 1]**2 + 4*x[2, 2]**2]])

In [9]:
print( "We have a 2x2 matrix for EACH of the partial derivatives! That's four 2x2 matrices")
print( ".   e.g., the derivative with respect to ", x[1,1], "is given by")
print( ".   ∂F/∂x_1,1 =")
diff(Fmatrix, x[1,1])

We have a 2x2 matrix for EACH of the partial derivatives! That's four 2x2 matrices
.   e.g., the derivative with respect to  x[1, 1] is given by
.   ∂F/∂x_1,1 =


Matrix([
[            8*x[1, 1], 2*x[1, 2] + 2*x[2, 1]],
[2*x[1, 2] + 2*x[2, 1],                     0]])

____
When we make use of the $\mathbf U = g(\mathbf X)$, we introduce an additional set of matrices $U$ and its 4 partial derivatives

In [10]:
def make_f_u():
    um =  Matrix( [[u[r,c] for r in range(1,3)] for c in range(1,3)])
    return expand(um*um)
def make_df_u(f_u):
    return [ [diff(f_u, u[r,c]) for r in range(1,3)] for c in range(1,3)]

f_u  = make_f_u()
df_u = make_df_u(f_u)

print("f(U)")
display(f_u)
print("Diff F(U) with respect to U_1,1")
display(df_u[0][0])

f(U)


Matrix([
[     u[1, 1]**2 + u[1, 2]*u[2, 1], u[1, 1]*u[2, 1] + u[2, 1]*u[2, 2]],
[u[1, 1]*u[1, 2] + u[1, 2]*u[2, 2],      u[1, 2]*u[2, 1] + u[2, 2]**2]])

Diff F(U) with respect to U_1,1


Matrix([
[2*u[1, 1], u[2, 1]],
[  u[1, 2],       0]])

In [13]:
def make_g_x():
    gm =  Matrix( [[x[r,c] for r in range(1,3)] for c in range(1,3)])
    return gm+gm.T
def make_dg_x( g_x ):
    return [ [diff(g_x, x[r,c]) for r in range(1,3)] for c in range(1,3)]

g_x  = make_g_x()
dg_x = make_dg_x( g_x)
print("g(X)= ")
display(g_x)
print("Diff g(X) with respect to x_2,1")
display(dg_x[1][0])

g(X)= 


Matrix([
[        2*x[1, 1], x[1, 2] + x[2, 1]],
[x[1, 2] + x[2, 1],         2*x[2, 2]]])

Diff g(X) with respect to x_2,1


Matrix([
[0, 1],
[1, 0]])

That makes for 16 matrices that need to be computed and assembled!

## 2.3 Some Common Derivatives

Common derivatives readily computed using the sum product and chain rules.<br>
In the following, $\alpha, a, A$ and $B$ are considered constant.
* $\frac{\partial}{\partial \mathbf x} \mathbf A \mathbf x = \mathbf A$
* $\frac{\partial}{\partial \mathbf x} (\mathbf x^t \mathbf A) = \mathbf A^t$
* $\frac{\partial}{\partial \mathbf x} (\mathbf x^t \mathbf A \mathbf x) = \mathbf x^t \left( \mathbf A + \mathbf A^t \right)$
* $\frac{\partial}{\partial t} \mathbf X^{-1} = - \mathbf X^{-1} \frac{d \mathbf X}{dt} X^{-1}$, assuming $x_{i j} = x_{i j}(t)$.

The last identity is readily obtained by taking the derivatives of $X^{-1} X = I$

Derivatives of Common Functions
* $\frac{\partial}{\partial \mathbf x} \Vert x - a \Vert^2 = 2 ( x - a ) \Rightarrow \frac{\partial}{\partial \mathbf x} \Vert x - a \Vert =  \frac{ x - a }{\Vert x - a \Vert }$ 
* $\frac{\partial}{\partial \mathbf X} \Vert A x \Vert^2 = 2 x^t A^t A$
* $\frac{\partial}{\partial \mathbf X} trace\mathbf (X) = I\quad$ ($\mathbf X$ is a square matrix)
* $\frac{\partial}{\partial \mathbf X} \Vert \mathbf X \Vert^2 = 2 \mathbf X$

## 2.4 Differentials

As shown above, derivatives can be tediously computed via partials,<br>$\qquad$
but they instead can be computed directly with matrix manipulations.<br>$\qquad$
In the following, we use an article by [Thomas P. Minka](https://tminka.github.io/papers/matrix/minka-matrix.pdf)


The idea is to use differential notation: for a differentiable function $y = f(x)$,<br>$\qquad$
the function can be approximated by a tangent hyperplane.

Let $dx$ be a given step size starting from some point $x$, and let $dy$ be the resulting change in $y$<br>$\qquad$
in the tangent hyperplane, then

$\qquad y(x + dx) = y(x) + f'(x) dx + \text{(higher order terms)} \Rightarrow dy = f'(x) dx$

$\qquad$ This equation applies even when $y$ and $x$ are not scalars. E.g., <br>
$\qquad\qquad dy = \frac{ \partial f(x) }{ \partial x} dx, \;\;\;$
where $dx = \begin{pmatrix} dx_1\\ dx_2\\ \dots \\ dx_n \end{pmatrix}$.

#### **Computations of Derivatives**

To compute a given derivative,
* compute the differential using the expressions given below
* put the result in canonical form and read of the derivative as the coefficient of $dx$, $d\mathbf{x}$ or $d\mathbf{X}$.<br>
This can prove difficult! One way to address this is to "vectorize" the result, yielding equations that are hard to read and interpret. It may be easier to switch to index notation at this point. 

Let $x$ and $y$ be variables (scalars, vectors and/or matrices), and $\alpha$ and $\mathbf{A}$ be constants.

The following differentials are easily established using the Einstein Summation Convention:
* $d\mathbf{A} = 0$
* $d(\alpha \mathbf{X}) = \alpha\ d\mathbf{X}$
* $d(\mathbf{X}+\mathbf{Y}) = d\mathbf{X} + d\mathbf{Y}$
* $d( \mathbf{X}\mathbf{Y} ) = d\mathbf{X}\ \mathbf{Y} + \mathbf{X}\ d\mathbf{Y}$
* $d\mathbf{X}^{-1} =  -\mathbf{X}^{-1}\ d\mathbf{X}\ \mathbf{X}^{-1}$
* $d( trace(\mathbf{X}) ) = trace( d\mathbf{X} )$
* $d( det\mathbf{X} ) = det\mathbf{X}\ trace\left( \mathbf{X}^{-1}\ d\mathbf{X} \right)$
* $d( log\left(det\left(\mathbf{X}\right)\right)) = trace\left(\mathbf{X}^{-1}\ d\mathbf{X}\right)$
* $d\mathbf{X}^{*}= (d\mathbf{X})^{*},\;\;\;$ where ⋆
is any operator that rearranges elements, e.g. transpose

A useful relationship to know is $trace(A B) = trace( B A)$

___
Let's establish the formula for the derivative of $\mathbf X^{-1}$.<br>
$\qquad$ An easy way is to consider $\mathbf X \mathbf X^{-1} = \mathbf I$.

$\qquad$ By the product rule the differential is $d\mathbf X\ \mathbf X^{-1}\ \mathbf X + \mathbf X\ d\mathbf X^{-1} = 0$. Solving for $d\mathbf X^{-1}$,<br>
$\qquad$ we obtain $d\mathbf X^{-1} = - \mathbf X^{-1}\ d\mathbf{X}\ \mathbf X^{-1}$.

$\qquad$ If for example $\mathbf X$ is a function of a variable $t$, we have $d \mathbf X = \frac{ d \mathbf X}{dt} dt$,<br>
$\qquad\qquad$ so that $\frac{d \mathbf X^{-1}}{ dt } = \mathbf X^{-1} \frac{d \mathbf X}{ dt }  \mathbf X^{-1}$

# 3. Examples

## 3.1. Neural Network Weight Updates

Neural network updates take the form $\mathbf{w} = \mathbf{X} \mathbf{v},$ where $\mathbf X$ contains weights that need to be updated.<br>
$\quad$ The weight update equations require $\frac{\partial \mathbf{w}}{\partial \mathbf{X}} = \frac{\partial (\mathbf{X v})}{\partial \mathbf{X}}$ 

In Einstein Summation Convention notation, this is

$\qquad
\left( \frac{\partial (\mathbf{X v})}{\partial \mathbf{X}}\right)_{i j k}
=\ \frac{ \partial}{\partial x_{j k}}
X_{i l} v_l
=\ \delta_{i j} \delta_{k l} v_l
=\ \delta_{i j} v_k
$

i.e.,<br>
$\qquad
\left( \frac{\partial (\mathbf{X v})}{\partial \mathbf{X}}\right)_{i j k}
= \left\{\begin{align}
v_k \quad & \text{ if } i \ne j \\
0 \quad & \text{otherwise}
\end{align}\right.
$

## 3.2 Least Mean Squares Objective Function

The least mean squares solution of $A x - b$ is given by $argmin_x \Vert A x - b \Vert$. 

$\qquad$ Let $F(x) = \Vert A x - b \Vert^2 = x^t A^t A x - x^t A^t b - b^t A x + b^t b$<br>
$\qquad$ and compute the gradient $\nabla F(x)$

\begin{align}
\frac{\partial F(x)}{\partial x_l}
=&\ \frac{\partial}{\partial x_l}  \left( x_i A_{j i} A_{j k} x_k - x_i A_{j i} b_j - b_i A_{i j} x_j + b_i b_i \right) \\
=&\ A_{j i} A_{j k} \delta_{i l} x_k + A_{j i} A_{j k} \delta_{k l} x_i - A_{j i} b_j \delta_{i l} - A_{i j} b_i \delta_{j l} \\
=&\ A_{j l} A_{j k} x_k + A_{j k} A_{j l} x_k  - A_{j l} b_j - A_{j l} b_j \\ 
=&\ 2 A_{j l} \left( A_{j k} x_k - b_j \right)
\end{align}

Hence

$\qquad \frac{\partial}{\partial x} \Vert A x - b \Vert^2 = 2 A^t (A x - b)$

## 3.3 The Rayleigh Quotient

The derivative of the Rayleigh Coefficient

$\qquad \mathbf x^t \mathbf A \mathbf x = a_{i j} x_i x_j$, so $R( \mathbf x) = \frac{a_{i j} x_i x_j}{ x_k x_k }$

The numerator and denominator are scalars, so we can use the quotient rule.

* Derivative or the numerator: $\qquad\;\;\frac{\partial a_{i j} x_i x_j  }{\partial x_l} = { a_{i j} (\delta_{i l} x_j + \delta_{j l} x_i )} = a_{l j} x_j + a_{i l} x_i$
* Derivative of the denominator: $\qquad\frac{\partial x_k x_k}{x_l} = 2 \delta_{k l} x_k = 2 x_l$

$\quad \therefore \;\; \frac{\partial R(\mathbf x)}{\partial x_l} = \frac{ (a_{l j} x_j + a_{i l} x_i) x_s x_s - 2 a_{i j} x_i x_j x_l}{(x_k x_k)^2}
\;\;\Leftrightarrow\;\; \frac{\partial R(\mathbf x)}{\partial \mathbf x} = \frac{ ( \mathbf A + \mathbf A^t) \mathbf x  - 2 R(\mathbf x) \mathbf x }{ \mathbf x^t \mathbf x }$

Since the matrix $\mathbf A$ for the Rayleigh quotient $R(\mathbf x)$ is symmetric, we finally have<br>
$\qquad \frac{\partial R(\mathbf x)}{\partial \mathbf x} = 2 \frac{A x}{x^t x} -2 R(x) \frac{x }{x^t x}$