# Natural Gradient

**Sources**:
1. [Natural Gradient Works Efficiently in Learning](https://direct.mit.edu/neco/article-abstract/10/2/251/6143/Natural-Gradient-Works-Efficiently-in-Learning?redirectedFrom=fulltext)

#### Description
A variant of gradient which directs to the steepest direction under some conditions.
For example, parameter space of the multilayer perceptrons are based on the Riemannian metric structure (i.e. non-orthonormal coordinate system).
Under such structure, ordinary gradient does not necessarily direct towards the steepest direction.
Natural gradient helps us to determine the steepest direction in wider range of space.
It is defined in a following form:

$$\tilde{\nabla} L(\mathbf{w}) = G^{-1} \nabla L(\mathbf{w})\textnormal{,}$$

where $G$ is the Riemannian metric tensor, $L$ is the loss function, and $\nabla$ is the ordinary gradient.

#### Derivation
Let $S := \{\mathbf{w} \in \mathbb{R}^n\}$ be a parameter space where a function $L(w)$ is defined.
When $S$ is the Euclidean space with an orthonormal coordinate system, then the squared length of small incremental vector $d\mathbf{w}$ is defined as:

$$|d\mathbf{w}|^2 = \sum_{i=1}^n (d\mathbf{w})^2\textnormal{.}$$

Now, if coordinate is non-orthonormal, squared length is defined as:

$$|d\mathbf{w}|^2 = \sum_{i,j} g_{ij}d\mathbf{w}_i d\mathbf{w}_j\textnormal{,}$$

where $G = (g_{ij})$ is the $n \times n$ *Riemannian matrix tensor*, which in general depends on the $\mathbf{w}$.

e.g. The Riemannian matrix tensor in Euclidean manifold with orthonormal coordinate system reduces to:

$$g_{ij}(\mathbf{w}) = \delta_{ij} =
\begin{cases}
    1, &\text{if } i = j \\
    0 &\text{otherwise}
\end{cases}
$$


Note that the latter definition of squared length also applies when $S$ is curved manifolds (which by premises does not have an orthonormal coordinate system).
**In fact, under such system, natural gradient works better in learning**.

Now, the steepest direction of a function $L(\mathbf{w})$ is $d\mathbf{w}$ which minimizes $L(\mathbf{w} + d\mathbf{w})$.
$|d\mathbf{w}|$ is constrained with a condition $|d\mathbf{w}|^2 = \epsilon^2$, where $\epsilon$ is a sufficiently small constant.

**Theorem.**
&emsp; The steepst descent direction of $L(\mathbf{w})$ in Riemannian space is given by: 

$$\tilde{\nabla} L(\mathbf{w}) = G^{-1} \nabla L(\mathbf{w})\textnormal{,}$$

where $G^{-1}$ is the inverse of metric $G$ and $\nabla L(\mathbf{w})$ is the ordinary/conventional gradient $\nabla L(\mathbf{w}) = \left( \partial_{w_1} L(\mathbf{w}), \dots, \partial_{w_n} L(\mathbf{w})\right)$.

**Proof.**
&emsp; Let $d\mathbf{w} = \epsilon \mathbf{a}$.
We minimize $L(\mathbf{w} + d\mathbf{w}) = L(\mathbf{w}) + \epsilon \nabla L(\mathbf{w}) \mathbf{a}$ subject to: 

$$|\mathbf{a}|^2 = \sum_{ij} g_{ij}a_i a_j = 1\text{.}$$

By Lagrangean method, we get:

$$\partial_\mathbf{a} \{L(\mathbf{w}) + \epsilon \nabla L(\mathbf{w}) \mathbf{a} - \lambda (\mathbf{a}^\intercal G \mathbf{a} - 1)\} = 0\text{,}$$
$$\nabla L(\mathbf{w}) - 2 \lambda G \mathbf{a} = 0\text{,}$$
$$\mathbf{a} = \frac{1}{2\lambda}G^{-1} \nabla L(\mathbf{w})\text{.}$$

Hence, $\tilde{\nabla}L(\mathbf{w}) = d\mathbf{w} = \frac{\epsilon}{2\lambda}G^{-1} \nabla L(\mathbf{w}) \propto G^{-1} \nabla L(\mathbf{w})$. \[Q.E.D.\]