https://kvfrans.com/what-is-the-natural-gradient-and-where-does-it-appear-in-trust-region-policy-optimization/

#KL divergence
Kullback-Leibler (KL) divergence is a measure of differences between two probability distributions $p(x)$ and $q(x)$. It is powerful in that distributions of different classes (parameterized differently) can be compared. 
$$D_{KL}(p||q) = \int p(x) \log \frac{p(x)}{q(x)} dx$$
Unlike some more conventional measurements of differences, like euclidean distance, this measurement is not symmetric
$$D_{KL}(p||q) \neq D_{KL}(q||p)$$ 


# Single variabel. Fisher information relation to KL divergence

see [KL_Divergence.ipynb](KL_Divergence.ipynb) for info on KL divergence

$$D_{KL}(p||q) = \int p(x) \log \frac{p(x)}{q(x)} dx $$

Suppose we want to measure KL divergence between probability distribution $$p(x; \theta)$$ parametrized by $\theta$ and same distribution $$p(x; \theta + \delta)$$ with parameters perturbed by small $\delta$

Plug our test distribution into definition of KL divergence
$$D_{KL}\bigg(p(x; \theta) ||p(x; \theta + \delta) \bigg) = \int p(x; \theta) \log \frac{p(x; \theta)}{p(x; \theta + \delta) } dx = \mathbb{E}\bigg[  \log \frac{p(x; \theta)}{p(x; \theta + \delta) }\bigg]$$
or
$$D_{KL}\bigg(p(x; \theta) ||p(x; \theta + \delta) \bigg)  = \mathbb{E}\bigg[  \log p(x; \theta) - \log p(x; \theta + \delta)\bigg]$$
Now we consider second order Taylor's expansion of log-probability
$$\log p(x; \theta + \delta) \overset{\text{2nd order}}{\approx} \log p(x; \theta) + \delta \frac{\partial}{\partial \theta}\log p(x; \theta) + \frac{\delta^2}{2}  \frac{\partial^2}{\partial x^2}\log p(x; \theta) + \underbrace{\dots}_{\approx 0}$$
Plug into KL Divergence
$$D_{KL}\bigg(p(x; \theta) ||p(x; \theta + \delta) \bigg) = \mathbb{E}\bigg[  \cancel{\log p(x; \theta) - \log p(x; \theta)} - \delta \frac{\partial}{\partial \theta}\log p(x; \theta) - \frac{\delta^2}{2}  \frac{\partial^2}{\partial x^2}\log p(x; \theta)\bigg]$$
We can split expectation by linearity. Second term, similarly to previous section, is zero
$$\mathbb{E}\bigg[\frac{\partial}{\partial \theta}\log p(x; \theta) \bigg] = \mathbb{E}\bigg[ \frac{\frac{\partial}{\partial \theta} p(x; \theta)}{p(x; \theta)} \bigg] = \int_{-\infty}^\infty \frac{\frac{\partial}{\partial \theta} p(x; \theta)}{p(x; \theta)} \cdot p(x; \theta) \ dx = \frac{\partial}{\partial x}\int_{-\infty}^\infty p(x; \theta) \ dx = 0$$
So
$$D_{KL}\bigg(p(x; \theta) ||p(x; \theta + \delta) \bigg) = -\frac{\delta^2}{2}\mathbb{E}\bigg[\frac{\partial^2}{\partial x^2}\log p(x; \theta)\bigg]$$
And expectation, in form of second derivative, is $-I(\theta)$ minus- Fisher information:
$$\boxed{D_{KL}\bigg(p(x; \theta) ||p(x; \theta + \delta) \bigg) = \frac{\delta^2}{2} I(\theta)}$$

# Multiple variables
## Definition (https://en.wikipedia.org/wiki/Fisher_information#Matrix_form)
Multi-variable Fisher information is defined as  a matrix with entries $i,j$
$$\boxed{[I(\theta)]_{i,j} = \mathbb{E}\bigg[\bigg(\frac{\partial}{\partial \theta_i}\log p(x; \theta) \bigg) \bigg(\frac{\partial}{\partial \theta_j}\log p(x; \theta)\bigg)\bigg| \theta\bigg]}$$
## Relation to KL divergence
We use definition for multi-variable Taylor expansion are replace
$$\log p(x; \theta + \delta) \overset{\text{2nd order}}{\approx} \log p(x; \theta) + \delta \frac{\partial}{\partial \theta}\log p(x; \theta) + \frac{\delta^2}{2}  \frac{\partial^2}{\partial x^2}\log p(x; \theta) + \underbrace{\dots}_{\approx 0}$$
with expression where $\theta$ an $\delta$ are vectors
$$\log p(x; \theta + \delta) \overset{\text{2nd order}}{\approx} \log p(x; \theta) + \delta^T \nabla_\theta \log p(x; \theta) + \frac{1}{2}\delta^T H \log p(x; \theta) \delta$$
where Hessian is defined as in [gradient_jacobian_hessian.ipynb](../symbolic/gradient_jacobian_hessian.ipynb)
as Jacobian of a gradient 
$$\vec{J}(\nabla \cdot) = H (\cdot)$$
And we want to find
$$D_{KL}\bigg(p(x; \theta) ||p(x; \theta + \delta) \bigg)  = \mathbb{E}\bigg[  \log p(x; \theta) - \log p(x; \theta + \delta)\bigg]$$
By same reasoning as previously we should be left with
$$D_{KL}\bigg(p(x; \theta) ||p(x; \theta + \delta) \bigg) = -\frac{1}{2}\delta^T \mathbb{E}\bigg[J(\nabla_\theta \log p(x; \theta))\bigg] \delta$$
***
Lets examine term in the expectation
$$ J(\nabla_\theta \log p(x; \theta))= J\bigg(\frac{\nabla_\theta p(x; \theta)}{p(x; \theta)}\bigg)$$
Entries of a gradient (column vector) are
$$\vec{g}_i = \bigg[\frac{\nabla_\theta p(x; \theta)}{p(x; \theta)}\bigg]_i = \frac{\partial_i p(x; \theta)}{p(x; \theta)} $$
and Jacobian takes different derivatives for columns $\partial_{\cdot j} = \partial_j$


Thus by taking j-th derivative for i-th column entry we get, via chain rule:

$$\partial_j \bigg(\frac{\partial_i p(x; \theta)}{p(x; \theta)}\bigg) = \frac{\partial_{ij}p(x; \theta)}{p(x; \theta)}- \frac{(\partial_i p(x; \theta))(\partial_j p(x; \theta))}{p(x; \theta)^2}$$
$$ = \frac{\partial_{ij}p(x; \theta)}{p(x; \theta)}- \frac{\partial_i p(x; \theta)}{p(x; \theta)}\frac{\partial_j p(x; \theta)}{p(x; \theta)}$$
$$ = \frac{\partial_{ij}p(x; \theta)}{p(x; \theta)}- \partial_i \log p(x; \theta) \cdot \partial_j \log p(x; \theta) $$
Or expressed as a matrix
$$H \log p(x; \theta) = \frac{ H p(x; \theta)}{p(x; \theta)} - \nabla_\theta \log p(x; \theta) \ \nabla_\theta \log p(x; \theta)^T$$
***
So
$$D_{KL}\bigg(p(x; \theta) ||p(x; \theta + \delta) \bigg) = -\frac{1}{2}\delta^T \mathbb{E}\bigg[\frac{ H p(x; \theta)}{p(x; \theta)} \bigg] \delta
+\frac{1}{2}\delta^T \mathbb{E}\bigg[\nabla_\theta \log p(x; \theta) \ \nabla_\theta \log p(x; \theta)^T\bigg] \delta$$
First term on the right is zero since by writing an integral and bringing out derivative operator
$$ \mathbb{E}\bigg[\frac{ H p(x; \theta)}{p(x; \theta)} \bigg] = H\cdot 1 = 0 \text{ (zero matrix)}$$

$$\boxed{D_{KL}\bigg(p(x; \theta) ||p(x; \theta + \delta) \bigg) \overset{\text{2nd order}}{\approx} \frac{1}{2}\delta^T \mathbb{E}\bigg[\nabla_\theta \log p(x; \theta) \ \nabla_\theta \log p(x; \theta)^T\bigg] \delta}$$

***
Similarly to 1D case we can Taylor-expand whole KL divergence
$$\boxed{D_{KL}\bigg(p(x; \theta) ||p(x; \theta + \delta) \bigg) \overset{\text{2nd order}}{\approx} \frac{1}{2}\delta^T H \delta = \frac{1}{2}(\theta - \theta_{old})^T H (\theta - \theta_{old}) }$$
where $H$ is Hessian matrix
$$H = \nabla^2 D_{KL}$$
This is important view because in tasks such as _Hessian-vector product_ we can 'decompose' Hessian into Jacobian of a gradient
$$H \vec{v} = J\big(\nabla  D_{KL}\big) \vec{v} = J\big(\nabla  D_{KL} \cdot \vec{v}\big) $$
Which allows us to avoid computing full Hessian and use gradients instead.
[CG_Hessian_vector_trick.ipynb](../optimization/CG_Hessian_vector_trick.ipynb)