# Gaussian Variational Inference

Gaussian variational inference is a family of variational inference algorithm with Gaussian variational distribution. Let us consider the Gaussian approximation $q(\theta) = \mathcal{N}(m, C)$ of a density function $\rho$. 


## Right Kullback–Leibler Gaussian approximation
The right KL Gaussian approximation is
    
$$
\begin{align*}
KL\bigl[\rho(\theta|y)\Vert q(\theta)\bigr] = \int \rho(\theta|y)\log \rho(\theta|y) d\theta - \mathbb{E}_{\rho}[-\frac{1}{2}(\theta - m)^TC^{-1}(\theta - m)] + \frac{1}{2}\log \det C
\end{align*}
$$

The minimization of the right KL Gaussian approximation leads to 

$$
\begin{align*}
m = \mathbb{E}_{\rho}[\theta] \qquad C =\mathbb{E}_{\rho}[(\theta - m)(\theta - m)^T]
\end{align*}
$$  

## Left Kullback–Leibler Gaussian approximation
The left KL Gaussian approximation is
    
$$
\begin{align*}
KL\bigl[q(\theta) \Vert \rho(\theta|y)\bigr] &= \int q(\theta)\log q(\theta) - q(\theta)\log\rho(\theta|y) d\theta\\
                                   &= -\frac{1}{2}\log \det[C]  - \int q(\theta)\log\rho(\theta|y) d\theta + \textrm{const.}
\end{align*}
$$

More specifically, we consider that $\rho(\theta|y) \propto \rho(y|\theta)\rho_{\rm prior}(\theta) $ and $\rho_{\rm prior}(\theta) \sim \mathcal{N}(m_{\rm prior}, C_{\rm prior})$, we seek to minimize 

$$ \min_{m,C} \mathbb{E}_q[\log q - \log \rho_{\rm prior} - \log(y|\theta)] \qquad q\sim \mathcal{N}(m, C)$$

This generally captures one mode for multimodal posterior distributions.

## Recursive variational Gaussian approximation [1]
The fixed point equations of the Left Kullback–Leibler Gaussian approximation are :

$$
\begin{align*}
\textrm{order 2 form:}&\\
&m = m_{\rm prior} + C_{\rm prior} \mathbb{E}_q[\nabla_{\theta}\log \rho(y|\theta)]\\
&C^{-1} = C_{\rm prior}^{-1}  - \mathbb{E}_q[\nabla_{\theta}^2\log \rho(y|\theta)]\\
%
\textrm{order 1 form:}&\\
&m = m_{\rm prior} + C_{\rm prior} \nabla_{m} \mathbb{E}_q[\log \rho(y|\theta)]\\
&C^{-1} = C_{\rm prior}^{-1}  - 2\nabla_{C}\mathbb{E}_q[\log \rho(y|\theta)]
        = C_{\rm prior}^{-1}  - C^{-1}\mathbb{E}_q\bigl[(\theta - m)\nabla_{\theta} \log\rho(y|\theta)^T\bigr]\\
%
\textrm{order 0 form:}&\\
&m = m_{\rm prior} + C_{\rm prior} C^{-1} \mathbb{E}_q[(\theta - m)\log \rho(y|\theta)]\\
&C^{-1} = C_{\rm prior}^{-1}  - C^{-1}\mathbb{E}_q\Bigl[\bigl((\theta - m)(\theta - m)^T C^{-1} - \mathbb{I}\bigr)\log \rho(y|\theta)\Bigr]
\end{align*}
$$

When iterative explicit scheme is applied, which replaces $m, C, q$ in the right hand side by $m_{n}, C_{n}, q_n$, it becomes iterative Gaussian posterior approximation algorithm.

## Particle flow methods [2]

Consider the linear transformations of particles, which represent the distribution $q_t$

$$
\frac{d\theta_t}{dt} = f_t (\theta_t) = A_t(\theta_t - m_t) + b_t
$$

where $m_t = \mathbb{E}_{q_t}[\theta]$ and $A_t$ $b_t$ are independent of $\theta_t$. It is worth mentioning that if $q_0$ is a Gaussian, the $q_t$ is Gaussian. The corresponding Fokker-Planck equation is

$$
\frac{\partial q_t}{\partial t} = -\nabla_{\theta} \cdot ( q_t  f_t )
$$




To determine $b_t$ and $A_t$, we consider

$$
\begin{align*}
\frac{\partial KL\bigl[q_t \Vert \rho(\theta|y)\bigr]}{\partial t}
&= \int \nabla_{\theta} \cdot ( q_t  f_t ) \bigl(\log \rho(\theta | y) - \log q_t\bigr) \\
&= \mathbb{E}_{q_t}[-\nabla_{\theta} f_t - f_t  \nabla_{\theta}\log \rho(\theta | y)] \\
& = -tr[A_t] - b_t^T \mathbb{E}_{q_t}\bigl[\nabla_{\theta}\log\rho(\theta|y)\bigr] - tr\bigl[A_t \mathbb{E}_{q_t}[(\theta - m_t)\nabla_{\theta}\log\rho(\theta|y)^T]\bigr] \\
& = -tr\Bigl[A_t\bigl(I + \mathbb{E}_{q_t}[(\theta - m_t)\nabla_{\theta}\log\rho(\theta|y)^T\bigr)\Bigr] - b_t^T \mathbb{E}_{q_t}\bigl[\nabla_{\theta}\log\rho(\theta|y)\bigr]
\end{align*}
$$

To ensure that $KL\bigl[q_t \Vert \rho(\theta|y)\bigr]$ decreases, we will have 
$$
\begin{align*}
b_t =  \mathbb{E}_{q_t}\bigl[\nabla_{\theta}\log\rho(\theta|y)\bigr] \qquad A_t =  I + \mathbb{E}_{q_t}[\nabla_{\theta}\log\rho(\theta|y) (\theta - m_t)^T]
\end{align*}
$$

Bringing $\log\rho(\theta|y) = \log\rho(y|\theta) + \log\rho_{\rm prior}(\theta) + \textrm{const.}$ leads to 

$$
\begin{align*}
b_t &=  \mathbb{E}_{q_t}\bigl[\nabla_{\theta}\log\rho(y|\theta)\bigr] - C^{-1}_{\rm prior}(m_t - m_{\rm prior})\\
A_t &=  I + \mathbb{E}_{q_t}[\nabla_{\theta}\log\rho(y|\theta) (\theta - m_t)^T] - C^{-1}_{\rm prior}C_t
\end{align*}
$$

Setting $b_t = 0$ and $A_t = 0$ leads to the fixed point equations of the Left Kullback–Leibler Gaussian approximation.




## Natural gradient methods 

### Natural gradient
Consider the optimization problems 

$$\min \mathcal{E}(\lambda)$$

which involves distributions, for examples

$$
\mathcal{E}(\lambda) = KL\bigl[q_{\lambda}(\theta) | \rho(\theta| y)\bigr]
$$

or learning problems

$$
\mathcal{E}(\lambda) = \mathbb{E}\Bigl[- \log q_{\lambda}(y , x) \Bigr]
$$

The naural gradient is defined as 

$$F_\lambda^{-1}\nabla_{\lambda} \mathcal{E}(\lambda) \qquad F_\lambda 
=  \mathbb{E}_{q_\lambda}\bigl[\nabla_\lambda \log q_{\lambda}(\cdot)  \nabla_\lambda \log q_{\lambda}(\cdot)^T \bigr]  
= -\mathbb{E}_{q_\lambda}\bigl[\nabla_\lambda^2 \log q_{\lambda}(\cdot) \bigr]$$

* Gradient depends on choice of coordinate system, Newton's method and natural gradient descent are invariant to affine coordinate transformations, and approximately invariant to general coordinate transformations. Let $\lambda = f(\tilde{\lambda})$, we have 
$$
\lambda_{n+1} = \lambda_{n} - \Delta tF_\lambda^{-1}\nabla_{\lambda} \mathcal{E}(\lambda)
$$
$$
\tilde\lambda_{n+1} \approx \tilde\lambda_{n} - \Delta t F_{\tilde\lambda}^{-1}\nabla_{\tilde\lambda} \mathcal{E}({\tilde\lambda})
$$
Since 
$$\tilde\lambda_{n+1} - \tilde\lambda_{n}  = f(\lambda_{n+1}) -f(\lambda_{n}) 
\approx J_n(\lambda_{n+1} -\lambda_{n}) 
= J_n( - \Delta tF_\lambda^{-1}\nabla_{\lambda} \mathcal{E}(\lambda) ) 
= - \Delta t F_{\tilde\lambda}^{-1}\nabla_{\tilde\lambda} \mathcal{E}({\tilde\lambda})$$

* With highest increase in the objective per change in KL divergence. Since 
$$KL\bigl[q_\lambda(\theta) \Vert q_{\lambda+\delta\lambda}(\theta)\bigr] \approx \delta\lambda^T F_\lambda \delta\lambda$$
We have
$$\textrm{argmax}_{\delta \lambda: KL[q_\lambda \Vert q_{\lambda+\delta\lambda}]\leq \epsilon} \mathcal{E}(\lambda + \delta \lambda) = \textrm{argmax}_{\delta \lambda: \delta\lambda^T F_\lambda \delta\lambda \leq \epsilon} \nabla_{\lambda}\mathcal{E}(\lambda)^T  \delta \lambda = F_\lambda^{-1} \nabla_{\lambda}\mathcal{E}(\lambda) $$

### Natural gradient variational inference [3]

The natural gradient variational inference updates the parameters in variational distribution

$$
\lambda_{n+1} = \lambda_{n} - \Delta t F_\lambda^{-1} \nabla_{\lambda}\mathcal{E}(\lambda_n)
$$

For variational distributions in the regular exponential-family
$$
q_\lambda(\theta) = h(\theta) \exp\bigl[ \langle \phi(\theta), \lambda \rangle  - A(\lambda)\bigr]
$$
Let denote expectation-parameter
$$
\mu(\lambda) := \mathbb{E}_{q_\lambda}\bigl[\phi(\theta)\bigr] = \nabla_\lambda[A(\lambda)]
$$
since we have 
$$
\nabla_\lambda \int  h(\theta) \exp\bigl[ \langle \phi(\theta), \lambda \rangle \bigr] d\theta 
= \nabla_\lambda \exp\bigl[A(\lambda)\bigr]\\
\int  h(\theta) \exp\bigl[ \langle \phi(\theta), \lambda \rangle \bigr] \phi(\theta) d\theta 
= \exp\bigl[A(\lambda)\bigr] \nabla_\lambda[A(\lambda)]
$$

The natural gradient can be written as 

$$
\nabla_{\lambda}\mathcal{E}(\lambda) =  [\nabla_\lambda \mu^T] \nabla_{\mu}\mathcal{E}(\lambda) =  [\nabla_\lambda^2 A] \nabla_{\mu}\mathcal{E}(\lambda(\mu)) =  F_\lambda \nabla_{\mu}\mathcal{E}(\lambda(\mu))
$$

Therefore, the natural gradient becomes

$$
\lambda_{n+1} = \lambda_{n} - \Delta t  \nabla_{\mu}\mathcal{E}(\lambda_n(\mu))
$$

For Gaussian exponential family, we have
$$
\phi(\theta) = [\theta;\, \theta\theta^T]\quad
\lambda = [C^{-1}m;\, -\frac{1}{2}C^{-1}]\quad
\mu = [m;\,mm^T + C]\quad
A(\lambda) = \frac{1}{2}m^TC^{-1}m + \frac{1}{2}\log(\det(2\pi C))
$$

The natural gradient update becomes

$$
\lambda_{n+1} = \lambda_{n} + \Delta t \Bigl[ \nabla_{\mu} \mathbb{E}_{q_n}\log\rho(y|\theta) + \lambda_0 - \lambda_{n}\Bigr] = (1 - \Delta t)\lambda_{n} + \Delta t \Bigl[ \nabla_{\mu} \mathbb{E}_{q_n}\log\rho(y|\theta) + \lambda_0 \Bigr] \\
$$

which leads to 
$$
\begin{align*}
C^{-1}_{n+1}m_{n+1} &= (1 - \Delta t)C^{-1}_{n}m_{n} + \Delta t  \Bigl[ \nabla_{m} \mathbb{E}_{q_n}\log\rho(y|\theta) - 2[\nabla_{C} \mathbb{E}_{q_n}\log\rho(y|\theta)] m_n + C^{-1}_{0}m_{0}
\Bigr]\\
-\frac{1}{2}C_{n+1}^{-1} &= -(1 - \Delta t)\frac{1}{2}C_{n}^{-1} + \Delta t  \Bigl[ \nabla_{C} \mathbb{E}_{q_n}\log\rho(y|\theta) -\frac{1}{2}C_{0}^{-1} \Bigr]
\end{align*}
$$

Finally we have 

$$
\begin{align*}
C^{-1}_{n+1}m_{n+1} &= C_{n+1}^{-1}m_{n} + \Delta t  \nabla_{m} \mathbb{E}_{q_n}\log\rho(y|\theta) + \Delta t C^{-1}_{0}(m_{0} - m_n)\\
C_{n+1}^{-1} &= (1 - \Delta t)C_{n}^{-1} - \Delta t  \Bigl[ 2\nabla_{C} \mathbb{E}_{q_n}\log\rho(y|\theta) -C_{0}^{-1} \Bigr]
\end{align*}
$$


# Reference
1. [The recursive variational Gaussian approximation (R-VGA)](https://link.springer.com/article/10.1007/s11222-021-10068-w)
2. [Flexible and efficient inference with particles for the variational Gaussian approximation](https://www.mdpi.com/1099-4300/23/8/990)
3. [Natural-Gradient Variational Inference](https://mlg-blog.com/2021/04/13/ngvi-bnns-part-1.html)