# Langevin Dynamics


We will assume that the target distribution $\rho_{\rm post}(\theta|y)$, defined on $R^{N_{\theta}}$ is given by

$$
\begin{align*}
\rho_{\rm post}(\theta|y) \propto \exp\Bigl(-\Phi(\theta; y)\Bigr)
\end{align*}
$$


##  Langevin diffusion
Consider the initial value Ito process that acts on a random variable $\theta_t \in R^{N_{\theta}}$

$$
\begin{align*}
d\theta_t = -\nabla_{\theta} \Phi(\theta_t; y) dt + \sqrt{2}dW_t
\end{align*}
$$

where $W_t \in R^{N_{\theta}}$ is a standard Wiener process. Langevin dynamics for sampling uses the property that $\rho_{\rm post}(\theta|y) \propto \exp\Bigl(-\Phi(\theta; y)\Bigr)$ is the stationary distribution of the diffusion process.

Assume the random variable at each time $t$ has a probability density $\theta_t \sim \rho_t(\theta)$. The Fokker-Plank equation representing the evolution of the probability density $\rho_t(\theta)$ is:

$$
\begin{align*}
\frac{\partial \rho_t(\theta)}{\partial t} 
&= \nabla_{\theta} \bigl[\rho_t(\theta) \nabla_{\theta} \Phi(\theta; y)\bigr] + \Delta_{\theta} \rho_t(\theta)\\
&= \nabla_{\theta} \bigl[\rho_t(\theta) \bigl(\nabla_{\theta} \Phi(\theta; y) + \nabla_{\theta} \log \rho_t(\theta)\bigr)\bigr]
\end{align*}
$$

Intuitively, 
$$\lim_{t \rightarrow +\infty} \log \rho_t(\theta) = -\Phi(\theta; y) + C$$


Moreover, for the $KL$-divergence 

$$
\begin{align*}
\frac{\partial}{\partial t}KL\Bigl[\rho_{t}(\theta) \Vert  \rho_{\rm post}(\theta | y)\Bigr]
&= \frac{\partial}{\partial t} \int \rho_{t}(\theta)  \log \frac{\rho_{t}(\theta)}{\rho_{\rm post}(\theta | y)} d\theta \\
&= \int \nabla_{\theta} \Bigl[\rho_t(\theta) \bigl(\nabla_{\theta} \Phi(\theta; y) + \nabla_{\theta} \log \rho_t(\theta)\bigr)\Bigr] \Bigl[ \log \frac{\rho_{t}(\theta)}{\rho_{\rm post}(\theta | y)} + 1 \Bigr] d\theta \\
&= -\int \rho_t(\theta) \bigl(\nabla_{\theta} \Phi(\theta; y) + \nabla_{\theta} \log \rho_t(\theta)\bigr)^2 d\theta \\
\end{align*}
$$
Therefore, the $KL$-divergence reduces, which leads to the convergence to the posterior probability distribution.

## General initial value Ito process [1]

Consider a general the initial value Ito process that acts on a random variable $\theta_t \in R^{N_{\theta}}$

$$
\begin{align*}
d\theta_t = F(t, \theta_t) dt + \sigma(t, \theta_t)dW_t
\end{align*}
$$

where $F: R_{+}\times R^{N_{\theta}} \rightarrow R^{N_{\theta}}$ is the drift term, $\sigma:R_{+}\times R^{N_{\theta}}\rightarrow R^{N_{\theta}\times M}$ is the diffusion matrix, and $W_t \in R^{M}$ is a standard Wiener process. The corresponding Fokker-Planck equation can be written as 

$$
\begin{align*}
\frac{\partial \rho_t(\theta)}{\partial t}
&= -\nabla_{\theta}\bigl[\rho_t(\theta)F(t, \theta)\bigr] + \nabla_{\theta} \bigl[\nabla_{\theta}\cdot \bigl(D(t,\theta) \rho_{t}(\theta)\bigr)\bigr] \\
&= -\nabla_{\theta} \bigl[\rho_t(\theta) \bigl( 
F(t, \theta) - D(t,\theta)\nabla_{\theta}\log \rho_{t}(\theta) - d(t, \theta) \bigr)\bigr] \\
\end{align*}
$$

When $D(t,\theta) = \frac{1}{2}\sigma(t,\theta)\sigma(t,\theta)^T\in R^{N_{\theta}\times N_{\theta}}$ and $d(t,\theta) = \nabla D(t,\theta) \in R^{N_{\theta}}$.

When 
$$
\begin{align*}
F(t, \theta) = -A_t\nabla_{\theta} \Phi(\theta;y) + (D(t,\theta) - A_t)\nabla_{\theta}\log \rho_t(\theta) + d(t,\theta)
\end{align*}
$$

where $A_t$ is a positive definite matrix. The $KL$-divergence becomes

$$
\begin{align*}
\frac{\partial}{\partial t}KL\Bigl[\rho_{t}(\theta) \Vert  \rho_{\rm post}(\theta | y)\Bigr]
&= \frac{\partial}{\partial t} \int \rho_{t}(\theta)  \log \frac{\rho_{t}(\theta)}{\rho_{\rm post}(\theta | y)} d\theta \\
&= \int -\nabla_{\theta} \bigl[\rho_t(\theta) \bigl( 
F(t, \theta) - D(t,\theta)\nabla_{\theta}\log \rho_{t}(\theta) - d(t, \theta) \bigr)\bigr] \Bigl[ \log \frac{\rho_{t}(\theta)}{\rho_{\rm post}(\theta | y)} + 1 \Bigr] d\theta \\
&= -\int \rho_t(\theta) \bigl(\nabla_{\theta} \Phi(\theta; y) + \nabla_{\theta} \log \rho_t(\theta)\bigr)^TA_t \bigl(\nabla_{\theta} \Phi(\theta; y) + \nabla_{\theta} \log \rho_t(\theta)\bigr) d\theta \\
\end{align*}
$$

When $D(t,\theta) - A_t \neq 0$, the estimating of the current density function $\rho_t(\theta)$ is required, which can be approximated as parametrized probability distributions. Therefore, it is generally used with an ensemble of particles (Variational particle flow). 

## Particle flow [1]

Assume we have $J$ particles $\{\theta_j\}_{j=1}^{J}$, the particle flow algoritm becomes 

$$
\begin{align*}
\frac{d\theta^j_t}{dt} =  -A_t\nabla_{\theta} \Phi(\theta^j_t;y) + (D(t,\theta^j_t) - A_t)\nabla_{\theta}\log \rho_t(\theta^j_t) + d(t,\theta^j_t) + \sigma(t, \theta^j_t)\xi_j
\end{align*}
$$


Let $\Theta \in R^{J N_\theta}$ the vertical concatenate vector of $\{\theta^j\}_{j=1}^{J}$, we have 
When we treat $\{\theta^j\}_{j=1}^{J}$ as a single high dimensional system of interacting particles, the posterior distribution

$$
\begin{align*}
\rho_{\rm post}(\Theta|y) =  \rho_{\rm post}(\theta^1|y)\rho_{\rm post}(\theta^2|y)\cdots\rho_{\rm post}(\theta^J|y)
\end{align*}
$$

The Ito process becomes

$$
\begin{align*}
d\Theta_t = F(t, \Theta_t) dt + \sigma(t, \Theta_t)dW_t
\end{align*}
$$

where $F: R_{+}\times R^{JN_{\theta}} \rightarrow R^{JN_{\theta}}$ is the drift term, $\sigma:R_{+}\times R^{JN_{\theta}}\rightarrow R^{JN_{\theta}\times M}$ is the diffusion matrix, and $W_t \in R^{M}$ is a standard Wiener process. The optimal drift for each particle reads 

$$
\begin{align*}
F(t, \theta^j, \Theta) 
&= -A_t\nabla_{\theta} \Phi(\theta^j;y) + (D(t,\theta^j) - A_t)\nabla_{\theta^j}\log \rho_t(\Theta) + d(t,\theta^j)\\
&= -A_t\nabla_{\theta} \Phi(\theta^j;y) + (D(t,\theta^j) - A_t)\nabla_{\theta}\log \rho_t(\theta^j) + d(t,\theta^j) 
+ (D(t,\theta^j) - A_t)\nabla_{\theta^j}\bigl(\log \rho_t(\Theta) - \log \rho_t(\theta^j)\bigr)
\end{align*}
$$

The additional term $(D(t,\theta^j) - A_t)\nabla_{\theta^j}\bigl(\log \rho_t(\Theta) - \log \rho_t(\theta^j)\bigr)$ pushes the particle away from the ensemble mean. However this term is difficult to evaluate. The solution proposed in [1] is that instead of minimizing the $KL$-divergence, we minimize the 

$$
\begin{align*}
\widehat{KL}\Bigl[\rho_{t}(\theta) \Vert  \rho_{\rm post}(\theta | y)\Bigr] = {KL}\Bigl[\rho_{t}(\theta) \Vert  \rho_{\rm post}(\theta | y)\Bigr] + \beta I_t 
\end{align*}
$$

Here $I_t$ denotes mutual infrormation, the smaller it is , the closer the particles are to being independtn random variables,

$$
\begin{align*}
I_t = -\int\int \rho_t(\theta, \hat{\theta})\log\Big(\frac{\rho_t(\theta, \hat{\theta})}{\rho_t(\theta)\rho_t(\hat{\theta})}\Big) d\theta d\hat{\theta}
\end{align*}
$$

An heuristic formulas for the regularization potential $\kappa$ can be used.
The goal is the move the particles closer to independent samples from the posterior.

## Gaussian approximation of Langevin dynamics

Consider the initial value Ito process that acts on a random variable $\theta_t \in R^{N_{\theta}}$. The corresponding Fokker-Planck equation can be written as 

$$
\begin{align*}
\frac{\partial \rho_t(\theta)}{\partial t}
&= -\nabla_{\theta}\bigl[\rho_t(\theta)F(t, \theta)\bigr] + \nabla_{\theta} \bigl[\nabla_{\theta}\cdot \bigl(D(t,\theta) \rho_{t}(\theta)\bigr)\bigr] \\
\end{align*}
$$

Let denote 

$$m_t = \mathbb{E} \theta_t \qquad C_t = \textrm{Cov}[(\theta_t - m_t)(\theta_t - m_t)^T]$$

The evolution of mean and covariance matrix are

$$
\frac{d}{dt} m_t = \mathbb{E}_{\rho_t} [F(t, x)] \qquad 
\frac{d}{dt} C_t = \mathbb{E}_{\rho_t} [F(t, x)(x - m_t)^T] + \mathbb{E}_{\rho_t} [(x - m_t)F(t, x)^T] + \mathbb{E}_{\rho_t}[D(t,x)]$$



# Reference
1. [Ensemble Variational Fokker-Planck Methods for Data Assimilation](https://arxiv.org/abs/2111.13926)