# Wasserstein Gradient Flow

Let define the KL divergence function on the density space $\mathcal{P}$
$$
\begin{align*}
\mathcal{E}(\rho) = KL\Bigl[\rho \Big\Vert  \rho_{\rm post}\Bigr] = \int\rho\log \frac{\rho}{\rho_{\rm post}} d\theta
\end{align*}
$$

Its functional derivative is

$$
\begin{align*}
\frac{\delta \mathcal{E}}{\delta \rho} 
=  \log \rho  - \log \rho_{\rm post}  + \textrm{const.}
\end{align*}
$$

Let define **Wasserstein metric**

$$
M(\rho)^{-1} \psi = - \nabla\cdot \bigl( \rho\nabla_{\theta}\psi \bigr), \qquad  \psi \in T_{\rho}^{*}\mathcal{P}
$$


For the Wasserstein gradient flow of the KL divergence is

$$
\begin{align*}
\frac{\partial \rho_t(\theta)}{\partial t} 
&= - M(\rho)^{-1}\frac{\delta \mathcal{E}}{\delta \rho} =  -\nabla_{\theta}\cdot\Bigl(\rho_t \nabla_{\theta}\bigl(\log \rho_{\rm post} - \log \rho_{t} \bigr) \Bigr)
\end{align*}
$$

This is the Fokker-Planck equation of the Langevin dynamics 
$$
d\theta_t = \nabla_{\theta} \log \rho_{\rm post}(\theta_t) dt + \sqrt{2}dW_t,
$$
which can be used for sampling.



# Markov Semigroups

The Langevin dynamics $\{\theta_t\}_{t\geq0}$ is a Markov process, which can be studied as a semigroup $P = \{P_t\}_{t\geq 0}$

$$P_t f(\theta) = \mathbb{E}[f(\theta_t) | \theta_0 = \theta] := \int f(\theta') p_t(\theta, \theta') d\theta'$$

Here $\rho_t$ is the associated probability kernel. By Jensen's inequality, for every convex function $\phi$, $P_t(\phi(f)) \geq \phi(P_t f)$.

By duality, those semigroups $\{P_t\}_{t\geq 0}$ also act on measures $\nu$
$$ \int P_t f d\nu_0 = \int f d \nu_t \qquad d\nu_t := dP_t^{*} \nu_0 $$
The invariant distribution $\mu$ satisfies 
$$ \int P_t f d\mu = \int f d \mu \qquad d\nu_t := dP_t^{*} \nu_0. $$

The Ergodicity is related to that $P_t f(x)$ converges to $\int fd\mu$. This indicates that for the Langevin dynamics, the law of $\theta_t$ will converge to the invariant probability measure $\mu$ as $t\rightarrow \infty$.


Its **infinitesimal generator** is defined as 
$$\partial_t P_t = LP_t = P_t L$$
For the Langevin dynamics, 
\begin{align} 
\partial_t \int P_t f d\nu_0 
&
= \int f \partial_t\rho_t d \theta  = -\int f \nabla_{\theta}\cdot\Bigl(\rho_t \nabla_{\theta}\bigl(\log \rho_{\rm post} - \log \rho_{t} \bigr) \Bigr) d \theta 
= \int \nabla_{\theta} f \cdot\Bigl(\rho_t \nabla_{\theta}\bigl(\log \rho_{\rm post} - \log \rho_{t} \bigr) \Bigr) d \theta \\
&
= \int \nabla_{\theta}\nabla_{\theta} f  + \nabla_{\theta}\log \rho_{\rm post} \nabla_{\theta} f \\
\end{align}
Therefore, we have
$$Lf = \nabla_{\theta}\nabla_{\theta} f  + \nabla_{\theta} \log \rho_{\rm post} \cdot \nabla_{\theta} f $$

For the invariant distribution, we have 
$$\int Lf d\mu = 0$$


The Markov semigroup $P$ is said to be symmetric with respect to the invariant measure $\mu$, or $\mu$ is reversible for $P$, if for all functions $f$ and $g$
$$\int f P_t g d\mu = \int g P_t f d\mu $$
That also indicates that $p_t(x,dy)\mu(dx) = p_t(y,dx)\mu(dy)$ and $\int f L g d\mu = \int gL f d\mu $ ($L$ is self-adjoint).

# Local property

## Carre du Champ Operators


The bilinear map 
$$\Gamma(f,g) = \frac{1}{2}[L(fg) - fLg - gLf]$$
In the limit of $P_t(f^2) \geq (P_t f)^2$, we have $\Gamma(f,f)\geq 0$.

For the Langevin dynamics, it becomes
$$\Gamma(f,g) = \nabla_{\theta}f \cdot \nabla_{\theta}g$$
When the Markov semigroup is symmetric, we have
$$\int \Gamma(f,g)d\mu = -\int f Lg d\mu$$
And hence, $-L$ is positive semidefinite.

## Iterated Carre du Champ Operators
Replacing the product operation by $\Gamma$ leads to 
$$\Gamma_2(f,g) = \frac{1}{2}[L\Gamma(f,g) - \Gamma(f, Lg) - \Gamma(g,Lf)]$$
For the Langevin dynamics, it becomes
$$\Gamma_2(f,g) = \nabla_{\theta}\nabla_{\theta}f:\nabla_{\theta}\nabla_{\theta}g - \nabla_{\theta}f^T \nabla_{\theta}\nabla_{\theta}\log\rho_{\rm post} \nabla_{\theta}g$$


## Curvature-dimension condition
At the core of functional inequality is the curvature-dimension condition $\textrm{CD}(\rho, n)$, for $\rho\in\mathbb{R}$ and $n\in[1,\infty]$

$$\Gamma_2(f) \geq \rho \Gamma(f) + \frac{1}{n}(Lf)^2$$
($\mu$-almost everywhere).
Consider curvature-dimension condition $\textrm{CD}(\rho,\infty)$, the determinant of $\Gamma_2(g) \geq \rho \Gamma(g)$ for $g=af_1 + b f_2 f_3$ with arbitrary $a$ and $b$ implies that 

$$4\Gamma(f)[\Gamma_2(f) - \rho\Gamma(f)] \geq \Gamma(\Gamma(f))$$
which is referred as the reinforced $\textrm{CD}(\rho,\infty)$ inequality.

## Dirichlet form
For reversible measure $\mu$, the Dirichlet form is
$$\mathcal{E}(f,g) = \int \Gamma(f,g)d\mu = -\int fLg d\mu$$

# Gradient bound

For reversible measure $\mu$, we have that $\mathrm{CD}(\rho,\infty)$ leads to 

$$\Gamma(P_t f) \leq e^{-2 \rho t} P_t(\Gamma(f))\quad \textrm{and} \quad \sqrt{\Gamma(P_t f)} \leq e^{-\rho t} P_t(\sqrt{\Gamma(f)})$$


**Formal proof**. For the first ineqaulity, let denote 

$$\Lambda(s) = P_s(\Gamma(P_{t-s}f)) \qquad s\in [0,t]$$

Then we have $\Lambda(0) = \Gamma(P_{t}f)$ and $\Lambda(t) = P_t(\Gamma(f))$. Next we will prove that $\Lambda' \geq 2\rho \Lambda$. We have 

$$\Lambda'(s) = LP_s(\Gamma(P_{t-s}f)) - 2P_s\Gamma(P_{t-s}f, LP_{t-s}f)  $$

Rewriting this identity with $g = P_{t-s}f$,

$$\Lambda'(s) = LP_s(\Gamma(g)) - 2P_s\Gamma(g, Lg) = P_s(L\Gamma(g) - 2\Gamma(g, Lg)) = 2P_s(\Gamma_2(g))  \geq 2P_s(\rho\Gamma(g)) =  2\rho \Lambda(s)$$


For the second ineqaulity, let denote 

$$\Lambda(s) = P_s(\sqrt{\Gamma(P_{t-s}f)}) \qquad s\in [0,t]$$

Then we have $\Lambda(0) = \sqrt{\Gamma(P_{t}f)}$ and $\Lambda(t) = P_t(\sqrt{\Gamma(f)})$. Next we will prove that $\Lambda' \geq \rho \Lambda$. We have 

\begin{align}
\Lambda'(s) 
&= LP_s(\sqrt{\Gamma(P_{t-s}f)}) - P_s\Gamma(P_{t-s}f)^{-1/2}\Gamma(P_{t-s}f, LP_{t-s}f) \\
&= P_s\Bigl( \Gamma(P_{t-s}f)^{-1/2} \bigl( \Gamma_2(P_{t-s}f) - \frac{\Gamma(\Gamma(P_{t-s}f))}{4\Gamma(P_{t-s}f)}\bigr)\Bigr)
\end{align}

Rewriting this identity with $g = P_{t-s}f$,

$$\Lambda'(s) = P_s\Bigl( \Gamma(g)^{-1/2} \bigl( \Gamma_2(g) - \frac{\Gamma(\Gamma(g))}{4\Gamma(g)}\bigr)\Bigr)\geq \rho P_s(\sqrt{g})$$

here we used the reinforced $\textrm{CD}(\rho, \infty)$ inequality.


# Ergodicity

In probability theory, ergodic properties usually relate to the long time behavior. In the context of Markov processes $\{\theta_t ; t \geq 0\}$, it is in general expected that quantities such as
$$
\begin{align*}
\lim_{t \rightarrow \infty} \frac{1}{t}\int_0^t f(\theta_s) ds = \int f d\mu
\end{align*}
$$
converge almost surely. In a less ambitious approach, ergodicity relates to that $P_t f(\theta)$ converges to $\int f d\mu$ for any $\theta$. Since $P_t = e^{tL}$, formally, $P_t f$ converges in $L_2(\mu)$ to the projection of $f$ on the space of functions with satisfies $Lf=0$. Then $L$ is ergodic if $Lf = 0$ leads to that $f$ is constant.


# Logconcave density 
The logconcave distribution satisfies that $d\mu = \rho_{\rm post}(\theta) d\theta \propto e^{-W} d\theta$ and $\nabla\nabla W \geq \rho I$.  In this section, we study different inequalities related to logconcave invariant distribution. These inequalities underpin the convergence of the Langevin dynamics for sampling.
For the Langevin dynamics, the invariant distribution satisfies $CD(\rho, \infty)$ condition:
\begin{align*}
\Gamma_2(f,f) = \nabla_{\theta}\nabla_{\theta}f:\nabla_{\theta}\nabla_{\theta}f - \nabla_{\theta}f^T \nabla_{\theta}\nabla_{\theta}\log\rho_{\rm post} \nabla_{\theta}f \geq
\nabla_{\theta}f^T \nabla_{\theta}\nabla_{\theta}W \nabla_{\theta}f \geq  \rho \nabla_{\theta}^T f \nabla_{\theta} f
\end{align*}



## Poincare Inequalities
Let define the variance of a function f in $L_2(\mu)$ as 
\begin{align*}
Var_{\mu}(f) = \int f^2 d\mu - \Bigl(\int f d\mu\Bigr)^2
\end{align*}
A Markov Process with invariant measure $\mu$ is said to satisfy a Poincare, or spectral gap, inequality with constant $C>0$, if 
\begin{align*}
Var_{\mu}(f) \leq C\mathcal{E}(f)
\end{align*}


For the Langevin dynamics, when the invariant distribution satisfies $CD(\rho, \infty)$, 
it leads to  

$$\Gamma(P_t f) \leq e^{-2 \rho t} P_t(\Gamma(f))$$

Consider $\Lambda(s) = P_s\bigl((P_{t-s}f)^2\bigr)$, $s \in [0,t]$, we have

$$\Lambda(s)' =  LP_s\bigl((P_{t-s}f)^2\bigr) - P_s\bigl(2P_{t-s}f L P_{t-s}\Gamma f)\bigr) = 2P_s(\Gamma(P_{t-s} \Gamma f))$$

Then we have

$$ P_t\bigl( f^2\bigr) - (P_tf)^2 = \Lambda(t) - \Lambda(0) =  \int_0^t \Lambda(s)' ds = \int_0^t 2P_s(\Gamma(P_{t-s}f)) ds  \leq \int_0^t 2P_s(e^{-2 \rho (t-s)} P_{t-s}f) ds  = \frac{1 - e^{-2\rho t}}{\rho} P_t(\Gamma f)$$


Using ergodicity and leting $t \rightarrow \infty$ lead to the Poincare inequality

$$ \int  f^2 d\mu - \Bigl(\int f d\mu \Bigr)^2  = \frac{1 }{\rho} \int \Gamma f d\mu = \frac{1 }{\rho} \int \nabla f \nabla f d\mu $$


## Log Sobolev Inequalities

Let define the entropy of a function f with $\int f \lvert \log f \rvert d \mu < \infty$ as 

\begin{align*}
Ent_{\mu}(f) = \int f \log f d\mu - \int f d\nu  \log \Bigl( \int f d\mu \Bigr)
\end{align*}

A Markov Process with invariant measure $\mu$ is said to satisfy a Poincare, or spectral gap, inequality with constant $C>0$, if 

\begin{align*}
Ent_{\mu}(f) \leq 2C\int \frac{\Gamma(f)}{f} d\mu
\end{align*}


For the Langevin dynamics, when the invariant distribution satisfies $CD(\rho, \infty)$, 
it leads to  

$$\sqrt{\Gamma(P_t f)} \leq e^{-\rho t} P_t(\sqrt{\Gamma(f)})$$

Consider $\Lambda(s) = P_s\bigl(\psi(P_{t-s}f)\bigr)$, $s \in [0,t]$, with $\psi(r) = r\log r$, we have

$$\Lambda(s)' =  LP_s\bigl(\psi(P_{t-s}f)\bigr) - P_s\bigl(\psi'(P_{t-s}f) L P_{t-s} f)\bigr) = P_s(\frac{\Gamma(P_{t-s}f)}{P_{t-s}f}) \leq e^{-2\rho (t-s)} P_s(\frac{\bigl(P_{t-s}\sqrt{\Gamma(f)}\bigr)^2}{P_{t-s}f}) \leq e^{-2\rho (t-s)} P_s(P_{t-s} \frac{\Gamma(f)}{f}) $$

Then we have

$$ P_t\bigl( \psi(f) \bigr) - \psi(P_tf) = \Lambda(t) - \Lambda(0) =  \int_0^t \Lambda(s)' ds = \int_0^t e^{-2\rho (t-s)} P_t(\frac{\Gamma(f)}{f}) ds    = \frac{1 - e^{-2\rho t}}{2\rho} P_t(\frac{\Gamma(f)}{f})$$


Using ergodicity and leting $t \rightarrow \infty$ lead to the Poincare inequality

\begin{align*}
Ent_{\mu}(f) \leq \frac{1}{2\rho}\int \frac{\Gamma(f)}{f} d\mu
\end{align*}

When $d\mu$ satisfies the log Sobolev inequality with constant $C$, $h\,d\mu$ with $\frac{1}{b} \leq h \leq b$ satisfies the log Sobolev inequality with constant $Cb^2$.

# Logconcave Sampling

For the Wasserstein gradient flow of the KL divergence is

$$
\begin{align*}
\frac{\partial \rho_t(\theta)}{\partial t} 
&=  -\nabla_{\theta}\cdot\Bigl(\rho_t \nabla_{\theta}\bigl(\log \rho_{\rm post} - \log \rho_{t} \bigr) \Bigr)
\end{align*}
$$

Assume $-\nabla_{\theta}\nabla_{\theta}\rho_{\rm post} \geq \rho $, the KL divergence satisfies

\begin{align*}
-\frac{\partial KL[\rho_t \Vert \rho_{\rm post}]}{\partial t} = \int \rho_t \nabla_{\theta}\bigl(\log \rho_{\rm post} - \log \rho_{t} \bigr) \cdot \nabla_{\theta}\bigl(\log \rho_{\rm post} - \log \rho_{t} \bigr) = \int \rho_{\rm post }\frac{\nabla_{\theta} \frac{\rho_t}{\rho_{\rm post}} \nabla_{\theta} \frac{\rho_t}{\rho_{\rm post}}}{\frac{\rho_t}{\rho_{\rm post}}}  \geq 2\rho Ent_{\mu}(\frac{\rho_t}{\rho_{\rm post}}) = 2\rho KL[\rho_t \Vert \rho_{\rm post}]
\end{align*}

Therefore, $KL[\rho_t \Vert \rho_{\rm post}] \leq e^{-2\rho t} KL[\rho_0 \Vert \rho_{\rm post}]$.

# Reference
1. [Analysis and Geometry of Markov Diffusion Operators](https://link.springer.com/book/10.1007/978-3-319-00227-9)