---
title: "Wasserstein gradient flows"
author: "Dimitra Maoutsa"
date: "2022-04-04"
categories: [blog]
description: "Understanding the JKO scheme (adapted text from my PhD thesis)"
image: images/wasser.jpg
bibliography: references.bib
---


## What is a gradient flow in the probability space?


Given some energy function $\mathcal{E}(\rho)$ in some probability space $\mathcal{P}(\Omega)$ with some metric $\mathcal{G}(\rho))$, $(\mathcal{P}(\Omega), \mathcal{G}(\rho))$,
a **gradient flow** is defined as **the inverse metric times the differential of the energy function**
\begin{equation}
    \partial_t \rho_t = -\mathcal{G}(\rho_t)^{-1} \frac{\delta \mathcal{E}(\rho_t)}{\delta \rho_t}.
\end{equation}
Here, $\rho_t$ is a distribution at time $t$.

Intuitively, this means that the considered system of equations follows the trajectory of steepest descend on the energy functional $\mathcal{E}(\rho)$. To define this steepest descend we need to define the notion of the gradient. This depends on the selected geometry of the space and is computed according to the selected metric.

If we consider as energy function the **Kullback Leibler divergence** $D_{KL}$, and for information metric the **Wasserstein metric** $\mathcal{W}$, the considered gradient flow, known as **Wasserstein gradient flow**, forms the **Fokker-Planck equation**.

In this case the metric inverse is $\nabla \cdot \rho_t \nabla$, and we can derive the Fokker--Planck equation as follows:

\begin{align}
\partial_t \rho_t &= - \text{grad}^{\mathcal{W}} D_{KL}(\rho_t ||\rho_{ss})\\
&= \nabla \cdot \left(  \rho_t \nabla \left( f + \log \rho_t +1 \right)  \right)\\
&= \nabla \cdot \left( \rho_t \nabla f\right) + \nabla \cdot \nabla \rho_t\\
&= \nabla \cdot \left( \rho_t \nabla f\right) + \Delta \rho_t.
\end{align}

In the above equation we have considered that the stationary density is given by $\rho_{ss} \propto e^{-f}$, and that the differential of $\frac{\delta \mathcal{E}(\rho_t)}{\delta \rho_t} = \log \rho + f$.

By considering the Benamou-Brenier formulation [@benamou2000computational],[@ambrosio2003lecture] of the Fokker-Planck dynamics we can obtain a better understanding on how the selected geometry (and metric) of the probability space influences the gradient flow dynamics. According to the Benamou-Brenier formalism the gradient flow dynamics for the Fokker-Planck equation has the following **optimal transport** interpretation:
 It describes a search over all possible vector fields $v_t$ that will transport probability mass from $\rho_0$ to $\rho_1$, with the Wasserstein distance capturing the minimum possible cost of this transfer. Given two probability distributions $\rho_0$ and $\rho_1$, we define this distance to be the minimum of the integral of the norm of the vector field $v_t$
\begin{equation}
d^2_{OT} (\rho_0, \rho_1) = \inf \limits_{\rho_t, v_t} \int_0^1 \| v_t \|^2_{L^2(\rho_t)} dt = \mathcal{W}^2_2 (\rho_0,\rho_1),
\end{equation}
under the constraint that the transient probability distribution $\rho_t$ fulfils the continuity equation
\begin{equation}
    \partial_t \rho_t + \nabla \cdot (\rho_t v_t) = 0,
\end{equation}
with $\rho_0 = \rho^0$ and $\rho_1 = \rho^1$.
This constraint captures how the probability $\rho_t$ evolves while being pushed along the time dependent vector field $v_t$. The Wasserstein distance is the minimal energy cost of performing this transformation from $\rho_0$ to $\rho_1$.
This defines a metric on probability measures, and consequently it induces a geometry on the space of probabilities. (Here, $v_t$ is the gradient of the local transport map.)
