# STA663 Final Report: Sinkhorn Algorithm

### Authors: Congwei Yang, Yijia Zhang, Haoliang Zheng

## 1. Introduction

The optimal transport distance (OT) has crucial applications in the field of data anslysis and machine learning. Theoretically, OT is a metric and can quantify the distance between probability measures with specified cost function. However, despite its potential applications and favorable properties, OT is generally infeasiable in most probelms because of its high computational cost and the curse of dimensionality. In 2013, The Sinkhorn algorithm \cite{1} was proposed by Marco Cuturi. It encorporates entropy regularization to reduce the computational burden, thus provided a much more efficient alternative of the original OT distance. However, the entropy regularization also introduced bias to the framework, causing the Sinkhorn evaluated results to be strictly larger than the exact OT distance. 

In this project, we will build a python package to compute the Sinkhorn distance. Moreover, the package provides functions that can convert random samples from probability measures into corresponding format for the computation of 1-Wasserstein distance, which is a special case of OT distance. 
### 1.1 Theoretical Background and Algorithm Description

Before introducing the algorithm, we need to present definitions of the optimal transport problem and Wasserstein distance. 

Definition 1: (The Monge-Kantorovich Optimal Transport Problem) Denote the Borel probability measure of $\mathbb{R}^d$ as $\mathcal{P}(\mathbb{R}^d)$. Given a cost function $c(x, y)$, the optimal transport problem between probability measures $P, Q \in \mathcal{P}(\mathbb{R}^d)$ is defined as

\begin{equation}
d(P, Q) := \min_{\pi \in \Gamma(P, Q)}\int_{\mathbb{R}^d \times \mathbb{R}^d}{c(x, y) d\pi(x, y)}
\end{equation}

where $\pi$ is the coupling of $P$ and $Q$, (i.e. $(x, y) \sim \pi \to x\sim P, y\sim Q$). 

We can define the $p$-Wasserstein distance from a special case of optimal transport with $c(x, y) = \lVert X - Y \rVert^p$

Definition 2: From \cite{2}, with $p \in \mathbb{N}^+$ and probability measures $P, Q \in \mathcal{P}(\mathbb{R}^d)$ with finite $p$-th moment, the $p$-th Wasserstein Distance is defined as

\begin{equation}
W_p(P, Q) := \bigg(\inf_{\pi \in \Gamma(P, Q)} \int_{\mathbb{R}^d \times \mathbb{R}^d} \lVert X - Y \rVert^p d\pi\bigg)^{1/p}
\end{equation}

Since it is generally infeasible to directly evaluate probability measures, we need to approximate them using empirical measures. Thus, we need to further specify the optimal transport problem for empirical measures. 

Definition 3: From \cite{1}, let $r, c \in \{x \in \mathbb{R}^d_+, x^T\mathbf{1}_d = 1\}$ be empirical probability measures. We can define
\begin{equation}
    U(r, c) := \{ P \in \mathbb{R}^{d \times d}_+: P\mathbf{1}_d = r, P^T\mathbf{1}_d = c\}
\end{equation}
Then, the optimal transport distance between $r$ and $c$ given a cost matrix $M$ is defined as

\begin{equation}
    d_M(r, c) := \min_{P \in U(r, c)}\langle P, M\rangle
\end{equation}

The cost matrix $M$ is a metric matrix. In other word, $M$ belongs to the cone of distance matrices: 

\begin{equation}
    \{M \in \mathbb{R}^{d \times d}_+: \forall i, j \leq d, m_{ij} = 0 \leftrightarrow i = j, \forall i, j, k \leq d, m_{ij} \leq m_{ik} + m_{kj} \}
\end{equation}

With the definition of OT problem for empirical measures, we can proceed to state the definition of Sinkhorn distance. 

Definition 4: From \cite{1}, denote the entropy of $r, c, P$ to be $h(r), h(c), h(P)$ respectively, then we can introduce the convex set

\begin{equation}
    U_\alpha(r, c) = \{P \in U(r, c) \mid h(P) \geq h(r) + h(c) - \alpha\}
\end{equation}

Then, we can define the Sinkhorn distance by

\begin{equation}
    d_{M, \alpha}(r, c) := \min_{P \in U_\alpha(r, c)}\langle P, M\rangle
\end{equation}

In practice, we will consider a Lagrange multiplier for the entropy constraint of the Sinkhorn distance \cite{1}. For $\lambda > 0$, 

\begin{equation}
    d_M^{\lambda}(r, c) := \langle P^\lambda \rangle
\end{equation}

where $P^{\lambda} = argmin_{P \in U(r, c)}\langle P, M\rangle - \frac{1}{\lambda}h(P)$ \cite{1}. 
By duality theory we know that each $\alpha$ corresponds a $\lambda > 0$ such that $d_{M, \alpha}(r, c) = d_M^\lambda(r, c)$ holds for a fixed pair of $r$ and $c$ \cite{1}.

Our package implements the Sinkhorn algorithm proposed by \cite{1} that computes $d_M^\lambda$ for a specified pair of empirical measures $r$ and $c$. The algorithm can be viewed as a matrix scaling, since the solution $P^\lambda$ has the form of $P^\lambda = \mathbf{diag}(u)K\mathbf{diag}(v)$, where $u$ and $v$ are two non-negative vectors, and $K := e^{-\lambda M}$, the elementary exponential of the matrix $-\lambda M$ \cite{1}. Thus, the algorithm can be easily implemented by iterations of matrix operations as shown in the following figure. 

<img src="Mat_scl.png" width = "600">

Due to the entropy regularization, there is biased introduced, and the result of the Sinkhorn algorithm will always be larger than the actual OT distance. 

## Numerical Instability of Sinkhorn Algorithm

The Sinkhorn algorihtm brought a great leap in computational efficiency of optimal transport distance and relieved the curse of dimensionality. However, it suffers from numerical instability \cite{3}. Notice that when $\lambda$ has a large scale, the matrix $K = e^{-\lambda M}$ may have extremely small terms, causing the following matrix scaling process to produce extremely large or small values \cite{3}. This will lead to inaccurate numerical outputs, and possibly overflow or underflow problem in the algorithm \cite{3}. 

Fortunatley, a solution to this numerical instability problem is to perform the matrix scaling in log-domain \cite{4}, where the numerical values in the computation process can be stabilized in an acceptable range. Indeed, the log-domain sinkhorn supports a wider range of $\lambda$. However, it also has disadvantages. The log-domain Sinkhorn incorporates much more exponential and logarithm operations, and it is not longer in the form of simple matrix-vector operations. Thus, the log-domain Sinkhorn suffers from heavier computational cost, and it can not be easily parallelized for multiple pairs of empirical measures. In practice, we can try to avoid the $\lambda$ range that will cause numerical instability and use the original Sinkhorn algorithm as much as possible, and only turn to the log-domain Sinkhorn when necessary. 