In [5]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:75% !important; }</style>"))

<center>
    <h1>Adversarial Losses of GANs</h1>
</center>

|    Anish Shah    |    Deshana Desai    |Benjamin Ahlbrand|
|:----------------:|:-------------------:|:---------------:|
|shah.anish@nyu.edu|deshana.desai@nyu.edu| ba1404@nyu.edu  |

<center> 
    <h4>Abstract</h4>
We study a large class of loss functions that have been used for training Generative Adversarial Networks (GANs) called “adversarial losses”. Besides vanilla GAN, we review the loss functions in $f$-GAN, Maximum Mean Discrepancy (MMD) GAN, Wasserstein GAN and Energy-based GAN. We discuss relevant statistical properties of these distance measures that affect their behaviour and how they are employed in GANs. Further, We perform experiments and create simple visualizations to demonstrate relationships of how these distance measures affect the network's ability to cover all modes / generate better samples by covering fewer modes, lead to vanishing gradients, produce disentangled latent spaces or the variance of the cost values as a function of discriminator outputs. We also review the effectiveness of the distance measures in producing samples using metrics such as visual quality, smooth interpolations, inception score on the LSUN dataset. We perform some of these experiments on smaller synthetic datasets due to hardware and computational time bottlenecks. A natural extension of our study in the measurement of the distance between the distributions of generator model and training data, and separately the distributions of the discriminator model versus the training data distribution, is provided by optimal transport theory (OT). Recently, GANs have been used in conjunction with techniques from OT by framing the problem as one of minimization of the transportation cost of moving one data distribution to another. We review and include some of these techniques in our discussion of distance measures.
</center>

## 1. Introduction

## 2. List of Notations

<table>
    <tr>
        <td>$D$</td>
        <td>The discriminator</td>
    </tr>
    <tr>
        <td>$\omega$</td>
        <td>parameter for our discriminator</td>
    </tr>
    <tr>
        <td>$f$</td>
        <td>A convex, lower-semicontinuous function satisfying $f(1) = 0$</td>
    </tr>
    <tr>
        <td>$f^{*}$</td>
        <td>Fenchel conjugate of $f$</td>
    </tr>
    <tr>
        <td>$V$</td>
        <td>$\mathcal{X} \mapsto \mathbb{R}$, output of discriminator without the activation function</td>
    </tr>
    <tr>
        <td>$g_f$</td>
        <td>$\mathbb{R} \mapsto \text{dom}_{f^{*}}$, output activation function which respects the domain $\text{dom}_{f^{*}}$</td>
    </tr>
    <tr>
        <td>$G$</td>
        <td>The generator</td>
    </tr>
    <tr>
        <td>$P$</td>
        <td>True or Target distribution</td>
    </tr>
    <tr>
        <td>$Q$</td>
        <td>Model or generated distribution</td>
    </tr>
    <tr>
        <td>$p(x)$</td>
        <td>probability density function of $P$</td>
    </tr>
    <tr>
        <td>$q(x)$</td>
        <td>probability density function of $Q$</td>
    </tr>
    <tr>
        <td>$\theta$</td>
        <td>parameter for model distribution or generator</td>
    </tr>
    <tr>
        <td>$z$</td>
        <td>The latent vector</td>
    </tr>
    <tr>
        <td>$\mathcal{Z}$</td>
        <td>The latent space</td>
    </tr>
    <tr>
        <td>$\mathcal{X}$</td>
        <td>The samples space</td>
    </tr>
    <tr>
        <td>$L$</td>
        <td>loss function of GAN </td>
    </tr>
</table>

## 3. Statistical Divergence Measures

A divergence measure is defined as a function which establishes the similarity between two probability distributions. The divergence need not be symmetric (that is, in general the divergence from $p$ to $q$ is not equal to the divergence from $q$ to $p$), and need not satisfy the triangle inequality \cite{wiki:xxx}.

### f-divergence

In statistics and probability theory, an $f$-divergence is a function $D_{f}\left( P \parallel Q \right)$ that measures the difference between two probability distributions $P$ and $Q$ \cite{csiszar2004information, liese2006divergences}. If $P$ and $Q$ are absolutely continuous distributions with respect to a reference $dx$ on $\mathcal{X}$ and $p$ and $q$ are its probability density function respectively, then we define the $f$-divergence,

\begin{align} \label{eq:fdiv}
    D_f(P \parallel Q) = \int_{\mathcal{X}} q(x) f \left( \frac{p(x)}{q(x)} \right) dx
\end{align}
    

where the _generator function_ $f: \mathbb{R}_{+} \mapsto \mathbb{R}$ is a convex, lower-semicontinuous function satisfying
$f(1) = 0$. Every convex, lower-semicontinuous function $f$ has a _convex conjugate_ function $f^{*}$ known as _Fenchel conjugate_ \cite{hiriart2012fundamentals}. The function is defined as  $f^{*}(t) = \sup\limits_{u \in \text{dom}_{f}} \{ut -  f(u)\}$,

Using Fenchel Conjugate in (\ref{eq:fdiv}),

\begin{align*}
    D_f(P \parallel Q) &= \int_{\mathcal{X}} q(x) \sup\limits_{t \in \text{dom}_{f^{*}} } \left\{ t \frac{p(x)}{q(x)} - f^{*}(t) \right\}  dx \\
    &\text{By Jensen Inequality,} \\
    &\geq \sup\limits_{T \in \mathcal{T}} \left( \int_{\mathcal{X}}p(x)T(x)dx - \int_{\mathcal{X}}q(x)f^{*}(T(x)) dx \right) \\
    &= \sup\limits_{T \in \mathcal{T}} \left( \mathbb{E}_{x \sim P} \left[T(X)\right] - \mathbb{E}_{x \sim Q} \left[f^{*}(T(X))\right] \right)
\end{align*}

where $\mathcal{T}$ is an arbitrary class of function $T : \mathcal{X} \mapsto \mathbb{R}$.
The lower bound is tight for $T^{*}(x) = f^{'} \left( \frac{p(x)}{q(x)} \right)$ \cite{nguyen2010estimating} where $f'$ is the first order derivative of $f$.

If we have a model $q_{\theta}(x)$ that should match the true distribution $p(x)$, we need to adjust the parameter $\theta$ using gradient descent to minimize the $f$-divergence. Our goal will be to find the best parameter $\theta^{*}$ using 

\begin{align}
    \theta^{*} &= \underset{\theta}{\mathrm{argmin}}
 D_{f} \left( P \parallel Q_{\theta} \right) \nonumber \\
    &= \underset{\theta}{\mathrm{argmin}} \mathbb{E}_{x \sim P} \left[ f^{'} \left( \frac{p(x)}{q_{\theta}(x)} \right) \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[f^{*}\left(f^{'} \left( \frac{p(x)}{q_{\theta}(x)} \right)\right)\right] \label{eq:theta-f-div}
\end{align}


Table \ref{table:f-div} and \ref{table:f-div-loss} has a list of common $f$-divergences that we consider in this paper and provide their Fenchel conjugates $f^{*}$.

<table>
    <tr>
        <th>Name</th>
        <th>Generator $f(u)$</th>
        <th>Conjugate $f^{*}(t)$ </th>
        <th>$T^*(x)$</th>
    </tr>
    <tr>
        <td>Kullback-Leibler (KL)</td>
        <td>$u \log u$</td>
        <td>$\exp{(t - 1)}$ </td>
        <td>$1 + \log \frac{p(x)}{q(x)}$</td>
    </tr>
    <tr>
        <td>Reverse KL</td>
        <td>$-\log u$</td>
        <td>$-1-\log(-t)$ </td>
        <td>$-\frac{q(x)}{p(x)}$</td>
    </tr>
    <tr>
        <td>Jensen-Shanon</td>
        <td>$-(u+1)\log \frac{1 + u}{2} + u \log u$</td>
        <td>$-\log(2-\exp(t))$ </td>
        <td>$\log \frac{2p(x)}{p(x) + q(x)}$</td>
    </tr>
    <tr>
        <td>Pearson $\mathcal{X}^2$</td>
        <td>$(u - 1)^2$</td>
        <td>$\frac{1}{4}t^2 + t$ </td>
        <td>$2 \left( \frac{p(x)}{q(x)} - 1 \right)$</td>
    </tr>
    <tr>
        <td>Total Variation</td>
        <td>$\frac{1}{2} | u - 1 |$</td>
        <td>$t$ </td>
        <td> $\frac{1}{2} \text{sign}\left( \frac{p(x)}{q(x)} - 1 \right)$</td>
    </tr>
</table>


### Kullback Leibler Divergence

KL Divergence is one example of $f$-divergence to measure how target probability distribution $p$ diverges from a model probability distribution $q$. For KL divergence, $f(u) = u \log u$.\cite{kldivwiki}

\begin{align}
    D_{KL}(p||q) = \int_x p(x) log\frac{p(x)}{q(x)}dx
\end{align}

Substituting in (\ref{eq:theta-f-div}), we can find the best set of parameters $\theta^{*}$ that minimizes the KL divergence\cite{gananddiv}, 

\begin{align*}
    \theta^{*} &=  \underset{\theta}{\mathrm{argmin}}\ \mathbb{E}_{x \sim P} \left[ 1 + \log \left( \frac{p(x)}{q_{\theta}(x)} \right) \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[ \frac{p(x)}{q_{\theta}(x)} \right] \\
    &=  \underset{\theta}{\mathrm{argmin}}\ \mathbb{E}_{x \sim P} \left[ 1 + \log \left( \frac{p(x)}{q_{\theta}(x)} \right) \right] - \mathbb{E}_{x \sim P} \left[ 1 \right] \\
    &=  \underset{\theta}{\mathrm{argmin}}\ \mathbb{E}_{x \sim P} \left[ \log \left( \frac{p(x)}{q_{\theta}(x)} \right) \right] \\
    &=  \underset{\theta}{\mathrm{argmin}}\ \mathbb{E}_{x \sim P} \left[ \log{p(x)} - \log{q_{\theta}(x)} \right] \\
    &=  \underset{\theta}{\mathrm{argmin}}\ \mathbb{E}_{x \sim P} \left[ \log{p(x)} \right] - \mathbb{E}_{x \sim P} \left[ \log{q_{\theta}(x)} \right] \\
    \text{The first term in the above equation is simply the} &  \text{negative entropy of the true distribution p(x). Since this quantity is independent of $\theta$, we get} \\
    &= \underset{\theta}{\mathrm{argmin}}\  - \mathbb{E}_{x \sim P} \left[ \log{q_{\theta}(x)} \right]  \\ 
    &= \underset{\theta}{\mathrm{argmax}}\  \mathbb{E}_{x \sim P} \left[ \log{q_{\theta}(x)} \right]  \\ 
\end{align*}

The above equation wants to find parameters $\theta$ such that the samples from $p(x)$ are assigned the highest log probability value under $q_\theta(x)$. This is also exactly the equation for maximum likelihood. Let us take an example dataset that is a mixture of two guassians. The implication of this likelihood maximization is if we draw a sample from $p(x)$ that has a low probability under $q_\theta(x)$, $log q_\theta(x)$ goes to negative infinity. This leads to heavy penalization for the parameters. On the other hand, if some input x has low probability under p(x) but is assigned a high probability under $q_\theta(x)$, this will not affect the maximum likelihood loss much. The result is that the estimated model tries to cover the entire support of the true distribution, and in doing so ends up assigning probability mass to regions of space (between the two mixture components) which have low probability under $p(x)$. 

### Reverse KL Divergence

To get around the above issue, we instead calculate Reverse KL Divergence. For Reverse KL Divergence, $f(u) = -\log u$. Substituting in (\ref{eq:theta-f-div}), we can find the best set of parameters $\theta^{*}$ that minimizes the Reverse KL divergence\cite{gananddiv}, 

\begin{align*}
    \theta^{*} &=  \underset{\theta}{\mathrm{argmin}}\ \mathbb{E}_{x \sim P} \left[ - \frac{q_{\theta}(x)}{p(x)} \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[ -1 - \log \left(\frac{q_{\theta}(x)}{p(x)}\right) \right] \\
    &=  \underset{\theta}{\mathrm{argmin}}\ \mathbb{E}_{x \sim Q_{\theta}} \left[ -1 \right] + \mathbb{E}_{x \sim Q_{\theta}} \left[ 1 + \log \left(\frac{q_{\theta}(x)}{p(x)}\right) \right] \\
    &=  \underset{\theta}{\mathrm{argmin}}\ \mathbb{E}_{x \sim Q_{\theta}} \left[\log \left(\frac{q_{\theta}(x)}{p(x)}\right) \right] \\
    &=  \underset{\theta}{\mathrm{argmin}}\ \mathbb{E}_{x \sim Q_{\theta}} \left[\log q_{\theta}(x) - \log{p(x)}\right] \\
    &=  \underset{\theta}{\mathrm{argmin}}\ \mathbb{E}_{x \sim Q_{\theta}} \left[\log q_{\theta}(x) \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[ \log{p(x)}\right] \\
    &=  \underset{\theta}{\mathrm{argmax}}\ -\mathbb{E}_{x \sim Q_{\theta}} \left[\log q_{\theta}(x) \right] + \mathbb{E}_{x \sim Q_{\theta}} \left[ \log{p(x)}\right] \\
\end{align*}


The first term above is simply the entropy of $q_\theta(x)$ which requires the probability mass to be as spread out as possible. The second term is the log probability of samples from $q_\theta(x)$ under the true distribution $p(x)$. This encourages the model to produce samples that are reasonably "close" to the true distribution $p(x)$. The first term is essential since the model would simply assign all probability mass to a single sample which has high probability under $p(x)$. The implication of using such an objective function is that our model picks a single mode and models it well. The solution is reasonably high-entropy and any sample from the estimated distribution has a reasonably high probability under $p(x)$. The drawback here is that there may be many diverse modes which remain unexplored. 

Another drawback of this method is that, reverse KL provides no control over which output is chosen, just that the distribution learned by the model has high probability under the true distribution. In contrast, the maximum likelihood can result in a "worse" output by virtue of trying to model every possible outcome despite being unable to do so well. A maximum likelihood could achieve better results if a model with sufficiently large capacity is trained. However this might be too expensive a trade-off (based on a recent example of the Glow approach \cite{kingma2018glow}).


### Aside: How is KL divergence equivalent to Maximum Likelihood Estimation?

We want to learn a distribution $Q$ that approximates $P$. We can attempt to do this by directly learning the probability density function $q(x)$ of $Q$. We optimize $Q$ through Maximum Likelihood Estimation. The MLE objective is:

\begin{align}
    \max_{\theta \in \mathbb{R}^d} \quad \frac{1}{m} \sum_{i=1}^{m} \log q_\theta(x^{(i)})
\end{align}

In the limit that $m \to \inf$, the samples $x(i)$ from the true distribution will appear based on the true data distribution $P$. This leads to:

\begin{align*}
    & \lim_{m \to \infty} \max_{\theta \in \mathbb{R}^d} \quad \frac{1}{m} \sum_{i=1}^{m} \log q_\theta(x^{(i)}) \\
    = & \max_{\theta \in \mathbb{R}^d} \quad  \int_x p(x) \log q_\theta(x) dx \\
    =& \min_{\theta \in \mathbb{R}^d} - \int_x p(x) \log q_\theta(x) dx \\
    =& \min_{\theta \in \mathbb{R}^d} \int_x p(x) \log p(x) dx - \int_x p(x) \log q_\theta(x) dx \\
    =& \min_{\theta \in \mathbb{R}^d} D_{KL}( p(x)||q_\theta(x))
\end{align*}

As discussed earlier, if $q_\theta(x) = 0$ at an $x$ where $p(x) > 0$, the KL divergence goes to $+ \infty$. Thus if  $q_\theta(x)$ has low dimensional support, then it is unlikely that all of $p(x)$ lies within the support. This will cause the KL divergence to explode on some data points.

### Jensen Shannon Divergence

The Jensen Shannon Divergence is a neat equation and has a clear connection with the above defined KL Divergence and Reverse KL Divergence. In the first case, we observed that the KL divergence abhorred regions where $p(x)$ had non null mass but $q_{\theta}(x)$ had null mass. The KL Divergence thus does not behave well in certain regions where $p(x)$ may be small but $q_{\theta}(x)$ may be high. 

In the reverse KL Divergence, we saw that the equation abhors regions where the log likelihood of the generated samples occurring in the true data distribution is low. However, the drawback of using the reverse KL divergence is that there may be some $x$ which have a low likelihood of occurring in $q_{\theta}(x)$ but a high likelihood of occurring in $p(x)$. thus it does not behave well for some points where $q_{\theta}(x)$ is small. 

The KL Divergence is asymmetric as it is clearly not the same as the Reverse KL Divergence. In order to read a symmetrised version of KL Divergence, one way would be to simply sum up the two different KL Divergences between the two distributions: $D_{KL}(p(x) \parallel q_\theta(x)) + D_{KL}(q_\theta(x)\parallel p(x))$

However, such a loss function would still heavily penalize when either of the two distributions have low mass at a region while the other does not. Instead of looking at the distance between two probability distributions to each other, the next obvious solution would be to measure the distance between each of these distributions to the average of all the distributions. This leads us to the symmetric Jensen Shannon Divergence\cite{jensen} metric:

\begin{align*}
    D_{JS}(P\parallel Q) = \frac{1}{2} D_{KL}(P \parallel M) + \frac{1}{2} D_{KL} (Q \parallel M)
\end{align*}

where M is the average of the individual probability distributions:

\begin{align*}
    M = \frac{1}{2}(P+Q)
\end{align*}


The Jensen-Shannon divergence doesn't have this property. It is well behaved both when $p(x)$ or  $q_{\theta}(x)$ may be small. This is also a symmetrized version of the KL divergence.


Jensen-Shannon Divergence is also a type of $f$-divergence where $f(u) = -(u+1)\log \frac{1 + u}{2} + u \log u$.

Substituting in (\ref{eq:theta-f-div}), we can find the best set of parameters $\theta^{*}$ that minimizes the JS  divergence,

\begin{align*}
    \theta^{*} &= \underset{\theta}{\mathrm{argmin}}\ \mathbb{E}_{x \sim P} \left[ \log{\frac{2p(x)}{p(x) + q_{\theta}(x)}} \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[  - \log \left( 2 - \frac{2p(x)}{p(x) + q_{\theta}(x)} \right) \right] \\
    &= \underset{\theta}{\mathrm{argmin}}\ \mathbb{E}_{x \sim P} \left[ \log{\frac{2p(x)}{p(x) + q_{\theta}(x)}} \right] + \mathbb{E}_{x \sim Q_{\theta}} \left[ \log \frac{2q(x)}{p(x) + q_{\theta}(x)}  \right] \\
\end{align*}

### Total Variational Distance

Let $\mathcal{X}$  be a set of samples and let $\Sigma$ be the set of all Borel subsets of $\mathcal{X}$. Let Prob($\mathcal{X}$) denote the space of probability measures defined on $\mathcal{X}$.

The _Total Variation_ (TV) distance between two distributions $P, Q \in \text{Prob}(\mathcal{X})$ is

\begin{align*}
    \delta(P, Q) &= \sup_{A\in \Sigma} |P(A) - Q(A)| \\
    &= \sup_{A\in \Sigma} \left| \int_{A} (p(x) - q(x)) dx \right|
\end{align*}

Intuitively speaking, the total variation distance considers each possible outcome A from $\sum$ that are outcomes of the probability distributions $P$ and $Q$. For some outcomes, the probability assigned by $P$ will be larger than the probability assigned for that same event by $Q$ and vice versa. We then sort through the entire list of outcomes and find the one for which the two probability assignments are most different i.e. where the gap between the two distributions is maximal. 


$A_{*} = \{ x \in \Sigma : p(x) \geq q(x) \}$ is the set achieved by $\delta(P, Q)$. Since $P, Q \in \text{Prob}(\mathcal{X})$, we know that,

\begin{align*}
    \int_{\mathcal{X}} (p(x) - q(x)) dx = P(\mathcal{X}) - Q(\mathcal{X}) = 0 
\end{align*}


and


\begin{align} \label{eq:tv-0}
    \int_{A_{*}} (p(x) - q(x)) dx = \int_{\mathcal{X} \setminus A_{*}} (q(x) - p(x)) dx
\end{align}


Therefore, 


\begin{align}
    \int_{\mathcal{X}} | p(x) - q(x) | dx &= 2 \int_{A_{*}} (p(x) - q(x)) dx \label{eq:tv-1} \\
    &\leq 2 \sup_{A} \left| \int_{A} (p(x) - q(x)) dx \right| \label{eq:tv-2}
\end{align}


\begin{align}
    \left| \int_{A} (p(x) - q(x)) dx \right| &= \max \left\{ \int_{A} (p(x) - q(x)) dx , \int_{A} (q(x) - p(x)) dx \right\} \nonumber \\
    &\leq \max \left\{ \int_{A_{*}} (p(x) - q(x)) dx , \int_{\mathcal{X}\setminus A_{*}} (q(x) - p(x)) dx \right\} \nonumber 
\end{align}

From \ref{eq:tv-0} and \ref{eq:tv-1},

\begin{align}
    \sup_{A} \left| \int_{A} (p(x) - q(x)) dx \right| &\leq \frac{1}{2} \int_{\mathcal{X}} | p(x) - q(x) | dx  \label{eq:tv-3}
\end{align}

From \ref{eq:tv-2} and \ref{eq:tv-3},

\begin{align*}
    \delta(P, Q) &= \frac{1}{2} \int_{\mathcal{X}} | p(x) - q(x) | dx
\end{align*}

Therefore, Total Variation Distance is also a type of $f$-divergence for $f(u) = \frac{1}{2} | u - 1|$. We can find the best set of parameters $\theta^{*}$ that minimizes the TV distance from equation \ref{eq:theta-f-div},

\begin{align*}
    \theta^{*} &= \underset{\theta}{\mathrm{argmin}}\  \mathbb{E}_{x \sim P} \left[ \frac{1}{2} \text{sign}\left( \frac{p(x)}{q_{\theta}(x)} - 1 \right) \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[ \frac{1}{2} \text{sign} \left( \frac{p(x)}{q_{\theta}(x)} - 1 \right) \right]
\end{align*}

### Pearson $\mathcal{X}^2$ Divergence

Pearson $\mathcal{X}^2$ Divergence is another type of $f$-divergence where $f(u) = (u - 1)^2$.

\begin{align*}
    \theta^{*} &= \underset{\theta}{\mathrm{argmin}}\  \mathbb{E}_{x \sim P} \left[ 2 \left( \frac{p(x)}{q_{\theta}(x)} - 1 \right) \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[\left( \frac{p(x)}{q_{\theta}(x)} - 1 \right)^2 + 2 \left( \frac{p(x)}{q_{\theta}(x)} - 1 \right)\right] 
\end{align*}

### Mean Maximum Discrepancy Distance

Given two distributions $P$ and $Q$ and a kernel $k$, the square of MMD distance \cite{li2017mmd} is defined as,

\begin{align*}
    M_{k}(P, Q) = \| p(x) - q(y) \|_{\mathcal{H}}^{2} = \mathbb{E}_{P} \left[ k(x, x') \right] - 2\mathbb{E}_{P,Q} \left[ k(x, y) \right] + \mathbb{E}_{Q} \left[ k(y, y') \right]
\end{align*}

### Wasserstein Distance

We look at two toy datasets for better understanding the problem of computing distances between probability distributions.

<figure class="image">
  <img src="dist1.png" alt="Toy Dataset" width="30%">
    <center><figcaption>Figure 1: Toy Dataset 1</figcaption></center>
</figure>

<figure class="image">
  <img src="dist2.png" alt="Toy Dataset" width="30%">
    <center><figcaption>Figure 2: Toy Dataset 2</figcaption></center>
</figure>


We sample points from three different probability distributions and color them with red, green and blue respectively. Based on the samples above, it is unclear which two distributions are closer - are the red samples closer to green or are the blue samples closer to green? This is non obvious since the red samples have lesser variance and it is easier to be certain about their approximate locations and on the other hand the blue points are highly scattered but the closest point to the green samples are blue samples. Similar to the Jensen Shannon approach, one way is to look at the center of mass of the distributions. In the second dataset, we see that the center of mass of the samples drawn from the red and green distribution may overlap but clearly the two distributions need to have non zero distance.

In the first dataset, the total variation distance would choose the worst case scenarios and conclude that the blue distribution is further away from the green than the red distribution. All of these examples only prove that defining a distance metric for probability distributions is an involved task.

The Wasserstein or "Earth movers" distance treats each of these points as drawn from a discrete probability distribution with probability mass $p(x) = 1/|\sum|$ where $x \sim \sum$. Each $x \in \sum$ corresponds to a pile of dirt of height corresponding to its probability mass $p(x)$. The cost of moving a unit of dirt from $x$ to $y$ is the distance  $d(x, y)$ between the points. The objective is to minimize the cost of doing this under certain constraints. The distance metric is defined as the L1 norm in this case.

The Wasserstein distance between the two distributions is thus defined as:

\begin{align*}
    W_{(P,Q)} = \inf_{\gamma \in \prod(P,Q)} \mathbb{E}_{(x,y)\sim\gamma}\left[ \parallel x-y\parallel\right]
\end{align*}

where $\prod(P,Q)$ denotes the set of all joint distributions $\gamma(x,y)$ whose marginals are $P$ and $Q$ respectively. 

Unfortunately, computing this Wasserstein distance exactly is intractable. The paper then introduces the **Kantorovich-Rubinstein duality** to reformulate the problem:

\begin{align*}
    W_{(P,Q)} = \sup_{\parallel f \parallel_L \leq 1} \mathbb{E}_{x \sim P}[f(x)] - \mathbb{E}_{x \sim Q}[f(x)]
\end{align*}

where the supremum is over all the 1-Lipschitz functions ${\displaystyle f : \mathcal{X} \to \mathbb{R}}$. 

**Lipschitz Functions**: Let $\tau_1$ and $\tau_2$ be distance functions on spaces X and Y. A function $f: X \to Y$ is K-Lipschitz $\forall x_1, x_2 \in X$ if:

\begin{align*}
    \tau_1(f(x_1),f(x_2)) \leq K\tau_2(x_1,x_2)
\end{align*}

This means that the slope of a K-lipschitz function never exceeds K.

If we replace the supremum over 1-Lipschitz functions with the supremum over *K-Lipschitz functions*, then the supremum is ${\displaystyle K.W(P,Q)}$ instead. (This occurs since every K-Lipschitz function is a 1 Lipschitz function if you divide it by $K$, and the Wasserstein objective is linear). 

The supremum over K-Lipschitz function is still intractable but easier to approximate. Suppose we have a parametrized function family ${f_w}_{w \in \mathcal{W}}$, where $w$ are the weights and $\mathcal{W}$ is the set of all possible weights. Also suppose that these functions are all K-Lipschitz functions for some K.

This gives us the following problem:
\begin{align*}
    \max_{w \in W} \mathbb{E}_{x \sim P}[f_w(x)] - \mathbb{E}_{x \sim Q}[f_w(x)] &\leq \sup_{\parallel f\parallel_L \leq K} \mathbb{E}_{x \sim P}[f(x)] - \mathbb{E}_{x \sim Q}[f(x)] \\
    &= K.W(P,Q)
\end{align*}

For optimization purposes, $K$ is unknown and fixed throughout the training process. The gradients of the weights $w$ will get scaled by $K$ but also by the learning rate $\alpha$ and thus $K$ will get absorbed into the hyperparameter tuning. This entire derivation only works when the function family ${f_w}_{w \in \mathcal{W}}$ is K-Lipschitz. To guarantee such a property, the paper uses weight clamping. The weights w are constrained to lie within $[-c,c]$, by clipping the weights $w$ after every update to $w$.


### Convergence of measures

Intuitively, if we consider a sequence of measures $\mu_n$ on a space sharing a common collection of measurable sets, such a sequence may represent an attempt to construct 'better and better' approximations to a desirable measure $\mu$ that is difficult to obtain directly. 

Two types of convergence measures are: strong and weak convergence measures.

**Strong convergence measure**: 
For $(X,{\mathcal  {F}})$ a measurable space, a sequence $\mu_n$ is said to converge strongly to a limit $\mu$ if

\begin{align*}
    \lim _{{n\to \infty }}\mu _{n}(A)=\mu (A)
\end{align*}

for every set $A\in\mathcal{F}$.

**Weak Convergence Measure**: Let ${\displaystyle S}$ be a metric space with its Borel ${\displaystyle \sigma }$. A bounded sequence of positive probability measures ${\displaystyle \mu_{n}\,(n=1,2,\dots )}$ on ${\displaystyle (S,\Sigma )}$ is said to converge weakly to the finite positive measure ${\displaystyle \mu}$ (denoted ${\displaystyle \mu_{n}\Rightarrow \mu)}$ if:

\begin{align*}
    {\displaystyle \operatorname {E} _{\mu_n}[f]\to \operatorname {E_\mu} [f]}
\end{align*}

for all bounded, continuous functions ${\displaystyle f}$.

**An Example**: If we consider the sequence of dirac delta distributions (where $\delta_{x_0}$ denotes the delta distribution at $x_0 \in R$, that is $x = x_0$ with probability 1. It is easy to see that $\delta_{1/n}$ converges weakly to the dirac measure located at 0 but it does not converge strongly. This is intuitively clear since $1/n \to 0$ (is "close" to zero). 

**Relative strength between adversarial divergences**: Let $\tau_1$ and $\tau_2$ be two adversarial divergences, if for any sequence of probability measures ($\mu_n$) and any target desirable probability measure $\mu^*$, $\tau_1({\mu^* \parallel \mu_n}) \to \inf_\mu \tau_1(\mu^* \parallel \mu)$ implies $\tau_2({\mu^* \parallel \mu_n}) \to \inf_\mu \tau_2(\mu^* \parallel \mu)$, then we say $\tau_1$ is stronger than $\tau_2$ and $\tau_2$ is weaker then $\tau_1$. We say they are both equivalent if $\tau_1$ is both stronger and weaker than $\tau_2$. We say $\tau_1$ is strictly stronger(strictly weaker) than $\tau_2$ if $\tau_1$ is stronger (weaker) than $\tau_2$ but not equivalent. 

Thus different distance metrics as we have discussed in Section 3 induce different sets of convergent sequences. In simpler terms, we say that a distant $\tau_1$ is weaker than distance $\tau_2$ if every sequence that converges under $\tau_2$ converges under $\tau_1$. The structure of these distance measures is summarized in the figure below.

<figure class="image">
  <img src="convergence.png" alt="loss" width="100%">
    <center><figcaption>Figure 3: Structure of adversarial divergences</figcaption></center>
</figure>


As we can observe, the Wasserstein distance is in the equivalence class of the weakest strict adversarial divergences. Arjovsky et al \cite{arjovsky2017wasserstein} showed that KL-divergence is stronger than JS-divergence which is equivalent to the total variation distance, which is strictly stronger than the Wasserstein-1 distance.

## Why does this matter?

The Wasserstein GAN paper introduces a simple example to argue why we need to care about weak convergences. Consider probability distributions defined over $\mathbb{R}^2$. We let the true data distribution be (0,y) with y sampled uniformly from U[0,1]. Now we consider the family of distributions Q where $Q = (\theta, y)$ with y also sampled from U[0,1].

<figure class="image">
  <img src="distribution.png" alt="loss" width="40%">
    <center><figcaption>Figure 4: Real and Fake distribution</figcaption></center>
</figure>


We would like our optimization algorithm to learn to move from the fake distribution to the target distribution at y=0. As the training progresses, the distance should decrease. However, this does not occur for all distances:

- **Total Variational Distance**: For any $\theta \neq 0$, let $A = {(0,y) : y \in [0,1]}$. This gives us:

\begin{align*}
    \delta(P,Q) = \begin{cases}
    1,& \text{if } \theta \neq 0, \\
    0,& \text{if } \theta = 0.
    \end{cases}
\end{align*}    
    
- **KL Divergence and Reverse KL Divergence**: From section 3, we know that KL Divergence $D_{KL}(P \parallel Q) = + \infty$ if there is any point (x,y) where $P(x,y) > 0$ and $Q(x,y) = 0$. 

\begin{align*}
    D_{KL}(P \parallel Q) = D_{KL}(Q \parallel P) = \begin{cases}
    +\infty,& \text{if }\theta \neq 0, \\
    0,&  \text{if }\theta = 0
    \end{cases}
\end{align*}
    
   Thus there is a very heavy penalty at all points not covered by the $P$ or $Q$ distributions in KL divergence or Reverse KL divergence respectively.
    
- **Jenson-Shannon divergence**: Consider the mixture $M = P/2 + Q/2$, then we have:

\begin{align*}
    D_{KL}(P \parallel M) = \int_{(x,y)} P(x,y) \frac{log P(x,y)}{M(x,y)} dy dx
\end{align*}

   For any x,y where $P(x,y) \neq 0$, $M(x,y) = \frac{1}{2}P(x,y)$, so this integral works out to $\log 2$. This is also true for the Reverse KL case: $D_{KL}(P \parallel M)$, this gives us:
    
\begin{align*}
    JS(P,Q) = \begin{cases}
    log 2,& \text{if }\theta \neq 0, \\
    0,&  \text{if }\theta = 0
    \end{cases}
\end{align*}
    
- **Wasserstein-1 Distance**: Since the two distributions are simply translations of one another, the best way the transport plan moves mass is in a straight line from $(0,y)$ to $(\theta, y)$. This gives us $W(P,Q) = |\theta|$

This example, although certainly very contrived, shows that there exist sequences of distributions that don't converge under JS, KL, Reverse KL or TV divergences but which do converge under the Wasserstein distance.

There is another important observation from this example. Since the different distances have a constant penalization, all the distance measures other than Wasserstein do not provide a usable gradient everywhere. Additionally, the resulting function is not even continuous. The paper points out that this is not uncommon. When the supports are low dimensional manifolds in high dimensional spaces, it's very easy for the intersection to be measure zero, which is enough to give similarly bad results.

This argument is strengthened by the following theorem from the paper: 

**Theorem**: Let $P$ be a fixed distribution over $\mathcal{X}$. Let $Z$ be a random variable over another space $\mathcal{Z}$. Let $g: \mathcal{Z} X \mathbb{R}^d \to \mathcal{X}$ be a function that is denoted by $g_\theta(z)$. Let Q denote the distribution of $g_\theta(z)$. Then:

- If g is continuous in $\theta$, so is W(P,Q)
- If g is locally Lipschitz and satisfies some other assumptions , then W(P,Q) is continuous everywhere, and differentiable almost everywhere.
- Statements 1-2 are false for the Jensen-Shannon divergence JS(P,Q) and all the KLs.

Thus out of KL, JS and Wasserstein distance, only the Wasserstein distance has guarantees of continuity and differentiability, which are both very desirable properties of loss functions.

From the above section on convergence measures, we also know that every distribution that converges under KL, Reverse KL, TV and JS distance measures also converges under the Wasserstein divergence. These make for a strong argument in favor of the usage of Wasserstein divergences.

## 4. Generative Adversarial Networks

In the above section, we use an objective function that has access to the true distribution, however, in reality, we only have access to samples from it. So computing reverse KL divergence to optimize the parameters of our model is not explicitly possible. Instead, in the case of GANs, we use a discriminator to represent a critic function that can differentiate between the real samples and samples from the generator. We train the Discriminator network to maximize the probability of assigning the correct label to both the training examples and generated samples from $G$. We simultaneously train $G$ to minimize $log(1-D(G(z))$ where $z$ represents an input noise variable drawn from a distribution $p_z(z)$. This leads to the following loss function for the discriminator (also known as the Binary Cross Entropy loss function):

\begin{align*}
    L = - \mathbb{E}_{x \sim p(x)} [log D_\omega(x)] - \mathbb{E}_{x \sim q_\theta(x)} [log (1 - D_\omega(x))]
\end{align*}

The optimal discriminator is thus attained by the following:

\begin{align*}
    L &= - \mathbb{E}_{x \sim p(x)} [\log \frac{p(x)}{p(x)+q_\theta(x)}] - \mathbb{E}_{x \sim q_\theta(x)} [\log \frac{q_{\theta}(x)}{p(x)+q_\theta(x)}] \\
    &= - \mathbb{E}_{x \sim p(x)} [\log \frac{p(x)}{2*\frac{1}{2}*(p(x)+q_\theta(x)})] - \mathbb{E}_{x \sim q_\theta(x)} [\log \frac{q_{\theta}(x)}{2*\frac{1}{2}*(p(x)+q_\theta(x))}] \\
    &= - \mathbb{E}_{x \sim p(x)} [\log \frac{p(x)}{2*(p(x)+q_\theta(x)})] - \mathbb{E}_{x \sim q_\theta(x)} [\log \frac{q_\theta(x)}{2*(p(x)+q_\theta(x))}] - \log (2) - \log (2) \\
    &= - \mathbb{E}_{x \sim p(x)} [\log \frac{p(x)}{2*(p(x)+q_\theta(x)})] - \mathbb{E}_{x \sim q_\theta(x)} [\log \frac{q_\theta(x)}{2*(p(x)+q_\theta(x))}] - \log(4) \\
    &= -\log(4) + D_{KL}(p(x) \parallel \frac{p(x) + q_\theta(x)}{2}) + D_{KL}(q_\theta(x) \parallel \frac{p(x)+q_\theta(x)}{2})
\end{align*}

where KL is the Kullback Leibler divergence. This equation directly leads us to The Jensen Shannon Divergence.

\begin{align*}
    L = -\log(4) +2 D_{JS} (p(x) \parallel q_\theta(x))
\end{align*}


We can generalize this framework of GANs to solving the following problem:

\begin{align*}
    \inf_\theta \sup_{f \sim \mathcal{F}} E_{x \sim p, y \sim q} [f(x,y)]
\end{align*}

where $\mathcal{F}$ is a class of functions. This process is considered adversarial since there are two networks competing to outsmart each other, the generator $G$ is trying to imitate the true distribution $P$ while the adversary $f$ is trying to distinguish between the true and generated data distributions. Thus the generator needs to produce the optimal $\theta^*$ such that:

\begin{align*}
    \theta^* \xrightarrow{} \sup_{f \sim \mathcal{F}} E_{x \sim p, y \sim q} [f(x,y)]
\end{align*}

the objective function measures the distance between the target distribution $P$ to the current estimate $q$. Hence, minimizing this function can lead us to a good approximate of the target distribution. This leads us to the concept of _Adversarial Losses_.

## 5. Adversarial Losses

In practice the function class $\mathcal{F}$ is usually a transformation of a simple function class $\Omega$, which is the set of discriminator parameters, as they have been addressed in the GAN literature.

### f-divergence

Nowozin et al. \cite{nowozin2016f} introduce $f$-GAN objective function. $Q$ is our generative model, taking as input a random vector and outputting a sample of interest and parameterized by $\theta$. $T$ is our discriminator function, taking as input a sample and returning a scalar and parameterized by $\omega$.

\begin{align*}
    L(\theta, \omega) &= \mathbb{E}_{x \sim P} \left[T_{\omega}(x) \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[f^{*}(T_{\omega}(x)) \right] \\
\end{align*}

The output of the discriminator $T_\omega(x)$ needs to respect the domain $\text{dom}_{f^{*}}$ of the conjugate function $f^{*}$. Therefore, we need an output activation function of the discriminator specific to the $f$-divergence used.

\begin{align*}
    L(\theta, \omega) &= \mathbb{E}_{x \sim P} \left[g_f(V_{\omega}(x)) \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[f^{*}(g_f(V_{\omega}(x))) \right] \\
    &= \mathbb{E}_{x \sim P, \hat{x} \sim Q_{\theta}} \left[ g_f(V_{\omega}(x)) - f^{*}(g_f(V_{\omega}(\hat{x}))) \right] \\
\end{align*}

where $V_\omega : \mathcal{X} \mapsto \mathbb{R}$ without any range constraints on the output and $g_f : \mathbb{R} \mapsto \text{dom}_{f^{*}}$ is an output activation function of the discriminator specific to the $f$-divergence used.

Now, we find the saddle point of $L(\theta, \omega)$ by minimizing with respect to $\theta$ and maximizing with respect to $\omega$.
\begin{equation}
    \theta^{*} = \underset{\theta}{\mathrm{argmin}}\ \max_{\omega} L(\theta, \omega) \label{eq:loss-f-div}
\end{equation}

Table \ref{table:f-div-loss} has a list of common $f$-divergences that we consider in this paper and provide their Fenchel conjugates $f^{*}$.

<table>
    <tr>
        <th>Name</th>
        <th>Output activation $g_f$</th>
    </tr>
    <tr>
        <td>Kullback-Leibler (KL)</td>
        <td>$v$</td>    
    </tr>
    <tr>
        <td>Reverse KL</td>
        <td>$-\exp(v)$</td>    
    </tr>
    <tr>
        <td>Jensen-Shanon</td>
        <td>$\log(2) - \log(1 + \exp(-v))$</td>    
    </tr>
    <tr>
        <td>Pearson $\mathcal{X}^2$</td>
        <td>$v$</td>    
    </tr>
    <tr>
        <td>Total Variation</td>
        <td>$\frac{1}{2} \tanh(v)$</td>    
    </tr>
</table>


### Kullback Leibler Divergence


Substituting in (\ref{eq:loss-f-div}), we get the loss function for Kullback-Leibler GAN,

\begin{align*}
    \theta^{*} &= \underset{\theta}{\mathrm{argmin}}\ \max_{\omega} \mathbb{E}_{x \sim P, \hat{x} \sim Q_{\theta}} \left[ V_{\omega}(x) - \exp(V_{\omega}(\hat{x})-1) \right]
\end{align*}

### Reverse K-L Divergence

Substituting in (\ref{eq:loss-f-div}), we get the loss function for Reverse KL GAN

\begin{align*}
    \theta^{*} &= \underset{\theta}{\mathrm{argmin}}\ \max_{\omega} \mathbb{E}_{x \sim P, \hat{x} \sim Q_{\theta}} \left[ -\exp(V_{\omega}(x)) + 1 + V_{\omega}(\hat{x}) \right]
\end{align*}

### Jensen-Shannon Divergence

Substituting in (\ref{eq:loss-f-div}), we get the loss function for Jensen-Shannon GAN

\begin{align*}
    \theta^{*} &= \underset{\theta}{\mathrm{argmin}}\ \max_{\omega} \mathbb{E}_{x \sim P, \hat{x} \sim Q_{\theta}} \left[ \log{\frac{2}{1 + \exp{(-V_{\omega}(x))}}} + \log \left(2 - \frac{2}{1 + \exp{(-V_{\omega}(\hat{x}))}}\right)\right] \\
    &= \underset{\theta}{\mathrm{argmin}}\ \max_{\omega} \mathbb{E}_{x \sim P, \hat{x} \sim Q_{\theta}} \left[ \log{\frac{2}{1 + \exp{(-V_{\omega}(x))}}} + \log \left(\frac{2\exp{(-V_{\omega}(\hat{x}))}}{1 + \exp{(-V_{\omega}(\hat{x}))}}\right)\right] 
\end{align*}
This loss function is very similar to the loss function in original GAN

### Total Variation Distance

Substituting in (\ref{eq:loss-f-div}), we get the loss function for Total Variation GAN

\begin{align*}
    \theta^{*} &= \underset{\theta}{\mathrm{argmin}}\ \max_{\omega} \mathbb{E}_{x \sim P, \hat{x} \sim Q_{\theta}} \left[ \frac{1}{2}\tanh{V_\omega(x)} - \frac{1}{2} \tanh{V_\omega(\hat{x})} \right] \\
\end{align*}

### Pearson $\mathcal{X}^2$ Divergence

Substituting in (\ref{eq:loss-f-div}), we get the loss function for Pearson $\mathcal{X}^2$ GAN

\begin{align*}
    \theta^{*} &= \underset{\theta}{\mathrm{argmin}}\ \max_{\omega} \mathbb{E}_{x \sim P, \hat{x} \sim Q_{\theta}} \left[ V_\omega(x) - \frac{1}{4} V_\omega(\hat{x}) ^ 2 - V_\omega(\hat{x}) \right] \\
\end{align*}


<figure class="image">
  <img src="fgan_gd.png" alt="loss" width="50%">
    <center><figcaption>Figure 5: We plot the loss to the generator for different values of the discriminator outputs ranging from -4 to +4 before these values are passed to their respective activation functions. </figcaption></center>
</figure>


### Mean Maximum Discrepancy Distance

For finite samples from distribution $\{x_1, \dots, x_n \} \sim P$ and $\{y_1, \dots, y_n\} \sim Q$, the approximate value of MMD distance $M_k(P, Q)$ \cite{li2017mmd} is

\begin{align*}
    \hat{M}_k(P, Q) = \frac{1}{\binom{n}{2}} \sum_{i \neq i'} k(x_i, x_i') - \frac{2}{\binom{n}{2}} \sum_{i \neq j} k(x_i, y_j) + \frac{1}{\binom{n}{2}} \sum_{j \neq j'} k(y_j, y_j')
\end{align*}

As we can see that $\hat{M}_k(P, Q)$ can not be zero even if P = Q. Therefore, we say that if $\hat{M}_k(P, Q) \leq c_\alpha$ where $c_\alpha > 0$ is some threshold, then $Q$ is indistinguishable from $P$. The generator $G_\theta$ can be trained to learn $P$ using a fixed kernel $k$ by miniminizing $\hat{M}_k(P, Q_\theta)$ with respect to $\theta$.

\begin{align*}
    \theta^{*} = \underset{\theta}{\mathrm{argmin}}\  \hat{M}_k(P, Q_\theta)
\end{align*}

Instead of a fixed kernel, we can also select a kernel $k \in \mathcal{K}$ such that we have a stronger signal where $P \neq Q_\theta$ to train $G_\theta$. Now, the objective for training is

\begin{align*}
    \min_\theta \max_{k \in \mathcal{K}} \hat{M}_k (P, Q_\theta)
\end{align*}

Since selecting a set of kernels $\mathcal{K}$ is a very tedious work, we introduce a family of injective functions $f_\phi$ parameterized by $\phi$ \cite{gretton2012kernel, gretton2012optimal} and compose it with the kernel $k$. $f_\phi$ will learn the mappings from one space to another. The final kernel $\tilde{k} = k \circ f_\phi$ where $\tilde{k}(x, x') = k(f(x), f(x'))$. The new objective is 

\begin{align*}
    \min_\theta \max_{\phi} \hat{M}_{k \circ f_\phi} (P, Q_\theta)
\end{align*}

Since $f$ is an injective function, it needs to be invertible function such that $f^{-1}(f(x)) = x, \forall x \in \mathcal{X}$. Li et all \cite{li2017mmd} use an autoencoder to approximate this. So, the discriminator is parameterized by $\phi = \{ \phi_{e}, \phi_d \}$. The discriminator consists of an encoder $f_{\phi_e}$ and also a decoder $f_{\phi_d} \approx f^{-1}$ to regularize $f$ training. The final objective is

\begin{align*}
    \min_\theta \max_{\phi} \hat{M}_{f_{\phi_e}} (P(\mathcal{X}), P(G_\theta(\mathcal{Z}))) - \lambda \mathbb{E}_{y \in \mathcal{X} \cup G(\mathcal{Z})} \|y - f_{\phi_d}(f_{\phi_e}(y)) \|^2
\end{align*}

### EBGAN

Energy-based GANs \cite{zhao2016energy} are similar to vanilla GAN. But they are trained using a different loss function which involves margin loss. The objective for the discriminator $D$ parameterized by $\omega$ under Energy Based GANs is 

\begin{align*}
    \omega^{*} = \underset{\omega}{\mathrm{argmax}}\  \mathbb{E}_{x \sim P}[D_\omega(x)] + \mathbb{E}_{z \sim P_z}[[m - D_\omega(G_{\theta}(z))]^{+}]
\end{align*}

for some $m > 0$ and $[x]+ = max (0,x)$. The objective of the generator network $G_{\theta}$ is 

\begin{align*}
    \theta^{*} = \underset{\theta}{\mathrm{argmin}}\  \mathbb{E}_{z\sim\mathbb{P}_z}[D_\omega(G_{\theta}(z))] -  \mathbb{E}_{x \sim P}[D_\omega(x)] 
\end{align*}

The essence of the energy based model is to map any point of the input space (the space of images) to a single scalar, which is called "energy". The learning phase involves using the training data to shape the energy surface in such a way that the desired configurations get assigned low energies while the incorrect ones get assigned high energies. 

In the context of generative adversarial networks, the output of the discriminator is treated as the "energy". The discriminator is trained to assign low energy values to the regions of high data density and higher energy values outside these regions. Conversely, the generator can be viewed as a trainable parameterized function that produces samples in regions of the space in which the discriminator assigns low energy. 

Arjovsky1 et al \cite{arjovsky2017wasserstein} prove that under an optimal discriminator, energy-based GANs optimize the total variation distance between the real and generated distributions. They show that the above objective for the generator under an optimal discriminator $D^{*}$ is changed to

\begin{align*}
    \theta^{*} &= \underset{\theta}{\mathrm{argmin}}\  \mathbb{E}_{z\sim\mathbb{P}_z}[D^{*}(G_{\theta}(z))] -  \mathbb{E}_{x \sim P}[D^{*}(x)] \\
    &= \underset{\theta}{\mathrm{argmin}}\ \frac{m}{2} \delta(P, Q_\theta)
\end{align*}

## 6. Experiments

## 7. Evaluations