<center>
    <h1>Adversarial Losses of GANs</h1>
</center>

|    Anish Shah    |    Deshana Desai    |Benjamin Ahlbrand|
|:----------------:|:-------------------:|:---------------:|
|shah.anish@nyu.edu|deshana.desai@nyu.edu| ba1404@nyu.edu  |

<center> 
    <h4>Abstract</h4>
We study a large class of loss functions that have been used for training Generative Adversarial Networks (GANs) called “adversarial losses”. Besides vanilla GAN, we review the loss functions in $f$-GAN, Maximum Mean Discrepancy (MMD) GAN, Wasserstein GAN and Energy-based GAN. We discuss relevant statistical properties of these distance measures that affect their behaviour and how they are employed in GANs. Further, We perform experiments and create simple visualizations to demonstrate relationships of how these distance measures affect the network's ability to cover all modes / generate better samples by covering fewer modes, lead to vanishing gradients, produce disentangled latent spaces or the variance of the cost values as a function of discriminator outputs. We also review the effectiveness of the distance measures in producing samples using metrics such as visual quality, smooth interpolations, inception score on the LSUN dataset. We perform some of these experiments on smaller synthetic datasets due to hardware and computational time bottlenecks. A natural extension of our study in the measurement of the distance between the distributions of generator model and training data, and separately the distributions of the discriminator model versus the training data distribution, is provided by optimal transport theory (OT). Recently, GANs have been used in conjunction with techniques from OT by framing the problem as one of minimization of the transportation cost of moving one data distribution to another. We review and include some of these techniques in our discussion of distance measures.
</center>

## 1. Introduction

## 2. List of Notations

<table>
    <tr>
        <td>$D$</td>
        <td>The discriminator</td>
    </tr>
    <tr>
        <td>$\omega$</td>
        <td>parameter for our discriminator</td>
    </tr>
    <tr>
        <td>$f$</td>
        <td>A convex, lower-semicontinuous function satisfying $f(1) = 0$</td>
    </tr>
    <tr>
        <td>$f^{*}$</td>
        <td>Fenchel conjugate of $f$</td>
    </tr>
    <tr>
        <td>$V$</td>
        <td>$\mathcal{X} \mapsto \mathbb{R}$, output of discriminator without the activation function</td>
    </tr>
    <tr>
        <td>$g_f$</td>
        <td>$\mathbb{R} \mapsto \text{dom}_{f^{*}}$, output activation function which respects the domain $\text{dom}_{f^{*}}$</td>
    </tr>
    <tr>
        <td>$G$</td>
        <td>The generator</td>
    </tr>
    <tr>
        <td>$P$</td>
        <td>True or Target distribution</td>
    </tr>
    <tr>
        <td>$Q$</td>
        <td>Model or generated distribution</td>
    </tr>
    <tr>
        <td>$p(x)$</td>
        <td>probability density function of $P$</td>
    </tr>
    <tr>
        <td>$q(x)$</td>
        <td>probability density function of $Q$</td>
    </tr>
    <tr>
        <td>$\theta$</td>
        <td>parameter for model distribution or generator</td>
    </tr>
    <tr>
        <td>$z$</td>
        <td>The latent vector</td>
    </tr>
    <tr>
        <td>$\mathcal{Z}$</td>
        <td>The latent space</td>
    </tr>
    <tr>
        <td>$\mathcal{X}$</td>
        <td>The samples space</td>
    </tr>
    <tr>
        <td>$L$</td>
        <td>loss function of GAN </td>
    </tr>
</table>

## 3. Statistical Divergence Measures

A divergence measure is defined as a function which establishes the similarity between two probability distributions. The divergence need not be symmetric (that is, in general the divergence from $p$ to $q$ is not equal to the divergence from $q$ to $p$), and need not satisfy the triangle inequality \cite{wiki:xxx}.

### f-divergence

In statistics and probability theory, an $f$-divergence is a function $D_{f}\left( P \parallel Q \right)$ that measures the difference between two probability distributions $P$ and $Q$ \cite{csiszar2004information, liese2006divergences}. If $P$ and $Q$ are absolutely continuous distributions with respect to a reference $dx$ on $\mathcal{X}$ and $p$ and $q$ are its probability density function respectively, then we define the $f$-divergence,

\begin{align} \label{eq:fdiv}
    D_f(P \parallel Q) = \int_{\mathcal{X}} q(x) f \left( \frac{p(x)}{q(x)} \right) dx
\end{align}
    

where the \textit{generator function} $f: \mathbb{R}_{+} \mapsto \mathbb{R}$ is a convex, lower-semicontinuous function satisfying
$f(1) = 0$. Every convex, lower-semicontinuous function $f$ has a \textit{convex conjugate} function $f^{*}$ known as \textit{Fenchel conjugate} \cite{hiriart2012fundamentals}. The function is defined as  $f^{*}(t) = \sup\limits_{u \in \text{dom}_{f}} \{ut -  f(u)\}$,

Using Fenchel Conjugate in (\ref{eq:fdiv}),

\begin{align*}
    D_f(P \parallel Q) &= \int_{\mathcal{X}} q(x) \sup\limits_{t \in \text{dom}_{f^{*}} } \left\{ t \frac{p(x)}{q(x)} - f^{*}(t) \right\}  dx 
    \intertext{By Jensen Inequality,}
    &\geq \sup\limits_{T \in \mathcal{T}} \left( \int_{\mathcal{X}}p(x)T(x)dx - \int_{\mathcal{X}}q(x)f^{*}(T(x)) dx \right) \\
    &= \sup\limits_{T \in \mathcal{T}} \left( \mathbb{E}_{x \sim P} \left[T(X)\right] - \mathbb{E}_{x \sim Q} \left[f^{*}(T(X))\right] \right)
\end{align*}

where $\mathcal{T}$ is an arbitrary class of function $T : \mathcal{X} \mapsto \mathbb{R}$.
The lower bound is tight for $T^{*}(x) = f^{'} \left( \frac{p(x)}{q(x)} \right)$ \cite{nguyen2010estimating} where $f'$ is the first order derivative of $f$.

If we have a model $q_{\theta}(x)$ that should match the true distribution $p(x)$, we need to adjust the parameter $\theta$ using gradient descent to minimize the $f$-divergence. Our goal will be to find the best parameter $\theta^{*}$ using 

\begin{align}
    \theta^{*} &= \argmin_{\theta} D_{f} \left( P \parallel Q_{\theta} \right) \nonumber \\
    &= \argmin_{\theta} \mathbb{E}_{x \sim P} \left[ f^{'} \left( \frac{p(x)}{q_{\theta}(x)} \right) \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[f^{*}\left(f^{'} \left( \frac{p(x)}{q_{\theta}(x)} \right)\right)\right] \label{eq:theta-f-div}
\end{align}


Table \ref{table:f-div} and \ref{table:f-div-loss} has a list of common $f$-divergences that we consider in this paper and provide their Fenchel conjugates $f^{*}$.

\begin{table}
\centering
\begin{tabular}{ l l l l  } 
 \hline
 Name & Generator $f(u)$ & Conjugate $f^{*}(t)$ & $T^*(x)$\\ 
 \hline
Kullback-Leibler (KL) & $u \log u$ & $\exp{(t - 1)}$ & $1 + \log \frac{p(x)}{q(x)}$ \\
Reverse KL & $-\log u$ & $-1-\log(-t)$ & $-\frac{q(x)}{p(x)}$\\
Jensen-Shanon & $-(u+1)\log \frac{1 + u}{2} + u \log u$ & $-\log(2-\exp(t))$ & $\log \frac{2p(x)}{p(x) + q(x)}$\\
Pearson $\mathcal{X}^2$ & $(u - 1)^2$ & $\frac{1}{4}t^2 + t$ & $2 \left( \frac{p(x)}{q(x)} - 1 \right)$ \\
Total Variation & $\frac{1}{2} | u - 1 |$ & $t$ & $\frac{1}{2} \text{sign}\left( \frac{p(x)}{q(x)} - 1 \right)$ \\
 \hline
\end{tabular}
\caption{List of $f$-divergences that we consider in this paper}
\label{table:f-div}
\end{table}

\subsubsection{Kullback Leibler Divergence}

KL Divergence is one example of $f$-divergence to measure how target probability distribution $p$ diverges from a model probability distribution $q$. For KL divergence, $f(u) = u \log u$.\cite{kldivwiki}

\begin{equation}
    D_{KL}(p||q) = \int_x p(x) log\frac{p(x)}{q(x)}dx
\end{equation}

Substituting in (\ref{eq:theta-f-div}), we can find the best set of parameters $\theta^{*}$ that minimizes the KL divergence\cite{gananddiv}, 

\begin{align*}
    \theta^{*} &=  \argmin_{\theta} \mathbb{E}_{x \sim P} \left[ 1 + \log \left( \frac{p(x)}{q_{\theta}(x)} \right) \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[ \frac{p(x)}{q_{\theta}(x)} \right] \\
    &=  \argmin_{\theta}\mathbb{E}_{x \sim P} \left[ 1 + \log \left( \frac{p(x)}{q_{\theta}(x)} \right) \right] - \mathbb{E}_{x \sim P} \left[ 1 \right] \\
    &=  \argmin_{\theta}\mathbb{E}_{x \sim P} \left[ \log \left( \frac{p(x)}{q_{\theta}(x)} \right) \right] \\
    &=  \argmin_{\theta}\mathbb{E}_{x \sim P} \left[ \log{p(x)} - \log{q_{\theta}(x)} \right] \\
    &=  \argmin_{\theta}\mathbb{E}_{x \sim P} \left[ \log{p(x)} \right] - \mathbb{E}_{x \sim P} \left[ \log{q_{\theta}(x)} \right] 
    \intertext{The first term in the above equation is simply the negative entropy of the true distribution p(x). Since this quantity is independent of $\theta$, we get:}
    &= \argmin_{\theta} - \mathbb{E}_{x \sim P} \left[ \log{q_{\theta}(x)} \right]  \\ 
    &= \argmax_{\theta} \mathbb{E}_{x \sim P} \left[ \log{q_{\theta}(x)} \right]  \\ 
\end{align*}

The above equation wants to find parameters $\theta$ such that the samples from $p(x)$ are assigned the highest log probability value under $q_\theta(x)$. This is also exactly the equation for maximum likelihood. Let us take an example dataset that is a mixture of two guassians. The implication of this likelihood maximization is if we draw a sample from $p(x)$ that has a low probability under $q_\theta(x)$, $log q_\theta(x)$ goes to negative infinity. This leads to heavy penalization for the parameters. On the other hand, if some input x has low probability under p(x) but is assigned a high probability under $q_\theta(x)$, this will not affect the maximum likelihood loss much. The result is that the estimated model tries to cover the entire support of the true distribution, and in doing so ends up assigning probability mass to regions of space (between the two mixture components) which have low probability under $p(x)$. 

\subsubsection{Reverse KL Divergence}

To get around the above issue, we instead calculate Reverse KL Divergence. For Reverse KL Divergence, $f(u) = -\log u$. Substituting in (\ref{eq:theta-f-div}), we can find the best set of parameters $\theta^{*}$ that minimizes the Reverse KL divergence\cite{gananddiv}, 

\begin{align*}
    \theta^{*} &=  \argmin_{\theta}\mathbb{E}_{x \sim P} \left[ - \frac{q_{\theta}(x)}{p(x)} \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[ -1 - \log \left(\frac{q_{\theta}(x)}{p(x)}\right) \right] \\
    &=  \argmin_{\theta}\mathbb{E}_{x \sim Q_{\theta}} \left[ -1 \right] + \mathbb{E}_{x \sim Q_{\theta}} \left[ 1 + \log \left(\frac{q_{\theta}(x)}{p(x)}\right) \right] \\
    &=  \argmin_{\theta}\mathbb{E}_{x \sim Q_{\theta}} \left[\log \left(\frac{q_{\theta}(x)}{p(x)}\right) \right] \\
    &=  \argmin_{\theta}\mathbb{E}_{x \sim Q_{\theta}} \left[\log q_{\theta}(x) - \log{p(x)}\right] \\
    &=  \argmin_{\theta}\mathbb{E}_{x \sim Q_{\theta}} \left[\log q_{\theta}(x) \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[ \log{p(x)}\right] \\
    &=  \argmax_{\theta}-\mathbb{E}_{x \sim Q_{\theta}} \left[\log q_{\theta}(x) \right] + \mathbb{E}_{x \sim Q_{\theta}} \left[ \log{p(x)}\right] \\
\end{align*}


The first term above is simply the entropy of $q_\theta(x)$ which requires the probability mass to be as spread out as possible. The second term is the log probability of samples from $q_\theta(x)$ under the true distribution $p(x)$. This encourages the model to produce samples that are reasonably "close" to the true distribution $p(x)$. The first term is essential since the model would simply assign all probability mass to a single sample which has high probability under $p(x)$. The implication of using such an objective function is that our model picks a single mode and models it well. The solution is reasonably high-entropy and any sample from the estimated distribution has a reasonably high probability under $p(x)$. The drawback here is that there may be many diverse modes which remain unexplored. 

Another drawback of this method is that, reverse KL provides no control over which output is chosen, just that the distribution learned by the model has high probability under the true distribution. In contrast, the maximum likelihood can result in a "worse" output by virtue of trying to model every possible outcome despite being unable to do so well. A maximum likelihood could achieve better results if a model with sufficiently large capacity is trained. However this might be too expensive a trade-off (based on a recent example of the Glow approach \cite{kingma2018glow}).


\subsubsection{Aside: How is KL divergence equivalent to Maximum Likelihood Estimation?}

We want to learn a distribution $Q$ that approximates $P$. We can attempt to do this by directly learning the probability density function $q(x)$ of $Q$. We optimize $Q$ through Maximum Likelihood Estimation. The MLE objective is:

\begin{equation}
    \max_{\theta \in \mathbb{R}^d} \quad \frac{1}{m} \sum_{i=1}^{m} \log q_\theta(x^{(i)})
\end{equation}

In the limit that $m \to \inf$, the samples $x(i)$ from the true distribution will appear based on the true data distribution $P$. This leads to:

\begin{align*}
    & \lim_{m \to \infty} \max_{\theta \in \mathbb{R}^d} \quad \frac{1}{m} \sum_{i=1}^{m} \log q_\theta(x^{(i)}) \\
    = & \max_{\theta \in \mathbb{R}^d} \quad  \int_x p(x) \log q_\theta(x) dx \\
    =& \min_{\theta \in \mathbb{R}^d} - \int_x p(x) \log q_\theta(x) dx \\
    =& \min_{\theta \in \mathbb{R}^d} \int_x p(x) \log p(x) dx - \int_x p(x) \log q_\theta(x) dx \\
    =& \min_{\theta \in \mathbb{R}^d} D_{KL}( p(x)||q_\theta(x))
\end{align*}

As discussed earlier, if $q_\theta(x) = 0$ at an $x$ where $p(x) > 0$, the KL divergence goes to $+ \infty$. Thus if  $q_\theta(x)$ has low dimensional support, then it is unlikely that all of $p(x)$ lies within the support. This will cause the KL divergence to explode on some data points.

\subsubsection{Jensen Shannon Divergence}

The Jensen Shannon Divergence is a neat equation and has a clear connection with the above defined KL Divergence and Reverse KL Divergence. In the first case, we observed that the KL divergence abhorred regions where $p(x)$ had non null mass but $q_{\theta}(x)$ had null mass. The KL Divergence thus does not behave well in certain regions where $p(x)$ may be small but $q_{\theta}(x)$ may be high. 

In the reverse KL Divergence, we saw that the equation abhors regions where the log likelihood of the generated samples occurring in the true data distribution is low. However, the drawback of using the reverse KL divergence is that there may be some $x$ which have a low likelihood of occurring in $q_{\theta}(x)$ but a high likelihood of occurring in $p(x)$. thus it does not behave well for some points where $q_{\theta}(x)$ is small. 

The KL Divergence is asymmetric as it is clearly not the same as the Reverse KL Divergence. In order to read a symmetrised version of KL Divergence, one way would be to simply sum up the two different KL Divergences between the two distributions: $D_{KL}(p(x) \parallel q_\theta(x)) + D_{KL}(q_\theta(x)\parallel p(x))$

However, such a loss function would still heavily penalize when either of the two distributions have low mass at a region while the other does not. Instead of looking at the distance between two probability distributions to each other, the next obvious solution would be to measure the distance between each of these distributions to the average of all the distributions. This leads us to the symmetric Jensen Shannon Divergence\cite{jensen} metric:

\begin{align*}
    D_{JS}(P\parallel Q) = \frac{1}{2} D_{KL}(P \parallel M) + \frac{1}{2} D_{KL} (Q \parallel M)
\end{align*}

where M is the average of the individual probability distributions:

\begin{align*}
    M = \frac{1}{2}(P+Q)
\end{align*}


The Jensen-Shannon divergence doesn't have this property. It is well behaved both when $p(x)$ or  $q_{\theta}(x)$ may be small. This is also a symmetrized version of the KL divergence.


Jensen-Shannon Divergence is also a type of $f$-divergence where $f(u) = -(u+1)\log \frac{1 + u}{2} + u \log u$.

Substituting in (\ref{eq:theta-f-div}), we can find the best set of parameters $\theta^{*}$ that minimizes the JS  divergence,

\begin{align*}
    \theta^{*} &= \argmin_{\theta}\mathbb{E}_{x \sim P} \left[ \log{\frac{2p(x)}{p(x) + q_{\theta}(x)}} \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[  - \log \left( 2 - \frac{2p(x)}{p(x) + q_{\theta}(x)} \right) \right] \\
    &= \argmin_{\theta}\mathbb{E}_{x \sim P} \left[ \log{\frac{2p(x)}{p(x) + q_{\theta}(x)}} \right] + \mathbb{E}_{x \sim Q_{\theta}} \left[ \log \frac{2q(x)}{p(x) + q_{\theta}(x)}  \right] \\
\end{align*}

\subsubsection{Total Variational Distance}

Let $\mathcal{X}$  be a set of samples and let $\Sigma$ be the set of all Borel subsets of $\mathcal{X}$. Let Prob($\mathcal{X}$) denote the space of probability measures defined on $\mathcal{X}$.

The \textit{Total Variation} (TV) distance between two distributions $P, Q \in \text{Prob}(\mathcal{X})$ is
\begin{align*}
    \delta(P, Q) &= \sup_{A\in \Sigma} |P(A) - Q(A)| \\
    &= \sup_{A\in \Sigma} \left| \int_{A} (p(x) - q(x)) dx \right|
\end{align*}

Intuitively speaking, the total variation distance considers each possible outcome A from $\sum$ that are outcomes of the probability distributions $P$ and $Q$. For some outcomes, the probability assigned by $P$ will be larger than the probability assigned for that same event by $Q$ and vice versa. We then sort through the entire list of outcomes and find the one for which the two probability assignments are most different i.e. where the gap between the two distributions is maximal. 


$A_{*} = \{ x \in \Sigma : p(x) \geq q(x) \}$ is the set achieved by $\delta(P, Q)$. Since $P, Q \in \text{Prob}(\mathcal{X})$, we know that,

\begin{align*}
    \int_{\mathcal{X}} (p(x) - q(x)) dx = P(\mathcal{X}) - Q(\mathcal{X}) = 0 
\end{align*}
and
\begin{align} \label{eq:tv-0}
    \int_{A_{*}} (p(x) - q(x)) dx = \int_{\mathcal{X} \textbackslash A_{*}} (q(x) - p(x)) dx
\end{align}

Therefore, 
\begin{align}
    \int_{\mathcal{X}} | p(x) - q(x) | dx &= 2 \int_{A_{*}} (p(x) - q(x)) dx \label{eq:tv-1} \\
    &\leq 2 \sup_{A} \left| \int_{A} (p(x) - q(x)) dx \right| \label{eq:tv-2}
\end{align}


\begin{align}
    \left| \int_{A} (p(x) - q(x)) dx \right| &= \max \left\{ \int_{A} (p(x) - q(x)) dx , \int_{A} (q(x) - p(x)) dx \right\} \nonumber \\
    &\leq \max \left\{ \int_{A_{*}} (p(x) - q(x)) dx , \int_{\mathcal{X}\textbackslash A_{*}} (q(x) - p(x)) dx \right\} \nonumber 
    \intertext{From \ref{eq:tv-0} and \ref{eq:tv-1}, }
    \sup_{A} \left| \int_{A} (p(x) - q(x)) dx \right| &\leq \frac{1}{2} \int_{\mathcal{X}} | p(x) - q(x) | dx  \label{eq:tv-3}
\end{align}

From \ref{eq:tv-2} and \ref{eq:tv-3},

\begin{align*}
    \delta(P, Q) &= \frac{1}{2} \int_{\mathcal{X}} | p(x) - q(x) | dx
\end{align*}

Therefore, Total Variation Distance is also a type of $f$-divergence for $f(u) = \frac{1}{2} | u - 1|$. We can find the best set of parameters $\theta^{*}$ that minimizes the TV distance from equation \ref{eq:theta-f-div},

\begin{align*}
    \theta^{*} &= \argmin_{\theta} \mathbb{E}_{x \sim P} \left[ \frac{1}{2} \text{sign}\left( \frac{p(x)}{q_{\theta}(x)} - 1 \right) \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[ \frac{1}{2} \text{sign} \left( \frac{p(x)}{q_{\theta}(x)} - 1 \right) \right]
\end{align*}

\subsubsection{Pearson $\mathcal{X}^2$ Divergence}

Pearson $\mathcal{X}^2$ Divergence is another type of $f$-divergence where $f(u) = (u - 1)^2$.

\begin{align*}
    \theta^{*} &= \argmin_{\theta} \mathbb{E}_{x \sim P} \left[ 2 \left( \frac{p(x)}{q_{\theta}(x)} - 1 \right) \right] - \mathbb{E}_{x \sim Q_{\theta}} \left[\left( \frac{p(x)}{q_{\theta}(x)} - 1 \right)^2 + 2 \left( \frac{p(x)}{q_{\theta}(x)} - 1 \right)\right] 
\end{align*}

## 4. Generative Adversarial Networks

## 5. Adversarial Losses

## 6. Experiments

## 7. Evaluations