# Discrete Representation Learning

[Paper](https://proceedings.neurips.cc/paper/2017/hash/7a98af17e63a0ac09ce2e96d03992fbc-Abstract.html)

[Slides](https://danjacobellis.github.io/ITML/discrete_representation_learning.slides.html)


<script>
    document.querySelector('head').innerHTML += '<style>.slides { zoom: 1.75 !important; }</style>';
</script>

<center> <h1>
Discrete Representation Learning
</h1> </center>

&nbsp;

<center> <h3>
Dan Jacobellis
</h3> </center>

&nbsp;

\[1\] [Neural discrete representation learning.](https://proceedings.neurips.cc/paper/2017/hash/7a98af17e63a0ac09ce2e96d03992fbc-Abstract.html) Van Den Oord et al. NIPS 2017.

\[2\] [Generating Diverse High-Fidelity Images
with VQ-VAE-2](https://proceedings.neurips.cc/paper/2019/file/5f8e2fa1718d1bbcadf1cd9c7a54fb8c-Paper.pdf) Razavi et al. NIPS 2019.

\[3\] [Jukebox: A Generative Model for Music.](https://openai.com/blog/jukebox/) Dhariwal et al. OpenAI 2020.

\[4\] [Zero-Shot Text-to-Image Generation.](http://proceedings.mlr.press/v139/ramesh21a.html?ref=https://githubhelp.com) Ramesh et al. PMLR 2021.

## Discrete representations

* Abstract away noise and detail
* Make the representation discrete to match real world
  * Language is naturally discrete
  * Speech and music can be represented by sequence of symbols
  * Images can be described by composing objects

## Vector Quantization

* Suppose we have a randomly sampled vector $x$ from some distribution.
  * Example: $x$ is a digital audio recording, image, or video.
* Goal: create a codebook of $k$ vectors $c_k$ so that $x$ is close to at least one of the vectors in the codebook.

$$k^* = \text{arg} \min_{k}{\lVert x-c_k\rVert ^2}$$

$$x \approx c_{k^*}$$

![](img/vq.png)

<p style="text-align:center;">
<img src="_images/vq.png" width=300 height=300 class="center">
</p>

## Vector Quantization

* Simple and effective form of lossy compression
* Still used today for high quality audio compression
* Finding large codebooks is difficult 

$$\{c_k^*\} := \arg\min_{\{c_k\}} \sum_{h=1}^N \min_k
\|x_h-c_k\|^2.$$

## Variational Autoencoders

* Encoder parameterises a posterior $q(z|x)$
* Decoder's distribution over input data is $p(x|z)$
* Prior distribution $p(z)$ is Gaussain with diagonal covariance
  * Allows us to take advantage of reparameterization trick
  
![](img/VAE.png)

## Autoregressive models

![](img/autoregressive.png)

## Autoregressive models

* Extremely powerful autoregressive models introduced prior to this work

![](img/autoregressive_models.png)

## Posterior Collapse

* Optimizing ELBO does not guarantee mutual information between input $x$ and latent $z$
* A strong autoregressive decoder may ignore latents $z$

$$q(z|x)\approx q(z)=\mathcal N(a,b)$$

* This was a signifant issue in previous attempts to combine VAE with autoregressive models
* Many different solutions have been proposed

## Vector-quantized autoencoder

* Discrete uniform prior instead of Gaussian
* Latent embedding space consists of codebook: $e\in R^{K\times D}$
* Posterior $q(z|x)$ is a set of codes
  
![](img/vqvae.png)

## Discrete latent variables

* Output of encoder is $z_e(x)$ is quantized to nearest code
* Nearest neighbor look-up in the embedding space $e$

$$k=\text{arg}\min_{j}{\lVert z_e(x) - e_j \rVert_2}$$

$$q(z|x)=\begin{cases}1 & \text{for } z=k \\ 0 & \text{otherwise} \end{cases}$$

  
![](img/encoder.png)

## Decoder

* Input to decoder is the embedding vector $z_q(x)=e_k$

* Less sensitive to posterior collapse so powerful autoregressive decoder can be used.

![](img/decoder.png)

## Learning

* The quantization step is not differentiable
  * Just copy the gradients from the decoder input to encoder output instead
  * Since the output of the encoder and the input to the decoder share the same $D$ dimensional space, there is still useful information in the gradient.
  * We will need to additional term in loss function to learn the codebook

## Loss function

$$L = \underbrace{\log (p(x|z_q(x))}_{\text{reconstruction loss}}+\underbrace{\lVert \text{sg}[z_e(x)] - e \rVert_2^2}_{\text{vector quantization}}+\underbrace{\beta \lVert z_e(x) - \text{sg}[e] \rVert_2^2}_{\text{commitment loss}}$$

<center> (stopgradient operator $\text{sg}[\cdot]$ forces the operand to be a non-updated constant) </center>

* Codebook loss ensures selected codes are close to the output of the encoder
* Commitment loss with hyperparameter $\beta$
  * Encourages the output of the encoder to stay close to the chosen codebook vector
  * Helps avoid fluctuating between code vectors when the encoder trains faster than the embeddings

## Hierarchical VQ-VAE
![](img/audio_hvqvae.png)
![](img/image_hvqvae.png)

## Random restarts

* Although discretizing helps with the posterior collapse there it is not perfect.

* One technique that has been used is random restarts

* If a vector in the codebook is not being used, randomly reset it to one of the encoder outputs.

## Time-frequency loss

* When the loss function compares the input an output samples directly, there is a tendency to favor low frequencies

* A solution is to compute the loss on a time-frequency representation of the signal

* Even better if the loss is calculated using TF representations with varying time and frequency resolutions