# Discrete Representation Learning

[Paper](https://proceedings.neurips.cc/paper/2017/hash/7a98af17e63a0ac09ce2e96d03992fbc-Abstract.html)

[Slides](https://danjacobellis.github.io/ITML/discrete_representation_learning.slides.html)


<script>
    document.querySelector('head').innerHTML += '<style>.slides { zoom: 1.75 !important; }</style>';
</script>

<center> <h1>
Discrete Representation Learning
</h1> </center>

&nbsp;

<center> <h3>
Dan Jacobellis
</h3> </center>

&nbsp;

\[1\] [Neural discrete representation learning.](https://proceedings.neurips.cc/paper/2017/hash/7a98af17e63a0ac09ce2e96d03992fbc-Abstract.html) Van Den Oord et al. NIPS 2017.

\[2\] [Jukebox: A Generative Model for Music.](https://openai.com/blog/jukebox/) Dhariwal et al. OpenAI 2020.

\[3\] [Zero-Shot Text-to-Image Generation.](http://proceedings.mlr.press/v139/ramesh21a.html?ref=https://githubhelp.com) Ramesh et al. PMLR 2021.

## Vector Quantization

* Suppose we have a randomly sampled vector $x$ from some distribution.
  * Example: $x$ is a digital audio recording, image, or video.
* Goal: create a codebook of $k$ vectors $c_k$ so that $x$ is close to at least one of the vectors in the codebook.

$$k^* = \text{arg} \min_{k}{\lVert x-c_k\rVert ^2}$$

$$x \approx c_{k^*}$$

<p style="text-align:center;">
<img src="_images/vq.png" width=300 height=300 class="center">
</p>

![](img/vq.png)

## Vector Quantization

* Simple and effective form of lossy compression
* Still used today for high quality audio compression
* How to find the codebook?

$$\{c_k^*\} := \arg\min_{\{c_k\}} \sum_{h=1}^N \min_k
\|x_h-c_k\|^2.$$

## Variational Autoencoders

* Encoder parameterises a posterior $q(z|x)$
* Decoder's distribution over input data is $p(x|z)$
* Prior distribution $p(z)$ is Gaussain with diagonal covariance
  * Allows us to take advantage of reparameterization trick

## Vector-quantized autoencoder

* Discrete prior instead of Gaussian
* Latent embedding space consists of codebook: $e\in R^{K\times D}$
* Posterior $q(z|x)$ is a set of codes
  
![](img/vqvae.png)

## Discrete latent variables

* Output of encoder is $z_e(x)$
* Nearest neighbor look-up in the embedding space $e$

$$k=\text{arg}\min_{j}{\lVert z_e(x) - e_j \rVert_2}$$

$$q(z|x)=\begin{cases}1 & \text{for } z=k \\ 0 & \text{otherwise} \end{cases}$$

$$z_q(x)=e_k$$
  
![](img/encoder.png)

## Decoder

* Input to decoder is the embedding vector $e_k$

![](decoder.png)