# Explanation

OpenAI now applies the scaling laws to image generation models. In the DALL E series, they combine the high quality image generation demonstrated by state of the art generative models (VQ-VAE & diffusion) with the scaling power of transformers (ViT), and then train them on internet scale image datasets.

### DALL E

DALL E is their first attempt at applying the scaling laws to image generation. They train a 12B parameter transformer on a 250 million image-text pair dataset scraped from the internet - conveniently, because there are lots of labeled images on the internet (via alt text, captions, etc.) they can use a filtered down version of these (they go over their data collection approach in the paper) to train the model.

The key intuition of the architecture is to feed both text and image tokens to the transformer, allowing it to learn to attend to the relationships between different words and parts of the image. This conditions the model to understand how captions can describe images, allowing it to generate an image given a caption, similar to how it does next sentence prediction.

Specifically, the transformer serves as the decoder for the model, and takes the encoded text caption as input along with image data converted into the image patch format used in the ViT architecture.

Instead of deriving these image patches directly from the image, DALL E instead uses a VQ-VAE to compress the image into a lower dimensional representation space, and then creates image patches from the representations outputted by the VQ-VAE encoder.

This works well since the VQ-VAE compresses the image into representations where each value is discrete (specifically, the DALL E VQ-VAE encoder encodes images into $32 \times 32$ tokens each with $8192$ possible values).

Given these two disjoint pieces, training happens in two steps:
1. First, the VQ-VAE is trained like a standard VQ-VAE to reconstruct the images, forcing the representation space to become robust.
2. Then, image-text pairs are fed to the transformer where text is encoded normally, and the images are encoded via the VQ-VAE encoder first before passing to the transformer.

This two part training is combined to maximize the likelihood that the model uses these encode image-text pairs to generate the original image.

### DALL E 2

DALL E 2 continues with the image generation scaling laws and uses a similar intuition, but updates two parts of the model with the new state-of-the-art. We can think of DALL E as having a few key features:

1. **Image Compression** - The VQ-VAE acts as an encoder to compress the image into a lower dimensional representation with less noise.
2. **Text-Image Attention** - The transformer is used to understand the relationships between different text and image pairs.
3. **Image Geneartion** - Then, the transformer decoder is also used to regenerate the original images (and by extension learns to generate variations).

DALL E 2 uses these same components, but switches them out for improved models (which slightly changes the architecture).

It accomplishes both image compression & text-to-image attention in one step using OpenAI's recently trained CLIP model (which was released around the same time as DALL E).

Because CLIP maps both images and captions into the same embedding space (this is the whole point of CLIP), it removes the need for DALL E 2 to learn relationships between text and images itself - the CLIP embedding space already contains sufficient information about these relationships.

So instead of focusing on learning these relationships, DALL E 2 just trains a decoder on the CLIP embedding space. Instead of using a transformer for the decoder, they use a diffusion model, which has been shown to beat GANs in image synthesis quality before this paper, and is especially good at creating variations from the original dataset.

In practice, DALL E 2 trains 2 separate networks that work together to make the full architecture:

1. A diffusion encoder that maps the text caption for an image into possible embeddings in the CLIP embedding space.
2. A diffusion decoder that maps the CLIP embedding back up from the embedding space to images.

The diffusion encoder may first seem confusing - the CLIP model itself already maps text captions into embeddings. But the diffusion encoder serves to modify the exact CLIP embedding of the text caption in the embedding space to make the embedding as useful as possible for the decoder to regenerate the image.

Like DALL E, DALL E 2 then uses this setup to maximize the likelihood of the decoder regenerating an image given the text-image pair, and scales this training up to a large dataset.

# My Notes

## 📜 [Zero-Shot Text-to-Image Generation](https://arxiv.org/pdf/2102.12092)

> Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset.

> We describe a simple approach for this task based on a transformer that auto-regressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

Instead of focusing on inductive bias to improve image modeling, they instead focus on data and scale - and as usual, it works!

> Recent advances fueled by large-scale generative models suggest a possible route for further improvements [in text-to-image modeling]. Specifically, when compute, model size, and data are scaled carefully, auto-regressive transformers have achieved impressive results in several domains such as text, images, and audio.

> Could dataset size and model size be the limiting factor of current approaches?

> In this work, we demonstrate that training a 12-billion parameter autoregressive transformer on 250 million image-text pairs collected from the internet results in a flexible, high fidelity generative model of images controllable through natural language.

> The resulting system achieves high quality image generation on the popular MS-COCO dataset zero-shot, without using any of the training labels.

They apply the same scaling hypothesis here to text-to-image models, and once again, get SoTA results with this hypothesis, creating a model that can perform well on previous datasets zero-shot without even training on them.

### Method

> Our goal is to train a transformer to auto-regressively model the text and image tokens as a single stream of data.

> However, using pixels directly as image tokens would require an inordinate amount of memory for high-resolution images.

> We address these issues by using a two-stage training procedure:

**Stage 1:** We train a discrete variational auto-encoder (dVAE) to compress each 256×256 RGB image into a 32 × 32 grid of image tokens, each element of which can assume 8192 possible values. This reduces the context size of the transformer.

**Stage 2:** We concatenate up to 256 BPE-encoded text tokens with the $32 \times 32$ image tokens, and train an autoregressive transformer to model the joint distribution over the text and image tokens.

>

They’re getting creative here. Using the strategies from VQ-VAE to compress the image farther so the context is smaller, and then use this to create image patch tokens like with ViT to send to the transformer - where the word and image tokens can all attend to each other!

> We can model the overall procedure as maximizing the evidence lower bound (ELB) on the joint likelihood of the model distribution over images $x$, captions $y$ and the tokens $z$ for the encoded RGB image.

$$
\ln p_{\theta,\psi}(x,y) \geqslant \mathbb{E}_{z \sim q_{\phi}(z | x)} (\ln p_\theta(x|y,z) - \beta D_{KL} (q_\phi(y, z|x), p_{\psi}(y, z))
$$

Here, we model the ELB with $p_{\theta,\psi}(x,y)$ representing the target probability to minimize - the probability of a given image $x$ given that we’re provided with the caption $y$.

This can be minimized by taking the KL divergence between the probability of caption $y$ and the tokens $z$ from the auto-encoder given the original image $x$ (this is the probability we have in training runs) - $q_\phi(y, z|x)$, with the joint probability of a specific caption and image tokens appearing together over the distribution of the model $p_\psi(y,z)$ (in the transformer?).

In other words, we want the probability of given image tokens appearing with a caption given a specific image to be the same probability as just the tokens and caption appearing together (since the tokens should be a lossless representation of the image).

The second term with the KL divergence allows us to minimize the difference between these distributions, zeroing out the term, which will contribute to maximizing the ELB.

Similarly the $\ln p_\theta (x|y,z)$ term allows the model to maximize the probability of generating the correct image $x$ given the caption $y$ and compressed image representation $z$.

Critically, the expectation is sampling $z \sim q_\phi(z, x)$ indicating the distribution over the most probable $z$ values given $x$ - so this entire ELB allows the VAE to improve the sampling of $z$ via this distribution, such that the KL divergence is minimized.

**1. Stage One: Learning the Visual Codebook**

> In the first stage of training, we maximize the ELB with respect to $\phi$ and $\theta$, which corresponds to training a dVAE on the images alone.

They first focused just on the distributions $q_\phi$ of $32 \times 32$ image tokens generated by the dVAE given the image $x^2$, and the distribution $p_\theta$ which is the distribution over the RGB images generated by the dVAE decoder given the image tokens.

In practice, this means focusing on optimizing the encoder & decoder stages to compress down and then re-generate the original images.

> The ELB now becomes difficult to optimize: as $q_\psi$ is a discrete distribution, and we cannot use the re-parameterization gradient to maximize it.

Because DALL E represents images with discrete rather than continuous data (it uses a grid of values which can assume exactly 8192 values), sampling from a continuous distribution between the encoder and the decoder as customarily done no longer works.

This is because using $\sigma$ and $\mu$ in this space would result in sampling jumps to different tokens, since variance in this subspace just implies skipping to different tokens (since values are discrete).

Given that this space is discrete, it’s also not differentiable, as fractional gradients have no meaning here.

> We instead use the gumbel-softmax relation, replacing the expectation over $q_\phi$ with one over $q_\phi^\tau$, where the relaxation becomes tight as the temperature $\tau \rightarrow 0$.

Instead of outputting a $\sigma$ and $\mu$ from the encoder to sample with, the model instead outputs a set of logits of the scores for each of the 8192 possible tokens at each position.

Then, gumbel noise is added to these scores to simulate the randomness effect, and the softmax of these scores is taken with a temperature value $\tau$ to control the softnening of this function.

This process, called the gumbel-softmax relation, creates a continuous and differentiable function simulating sampling for our discrete tokens, which can be used in the VAE.

> The likelihood for $p_\theta$ is evaluated using the log-laplace distribution.

> We also found that increasing the KL weight to β = 6.6 promotes better codebook usage and ultimately leads to a _smaller_ reconstruction error at the end of training.

They maximize the weight of the KL divergence term in the ELB, which allows the auto-encoder to ensure that each mapping of image → image tokens is relatively unique, so it maintains all the important information in the compression.

**2. Stage 2: Learning the Prior**

> In the second stage, we fix $\phi$ and $\theta$, and learn the prior distribution over the text and image tokens by maximizing the ELB with respect to $\psi$.

Now we fix the distributions of the image to image token compression, and the image tokens back to the image, and we focus on the joint distribution of image tokens with text.

> Given a text-image pair, we BPE-encode the lowercased caption using at most 256 tokens with vocabulary size $16,384$, and encode the image using $32 \times 32 = 1024$ tokens with vocabulary size $8192$.

> The image tokens are obtained using argmax sampling from the dVAE encoder logits, without adding any gumbel noise.

The gumbel noise was used during training, but is not actually needed during usage of the dVAE - the logits can just be used directly by picking the most likely token for each part of the image.

> The transformer is a decoder-only model in which each image token can attend to all text tokens in any one of its 64 self-attention layers.

> Instead, we opt to learn a special padding token separately for each of the 256 text positions.

They use a padding token (which should carry no information) to fill out the remaining spots in the max 256 length image description, since each input should have the same number of text and image tokens.

**3. Data Collection**

> To scale up to 12-billion parameters, we created a dataset of a similar scale to JFT-300M by collecting 250 million text-images pairs from the internet.

**4. Mixed-Precision Training**

> To save GPU memory and increase throughput, most parameters, Adam moments, and activations are stored in 16-bit precision.

First mention I’ve seen of low-level compute details including floating point precisions used.

> Getting the model to train in 16-bit precision past one billion parameters, without diverging, was the most challenging part of this project. We believe the root cause of this instability to be underflow in the 16-bit gradients.

Here we hit an actual engineering challenge discussed in the paper.

**5. Distributed Optimization**

> Our 12-billion parameter model consumes about 24 GB of memory when stored in 16-bit precision, which exceeds the memory of a 16 GB NVIDIA V100 GPU. We address this using parameter sharding.

> Parameter sharding allows us to almost completely hide the latency of the intra-machine communication by overlapping it with compute-intensive operations.

**6. Sample Generation**

> We rerank the samples drawn from the transformer using a pre-trained contrastive model. Given a caption and a candidate image, the contrastive model assigns a score based on how well the image matches the caption.

> Training the transformer on the tokens from the dVAE encoder allows us to allocate its modeling capacity to the low-frequency information that makes images visually recognizable to us.

> However, it also disadvantages the model, since the heavy compression renders it unable to produce high-frequency details.

### Experiments

**1. Quantitative Results**

> Given a caption, the sample from our model receives the majority vote for better matching the caption 93% of the time. It also receives the majority vote for being more realistic 90% of the time.

**2. Qualitative Results**

> We found that our model has the ability to generalize in ways that we did not originally anticipate. […] It has developed a rudimentary ability to compose unusual concepts at high levels of abstraction.

> Our model also appears to be capable of combinatorial generalization, such as when rendering text or when probed on sentences like “an illustration of a baby hedgehog in a Christmas sweater walking a dog.”

> To a limited degree of reliability, we also find our model to be capable of zero-shot image-to-image translation controllable by natural language.

Here’s the beginning of editing images with text - the model can update existing images/complete them with captions.

> This works with several other kinds of transformations.

### Conclusion

> We investigate a simple approach for text-to-image generation based on an autoregressive transformer, when it is executed at scale.

> Our findings suggest that improving generalization as a function of scale may be a useful driver for progress on this task.



## 📜 [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/pdf/2204.06125)

> We propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding.

> Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation.

> Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion.

The creation of CLIP has enabled far more robust text-to-image models by adding an image decoder that can convert from CLIP embeddings to an image.

> We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

Using diffusion models for both parts of the model appears to be the best approach.

> In this work, we combine these two approaches [CLIP and diffusion models] for the problem of text-conditional image generation. We first train a diffusion _decoder_ to invert the CLIP image _encoder_.

> One notable advantage of using the CLIP latent space (over GANs) is the ability to semantically modify images by moving in the direction of any encoded text vector, whereas discovering these directions in GAN latent space involves luck and diligent manual examination.

Because of the syntactically and semantically consistent embeddings of the CLIP latent space, manipulating images is possible using text, whereas this is intractable with GANs.

> To obtain a full generative model of images, we combine the CLIP image embedding decoder with a prior model, which generates possible CLIP image embeddings from a given text caption.

The prior model is meant to enhance the CLIP image embeddings from the original text caption to make them more conducive to generating good images (which may mean enriching them with more description, etc.)

### Method

We can model the combined action of the _prior_ and _decoder_ as follows

$$
P(x|y) = P(x, z_i|y) = P(x|z_i, y)P(z_i|y)
$$

Here, we model the distribution of the probability of an image $x$ given the caption $y$ by splitting it into the prior, which models the probability of a given image embedding $z_i$ given the caption $y$, and then the probability of an image $x$ given the image embedding $z_i$, and optionally, the caption $y$ as well.

**1. Decoder**

> We use diffusion models to produce images conditioned on CLIP image embeddings.

> Specifically, we modify the architecture […] by projecting and adding CLIP embeddings to the existing time-step embedding, and by projecting CLIP embeddings into four extra tokens of context that are concatenated to the sequence of outputs from the GLIDE text encoder.

**2. Prior**

> For the diffusion prior, we train a decoder-only Transformer with a causal attention mask on a sequence consisting of, in order: the encoded text, the CLIP text embedding, an embedding for the diffusion time-step, the noised CLIP image embedding, and a final embedding whose output from the Transformer is used to predict the un-noised CLIP image embedding.

### Image Manipulations

> Our approach allows us to encode any given image $x$ into a bipartite latent representation $(z_i, x_T)$ that is sufficient for the decoder to produce an accurate reconstruction.

**1. Variations**

![Screenshot 2024-05-17 at 2.47.58 PM.png](../../images/Screenshot_2024-05-17_at_2.47.58_PM.png)

> Given an image $x$, we can produce related images that share the same essential content but vary in other aspects, such as shape and orientation.

**2. Interpolations**

![Screenshot 2024-05-17 at 2.48.14 PM.png](../../images/Screenshot_2024-05-17_at_2.48.14_PM.png)

> It is also possible to blend two images $x_1$ and $x_2$ for variations, traversing all of the concepts in CLIP’s embedding space that occur between them.

**3. Text Diffs**

![Screenshot 2024-05-17 at 2.48.28 PM.png](../../images/Screenshot_2024-05-17_at_2.48.28_PM.png)

> A key advantage of using CLIP compared to other models for image representations is that it embeds images and text to the same latent space, thus allowing us to apply language-guided image manipulations (text-diffs).