# unit 6.1 - Generating images

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/culurciello/deep-learning-course-source/blob/main/source/lectures/61-generating-images.ipynb)

## From text and images to concepts

Generating images using neural networks usually involves converting a text description of the desired image into actual pixels. We here provide a summary of the most recent (as of 2024) techniques.

Since we want to use text to describe the desired output image, we need a way to connect images and their description or caption. This is a form of data correlation or joint-embedding where we want the image and its caption to be encoded into the same neural code. The neural code is just a short and compressed version of the text or image. 

It helps to think of these neural codes as "concepts" in your mind. Ideas or concepts encode multi-modal data in a compact form that is used for reasoning and intelligent behavior. In our human brain we also have concepts: for example thing of the word "cat", which tags the multi-modal concept 'cat' in our brain as a collection of visual appearance, tactile feeling, motions, and all characteristics we know of 'cat'.

Here we show an example of joint modality encoding into a concept space. The text "a crab on black rocks" and an image of the same are encoded into the same 512-numbers vector.

![](images/generate-images1.png)


One popular technique is [CLIP](https://openai.com/research/clip). Another super easy way to do this is to add a trainable classifier the text encoder, and train it to match the encoding of images. For example one can uses the encoded image as "concept" space and add a 2-layer classifier to the text encoder (a CNN, for example). Training will make the classifier output the same concept. 

After the text and image encoder has been trained, the way to generate pixels is to use the concept and project into pixel space. The concept contains an alignment to both textual content and its relation to images.

The image encoder path is not required, as it is only used to train the concept space. In actual pixel generation we use the text encoder projected into the concept space as vehicle to then drive image generation, as can be seen below.

![](images/generate-images2.png)

## From concepts to pixels?

How do you generate pixels from a concept? It requires the use of an image decoder, trained from an auto-encoder perspective. What does this mean? Recall we used an image encoder to encode concepts into the same space as the image captions. 

The easiest way to do this is to train an image decoder as a stack of [upsampling (transpose) convolutions](https://d2l.ai/chapter_computer-vision/transposed-conv.html). These modules can upsample images by creating pixels between pixels.

But there are many other techniques to generate pixels, including [adversarial techniques](https://blog.ovhcloud.com/understanding-image-generation-beginner-guide-generative-adversarial-networks-gan/) which have training difficulties. 

Clearly it is a difficult task for one neural network to create large images in one shot. That is why these techniques have proven effective in generating images with low pixel counts. Of course we like many more pixels in our images, many more than 512x512, ideally thousands of pixels. These methods are just not as effective at providing this level of details. What is needed is to use neural networks that can evolve an image in multiple steps. In recent times, (year 2024) one of the most popular technique to create large images is [diffusion](https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/). These models learn to remove noise from an image in successive steps, using a U-Net neural network architecture. 




## References

https://www.assemblyai.com/blog/how-dall-e-2-actually-works/


https://x.com/stanley_h_chan/status/1764827260115190075?s=20

