---
title: Draft - Stable diffusion using 🤗 Hugging Face - Looking under the hood.
author: Aayush Agrawal
date: "1999-11-04"
categories: [Stable Diffusion]
image: "underthehood.png"
format:
    html:
        code-fold: false
        number-sections: true
---

> An introduction into what goes on in the pipe function of  🤗 [hugging face diffusers library](https://github.com/huggingface/diffusers) `StableDiffusionPipeline` function.

This is my second post of the Stable diffusion series, if you haven't checked out the first one, you can read it here - <br>
1. **Part 1** - [Introduction to Stable diffusion using 🤗 Hugging Face](https://aayushmnit.com/posts/2022-11-02-StabeDiffusionP1/2022-11-02-StableDiffusionP1.html).


In this post, we will understand the basic components of a stable diffusion pipeline and their purpose. Later we will reconstruct `StableDiffusionPipeline.from_pretrained` function using these components. Let's get started - 

<figure>
<img src="./underthehood.png" style="width:100%">
<figcaption align = "center">
        Fig. 1: This image was generated by 🤗 Stable diffusion model using "a scientist looking under the hood of a car realistic 4k image" prompt.
</figcaption>
</figure>

## Introduction

Diffusion models as seen in the previous post can generate high-quality images. Stable diffusion models are a special kind of diffusion model called the **Latent Diffusion** model. They have first proposed in this paper [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752). The original Diffusion model tends to consume a lot more memory, so latent diffusion models were created which can do the diffusion process in lower dimension space called `Latent` Space. On a high level, diffusion models are machine learning models that are trained to `denoise` random Gaussian noise step by step, to get the result i.e. `image`. In `latent diffusion`, the model is trained to do this same process in a lower dimension. <br>

There are three main components in latent diffusion - <br>

1. A text encoder, in this case, a [CLIP's Text encoder](https://openai.com/blog/clip/) 
2. An autoencoder, in this case, a Variational Auto Encoder also referred to as VAE 
3. A [U-Net](https://arxiv.org/abs/1505.04597)

Let's dive into each of these components and understand their use in the diffusion process. The way I will be attempting to explain these components is by talking about them in the following three stages - <br>

1. **_The Basics: What goes in the component and what comes out of the component_** - This is an important, and key part of the [top down learning approach](https://www.fast.ai/posts/2016-10-08-teaching-philosophy.html) of understanding "the whole game"
2. **_Deeper explanation using 🤗 code._** - This part will provide more understanding of what the model produces using the code
3. **_What's their role in the Stable diffusion pipeline_** - This will build your intuition around how this component fits in the Stable diffusion process. This will help your intuition on the diffusion process


## CLIP Text Encoder

### Basics - What goes in the component and what comes out of the component?

CLIP(Contrastive Language–Image Pre-training) text encoder takes the text as an input and generates text embeddings that are close in latent space as it may be if you would have encoded an image through a CLIP model.

<figure>
    <img src="./clip_image.png" style="width:100%">
<figcaption align = "center">
        Fig. 2: CLIP text encoder 
</figcaption>
</figure>

### Deeper explanation using 🤗 code

Generally, any machine learning model doesn't understand text data. For any model to understand text data, we need to convert this text into numbers that hold the meaning of the text, generally referred to as `embeddings`. The process of converting a text to a number can be broken down into two parts - <br>
1. **_Tokenizer_** - Breaking down each word into sub-words and then using a lookup table to convert them into a number <br>
2. **_Token_To_Embedding Encoder_** - Converting those numerical sub-words into a representation that contains the representation of that text <br>

Let's look at it through code. We will start by importing the relevant artifacts.

In [2]:
import torch, logging

## disable warnings
logging.disable(logging.WARNING)  

## Import the CLIP artifacts 
from transformers import CLIPTextModel, CLIPTokenizer

## Initiating tokenizer and encoder.
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16)
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16).to("cuda")

Let's initialize a prompt and tokenize it.

In [9]:
prompt = ["a dog wearing hat"]
tok =tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt") 
print(tok.input_ids.shape)
tok

torch.Size([1, 77])


{'input_ids': tensor([[49406,   320,  1929,  3309,  3801, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0]])}

A `tokenizer` returns two objects in the form of a dictionary - <br>
1. **_`input_ids`_** - A tensor of size 1x77 as one prompt was passed and padded to 77 max length. _`49406`_ is a start token, _`320`_ is a token given to the word "a", _`1929`_ to the word dog, _`3309`_ to the word wearing, _`3801`_ to the word hat, and _`49407`_ is the end of text token repeated till the pad length of 77. <br>
2. **_`attention_mask`_** - `1` representing an embedded value and `0` representing padding.

In [43]:
for token in list(tok.input_ids[0,:7]): print(f"{token}:{tokenizer.convert_ids_to_tokens(int(token))}")

49406:<|startoftext|>
320:a</w>
1929:dog</w>
3309:wearing</w>
3801:hat</w>
49407:<|endoftext|>
49407:<|endoftext|>


So let's look at the `Token_To_Embedding Encoder` which takes the `input_ids` generated by the tokenizer and converts them into embeddings - 

In [48]:
emb = text_encoder(tok.input_ids.to("cuda"))[0].half()
print(f"Shape of embedding : {emb.shape}")
emb

Shape of embedding : torch.Size([1, 77, 768])


tensor([[[-0.3887,  0.0229, -0.0522,  ..., -0.4902, -0.3066,  0.0673],
         [ 0.0292, -1.3242,  0.3074,  ..., -0.5264,  0.9766,  0.6655],
         [-1.5928,  0.5063,  1.0791,  ..., -1.5283, -0.8438,  0.1597],
         ...,
         [-1.4688,  0.3113,  1.1670,  ...,  0.3755,  0.5366, -1.5049],
         [-1.4697,  0.3000,  1.1777,  ...,  0.3774,  0.5420, -1.5000],
         [-1.4395,  0.3137,  1.1982,  ...,  0.3535,  0.5400, -1.5488]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<NativeLayerNormBackward0>)

As we can see above, each tokenized input of size 1x77 has now been translated to 1x77x768 shape embedding. So, each word got represented in a 768-dimensional space.

### What's their role in the Stable diffusion pipeline 

Stable diffusion only uses a CLIP trained encoder for the conversion of text to embeddings. This becomes one of the inputs to the U-net. On a high level, CLIP uses an image encoder and text encoder to create embeddings that are similar in latent space. This similarity is more precisely defined as a [Contrastive objective](https://arxiv.org/abs/1807.03748). For more information on how CLIP is trained, please refer to this [Open AI blog](https://openai.com/blog/clip/).

<figure>
    <img src="./clip_contrastive.png" style="width:100%">
<figcaption align = "center">
        Fig. 2: CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. Credit - [OpenAI](https://openai.com/blog/clip/)
</figcaption>
</figure>

## VAE - Variational Auto Encoder

To be written.

## U-Net

To be written.

## Stable Diffusion Process

To be written.

## Conclusion

To be written.

I hope you enjoyed reading it, and feel free to use my code and try it out for generating your images. Also, if there is any feedback on the code or just the blog post, feel free to reach out on [LinkedIn](https://www.linkedin.com/in/aayushmnit/) or email me at aayushmnit@gmail.com.