# A gentle introduction to Stable Diffusion: Part 2 - The CLiP text embedding model

Hello and welcome (back) to this explainer for Stable Diffusion - specifically targeting a non-technical audience. This will be a quick rundown of how *Latent Diffusion Models* (LDMs) work, specifically focusing on Stable Diffusion v1.4, as it is probably the most intuitive to explain.

This explainer will be split into multiple parts - first, we will give a high level overview of the model cascade itself, then we'll go into each component of the cascade, and finally, we'll put the model cascade together at the end!

This is the second section out of five: a look at the CLiP text embedder. The other sections are accessible [here]( https://research.qut.edu.au/genailab/projects/unboxing-genai/)

1. Introduction to Latent Diffusion Models
2. The CLiP text embedding model
3. Variational Auto Encoders for image compression
4. Convolutional UNet de-noiser
5. Conclusion - putting it all together


## The Stable Diffusion Text Embedder

Let's start our deep dive first with the text embedder. As previously mentioned, the text embedder here is a model called `CLiP-ViT-L/14`. <font color="red">This model was first introduced in the paper XXX by researchers at YYY</font>. It is designed specifically for embedding both text and images into the same vector space.

Essentially, when you use this model to embed words and images, the embedding representations of words and images that are semantically similar will should be close-by in the vector embedding space. It's because this model was trained on both images and text that the folks at Stability.AI decided to use this model for their word embeddings in Stable Diffusion v1.4.


In [3]:
# First, we need to install the required libraries

!pip install -q --upgrade transformers==4.25.1

  error: subprocess-exited-with-error
  
  Building wheel for tokenizers (pyproject.toml) did not run successfully.
  exit code: 1
  
  [51 lines of output]
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-cpython-312
  creating build\lib.win-amd64-cpython-312\tokenizers
  copying py_src\tokenizers\__init__.py -> build\lib.win-amd64-cpython-312\tokenizers
  creating build\lib.win-amd64-cpython-312\tokenizers\models
  copying py_src\tokenizers\models\__init__.py -> build\lib.win-amd64-cpython-312\tokenizers\models
  creating build\lib.win-amd64-cpython-312\tokenizers\decoders
  copying py_src\tokenizers\decoders\__init__.py -> build\lib.win-amd64-cpython-312\tokenizers\decoders
  creating build\lib.win-amd64-cpython-312\tokenizers\normalizers
  copying py_src\tokenizers\normalizers\__init__.py -> build\lib.win-amd64-cpython-312\tokenizers\normalizers
  creating build\lib.win-amd64-cpython-312\tokenizers\pre_tokenizers
  copying py_

In [2]:
# Then, we initialize some imports and settings

import os
import torch

from pprint import pprint
from transformers import CLIPTextModel, CLIPTokenizer, logging

# Supress some unnecessary warnings when loading the CLIPTextModel
logging.set_verbosity_error()

# Set device
torch_device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
if "mps" == torch_device: os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = "1"


Now let's load the model:


In [6]:
# Load the tokenizer and text encoder to tokenize and encode the text.
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")

text_encoder = text_encoder.to(torch_device).half()

To turn a sentence into an embedding, first we must "tokenize" the sentence, which is why we've loaded a Tokenizer.

A tokenizer takes subwords from your input string/sentence, and then, using a dictionary, converts them to integers. Subwords can be as short as a letter, or an entire word.

The reason we turn these into numbers first is to convert text into much smaller chunks of data that can more easily be mathematically operated on. Think of this stage as sort of like a simple code-book where we match certain chunks of words to numbers.

In [8]:

input_sentence = ["Hi mum! I'm in a computer!"]

tokenized_input = tokenizer(
    input_sentence,
    
    # 'padding' is to fill in "unused" tokens in the input, to make sure it goes into our embedding model properly
    padding="max_length",

    # 'max_length' is setting how many tokens the padder needs to pad the input out to
    max_length=tokenizer.model_max_length,
    
    # 'truncation' being true means that, if we put in too many tokens, we will just chop off the last tokens to make sure the input fits
    truncation=True,
    
    # 'return_tensors' set to 'pt' means that our datatype will be compatible with PyTorch
    return_tensors="pt"
)

pprint(tokenized_input)


{'input_ids': tensor([[49406,  1883,  4030,   256,   328,   880,   530,   320, 11639,   652,
           256, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0]])}


You can see that by calling `tokenizer()`, the text was converted to a list of numbers (a tensor), the `input_ids` field. The `attention_mask` is used to tell the model which input tokens are actually text, vs. padding or other input parts that should be ignored by the model.

These tokenised outputs can easily be "reversed", to get our inputs back.


In [9]:

tokenizer.decode(
    # We pass in the 'input_ids' field (we don't need the `attention_mask` field here.
    # The [0] is needed because we could have tokenized many text strings at once (we didn't though)
    tokenized_input['input_ids'][0],

    # XXX - AJS - will can you explain this arg
    skip_special_tokens=True
)


"hi mum! i'm in a computer!"


Now we will "embed" these tokens - or put them into a useful vector space for the model to "understand" what this sentence means.

We use the term "understand" loosely - really, this model is trained to keep points in vector embedding space close together if they're similar, and keep points far apart if they're not similar.

<font color="red">AJS: Will: I think it's worth going into a bit more detail here about how the embedder is trained. Just a few more sentences worth. I thought CLIP isn't using labelled data, it's contrastive learning right? This is pretty important context that will be useful for the target audience to understand</font>

What is and isn't similar is defined by <font color="red">humans who have labelled a bunch of data, as well as a bunch of other assumptions about language and context. We won't go too deep into it, but essentially,</font> we've converted the characters from this sentence into a numbers in a vector space that can be used to represent language mathematically in a useful and general purpose way.


In [10]:

# Move the tokenized text to the processing device (GPU or CPU etc.)
tokenized_input = tokenized_input.to(torch_device)

# Embed it using the CLIP text encoder
sentence_embedding = text_encoder(**tokenized_input)

pprint(sentence_embedding)



What you can see printed out is two items. The first item is the "unpooled" output of the embedding model, and the bottom output is the "pooled" output.

Essentially what's happening is that each token itself is getting an embedding assigned to it, and then all of those embeddings get averaged ('pooled') together to get an overall representation. <font color="red">This first sentence is the one we'll be putting into our UNet later. AJS - this isn't clear, do we use the pooled or unpooled?</font> 

Two things worth noting here - one is that it's now much trickier to decode this set of numbers (especially after they've been averaged) back into the input sentence. This is an encoder-only network, and typically the only way to turn this set of number back into a sentence is to use a decoder network trained to decode text from this particular embedding space. <font color="red">The CLIP embedding model doesn't have this compoent included.</font>

Second, is the fact that this is a set of numbers - and in that sense, you can sort of do an arbitrary arithmetic on it to create whatever set of numbers you like! You could, for example, add `1.0` to all of the numbers in the vector, or multiply all the numbers by `pi`, or add another vector, or completely randomise them! This is potentially an interesting way of exercising control over the final output in an LDM - however, we will explore this more later.

Below are some convenience functions to tokenize and embed strings, as well as a few fun functions that will do everything I just demonstrated, but with your own inputs. We will use these later in this notebook.

<font color="red">AJS - I thought below you were going to give some fun functions that add 1.0 to an embedding, multiply by a constant, add two vecotrs, randomize an embedding etc.? Add these functions?</font>


In [12]:

def tokenize(stringlist):
    """Convert a list of strings into a tensor of tokenized strings"""
    tokenized = tokenizer(stringlist, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")
    return tokenized

def detokenizer(tokenized):
    """Decode a single tokenized string
    
    AJS - will should this be adjusted to handle an entire batch like tokenize() above?
    """
    return tokenizer.decode(tokenized['input_ids'][0], skip_special_tokens=True)

def sentence_embedder(stringlist):
    """Tokenize, then embed a list of strings"""
    inputs=tokenize(stringlist)
    inputs=inputs.to(torch_device)
    with torch.no_grad():
        sentence_embeddings=text_encoder(**inputs)
    return sentence_embeddings
    

In [13]:

demo_strings = [
    'demo string! The meaning of life is 42 (and so am I in spirit - 42 years old. Time flies.)',
    'This is another text string!'
]

tokenized=tokenize(demo_strings)
decoded=detokenizer(tokenized)
embeddings=sentence_embedder(demo_strings)

print(demo_strings)
print('')
print(tokenized)
print('')
print(decoded)
print('')
print(embeddings)


['demo string! The meaning of life is 42 (and so am I in spirit - 42 years old. Time flies.)']

{'input_ids': tensor([[49406,  7380,  9696,   256,   518,  8342,   539,   970,   533,   275,
           273,   263,   537,   706,   687,   328,   530,  4071,   268,   275,
           273,  1151,   896,   269,   788,  8070,  5376, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 

In [14]:

# OPEN QUESTION: should I also embed some images and explain cos sim to show how CLIP works with images as well?
AJS - definitely! The explainer here is only 1/2 done!


Now that we understand the text embedder, let's move on to the Variational AutoEncoder. *make this a link*