# Stable Diffusion Notebook





#### ***Background of stable Diffusion***

Powerful [open-source](https://github.com/Stability-AI/stablediffusion) model.

Image-to-image, text-to-image, and inpainting (select part to remove)

? Why does noise allow the model to learn better?



#### Encoder

We build our VAE, which compresses our image, so there is less necessary computation in our convolutional layers.

After we run our latent vector through our unet, we put this through our decoder.

In [1]:
''' Understanding How convolutional layers will affect shape'''

# input is: (Batch, Channels, H,W)
def get_shape(input,out_channels,kernel_size,padding,stride):
    
    h_out = ((input[2] + 2*padding - kernel_size) / stride) + 1
    w_out = ((input[3] + 2*padding - kernel_size) / stride) + 1
    
    output = [input[0],out_channels,h_out,w_out]
    
    print(f"Shape of conv layer is: {output}")


# should 128,128
get_shape([5,3,512,512],out_channels=128,kernel_size=3,padding=0,stride=2)

# should 256,256
get_shape([5,128,256,256],out_channels=256,kernel_size=3,padding=0,stride=2)
    

Shape of conv layer is: [5, 128, 255.5, 255.5]
Shape of conv layer is: [5, 256, 127.5, 127.5]


How did they come up with the architecture?

It must have already worked well in practice for another architecture. They have simply borrowed from other architectures,
- You must be aware of what is working at the cutting edge in neural network architectures!

### GroupNorm

What is group normalization?

Since our output has come from convolutions, we want 'groups' that are normalized in '*square-blocks*. So, each distinct group will be local features, rather than entire rows or collumns (batch or output).

So, we are ***making closer features have the same distribution***. This effectively makes it so that: individual similar features are expressed relatively to other nearby features. For example, elements with in one half of the image will be expressed relative to eachother, and not the other half, because the other half is less related.

In [2]:
import torch
from torch import nn
# playing with up-sampling
# batch,channels,height,width
test = torch.randn((3,3,4,4))
print(test[0,0]) # 1st channel, 1st batch

# this will grow by just multiplying the pixel!
ups = nn.Upsample(scale_factor=2)
test = ups(test)
print(test[0,0])

tensor([[ 0.3066,  1.0215,  2.2728, -0.5375],
        [-0.7986,  0.2427,  1.1274, -0.2242],
        [ 0.8298, -1.5215, -0.1798,  0.2187],
        [-0.4769, -0.3294, -1.2319,  0.8011]])
tensor([[ 0.3066,  0.3066,  1.0215,  1.0215,  2.2728,  2.2728, -0.5375, -0.5375],
        [ 0.3066,  0.3066,  1.0215,  1.0215,  2.2728,  2.2728, -0.5375, -0.5375],
        [-0.7986, -0.7986,  0.2427,  0.2427,  1.1274,  1.1274, -0.2242, -0.2242],
        [-0.7986, -0.7986,  0.2427,  0.2427,  1.1274,  1.1274, -0.2242, -0.2242],
        [ 0.8298,  0.8298, -1.5215, -1.5215, -0.1798, -0.1798,  0.2187,  0.2187],
        [ 0.8298,  0.8298, -1.5215, -1.5215, -0.1798, -0.1798,  0.2187,  0.2187],
        [-0.4769, -0.4769, -0.3294, -0.3294, -1.2319, -1.2319,  0.8011,  0.8011],
        [-0.4769, -0.4769, -0.3294, -0.3294, -1.2319, -1.2319,  0.8011,  0.8011]])


In [3]:
# testing clip loss w/ minimal implementation
import torch.nn as nn
import torch.nn.functional as F

# image vectors B, Embd
txt = torch.randn((12,100))
images = torch.randn((12,100))

# normalizing so add to 1
txt = F.normalize(txt) 
images = F.normalize(images)

t = 0.01 # learned temperature parameter for sensitivity scaling

# get inner-products
corr_logits = images@txt.T # : dot product of text and images, row=text,col=img

# cross-entropy
labels = torch.arange(corr_logits.shape[0])

# for each row, picking out target label (collumn of row)
criterion = nn.CrossEntropyLoss()
loss = criterion(corr_logits,labels)

print(loss)

tensor(2.4814)


In [None]:
'''Full implementation of CLIP

We not be focusing on the Encoder/Decoder for the image, and the text,
and we will abstract these away, downloading pre-trained models.

We will focus on the actual CLIP loss-objective, and add a final linear layer
to project our outputs from our expressive encoder

'''

# downloading our tokenizer
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

In [31]:
text = "io kl ioioio"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)


# now, we have a way to generate tokens from our transformer!
print(output.last_hidden_state.shape)


torch.Size([1, 8, 768])
