# Stable Diffusion Notebook





#### ***Background of stable Diffusion***

Powerful [open-source](https://github.com/Stability-AI/stablediffusion) model.

Image-to-image, text-to-image, and inpainting (select part to remove)

? Why does noise allow the model to learn better?



#### Encoder

We build our VAE, which compresses our image, so there is less necessary computation in our convolutional layers.

After we run our latent vector through our unet, we put this through our decoder.

In [1]:
''' Understanding How convolutional layers will affect shape'''

# input is: (Batch, Channels, H,W)
def get_shape(input,out_channels,kernel_size,padding,stride):
    
    h_out = ((input[2] + 2*padding - kernel_size) / stride) + 1
    w_out = ((input[3] + 2*padding - kernel_size) / stride) + 1
    
    output = [input[0],out_channels,h_out,w_out]
    
    print(f"Shape of conv layer is: {output}")


# should 128,128
get_shape([5,3,512,512],out_channels=128,kernel_size=3,padding=0,stride=2)

# should 256,256
get_shape([5,128,256,256],out_channels=256,kernel_size=3,padding=0,stride=2)
    

Shape of conv layer is: [5, 128, 255.5, 255.5]
Shape of conv layer is: [5, 256, 127.5, 127.5]


How did they come up with the architecture?

It must have already worked well in practice for another architecture. They have simply borrowed from other architectures,
- You must be aware of what is working at the cutting edge in neural network architectures!

### GroupNorm

What is group normalization?

Since our output has come from convolutions, we want 'groups' that are normalized in '*square-blocks*. So, each distinct group will be local features, rather than entire rows or collumns (batch or output).

So, we are ***making closer features have the same distribution***. This effectively makes it so that: individual similar features are expressed relatively to other nearby features. For example, elements with in one half of the image will be expressed relative to eachother, and not the other half, because the other half is less related.

In [None]:
import torch
from torch import nn
# playing with up-sampling
# batch,channels,height,width
test = torch.randn((3,3,4,4))
print(test[0,0]) # 1st channel, 1st batch

# this will grow by just multiplying the pixel!
ups = nn.Upsample(scale_factor=2)
test = ups(test)
print(test[0,0])

tensor([[-0.6566,  0.2413, -1.9425, -0.1324],
        [ 0.2438,  0.5191, -1.3799,  0.9107],
        [-0.6993,  0.7703, -1.5209, -1.1541],
        [-1.4260,  0.9945, -0.2328,  1.4681]])
tensor([[-0.6566, -0.6566,  0.2413,  0.2413, -1.9425, -1.9425, -0.1324, -0.1324],
        [-0.6566, -0.6566,  0.2413,  0.2413, -1.9425, -1.9425, -0.1324, -0.1324],
        [ 0.2438,  0.2438,  0.5191,  0.5191, -1.3799, -1.3799,  0.9107,  0.9107],
        [ 0.2438,  0.2438,  0.5191,  0.5191, -1.3799, -1.3799,  0.9107,  0.9107],
        [-0.6993, -0.6993,  0.7703,  0.7703, -1.5209, -1.5209, -1.1541, -1.1541],
        [-0.6993, -0.6993,  0.7703,  0.7703, -1.5209, -1.5209, -1.1541, -1.1541],
        [-1.4260, -1.4260,  0.9945,  0.9945, -0.2328, -0.2328,  1.4681,  1.4681],
        [-1.4260, -1.4260,  0.9945,  0.9945, -0.2328, -0.2328,  1.4681,  1.4681]])
