# Yet Another Stable Diffusion Post

> Diving into the details of Diffusion Models.

- toc:true- branch: master
- badges: true
- comments: true
- author: cck
- categories: [diffusion, deep learning]

In [None]:
#| default_exp diffusionIntroPost

## Background Info

This post goes over Diffusion Models and how they work. It is mainly a future repo of personal notes about this topic, based on my own learnings. 

At the same time, the post covers many details and gaps that were either missing or assumed in other diffusion tutorials. To be clear: the content in those other write ups is fantastic, and I would encourage everyone to read them as well. But, while trying to explain this subject to non-AI friends, I realized there are many tricky and unclear parts that are likely hard for an experienced person (or someone who works on this all day) to see. 

Hopefully this post gives a good introduction and covers those gaps, while being practically relevant and fun. It is to be the blog I wish I had when starting out.  

The content comes mainly from the following excellent resources:
- annotated diffusion model
- Phil Wang’s code implementation
- Lilian Wang’s post
- Song’s compendium
- Karras et. Al, Elucidating the Design Space 
- Understanding Diffusion Models: A Unified Perspective

## Intro  

### The task: what are we trying to do?  

Before jumping into equations and definitions, we should take a step back and remember what these diffusion models are trying to do. What is the end goal behind all of this compute? 

The goal is to generate new images.  

In the spirit of artistry, let's say we are "creating" new images instead. Consider this example: we want to create a painting of a vase with sunflowers.  

As a person, it is easy to imagine the process of creating this painting. We would pick up a paintbrush and start painting a vase. Assuming we aren't trained artists, then this first painting will likely be... rough, to put it gently. But, with practice, our paintings would steadily get better. After enough practice and effort, we would be able to create a more than respectable painting of a sunflower vase. 

The sunflower vase creation described above would take a lot of effort and time. But there is a clear path for getting there. The same is true if we'd instead asked for a photo or drawing of the sunflower phase. In that case, instead of practicing painting, we would have taken up photography or illustration, respectively. 

In the real world, we would have likely bought or commissioned this sunflower art from someone else. An artist who has already put in the time and effort to improve their skills. This artist will have already gone through intense practice and effort, saving us the trouble.  

In any case, whether we chose to create or buy this sunflower art, there is a clear path for bringing it into the world.  


### Beyond sunflowers

The example above described a relatively simple scene: vase + flowers. What if we needed a painting of something more complex? For example a fresco of a field of sunflowers as the dawn sun first cracks the horizon, perhaps with birds in the skies or animals roaming about. 

If we were painting it ourselves, then we'd have to practice for a lot longer to nail this more complex scene and subjects. Or, we'd need to find increasingly specialized artists who can handle the task.  

This was an elaborate and drawn-out of way of saying something quite obvious: there is a skills and time bottleneck in creating something new. Good old human effort and practice can in practice improve this bottleneck. However, there *must* be an easier way?  

That is where computers come into play. Computers are incredible at rote, automatic tasks that would otherwise take a lot of time and brainpower from a human. But computers can famously only do exactly what they are told: nothing else and nothing more. How could we possibly get a computer to follow the process above, of human effort, practice, and refinement? In other words, how could a computer learn to paint a sunflower vase?  


### Painting vs. Sculpting

Diffusion models are part of a broad family of approaches for creating new images, via a computer. In a very loose analogy, they are trained and refined just like our aspiring painter in the previous section.  

Let's revisit how a human would paint the sunflower vase. Then, we will see how a Diffusion Model would paint the vase.  

A human painter would start with a blank canvas. Then, through practice, the artist learns how to fill this canvas with brush strokes to create our sunflower painting.  
A Diffusion Model starts with a different kind of canvas. Instead of starting as blank, the Model's canvas is filled with noise. Then, through training, the model learns how to remove and carve out the noise to reveal our sunflower painting underneath.  

To link the two canvases: imagine if someone threw several buckets of paint at a blank canvas. The buckets themselves are of different sizes, and filled with every kind of color. Imagine someone has been throwing these buckets for a long time at random. Long enough for there to be hundreds or thousands of layers of paint, stacked in every combination of colors. 

Now, if an artist stopped by, and we asked them to turn this paint-drenched canvas into a sunflower vase, how would they do it? The easiest way would be to throw on another coat of all-white paint, wait for it to dry, and paint the sunflowers as requested. But, suppose this artist was a masochist (wild, right?), and wanted to do something much harder.  

If this artist knew the exact order in which the buckets were thrown, and where they were thrown, he could do something else: he could take a scalpel and carefully remove the layers of paint he didn't need.  

He could initially carve down until the first white coat of painting in every spot. This would give him a blank canvas, if maybe one that's a little uneven. Then, he could continue slicing out small strips of paint until he reached: yellow for the sunflowers, green for the stem, etc. Remember, there many layers of paint of every color. The artist simply has to know how far down to cut to reach his desired color for an area. After a long time of carving, the artist will have uncovered a sunflower painting in his specific vision. 

If the artist follows the example above, he has instead "sculpted" the sunflower as opposed to "painting" it. This sculpting is exactly what a diffusion model learns to do. It starts with "noise" (aka the paint-soaked canvas with multiple layers of every-colored paint), and learns to remove layers bit by bit, in order to reveal the masterpiece underneath. 

The main question now becomes: how does a computer learn to sculpt new images out of noise? Diffusion Models are a formal answer to this question, where the learning can be automated by a computer. 

### Aside: Deep Learning  

Deep Learning has enabled computers to do things that were previously impossible (or at least incredibly difficult) for them. Some of these impossible tasks include: finding specific objects in a photo, figuring out the exact words someone spoke, or writing a short story. As people these tasks are relatively easy for us. But to a computer, it is as hard as asking a human to calculate giant matrix multiplications in their heads. One type of knowledge isn't objectively better than the other. The hardware (circuits vs. brains) has simply been specialized to different tasks. 

### Formal digital sculpting

The original approach for this task actually came about in [Sohl, 2015]. That seems like an eternity ago measured by the speed of Deep Learning progress. But, it took more recent work [2019, 2020] and followups since to truly unlock the capabilities of these models.  

Taking a different approach. Start with the forward diffusion process, and change the subscript notation. More “left-to-right to diffuse noise, then undo to generate” More intuitive based on talks with artist friends excited about these models and wanting to know more
“Drawing denoised samples from a parametrized latent variable” makes sense to stats nerds and math people. “Making an image blurry, then learning how to de-blur it” is understood by normal people. 

I’ve never liked how latent folks draw their arrows. Directions feel backwards. Be normal, left-to-right, pointing downwards (generating, from the AI heavens above). The latent is unobservable. We can always look down at the ground, but looking up at the sky burns our eyes. 

Their posts are far more rigorous and better subject mastery
This is for my own future reference
Many typical diffusion details are unclear, or the lead is buried

Small details and perspective change that feel more natural to me

Definitions and notations start with backwards first, which seems like a random starting point. Magically sampling out of thin air. Better to first define the forward diffusion process, very intuitive and people easily grok it
Then, show how we could reverse or un-do this to recover the original image. 
Show how we could use this exact process, but starting from different noise, to create a new image

Not really adding, we are mixing instead. Pure adding would saturate. It is more like smoothing - link to Cold Diffusion which seems to break from notion of “adding Gaussian noise” 
The nice property
Notation and constants
Derivations, what they are doing


Focused broadly on score-matching and denoising models, which have shown some equivalences or similarities, and there are useful and important points in both.

Score-matching is often introduced as a floating term. Matching what and why? The true underlying distribution has a value. We want noisy updates to march away from this, and sampling to get progressively closer to the true distribution.

Bad log likelihoods explained by increasing focus on imperceptible details. Nature has a sparse basis, can lose many components without affecting fidelity. But, weighting the imperceptible differences tells us the “static” has a high-weight value. What could cause this sharp texture loss?  

Sample code with CIFAR-10
Refactored version of Annotated diffusion for myself 

In [None]:
#| export
import os

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()