<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

> Making Classifier-free Guidance a dynamic process.

# Introduction

This notebook introduces dynamic Classifier-free Guidance (`dCFG`) for diffusion models.  

`dCFG` makes it so that Classifier-free Guidance changes at each timestep in the diffusion process. We cover why this might be important in the section below.  

## Previous work on `dCFG`

We previously ran an [exploratory series](https://enzokro.dev/blog/posts/2022-11-26-guidance-expts-8) on `dCFG` on the `v1` Stable Diffusion models. Then, we made a short introduction notebook on `dCFG` for the [Stable Diffusion v2.0 model](https://enzokro.dev/blog/posts/2022-11-28-sd-v2-schedules-1/).  

With the release of the new and improved [Stable Diffusion v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1) model, it seems like a great time to take a step back, recap what we've learned so far, and put our approach on more solid footing.  

:::: {.callout-note}  
There are similar dynamic guidance approaches in the Imagen paper, and in applications for Text-to-Speech with diffusion.  
::::

# Overview of Guidance for Diffusion Models

This section goes over how to generate images based on a known, given input. We shortly recap the different ways of generating images.  

Specifically, we review unconditional image generation, then move on to classifier-guided generation, and finally close with classifier-free generation. This represents how people have gone from generating random photos to the incredible diffusion images floating around the web.  

## Unconditioned Generation

Unconditional image generation is the bedrock of generative models. Here, we are given a collection of training images. The goal is to learn and model the probability distribution that generated these images. If we can learn or estimate this distribution, then we can sample from it to create brand new images.  

Ideally, we would have a grand Oracle that models the distribution of every single possible image. This Oracle would then, in theory, be able to generate absolutely any image we can think of. Unfortunately creating this Oracle would require an almost infinite amount of data, assuming we could even gather it in the first place (we can't). The best we can do then is to gather a subset of the images we care the most about. For example, if we are trying to generate outdoor landscapes, we could gather images of nature. The more images we gather the better.  

The goal is to make our training image set large and diverse enough to represent the topic or subject (aka distribution) that we want to generate. Once we have this training image set, there is a wide range of Machine Learning approaches to both model and sample from its distribution. The most popular generation approaches are detailed in this [excellent blog post](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/) by Lilian Weng. These approaches include:  

- GANs  
- Flow-based models  
- Variational Auto-Encoders  
- Diffusion models  

Assuming our training set is large and representative enough, any of these approaches can learn to model and sample from its data-generating distribution.  

This is fantastic if we want to create new styles or variants of our data. For example, if the training data was made of fashion styles, then we could generate new or unique trends. Or if the data was some sort of asset like character sprites or objects in a video game, we could generate new and creative items.  

However, we often want to create and generate specific outputs. If you've used any online Stable Diffusion APIs, that's a perfect example. We want the model to specifically generate an output based on the given input text. Or, even tying it to our earlier examples, maybe we want to create a new fashion trend that's inspired by specific styles. Likewise for the video game assets, maybe we can to create a new create that's a blend of two existing monsters. This is where **guidance** comes into play. 

## Classifier Guidance

## Classifier-free Guidance

### Making it dynamic

## What does `dCFG` actually do?

# Python imports

We start with a few python imports.

In [3]:
#| echo: false
#| include: false


import tensorflow as tf
tf.get_logger().setLevel('INFO')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


ModuleNotFoundError: No module named 'tensorflow'

In [2]:
import os
import gc
import random
from typing import Callable, List, Dict
from functools import partial

import torch
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

# runs dCFG
from dynamic_cfg.guidance import DynamicCFG

# to load Stable Diffusion pipelines
from dynamic_cfg.diffusion import MinimalDiffusion
# to plot generated images
from dynamic_cfg.utils import show_image, image_grid, plot_grid

# Default schedule parameters from the blog post
from dynamic_cfg.schedules import DEFAULT_SCHED_PARAMS, DEFAULT_T_PARAMS, get_cos_sched

## Seed for reproducibility

`seed_everything` makes sure that the results are reproducible across notebooks.

In [9]:
# set the seed for rng
SEED = 2863311530
def seed_everything(seed: int):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# for sampling the initial, noisy latents
seed_everything(SEED)

# Text prompt for image generations

Negative prompts appear to be very helpful in `v2`. At least, more helpful than they were for `v1.x` models.  

Below, we also borrow a prompt and negative-prompt format that's going around the Stable Diffusion discord. It seems to be a good starting point as the community figures out the new prompt structures. 

In [None]:
# text prompt for image generations
prompt = "a futuristic metropolis collapsed by the beach on a caribbean island, dystopia, apocalyptic, sci-fi, disaster, art station, misery, cinematic, hdri, matte painting, concept art, soft render, highly detailed, cgsociety, octane render, trending on artstation, architectural HQ, 4k"
# prompt = "One Second Before Awakening From a Dream Provoked by the Flight of a Bee Around a Pomegranate"

# a good negative prompt
# neg_prompt = "!!!!!!text!!!!!!, watermark, bad art, deformed, blurry, strange colours, sketch, lacklustre, repetitive, cropped, lowres, deformed, old, childish"
neg_prompt = "(ugly, cartoon, bad anatomy, bad art, frame, deformed, disfigured, extra limbs, text, meme, low quality, mutated, ordinary, overexposed, pixelated, poorly drawn, signature, thumbnail, too dark, too light, unattractive, useless, watermark, writing, cropped:1.1)"

# Image and Sampler parameters

The images will be generated over $30$ diffusion steps. It will be a rather large `1024 x 1024` output.   

We are using the `DPM++ SDE Karras` sampler with 30 steps. This sampler seems to be working the best for high-quality outputs at the moment. The `2m Karras` schedule wins out on speed, however.  

If the image is too large or the generation is too slow on your machine, I'd suggest bumping down to a `768 x 768` resolution and using the `k_dpmpp_2m` sampler instead.  

In [None]:
# number of diffusion steps
num_steps = 50    

# image dimensions
height = 768 # 768
width  = 768 # 768

# group the arguments for the generation function
gen_kwargs = {
    'height': height,
    'width': width, 
    'negative_prompt': neg_prompt, 
    'num_steps': num_steps,
}

# set the k-diffusion scheduler
sampler_kls = 'k_dpmpp_sde' # 'dpm_multi'

# whether to use the Karras sigma schedule
use_karras_sigmas = True

# group scheduler arguments
sampler_kwargs = {
    'scheduler_kls': sampler_kls,
    'use_karras_sigmas': use_karras_sigmas,
}

# Gathering Stable Diffusion models

For now, the `k_diffusion` integration is only working with the full, `768-v` model. The plan is to eventually support the base model as well.

In [None]:
# to load Stable Diffusion v2-1 with our chosen sampler
model_name = 'stabilityai/stable-diffusion-2-1'
model_kwargs = {'unet_attn_slice': True,
                'schedule_kwargs': sampler_kwargs,}

# device and precision for the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
dtype = torch.float16
revision = "fp16"

# Creating Guidance schedules

### Schedule parameters  

Given how much the prompts have changed in v2, we are back in exploration territory as to what are the best parameters. Exciting times!  

Overall, it seems that the Guidance range is broader in v2. Folks are getting good results with low CFGs (3-5) or with higher values (9+). This is likely highly dependent on both the prompt and negative-prompt. We should know more as the stability.ai team releases their guides and tips.  

The functions below quickly build different Guidance schedules. They are also re-used from the [previous notebooks](https://enzokro.dev/blog/posts/2022-11-26-guidance-expts-8/https://enzokro.dev/blog/posts/2022-11-26-guidance-expts-8/). 

## Static baselines

First we create the constant, baseline Guidances.  

## Improving the baseline with scheduled Guidance

Now we build the most promising dynamic schedule: `Inverse kDecay` with a fast warmup.  

# Function to run the experiments

The code below loads the v2 Stable Diffusion model. It's also our harness to easily run many, different experiments. 

In [None]:
def load_sd_model(model_name, device, dtype, revision, model_kwargs={}):
    '''Loads the given `model_name` Stable Diffusion in `dtype` precision.  
    
    The model is placed on the `device` hardware. 
    Optional `model_kwargs` are passed to the model's load function.
    '''
    pipeline = MinimalDiffusion(model_name, device, dtype, revision, **model_kwargs)
    pipeline.load()
    return pipeline


# load the current Diffusion model
pipeline = load_sd_model(model_name, device, dtype, revision, model_kwargs=model_kwargs)

In [None]:
def run(pipeline, prompt, cfg_runs, gen_kwargs={},
        norm_name='', show_each=False, test_run=False):
    """Runs a dynamic Classifier-free Guidance experiment. 
    
    Generates an image for the text `prompt` given all the values in `schedules`.
    Uses a Guidance Transformation class from the `cf_guidance` library.  
    Stores the output images with a matching title for plotting. 
    Optionally shows each image as its generated.
    If `test_run` is true, it runs a single schedule for testing. 
    """
    # store generated images and their title (the experiment name)
    images, titles = [], []
    
    # optionally run a single test schedule
    if test_run:
        print(f'Running a single schedule for testing.')
        cfg_runs = cfg_runs[:1]
        
    # run all schedule experiments
    for i,cfg in enumerate(cfg_runs):
        
        # parse out the title for the current run
        cur_title  = cfg['title']
        titles.append(cur_title)
        
        # create the guidance transformation 
        sched_name = cfg['sched_name']
        norm_name = cfg['norm_name']
        guide_tfm = DynamicCFG(norm_name, sched_name)
        # update its schedule parameters
        guide_tfm.update_sched_kwargs(cfg['params'])
        
        print(f'Running generation [{i+1} of {len(cfg)}]: {cur_title}...')
        with torch.no_grad(), torch.autocast(device):
            img = pipeline.generate(prompt, dynamic_cfg=guide_tfm, **gen_kwargs)

        # store the generated image
        images.append(img)
        # optionally plot each generated image
        if show_each:
            show_image(img, scale=1)
            
    print('Done.')
    return {'images': images,
            'titles': titles}

# Generating the images

We put together all of the pieces above to generate images with Stable Diffusion v2. Both with our static baseline guidances, and the `Inverse-kDecay` schedules. 

In [None]:
# stores the generated images
outputs = {}

# schedules to run
baseline_expts = [
    {'sched_name': 'constant',
     'title': 'constant_guidance',
     'params': {'max_val': 9},
     'norm_name': 'no_norm',
    }
]

# different schedules
sched_expts = [

    {'sched_name': 'linear',
     'title': 'linear_guide_minVal_5',
     'params': {'min_val': 5},
     'norm_name': 'no_norm',
    },


    {'sched_name': 'cosine',
     'title': 'cos_guide_kdecay_01',
     'params': {'k_decay': 0.1, 'min_val': 5},
     'norm_name': 'no_norm',
    },


    {'sched_name': 'cosine',
     'title': 'cos_guide_kdecay_02',
     'params': {'k_decay': 0.2, 'min_val': 5},
     'norm_name': 'no_norm',
    },


]


# view some info about the run
print(f'Running model: {model_name}')
print(f'Generation kwargs: {gen_kwargs}')
print(f'Using prompt: {prompt}')


# run the baseline, static Guidance
baseline_res = run(pipeline, prompt, baseline_expts, gen_kwargs=gen_kwargs)
outputs[(model_name,'baseline')] = baseline_res

# run the scheduled Guidances
sched_res = run(pipeline, prompt, sched_expts, gen_kwargs=gen_kwargs)
outputs[(model_name,'scheduled')] = sched_res

                            
# cleanup GPU memory
pipeline = None
gc.collect()
del pipeline
torch.cuda.empty_cache()

Running model: {'model_name': 'stabilityai/stable-diffusion-2-1', 'model_kwargs': {'unet_attn_slice': True, 'schedule_kwargs': {'scheduler_kls': 'k_dpmpp_sde', 'use_karras_sigmas': True}}}
Generation kwargs: {'height': 768, 'width': 768, 'negative_prompt': '(ugly, cartoon, bad anatomy, bad art, frame, deformed, disfigured, extra limbs, text, meme, low quality, mutated, ordinary, overexposed, pixelated, poorly drawn, signature, thumbnail, too dark, too light, unattractive, useless, watermark, writing, cropped:1.1)', 'num_steps': 50}
Using prompt: a futuristic metropolis collapsed by the beach on a caribbean island, dystopia, apocalyptic, sci-fi, disaster, art station, misery, cinematic, hdri, matte painting, concept art, soft render, highly detailed, cgsociety, octane render, trending on artstation, architectural HQ, 4k
Enabling default unet attention slicing.
Using k-diffusion sampler: <dynamic_cfg.kdiff.DPMPPSDESampler object at 0x7fe6dde93f40>
Using Guidance Normalization: no_norm
Ru

  0%|          | 0/50 [00:00<?, ?it/s]

Done.
Using Guidance Normalization: no_norm
Running experiment [1 of 2]: Param: "k_decay", val=0.1...
Using negative prompt: (ugly, cartoon, bad anatomy, bad art, frame, deformed, disfigured, extra limbs, text, meme, low quality, mutated, ordinary, overexposed, pixelated, poorly drawn, signature, thumbnail, too dark, too light, unattractive, useless, watermark, writing, cropped:1.1)
Using Karras sigma schedule


# Results

In [None]:
#| echo: false
# names of all the models we tried
model_names = [
    'stabilityai/stable-diffusion-2-1',
    
    ##TODO: support base model
    # 'stabilityai/stable-diffusion-2-base',
]

# plot dimensions
plot_height, plot_width = height, width
# for the grid layout
num_scheds = 3
num_rows = 1

def plot_all_results(model_name):
    types = [
        'baseline', 
        'scheduled',
    ]
    mres = [(outputs[(model_name,t)], t) for t in types]
    for i in range(num_scheds):
        image_grid(
            [mres[0][0]['images'][0]] + [o[0]['images'][i] for o in mres[1:]], 
            title=[mres[0][0]['titles'][0]] + [f"{o[0]['titles'][i]}_{o[1]}" for o in mres[1:]],
            rows=num_rows, width=plot_width, height=plot_height
        )
        plt.suptitle(f'Model: {model_name}')

## Stable Diffusion v2 images 

Here we plot all of the generated images.  

The image on the left is the baseline with a static, constant Guidance.\
The images on the right are the improvements with Guidance scheduling. Specifically, using the `Inverse-kDecay` cosine schedules with different values of `k`.

In [None]:
plot_all_results(model_name)

# Conclusion

In this notebook we checked whether scheduling the Classifier-free Guidance improves the images generated by Stable Diffusion v2.  

At first glance, it seems that scheduling still helps! The scheduled generations have a lot more buildings and details. They seem to also better follow the prompt.   