# Latent Walk Diffusion

In this notebook, we will take a look at latent walking in latent spaces. Generative models, like the ones used in Stable Diffusion, learn a latent representation of the world. A latent representation is a low-dimensional vector space embedding of the world. In the case of SD, this latent representation is learnt by training on text-image pairs. This representation is used to generate samples given a prompt and a random noise vector. The model tries to predict and remove noise from the random noise vector, while also aligning it the vector to the prompt. This results in some interesting properties of the latent space. In this notebook, we will explore these properties.

Stable Diffusion models (atleast, the models used here) learn two latent representations - one of the NLP space for prompts, and one of the image space. These latent representations are continuous. If we choose two vectors in the latent space to sample from, we get two different/similar images depending on how different the chosen vectors are. This is the basis of latent walking. We can choose two vectors in the latent space, and sample from the latent path between them. This results in a smooth transition between the two images.

### Install and Import required packages

In [None]:
%pip install stablefused ipython

In [None]:
import numpy as np
import torch

from IPython.display import Video, display
from PIL import Image
from stablefused import LatentWalkDiffusion, TextToImageDiffusion
from stablefused.utils import image_grid, pil_to_video

### Initialize model and parameters

We use RunwayML's Stable Diffusion 1.5 checkpoint and initialize our Latent-Walk and Text-to-Image Diffusion models. Play around with different prompts and parameters, and see what you get! You can comment out the parts that use seeds to generate random images each time you run the notebook.

We use the following mechanism to trade-off speed for reduced memory footprint. It allows us to work with bigger images and larger batch sizes with about just 6GB of GPU memory.
- U-Net Attention Slicing: Allows the internal U-Net model to perform computations for attention heads sequentially, rather than in parallel.
- VAE Slicing: Allow tensor slicing for VAE decode step. This will cause the vae to split the input tensor to compute decoding in multiple steps.
- VAE Tiling: Allow tensor tiling for vae. This will cause the vae to split the input tensor into tiles to compute encoding/decoding in several steps.

Also, notice how we are loading the same model twice. That should use twice the memory, right? Well, in most cases, users stick to using the same model checkpoints across different inference pipelines, and so it makes sense to share the internal models. StableFused maintains an internal model cache which allows all internal models to be shared, in order to save memory.

In [None]:
model_id = "runwayml/stable-diffusion-v1-5"
lw_model = LatentWalkDiffusion(model_id=model_id, torch_dtype=torch.float16)

lw_model.enable_attention_slicing()
lw_model.enable_slicing()
lw_model.enable_tiling()

t2i_model = TextToImageDiffusion(model_id=model_id, torch_dtype=torch.float16)

Prompt Credits: https://mspoweruser.com/best-stable-diffusion-prompts/#6_The_Robotic_Baroque_Battle


In [None]:
prompt = "Large futuristic mechanical robot in the foreground of a baroque-style battle scene, photorealistic, high quality, 8k"
negative_prompt = "cartoon, unrealistic, blur, boring background, deformed, disfigured, low resolution, unattractive"
num_images = 4
seed = 44

torch.manual_seed(seed)
np.random.seed(seed)

In [None]:
# There seems to be a bug in stablefused which requires to be reviewed. The
# bug causes images created using latent walk to be very noisy, or plain white,
# deformed, etc.
# This is why instead of being able to use an actual image to generate latents,
# we need to make it ourselves. In the future, this notebook will be updated
# to allow latent walking for user-chosen images

# filename = "the-robotic-baroque-battle.png"
# start_image = [Image.open(filename)] * num_images

# # This step is only required when loading model with torch.float16 dtype
# start_image = np.array(start_image, dtype=np.float16)

# latent = lw_model.image_to_latent(start_image)

image_height = 512
image_width = 512
shape = (
    1,
    lw_model.unet.config.in_channels,
    image_height // lw_model.vae_scale_factor,
    image_width // lw_model.vae_scale_factor,
)
single_latent = lw_model.random_tensor(shape)
latent = single_latent.repeat(num_images, 1, 1, 1)

### Latent Walking to generate similar images

Let's see what our base image for latent walking looks like.

In [None]:
t2i_model(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=20,
    guidance_scale=10.0,
    latent=single_latent,
)[0]

Latent walk with diffusion around the latent space of our sampled latent vector. This results in generation of similar images. The similarity/difference can be controlled using the `strength` parameter (set between 0 and 1, defaults to 0.2). Lower strenght leads to similar images with subtle differences. Higher strength can cause completely new ideas to be generated.

In [None]:
images = lw_model(
    prompt=[prompt] * num_images,
    negative_prompt=[negative_prompt] * num_images,
    latent=latent,
    strength=0.25,
    num_inference_steps=20,
)

In [None]:
image_grid(images, rows=2, cols=2)

### Generating Videos with Latent Walking

Here, we generate a video by walking the latent space of the model, using interpolation techniques to generate frames. An interpolation is just a weighted average of two embeddings calculated by some interpolation function. [Linear interpolation](https://en.wikipedia.org/wiki/Linear_interpolation) is used on the prompt embeddings and [Spherical Linear Interpolation](https://en.wikipedia.org/wiki/Slerp) is used on the latent embeddings, by default. You can change the interpolation method by passing `embedding_interpolation_type` or `latent_interpolation_type` parameter.

Note that stablefused is a toy library in its infancy and is not optimized for speed, and does not support a lot of features. There are multiple bugs and issues that need to be addressed. Some things need to be implemented manually currently, but in the future, I hope to make the process easier.

In [None]:
# Prompt credits: ChatGPT
story_prompt = [
    "A dog chasing a cat in a thrilling backyard scene, high quality and photorealistic",
    "A determined dog in hot pursuit, with stunning realism, octane render",
    "A thrilling chase, dog behind the cat, octane render, exceptional realism and quality",
    "The exciting moment of a cat outmaneuvering a chasing dog, high-quality and photorealistic detail",
    "A clever cat escaping a determined dog and soaring into space, rendered with octane render for stunning realism",
    "The cat's escape into the cosmos, leaving the dog behind in a scene,high quality and photorealistic style",
]

seed = 123456

torch.manual_seed(seed)
np.random.seed(seed)

In [None]:
# There seems to be a bug in stablefused which requires to be reviewed. The
# bug causes images created using latent walk to be very noisy, or plain white,
# deformed, etc.
# This is why instead of being able to use an actual image to generate latents,
# we need to make it ourselves. In the future, this notebook will be updated
# to allow latent walking for user-chosen images

# t2i_images = t2i_model(
#     prompt = story_prompt,
#     negative_prompt = [negative_prompt] * len(story_prompt),
#     num_inference_steps = 20,
#     guidance_scale = 12.0,
# )

image_height = 512
image_width = 512
shape = (
    len(story_prompt),
    lw_model.unet.config.in_channels,
    image_height // lw_model.vae_scale_factor,
    image_width // lw_model.vae_scale_factor,
)
latent = lw_model.random_tensor(shape)

In [None]:
t2i_images = t2i_model(
    prompt=story_prompt,
    num_inference_steps=20,
    guidance_scale=15.0,
    latent=latent,
)

In [None]:
image_grid(t2i_images, rows=2, cols=3)

In [None]:
# Due to the bug mentioned above, this step is not required.
# We can directly use the latents we generated manually
# np_t2i_images = np.array(t2i_images, dtype = np.float16)
# t2i_latents = t2i_model.image_to_latent(np_t2i_images)

In [None]:
interpolation_steps = 24

# Since stablefused does not support batch processing yet, we need
# to do it manually. This notebook will be updated in the future
# to support batching internally to handle a large number of images

story_images = []
for i in range(len(story_prompt) - 1):
    current_prompt = story_prompt[i : i + 2]
    current_latent = latent[i : i + 2]
    imgs = lw_model.interpolate(
        prompt=current_prompt,
        negative_prompt=[negative_prompt] * len(current_prompt),
        latent=current_latent,
        num_inference_steps=20,
        interpolation_steps=interpolation_steps,
    )
    story_images.extend(imgs)

In [None]:
filename = "dog-chasing-cat-story.mp4"
pil_to_video(story_images, filename, fps=8)

In [None]:
display(Video(filename, embed=True))