# Text to Image Diffusion

In this notebook, we take a look at Text to Image Diffusion.

### Install and Import required packages

In [None]:
!pip install stablefused ipython

In [None]:
import numpy as np
import torch

from IPython.display import Video, display
from stablefused import TextToImageDiffusion
from stablefused.utils import image_grid, pil_to_video

### Initialize model and parameters

We use RunwayML's Stable Diffusion 1.5 checkpoint and initialize our Text To Image Diffusion model, and some other parameters. Play around with different prompts and see what you get! You can comment out the seed part if you want to generate new random images each time you run the notebook.

In [None]:
model_id = "runwayml/stable-diffusion-v1-5"
model = TextToImageDiffusion(model_id=model_id)

In [None]:
prompt = "Cyberpunk cityscape with towering skyscrapers, neon signs, and flying cars."
negative_prompt = "cartoon, unrealistic, blur, boring background, deformed, disfigured, low resolution, unattractive"
num_inference_steps = 20
seed = 1337

torch.manual_seed(seed)
np.random.seed(seed)

You can run the stable diffusion inference using the call method `()` or the `.generate()` method. Refer to the documentation to see what parameters can be provided.

In [None]:
model(prompt=prompt, num_inference_steps=num_inference_steps)[0]

### Visualizing the Diffusion process

In [None]:
prompt = [
    "Gothic painting of an ancient castle at night, with a full moon, gargoyles, and shadows",
    "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k",
    "A close up image of an very beautiful woman, aesthetic, realistic, high quality",
    "Concept art for a post-apocalyptic world with ruins, overgrown vegetation, and a lone survivor",
]

We enable attention slicing for the UNet to reduce memory requirements, which causes attention heads to be processed sequentially. We also enable slicing and tiling of the VAE to reduce memory required for decoding process from latent space to image space.

In [None]:
model.enable_attention_slicing()
model.enable_slicing()
model.enable_tiling()

Run inference on the different text prompts. We pass `return_latent_history = True`, which returns all the latents from the denoising process in latent space. We can then decode these latents to images and create a video of the denoising process.

In [None]:
images = model(
    prompt=prompt,
    negative_prompt=[negative_prompt] * len(prompt),
    num_inference_steps=num_inference_steps,
    guidance_scale=10.0,
    return_latent_history=True,
)

We tile the images in a 2x2 grid here.

In [None]:
timestep_images = []
for imgs in zip(*images):
    img = image_grid(imgs, rows=2, cols=len(prompt) // 2)
    timestep_images.append(img)

In [None]:
path = "text_to_image_diffusion.mp4"
pil_to_video(timestep_images, path, fps=5)

Tada!

In [None]:
display(Video(path, embed=True))