## Stable Diffusion on SPR with IPEX

This is a demo of [Stable Diffusion with the Hugging Face API](https://huggingface.co/stabilityai), and using the [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) (IPEX) to optimize the model pipeline on Intel's 4th generation Xeon platform.

The demo consists of the following steps:

1. Load and define the core SD model components from HF.
2. Set up and run a standard SD pipeline with the HF API, i.e., generate a FP32 precision image.
3. Optimize SD with IPEX, using Auto Mixed Precision (BF16), and run the pipeline again.
4. Compare the results wrt inference latency time.
5. Run batched inference with the optimized SD pipeline.

**This demo is executed in a Conda\* environment.**

The environment is the latest [Intel® oneAPI AI Analytics Toolkit](https://software.intel.com/content/www/us/en/develop/tools/oneapi/ai-analytics-toolkit.html) PyTorch* environment, which includes Intel® Optimizations for deep learning workflows. See [here](https://software.intel.com/content/www/us/en/develop/articles/installing-ai-kit-with-conda.html) for more installation information.

If not yet installed, please make sure to uncomment the following line in order to install Diffusers, Transformers, and update Torch and IPEX.

In [13]:
! pip install -U tqdm

Collecting tqdm
  Obtaining dependency information for tqdm from https://files.pythonhosted.org/packages/00/e5/f12a80907d0884e6dff9c16d0c0114d81b8cd07dc3ae54c5e962cc83037e/tqdm-4.66.1-py3-none-any.whl.metadata
  Using cached tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
Using cached tqdm-4.66.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm
Successfully installed tqdm-4.66.1


In [14]:
import torch

from diffusers import AutoencoderKL, DDPMScheduler, PNDMScheduler, DPMSolverMultistepScheduler, EulerDiscreteScheduler, StableDiffusionPipeline, UNet2DConditionModel
from diffusers.optimization import get_scheduler
from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker

from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer

import matplotlib.pyplot as plt

import time

# Define model ID for SD version
model_id = "stable-diffusion-2-1-base"

pipe = StableDiffusionPipeline.from_pretrained(model_id)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)


Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

**Single image inference**

Then, we call the pipeline with a written description of the wanted image, i.e., the text prompt. And generate an image.

The inference process can be made deterministic by setting the generator seed. Furthermore, through the number of inference steps, we can govern the quality of the image, i.e., more steps equals better quality. Reduce the number of steps to receive results faster.

Please experiment with your own prompts!

In [15]:
# Define the prompt for the image generation
prompt = "Painting of a frog with hat on a bicycle cycling in New York City at a beautiful dusk with a traffic jam and moody people in the style of Picasso"

# Set the number of iterations for the image generation
n_inf_steps = 10

# Setting seed for deterministic output
seed = 701
generator = torch.Generator("cpu").manual_seed(seed)

# Simple timing of inference
start = time.time()
image = pipe(prompt, num_inference_steps=n_inf_steps, generator=generator).images[0]
end = time.time()
sd_fp32_t = end-start
print(f"Generating one FP32 image took {round(sd_fp32_t, 2)}s")

image.save("frog_test_FP32.png")

  0%|          | 0/10 [00:00<?, ?it/s]

Generating one FP32 image took 8.9s


**Optimization with IPEX**

The UNET component of the model architecture is the one that uses most computational resources in the inference process. Hence, with IPEX, we optimize it, and put it in BF16 precision.
Please note: Experiment also with the optimization of the text encoder.

In [16]:
import intel_extension_for_pytorch as ipex

infer_dtype = torch.bfloat16

# Put model in eval mode.
pipe.unet.eval()
# text_encoder.eval()

# Optimize the model w/ IPEX
pipe.unet = ipex.optimize(pipe.unet.eval(), dtype=infer_dtype, inplace=True)
pipe.vae = ipex.optimize(pipe.vae.eval(), dtype=infer_dtype, inplace=True)
pipe.text_encoder = ipex.optimize(pipe.text_encoder.eval(), dtype=infer_dtype, inplace=True)

# Optimize with torch.compile
pipe.unet = torch.compile(pipe.unet)
pipe.vae = torch.compile(pipe.vae)
pipe.text_encoder = torch.compile(pipe.text_encoder)


In [None]:
generator = torch.Generator("cpu").manual_seed(seed)

# Simple timing of inference
start = time.time()
with torch.cpu.amp.autocast():
    image = pipe(prompt, num_inference_steps=n_inf_steps, generator=generator).images[0]
end = time.time()
sd_bf16_t = end-start
print(f"Generating one BF16 image took {round(sd_bf16_t, 2)}s")

image.save("frog_test_BF16.png")

[2023-10-13 16:56:01,133] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing forward
[2023-10-13 16:56:02,744] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo done tracing forward (RETURN_VALUE)
[2023-10-13 16:56:02,784] torch._dynamo.output_graph: [INFO] Step 2: calling compiler function debug_wrapper
[2023-10-13 16:56:10,951] torch._inductor.compile_fx: [INFO] Step 3: torchinductor compiling FORWARDS graph 0
[2023-10-13 16:56:11,032] torch._inductor.graph: [INFO] Creating implicit fallback for:
  target: torch_ipex.ipex_linear.default
  args[0]: TensorBox(StorageBox(
    ComputedBuffer(name='buf3', layout=FlexibleLayout('cpu', torch.bfloat16, size=[1, 77, 1024], stride=[78848, 1024, 1]), data=Pointwise(
      'cpu',
      torch.bfloat16,
      tmp0 = load(arg373_1, i1)
      tmp1 = load(arg0_1, i2 + 1024 * (tmp0))
      tmp2 = load(arg372_1, i1)
      tmp3 = load(arg1_1, i2 + 1024 * (tmp2))
      tmp4 = tmp1 + tmp3
      tmp5 = to_dtype(tmp4, torch.fl

  0%|          | 0/10 [00:00<?, ?it/s]

[2023-10-13 16:56:30,821] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing forward
[2023-10-13 16:56:33,915] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo done tracing forward (RETURN_VALUE)
[2023-10-13 16:56:33,956] torch._dynamo.output_graph: [INFO] Step 2: calling compiler function debug_wrapper


We compare the inference speeds between the two runs.

In [None]:
print(f"Inference with IPEX, using AMP+BF16, was {round(sd_fp32_t/sd_bf16_t, 2)}x faster.")

In [None]:
def plotter(outputdict):
    fig = plt.figure(figsize = (10, 5)) 
    plt.bar(outputdict.keys(),outputdict.values(),color=['#ffd21e', '#0071c5'],width=0.4)
    plt.xlabel("Model")
    plt.ylabel("Eval inference (seconds); lower is better")
    plt.show()

In [None]:
outputDict={"Full-precision":sd_fp32_t,"AMP (BF16)":sd_bf16_t}
plotter(outputDict)

**Batched inference**

Finally, we generate a batch of 3 images, and compare the performance.

In [10]:
from PIL import Image

def image_grid(imgs, rows, cols):
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size
    
    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

In [12]:
num_images = 3

prompt = ["Painting of a frog with hat on a bicycle cycling in New York City at a beautiful dusk with a traffic jam and moody people in the style of Picasso"] * num_images

start = time.time()
with torch.cpu.amp.autocast():
    images = pipe(prompt, num_inference_steps=n_inf_steps).images
end = time.time()
sd_bbf16_t = end-start
print(f"Generating {num_images} BF16 images took {round(sd_bbf16_t, 2)}s. Per image inference time: {round(sd_bbf16_t/num_images, 2)}s.")

grid = image_grid(images, rows=1, cols=3)

grid.save(f"frog_batch.png")

  0%|          | 0/10 [00:00<?, ?it/s]

Generating 3 BF16 images took 60.24s. Per image inference time: 20.08s.
