###Check runtime

In [None]:
!nvidia-smi

### Install packages

In [None]:
!pip install -q -U transformers accelerate peft diffusers==0.32.2

### Check pytorch nightly installation for latest and fastest kernels

In [None]:
!pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121

###Clear pipeline cash function



In [None]:
import gc
import torch

def clear_cache():
  gc.collect()
  torch.cuda.empty_cache()
  torch.cuda.reset_peak_memory_stats()

###Run with scaled dot product attention (SDPA)

SDPA is an optimized and memory-efficient attention (similar to xFormers) that automatically enables several other optimizations depending on the model inputs and GPU type. SDPA is enabled by default if you’re using PyTorch 2.0 and the latest version of 🤗 Diffusers, so you don’t need to add anything to your code.


In [None]:
from time import time
from diffusers import DiffusionPipeline
import torch

# Load the pipeline in full-precision and place its model components on CUDA.
pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.bfloat16
).to("cuda")

prompt = "One horse in an aquarium, cold color palette, muted colors, detailed, 4k"
negative_prompt = "bad anatomy, bad proportions, missed legs, ugly"

start = time()
image = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=30).images[0]
print('Inference time is {}'.format(time() - start))

del pipe

In [None]:
image

###Run pipeline only with bfloat16

There are several benefits of using reduced precision:

*   Using a reduced numerical precision (such as float16 or bfloat16) for inference doesn’t affect the generation quality but significantly improves latency.

*   The benefits of using bfloat16 compared to float16 are hardware dependent, but modern GPUs tend to favor bfloat16.

*   bfloat16 is much more resilient when used with quantization compared to float16, but more recent versions of the quantization library (torchao) we used don’t have numerical issues with float16.

In [None]:
clear_cache()

# Load the pipeline in full-precision and place its model components on CUDA.
pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.bfloat16
).to("cuda")

# Run the attention ops without SDPA.
pipe.unet.set_default_attn_processor()
pipe.vae.set_default_attn_processor()

start = time()
image = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=30).images[0]
print('Inference time is {}s'.format(time() - start))

In [None]:
image