# Stable Diffusion XL with Neuronx: LoRA adapters

`🤗 Optimum` extends `🤗 Diffusers` to support inference on the second generation of Neuron devices(powering Trainium and Inferentia 2). It aims at inheriting the ease of Diffusers on Neuron.

To get started, make sure you have [configured your inf2 / trn1 instance](https://huggingface.co/docs/optimum-neuron/installation), and installed optimum:

In [None]:
pip install "optimum[neuronx, diffusers]" matplotlib
pip install -U peft

## Compilation

To deploy SDXL models, we will also start by compiling the models. We support the export of following components in the pipeline to boost the speed:

* Text encoder
* Second text encoder
* U-Net (a three times larger UNet than the one in Stable Diffusion pipeline)
* VAE encoder
* VAE decoder

You can either compile and export a Stable Diffusion Checkpoint via CLI or `NeuronStableDiffusionXLPipeline` class. 
In this tutorial, we will export [`stabilityai/stable-diffusion-xl-base-1.0`](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) with the API.

In [1]:
from optimum.neuron import NeuronStableDiffusionXLPipeline

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
adapter_id = "lora"
input_shapes = {"batch_size": 1, "height": 1024, "width": 1024, "num_images_per_prompt": 1}

# Compile
pipe = NeuronStableDiffusionXLPipeline.from_pretrained(
    model_id,
    export=True,
    inline_weights_to_neff=True,  # caveat: performance drop if neff/weights separated, will be improved by a future Neuron sdk release.
    lora_model_ids=adapter_id,
    lora_weight_names="pytorch_lora_weights.safetensors",
    lora_adapter_names="sttirum",
    **input_shapes,
)

# Save locally or upload to the HuggingFace Hub
save_directory = "sd_neuron_xl/"

pipe.save_pretrained(save_directory) 

Keyword arguments {'subfolder': '', 'use_auth_token': None, 'trust_remote_code': False} are not expected by StableDiffusionXLImg2ImgPipeline and will be ignored.


Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

Applying optimized attention score computation for sdxl.


2024-07-18 18:28:11.000134:  8647  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.13.66.0+6dfecc895/MODULE_66bfa5b81beb10729c0d not found in aws-neuron/optimum-neuron-cache: 404 Client Error. (Request ID: Root=1-66995ebb-5aea939c1328809365016e9e;ac699e2f-f6f7-4887-9d79-dc783f93e5b6)

Entry Not Found for url: https://huggingface.co/api/models/aws-neuron/optimum-neuron-cache/tree/main/neuronxcc-2.13.66.0%2B6dfecc895%2FMODULE_66bfa5b81beb10729c0d?recursive=True&expand=False.
neuronxcc-2.13.66.0+6dfecc895/MODULE_66bfa5b81beb10729c0d does not exist on "main" 
The model will be recompiled.
Keyword arguments {'subfolder': '', 'use_auth_token': None, 'trust_remote_code': False} are not expected by StableDiffusionXLImg2ImgPipeline and will be ignored.


Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

Applying optimized attention score computation for sdxl.
***** Compiling text_encoder *****
Using Neuron: --auto-cast matmul
Using Neuron: --auto-cast-type bf16


2024-07-18T18:28:27Z Running DoNothing
2024-07-18T18:28:27Z DoNothing finished after 0.000 seconds
2024-07-18T18:28:27Z Running AliasDependencyInduction
2024-07-18T18:28:27Z AliasDependencyInduction finished after 0.003 seconds
2024-07-18T18:28:27Z Running CanonicalizeIR
2024-07-18T18:28:27Z CanonicalizeIR finished after 0.010 seconds
2024-07-18T18:28:27Z Running LegalizeCCOpLayout
2024-07-18T18:28:27Z LegalizeCCOpLayout finished after 0.010 seconds
2024-07-18T18:28:27Z Running ResolveComplicatePredicates
2024-07-18T18:28:27Z ResolveComplicatePredicates finished after 0.009 seconds
2024-07-18T18:28:27Z Running AffinePredicateResolution
2024-07-18T18:28:27Z AffinePredicateResolution finished after 0.010 seconds
2024-07-18T18:28:27Z Running EliminateDivs
2024-07-18T18:28:27Z EliminateDivs finished after 0.010 seconds
2024-07-18T18:28:27Z Running PerfectLoopNest
2024-07-18T18:28:27Z PerfectLoopNest finished after 0.010 seconds
2024-07-18T18:28:27Z Running Simplifier
2024-07-18T18:28:27Z S

[Compilation Time] 36.69 seconds.
***** Compiling text_encoder_2 *****
Using Neuron: --auto-cast matmul
Using Neuron: --auto-cast-type bf16


2024-07-18T18:29:06Z Running DoNothing
2024-07-18T18:29:06Z DoNothing finished after 0.000 seconds
2024-07-18T18:29:06Z Running AliasDependencyInduction
2024-07-18T18:29:06Z AliasDependencyInduction finished after 0.007 seconds
2024-07-18T18:29:06Z Running CanonicalizeIR
2024-07-18T18:29:06Z CanonicalizeIR finished after 0.027 seconds
2024-07-18T18:29:06Z Running LegalizeCCOpLayout
2024-07-18T18:29:06Z LegalizeCCOpLayout finished after 0.027 seconds
2024-07-18T18:29:06Z Running ResolveComplicatePredicates
2024-07-18T18:29:06Z ResolveComplicatePredicates finished after 0.024 seconds
2024-07-18T18:29:06Z Running AffinePredicateResolution
2024-07-18T18:29:06Z AffinePredicateResolution finished after 0.026 seconds
2024-07-18T18:29:06Z Running EliminateDivs
2024-07-18T18:29:06Z EliminateDivs finished after 0.024 seconds
2024-07-18T18:29:06Z Running PerfectLoopNest
2024-07-18T18:29:06Z PerfectLoopNest finished after 0.025 seconds
2024-07-18T18:29:06Z Running Simplifier
2024-07-18T18:29:06Z S

[Compilation Time] 108.89 seconds.
***** Compiling unet *****
Using Neuron: --auto-cast matmul
Using Neuron: --auto-cast-type bf16


2024-07-18T18:31:45Z Running DoNothing
2024-07-18T18:31:45Z DoNothing finished after 0.000 seconds
2024-07-18T18:31:45Z Running AliasDependencyInduction
2024-07-18T18:31:45Z AliasDependencyInduction finished after 0.021 seconds
2024-07-18T18:31:45Z Running CanonicalizeIR
2024-07-18T18:31:45Z CanonicalizeIR finished after 0.071 seconds
2024-07-18T18:31:45Z Running LegalizeCCOpLayout
2024-07-18T18:31:45Z LegalizeCCOpLayout finished after 0.077 seconds
2024-07-18T18:31:45Z Running ResolveComplicatePredicates
2024-07-18T18:31:45Z ResolveComplicatePredicates finished after 0.072 seconds
2024-07-18T18:31:45Z Running AffinePredicateResolution
2024-07-18T18:31:45Z AffinePredicateResolution finished after 0.072 seconds
2024-07-18T18:31:45Z Running EliminateDivs
2024-07-18T18:31:45Z EliminateDivs finished after 0.072 seconds
2024-07-18T18:31:45Z Running PerfectLoopNest
2024-07-18T18:31:45Z PerfectLoopNest finished after 0.072 seconds
2024-07-18T18:31:45Z Running Simplifier
2024-07-18T18:31:45Z S

[Compilation Time] 1007.33 seconds.
***** Compiling vae_encoder *****
Using Neuron: --auto-cast matmul
Using Neuron: --auto-cast-type bf16


2024-07-18T18:47:33Z Running DoNothing
2024-07-18T18:47:33Z DoNothing finished after 0.000 seconds
2024-07-18T18:47:33Z Running AliasDependencyInduction
2024-07-18T18:47:33Z AliasDependencyInduction finished after 0.001 seconds
2024-07-18T18:47:33Z Running CanonicalizeIR
2024-07-18T18:47:33Z CanonicalizeIR finished after 0.007 seconds
2024-07-18T18:47:33Z Running LegalizeCCOpLayout
2024-07-18T18:47:33Z LegalizeCCOpLayout finished after 0.006 seconds
2024-07-18T18:47:33Z Running ResolveComplicatePredicates
2024-07-18T18:47:33Z ResolveComplicatePredicates finished after 0.005 seconds
2024-07-18T18:47:33Z Running AffinePredicateResolution
2024-07-18T18:47:33Z AffinePredicateResolution finished after 0.006 seconds
2024-07-18T18:47:33Z Running EliminateDivs
2024-07-18T18:47:33Z EliminateDivs finished after 0.005 seconds
2024-07-18T18:47:33Z Running PerfectLoopNest
2024-07-18T18:47:33Z PerfectLoopNest finished after 0.006 seconds
2024-07-18T18:47:33Z Running Simplifier
2024-07-18T18:47:33Z S

[Compilation Time] 304.23 seconds.
***** Compiling vae_decoder *****
Using Neuron: --auto-cast matmul
Using Neuron: --auto-cast-type bf16


2024-07-18T18:52:35Z Running DoNothing
2024-07-18T18:52:35Z DoNothing finished after 0.000 seconds
2024-07-18T18:52:35Z Running AliasDependencyInduction
2024-07-18T18:52:35Z AliasDependencyInduction finished after 0.002 seconds
2024-07-18T18:52:35Z Running CanonicalizeIR
2024-07-18T18:52:35Z CanonicalizeIR finished after 0.008 seconds
2024-07-18T18:52:35Z Running LegalizeCCOpLayout
2024-07-18T18:52:35Z LegalizeCCOpLayout finished after 0.008 seconds
2024-07-18T18:52:35Z Running ResolveComplicatePredicates
2024-07-18T18:52:35Z ResolveComplicatePredicates finished after 0.007 seconds
2024-07-18T18:52:35Z Running AffinePredicateResolution
2024-07-18T18:52:35Z AffinePredicateResolution finished after 0.008 seconds
2024-07-18T18:52:35Z Running EliminateDivs
2024-07-18T18:52:35Z EliminateDivs finished after 0.007 seconds
2024-07-18T18:52:35Z Running PerfectLoopNest
2024-07-18T18:52:35Z PerfectLoopNest finished after 0.007 seconds
2024-07-18T18:52:35Z Running Simplifier
2024-07-18T18:52:35Z S

[Compilation Time] 679.65 seconds.
[Total compilation Time] 2136.79 seconds.


2024-07-18 19:03:52.000073:  8647  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-07-18 19:03:52.000078:  8647  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


Model cached in: /var/tmp/neuron-compile-cache/neuronxcc-2.13.66.0+6dfecc895/MODULE_66bfa5b81beb10729c0d.
Loading only U-Net into both Neuron Cores...
Saving the ('text_encoder', 'text_encoder_2', 'unet', 'vae_encoder', 'vae_decoder')...



We Recommend `inf2.8xlarge` or larger for compilation. You will also be able to compile the models with a CPU-only instance *(needs ~92GB memory)* using the CLI with `--disable-validation`, which disables the validation of inference on neuron devices.

In the following section, we will run the pre-compiled model on Neuron devices, to reduce expenses, you can run inference with `inf2.xlarge` instance.

## Text-to-image Inference

If you have pre-compiled Stable Diffusion XL models, you can load them directly to skip the compilation: 

In [2]:
from optimum.neuron import NeuronStableDiffusionXLPipeline

stable_diffusion_xl = NeuronStableDiffusionXLPipeline.from_pretrained("sd_neuron_xl")  # Pass a local path or your repo id on the HuggingFace hub.

Loading only U-Net into both Neuron Cores...


In [13]:
import torch
# Run pipeline
prompt = """
photo of <<TOK>> , 3d portrait, ultra detailed, gorgeous, 3d zbrush, trending on dribbble, 8k render
"""         

negative_prompt = """
ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, body out of frame, blurry, bad anatomy, blurred, 
watermark, grainy, signature, cut off, draft, amateur, multiple, gross, weird, uneven, furnishing, decorating, decoration, furniture, text, poor, low, basic, worst, juvenile, 
unprofessional, failure, crayon, oil, label, thousand hands
"""

seed = 491057365
generator = [torch.Generator().manual_seed(seed)]

image = stable_diffusion_xl(prompt, 
             num_inference_steps=50, 
             guidance_scale=7, 
             negative_prompt=negative_prompt,
             generator=generator).images[0]

  0%|          | 0/50 [00:00<?, ?it/s]