# Image generation with Stable Diffusion v3 and OpenVINO

Stable Diffusion V3 is next generation of latent diffusion image Stable Diffusion models family that  outperforms state-of-the-art text-to-image generation systems in typography and prompt adherence, based on human preference evaluations. In comparison with previous versions, it based on Multimodal Diffusion Transformer (MMDiT) text-to-image model that features greatly improved performance in image quality, typography, complex prompt understanding, and resource-efficiency.

![mmdit.png](https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/dd079427-89f2-4d28-a10e-c80792d750bf)

More details about model can be found in [model card](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [research paper](https://stability.ai/news/stable-diffusion-3-research-paper) and [Stability.AI blog post](https://stability.ai/news/stable-diffusion-3-medium).
In this tutorial, we will consider how to convert Stable Diffusion v3 for running with OpenVINO. An additional part demonstrates how to run optimization with [NNCF](https://github.com/openvinotoolkit/nncf/) to speed up pipeline.
If you want to run previous Stable Diffusion versions, please check our other notebooks:

* [Stable Diffusion](../stable-diffusion-text-to-image)
* [Stable Diffusion v2](../stable-diffusion-v2)
* [Stable Diffusion v3](../stable-diffusion-v3)
* [Stable Diffusion XL](../stable-diffusion-xl)
* [LCM Stable Diffusion](../latent-consistency-models-image-generation)
* [Turbo SDXL](../sdxl-turbo)
* [Turbo SD](../sketch-to-image-pix2pix-turbo)

#### Table of contents:

- [Prerequisites](#Prerequisites)
- [Build PyTorch pipeline](#Build-PyTorch-pipeline)
- [Convert models with OpenVINO](#Convert-models-with-OpenVINO)
    - [Transformer](#Transformer)
    - [T5 Text Encoder](#T5-Text-Encoder)
    - [Clip text encoders](#Clip-text-encoders)
    - [VAE](#VAE)
- [Prepare OpenVINO inference pipeline](#Prepare-OpenVINO-inference-pipeline)
- [Run OpenVINO model](#Run-OpenVINO-model)
- [Quantization](#Quantization)
    - [Prepare calibration dataset](#Prepare-calibration-dataset)
    - [Run Quantization](#Run-Quantization)
    - [Run Weights Compression](#Run-Weights-Compression)
    - [Compare model file sizes](#Compare-model-file-sizes)
    - [Compare inference time of the FP16 and optimized pipelines](#Compare-inference-time-of-the-FP16-and-optimized-pipelines)
- [Interactive demo](#Interactive-demo)


### Installation Instructions

This is a self-contained example that relies solely on its own code.

We recommend  running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/README.md#-installation-guide).

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/stable-diffusion-v3/stable-diffusion-v3.ipynb" />


## Prerequisites
[back to top ⬆️](#Table-of-contents:)

In [None]:
%pip install -q "diffusers>=0.14.0" "gradio>=4.19" "torch>=2.1" "transformers" "nncf>=2.12.0" "datasets>=2.14.6" "opencv-python" "pillow" "peft>=0.7.0" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -qU "openvino>=2024.3.0"
%pip install git+https://github.com/openvinotoolkit/nncf.git

## Build PyTorch pipeline
[back to top ⬆️](#Table-of-contents:)

>**Note**: run model with notebook, you will need to accept license agreement. 
>You must be a registered user in 🤗 Hugging Face Hub. Please visit [HuggingFace model card](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers), carefully read terms of usage and click accept button.  You will need to use an access token for the code below to run. For more information on access tokens, refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).
>You can login on Hugging Face Hub in notebook environment, using following code:

In [None]:
# uncomment these lines to login to huggingfacehub to get access to pretrained model

# from huggingface_hub import notebook_login, whoami

# try:
#     whoami()
#     print('Authorization token already provided')
# except OSError:
#     notebook_login()

In [None]:
from diffusers import StableDiffusion3Pipeline

import numpy as np

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", text_encoder_3=None, tokenizer_3=None)
pipe.to("cpu")

### Store the Configs

This will be used later when wrapping the Torch FX models to insert back into the pipeline

In [None]:
text_encoder_config = pipe.text_encoder.config
text_encoder_2_config = pipe.text_encoder_2.config
transformer_config = pipe.transformer.config
vae_config = pipe.vae.config

## Run FP Inference

In [None]:
import torch
import random
import numpy as np

torch.manual_seed(42)
random.seed(42)
np.random.seed(42)
latents = np.random.randn(1, 16, 128, 128).astype(np.float32)
latents = torch.from_numpy(latents).to("cpu")
generator = torch.Generator(device="cpu").manual_seed(42)
prompt = "valley in the Alps at sunset, epic vista, beautiful landscape, 4k, 8k"
# prompt = 'A raccoon trapped inside a glass jar full of colorful candies, the background is steamy with vivid colors'
with torch.no_grad():
    image = pipe(prompt=prompt, negative_prompt='', num_inference_steps=1, generator=generator, guidance_scale=5).images[0]
image.resize((512, 512,))

## Convert models to Torch FX
[back to top ⬆️](#Table-of-contents:)

This step converts the pytorch models in the hf pipeline to Torch FX representation using the `capture_pre_autograd()` function. 


The pipeline consists of four important parts:

* Clip and T5 Text Encoders to create condition to generate an image from a text prompt.
* Transformer for step-by-step denoising latent image representation.
* Autoencoder (VAE) for decoding latent space to image.

In [None]:
from torch._export import capture_pre_autograd_graph
from nncf.torch.dynamic_graph.patch_pytorch import disable_patching

text_encoder_input = torch.ones((1, 77), dtype=torch.long)
text_encoder_kwargs = {}
text_encoder_kwargs['output_hidden_states'] = True

vae_encoder_input = torch.ones((1, 3, 128, 128))
vae_decoder_input = torch.ones((1, 16, 128, 128))

unet_kwargs = {}
unet_kwargs["hidden_states"] = torch.ones((2, 16, 128, 128))
unet_kwargs["timestep"] = torch.from_numpy(np.array([1,2], dtype=np.float32))
unet_kwargs["encoder_hidden_states"] = torch.ones((2, 154, 4096))
unet_kwargs["pooled_projections"] = torch.ones((2, 2048))
unet_kwargs["joint_attention_kwargs"] = None
unet_kwargs["return_dict"] = False

with torch.no_grad():
    with disable_patching():
        pipe.text_encoder = capture_pre_autograd_graph(pipe.text_encoder.eval(), args=(text_encoder_input,), kwargs=(text_encoder_kwargs))
        pipe.text_encoder_2 = capture_pre_autograd_graph(pipe.text_encoder_2.eval(), args=(text_encoder_input,), kwargs=(text_encoder_kwargs))
        pipe.vae.decoder = capture_pre_autograd_graph(pipe.vae.decoder, args=(vae_decoder_input,))
        pipe.vae.encoder = capture_pre_autograd_graph(pipe.vae.encoder, args=(vae_encoder_input,))
        pipe.transformer = capture_pre_autograd_graph(pipe.transformer.eval(), args=(), kwargs=(unet_kwargs))
del unet_kwargs
del vae_encoder_input
del vae_decoder_input
del text_encoder_input
del text_encoder_kwargs

## Quantization
[back to top ⬆️](#Table-of-contents:)

[NNCF](https://github.com/openvinotoolkit/nncf/) enables post-training quantization by adding quantization layers into model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers. Quantized operations are executed in `INT8` instead of `FP32`/`FP16` making model inference faster.

According to `StableDiffusion3Pipeline` structure, the `transformer` model takes up significant portion of the overall pipeline execution time. Now we will show you how to optimize the transformer part using [NNCF](https://github.com/openvinotoolkit/nncf/) to reduce computation cost and speed up the pipeline. Quantizing the rest of the pipeline does not significantly improve inference performance but can lead to a substantial degradation of accuracy. That's why we use 4-bit weight compression for the rest of the pipeline to reduce memory footprint.

Please select below whether you would like to run quantization to improve model inference speed.

> **NOTE**: Quantization is time and memory consuming operation. Running quantization code below may take some time.

In [None]:
from notebook_utils import quantization_widget
from sd3_quantization_helper import TRANSFORMER_INT8_PATH, TEXT_ENCODER_INT4_PATH, TEXT_ENCODER_2_INT4_PATH, TEXT_ENCODER_3_INT4_PATH, VAE_DECODER_INT4_PATH

to_quantize = quantization_widget()

to_quantize

INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino


Checkbox(value=True, description='Quantization')

### Collect Calibration Dataset

In [4]:
%%skip not $to_quantize.value

import datasets
from tqdm.notebook import tqdm
from typing import Any, Dict, List

def disable_progress_bar(pipeline, disable=True):
    if not hasattr(pipeline, "_progress_bar_config"):
        pipeline._progress_bar_config = {'disable': disable}
    else:
        pipeline._progress_bar_config['disable'] = disable


class UNetWrapper(torch.nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet
        self.captured_args = []

    def forward(self, *args, **kwargs):
        if np.random.rand() <= 0.7:
            self.captured_args.append((*args, *tuple(kwargs.values())))
        return self.unet(*args, **kwargs)

def collect_calibration_data(ov_pipe, calibration_dataset_size: int, num_inference_steps: int) -> List[Dict]:
    
    original_unet = ov_pipe.transformer
    calibration_data = []
    disable_progress_bar(ov_pipe)
    
    dataset = datasets.load_dataset("google-research-datasets/conceptual_captions", split="train", trust_remote_code=True).shuffle(seed=42)

    pipe_copy = ov_pipe
    wrapped_unet = UNetWrapper(ov_pipe.transformer)
    pipe_copy.transformer = wrapped_unet
    # Run inference for data collection
    pbar = tqdm(total=calibration_dataset_size)
    for i, batch in enumerate(dataset):
        prompt = batch["caption"]
        print(prompt)
        if len(prompt) > ov_pipe.tokenizer.model_max_length:
            continue
        # Run the pipeline
        ov_pipe(prompt, num_inference_steps=num_inference_steps)
        calibration_data.extend(wrapped_unet.captured_args)
        wrapped_unet.captured_args = []
        pbar.update(len(calibration_data) - pbar.n)
        if pbar.n >= calibration_dataset_size:
            break

    disable_progress_bar(ov_pipe, disable=False)
    pipe_copy.transformer = original_unet
    ov_pipe = pipe_copy
    return calibration_data
if to_quantize:
    calibration_dataset_size = 300
    unet_calibration_data = collect_calibration_data(pipe,
                                                        calibration_dataset_size=calibration_dataset_size,
                                                        num_inference_steps=50)

In [None]:
%%skip not $to_quantize.value

import nncf
from nncf.quantization.advanced_parameters import AdvancedSmoothQuantParameters
from nncf.quantization.range_estimator import RangeEstimatorParametersSet

if to_quantize:
    with disable_patching():
        with torch.no_grad():
            nncf.compress_weights(pipe.text_encoder)
            nncf.compress_weights(pipe.text_encoder_2)
            nncf.compress_weights(pipe.vae.encoder)
            nncf.compress_weights(pipe.vae.decoder)
            pipe.transformer = nncf.quantize(
                model=pipe.transformer,
                calibration_dataset=nncf.Dataset(unet_calibration_data),
                subset_size=len(unet_calibration_data),
                model_type=nncf.ModelType.TRANSFORMER,
                ignored_scope=nncf.IgnoredScope(names=['conv2d']),
                advanced_parameters=nncf.AdvancedQuantizationParameters(weights_range_estimator_params=RangeEstimatorParametersSet.MINMAX, activations_range_estimator_params=RangeEstimatorParametersSet.MINMAX)
            )

In [5]:
%%skip not $to_quantize.value

pipe.text_encoder = torch.compile(pipe.text_encoder, backend='openvino')
pipe.text_encoder_2 = torch.compile(pipe.text_encoder_2, backend='openvino')
pipe.vae.encoder = torch.compile(pipe.vae.encoder, backend='openvino')
pipe.vae.decoder = torch.compile(pipe.vae.decoder, backend='openvino')
pipe.transformer = torch.compile(pipe.transformer, backend='openvino')

### Wrap the Models

Before inserting them, the models need to be wrapped.

In [None]:
def wrap_model(pipe_model, base_class, config=None):
    class WrappedModel(base_class):
        def __init__(self, model, config):
            if(isinstance(config, dict)):
                super().__init__(**config)
            else:
                super().__init__(config)
            cls_name = base_class.__name__
            if(cls_name=='AutoencoderKL'):
                self.encoder = model.encoder
                self.decoder = model.decoder
            else:
                self.model = model
        def forward(self, *args, **kwargs):
            return self.model(*args, **kwargs)
    return WrappedModel(pipe_model, config)

In [None]:
%%skip not $to_quantize.value

from diffusers.models.transformers.transformer_sd3 import SD3Transformer2DModel
from diffusers.models.autoencoders.autoencoder_kl import AutoencoderKL
from transformers.models.clip import CLIPTextModelWithProjection

pipe.transformer = wrap_model(pipe.transformer, SD3Transformer2DModel, dict(transformer_config))
pipe.vae = wrap_model(pipe.vae, AutoencoderKL, dict(vae_config))
pipe.text_encoder = wrap_model(pipe.text_encoder, CLIPTextModelWithProjection, text_encoder_config)
pipe.text_encoder_2 = wrap_model(pipe.text_encoder_2, CLIPTextModelWithProjection, text_encoder_2_config)

### Inference for Compilation

Run inference with single step to compile the model

In [None]:
%%skip not $to_quantize.value

# Warmup the model for initial compile
prompt = "valley in the Alps at sunset, epic vista, beautiful landscape, 4k, 8k"
negative_prompt = "frames, borderline, text, charachter, duplicate, error, out of frame, watermark, low quality, ugly, deformed, blur"
num_steps = 1
with torch.no_grad():
    image = pipe(prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=num_steps, generator=generator).images[0]

In [None]:
%%skip not $to_quantize.value

with torch.no_grad():
    image = pipe(prompt=prompt, negative_prompt='', num_inference_steps=28, generator=generator, guidance_scale=5).images[0]
image.resize((512, 512,))

In [None]:
%%skip not $to_quantize.value

def get_model_size(models):
    total_size = 0
    for model in models:
        param_size = 0
        for param in model.parameters():
            param_size += param.nelement() * param.element_size()
        buffer_size = 0
        for buffer in model.buffers():
            buffer_size += buffer.nelement() * buffer.element_size()

        model_size_mb = (param_size + buffer_size) / 1024**2

        total_size += model_size_mb
    return total_size
print("Transformer Size:")
print(get_model_size([pipe.transformer]))
print("Pipeline Size:")
get_model_size([pipe.transformer, pipe.vae.encoder, pipe.vae.decoder, pipe.text_encoder, pipe.text_encoder_2])

In [None]:
%%skip not $to_quantize.value

from sd3_quantization_helper import visualize_results

opt_image = pipe(
    "A raccoon trapped inside a glass jar full of colorful candies, the background is steamy with vivid colors",
    negative_prompt="",
    num_inference_steps=28 if not use_flash_lora.value else 4,
    guidance_scale=5 if not use_flash_lora.value else 0,
    height=1024,
    width=1024,
    generator=torch.Generator().manual_seed(141),
).images[0]

visualize_results(image, opt_image)