Skip to content

Model format

Peter Major edited this page Jun 16, 2023 · 2 revisions

Unpaint accepts Stable Diffusion models in the ONNX format. Each model must contain the following files:

Path Role
feature_extractor / preprocessor_config.json Configuration for feature extraction
safety_checker / model.onnx Model for the safety check
scheduler / scheduler_config.json Configuration for denoising scheduler
text_encoder / model.onnx Model for text encoding
tokenizer / merges.txt, special_tokens_map.json, tokenizer_config.json, vocab.json Configuration for text tokenization
unet / model.onnx Model for denoising
vae_decoder / model.onnx Model for VAE decoding
vae_encoder / model.onnx Model for VAE encoding

Safety checker

Checks if the image is safe for work (SFW).

Input:

  • float16 clip_input[batch, channels, height, width]: the input image in color space, usually 224 x 224 pixels, colors scaled to the range 0 .. 1 then normalized, planar color channels
  • float16 images[batch, height, width, channels]: the input image in color space, original size, color scaled to the range 0 .. 1

Output:

  • float16 out_images[batch, height, width, channels]: the input image, if safe, otherwise a black image
  • boolean has_nsfw_concepts[batch]: a boolean value for each input image in the batch, true if the image is unsafe otherwise false

Text encoder

Encodes tokenized text as an embedding.

Input:

  • int32 input_ids[batch, sequence]: the input tokens, usually have the dimension of batch size * 77

Output:

  • float16 last_hidden_state[batch, sequence, 768]: the text embedding (aka. last hidden state)

Denoizer (U-net)

Denoises images.

Input:

  • float16 sample[batch, channels, height, width]: the image to denoise in latent space
  • float16 timestep[batch]: the timestep for denoising
  • float16 encoder_hidden_state[batch, sequence, 768]: the text embedding (aka. last hidden state)

Output:

  • float16 out_sample[batch, channels, height, width]: the denoised image in latent space

VAE decoder

Converts images from latent to color space.

Input:

  • float16 latent_sample[batch, channels, height, width]: an image in latent space

Output:

  • float16 sample[batch, channels, height, width]: an image in color space

VAE encoder

Converts images from color to latent space.

Input:

  • float16 sample[batch, channels, height, width]: an image in color space

Output:

  • float16 latent_sample[batch, channels, height, width]: an image in latent space
Clone this wiki locally