Model format

Unpaint accepts Stable Diffusion models in the ONNX format. Each model must contain the following files:

Path	Role
`feature_extractor / preprocessor_config.json`	Configuration for feature extraction
`safety_checker / model.onnx`	Model for the safety check
`scheduler / scheduler_config.json`	Configuration for denoising scheduler
`text_encoder / model.onnx`	Model for text encoding
`tokenizer / merges.txt, special_tokens_map.json, tokenizer_config.json, vocab.json`	Configuration for text tokenization
`unet / model.onnx`	Model for denoising
`vae_decoder / model.onnx`	Model for VAE decoding
`vae_encoder / model.onnx`	Model for VAE encoding

Safety checker

Checks if the image is safe for work (SFW).

Input:

float16 clip_input[batch, channels, height, width]: the input image in color space, usually 224 x 224 pixels, colors scaled to the range 0 .. 1 then normalized, planar color channels
float16 images[batch, height, width, channels]: the input image in color space, original size, color scaled to the range 0 .. 1

Output:

float16 out_images[batch, height, width, channels]: the input image, if safe, otherwise a black image
boolean has_nsfw_concepts[batch]: a boolean value for each input image in the batch, true if the image is unsafe otherwise false

Encodes tokenized text as an embedding.

Input:

int32 input_ids[batch, sequence]: the input tokens, usually have the dimension of batch size * 77

Output:

float16 last_hidden_state[batch, sequence, 768]: the text embedding (aka. last hidden state)

Denoises images.

Input:

float16 sample[batch, channels, height, width]: the image to denoise in latent space
float16 timestep[batch]: the timestep for denoising
float16 encoder_hidden_state[batch, sequence, 768]: the text embedding (aka. last hidden state)

Output:

float16 out_sample[batch, channels, height, width]: the denoised image in latent space

Converts images from latent to color space.

Input:

float16 latent_sample[batch, channels, height, width]: an image in latent space

Output:

Converts images from color to latent space.

Input:

Output:

float16 latent_sample[batch, channels, height, width]: an image in latent space