Image-to-Image Conditioning
Experiments

Image-to-Image Conditioning

A picture is worth a thousand words, sometimes starting from a reference image is more efficient than trying your luck with simple text-to-image.

In this section we'll explore various image-to-image techniques.

If you want to follow the following examples be sure to download the content of the input directory of this repository and place it inside ComfyUI/input/.

Simple Img2Img

The easiest of the image to image workflows is by "drawing over" an existing image using a lower than 1 denoise value in the sampler.

The lower the denoise the closer the composition will be to the original image.

We can of course augment the generation with proper prompting. In this workflow we use a base image of a portrait of a woman to create a similar image of a man.

👉 Note: We are using SDXL for this example. The latent size is 1024x1024 but the conditioning image is only 512x512. It is a good idea to always work with images of the same size. That's why in this example we are scaling the original image to match the latent. This is generally true for every image-to-image workflow, including ControlNets especially if the aspect ratio is different.

unCLIP model

Sometimes you want to create an image based on the style of a reference picture. You are not painting over but taking inspiration from a source.

This can be done with unCLIP models. In this example we are using the sd21-unclip-h.ckpt checkpoint.

We load the checkpoint with the unCLIPCheckpointLoader node. Note that it is based on SD2.1 so we use 768x768 latent size that is the resolution the model is trained for.
We use a CLIP Vision Encode node to encode the reference picture for the model.
The conditioning happens on the unCLIPConditioning node. noise_augmentation defines how close to the original the new image will be with 0 being the most faithful. It is generally a good idea to set this value to 0.1-0.3 just to give some leeway to the sampler. strength is the conditioning strength in relation to the other conditionings (in this example the text clip). It's like setting the weight of a text inside a prompt, eg: (red hat:1.2).

💡 Tip: You'll notice that there are two unCLIP models available: sd21-unclip-l.ckpt and sd21-unclip-h.ckpt. Generally for one off image you want to use the -h variant that is more accurate. The -l model was created for when resources are scarse or extreme speed is essential.

Style Model

The Style model works similarly to unCLIP but it's a CLIP Vision conditioning and can be used with any SD1.5 model.

For this to work you need the CLIP Vision model in the ComfyUI/models/clip_vision directory and the style model itself in ComfyUI/models/style_models.

Load the CLIP Vision model.
Encode the source image for the model to use.
Load the Style model
Connect your prompt to the Apply style model node and then to the KSampler positive. Note that albeit the node doesn't offer a strength option you can technically fine tune the effect with timestepping. Check the experiments for some examples.

👉 Note: The style model --like many of the "in the style of..." img2img-- can't apply a style of something it doesn't understand. A picture of a famous painting or of a person will be easy to process but something more exotic, abstract or unintelligible might lead to underwhelming results.

IPAdapter image + text

IPAdapter are a series of very effective models for image conditioning. They can be used alone or in conjuction with text and ControlNets.

In this workflow we offer a simple image+text conditioning example. Also check the experiments for more use cases. We are using SDXL but models for SD1.5 are also available.

You need to download these pretrained models on huggingface and install the ComgyUI extension (from yours truly). Note that you need both the model and the image encoder. Follow the installation instruction on the extension page.

The workflow itself is very simple and similar to the style model.

💡 Tip: You can use multiple reference images using the ImageBatch node.

SDXL Revision

Stability AI released the Revision model that is similar to the other methods we explored in this section but dedicated to SDXL.

Revision is used for image (including multiple images) and image+text conditioning and it's also a rather effective tool for creating image variations.

Download the clip vision model and place it in the ComfyUI/models/clip_vision directory.

The workflow is similar to unCLIP but the checkpoint is SDXL base. In this example we are merging the style of two images.

As per unCLIP, noise_augmentation determines the closeness to the reference image (0 being the closest) and strength the weight of the conditioning.

👉 Note: to eliminate any interference from the text clip we also use ConditioningZeroOut, this is optional and used exclusively for the purpose of this example that is meant as a pure image+image conditioning without any other external influence. You can further alter the image generation by elimitating the zero out nodes and by adding a custom prompt.

In this experiment we use one prompt as text conditioning then we connect a zeroed-out empty prompt to the style model. We can now timestep the two conditionings to easily calibrate the end result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Image-to-Image Conditioning

Simple Img2Img

unCLIP model

Style Model

IPAdapter image + text

SDXL Revision

Experiments

unCLIP multiple images

unCLIP with SDXL refiner augmentation

IPAdapter image variations

IPAdapter + Canny control net

Timestepping a Style model

Files

README.md

Latest commit

History

README.md

File metadata and controls

Image-to-Image Conditioning

Experiments