<a href="https://www.kaggle.com/code/yunasheng/text-to-image?scriptVersionId=161403134" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

When you think of diffusion models, text-to-image is usually one of the first things that come to mind. Text-to-image generates an image from a text description(for example,"Astronaut in a jungle, cold color palette, mutes colors, detailed, 8k") whih is also known as a prompt.

From a very high level, a diffusion model takes a prompt and some random initial noise, and iterativelt removes the noise to construct an image. The denoising process is guided by the prompt, and once the denoising process ends after a predetermind number of time steps, the image representaion is decoded into an image.

We can generate images from a prompt in two steps:

1. | Load a checkpoint into the AutoPipelineForText2Image class, which automatically detects the appropriate pipeline class to use based on the checkpoint:

In [None]:
!pip install diffusers==0.23.1

In [None]:
from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16").to("cuda")

2. |Pass a prompt to the pipeline to generate an image:

In [None]:
image = pipeline(
    "stained glass of darth vader, blacklight, centered composition, masterpiece, photorealistic, 8k").images[0]
image

# Popular models

The most common text-to-image models are Stable Diffusion V1.5, Stable Diffusion XL(SDXL), and Kandinsky2.2. There are also ControlNet models or adapters that can be udes with text-to-image models for more direct control in generating images. The results from each model are slightly different because of their architecture and training process, but no matter which model you chose, their usage is more or less the same. Let's use the same prompt for each model and compare their results.

# Stable Diffusion v1.5

Stable Diffusion v1.5 is a latent diffusion model initialized from Stable Diffusionv1.4, and finetuned for 595k steps on ` 512*512`  images from the LAION-Aesthetics V2 dataset V2 dataset. You can use this model like this:

In [None]:
from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16").to("cuda")
generator = torch.Generator("cuda").manual_seed(31)
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator).images[0]
image

# Stable Duffusion XL

SDXL is a much larger version of the previous Stable Diffusion models, and involves a two-stage model process that adds even more details to an image. It also includes some additional *micro-conditionings* to generate high-quality images centered subjects. Take a look at the more comprehensive SDXL guide to learn more about how to use it. In general, you can use SDXL like:

In [None]:
from diffusers impoer AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16").to("cuda")
generator = torch.Generator("cuda").manual_seed(31)
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator).images[0]
image

# Kandinsky 2.2

The Kandisky model is a bit different from the Stable Diffusion models because it also usea an image prior model to create embeddings that are used to better align text and images in the diffusion model.

The easiest way to use Kandinsky 2.2 is:

In [None]:
from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "kandisky-community/kandisky-2-2-decoder", torch_dtype=torch.float16).to("cuda")
generator = torch.Generator("cuda").manual_seed(31)
image = pipeline("Austronaut in a jungle, cold color palette,muted colors, detailed, 8k", generator=generator).images[0]
image

# ControlNet

ControlNet models are auxiliary models or adaptera that are finetuned on top of text-to-image models, such as Stable **Diffusion v1.5**. Using ControlNet models in combination with text-to-image models offers diverse options for more explicit control over how to generate an image. With ControlNet, you add an additional conditioning input image to the model.For example, if you provide an image of a human pose(usually represented as multiple keypoints that are connected into a skeleton) as a conditioning input, the model generates an image that follows the pose of the image. Check out the more in-depth **ControlNet** guide to learn more about other conditioning inputs and how to use them.

In this example,let's condition the ControlNet with a human pose estimation image. Load the ControlNet model pretrained on human pose estimations:

In [None]:
from diffusers import ControlNetModel, AutoPipelineForText2Image
from diffusers.utils import load_image
import torch

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_openpose", torch_dtype=torch.float16, variant="fp16").to("cuda")
pose_image = load_image("https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png")

Pass the controlnet to the **AutoPipelineForText2Image** , and provide the promptand pose estimation image:

In [None]:
pipeline = AutoPipelineForText1Image.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, variant="fp16").to("cuda")
generator = torch.generator("cuda").manual_seed(31)
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detained, 8k", image=pose_image, generator=generator).images[0]
image

# Configure pipeline parameters

There are a number of parameters that can be configured in the pipeline that affect how an image is generated. You can change the image's size, specify a negative prompt to improve image quality, and more. This section dives deeper into how to use those parameters.

# Height ans width

The height and width parameters control the height and width (in pixels) of the generated image. By defaut, the Stable Diffusion v1.5 model outputs `512*512` images, but youcan change this to any sixe that is a multiple of `8` . For example, to create a rectangular image: