<a href="https://colab.research.google.com/github/gitmystuff/HuggingFace/blob/main/Diffusers_Starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Diffusers Starter

**What is the Hugging Face `diffusers` Library?**

The `diffusers` library is a Python library developed by Hugging Face that simplifies the process of working with diffusion models. It provides:

* **Pre-trained Models:** A collection of ready-to-use diffusion models for various tasks (image generation, audio generation, etc.).
* **Schedulers:** Implementations of different noise scheduling algorithms used in diffusion models.
* **Pipelines:** High-level abstractions that combine models and schedulers for easy inference.
* **Modularity:** The ability to access and manipulate individual components (models, schedulers) for custom diffusion system development.

**Key Components:**

1.  **Models:**
    * These are the neural networks that learn to denoise images (or other data).
    * They are trained to predict the noise added to an image at each step of the diffusion process.
    * The example mentions `ddpm-celebahq-256`, which is a pre-trained model for generating celebrity images.
2.  **Schedulers:**
    * Schedulers define the noise schedule, which controls how much noise is added or removed at each step of the diffusion process.
    * Different schedulers can lead to different generation speeds and quality.
    * They are the logic that determines the amount of noise to add during the forward diffusion process, and the logic that determines how to remove the noise during the reverse diffusion process.
3.  **Pipelines:**
    * Pipelines bundle models and schedulers together, providing a simple interface for inference.
    * The `DDPMPipeline` is an example of a pipeline that uses a DDPM (Denoising Diffusion Probabilistic Model) model.
    * Pipelines are designed to make it very easy to go from zero to image generation, or other forms of generated data, very quickly.

**Understanding Diffusion Models:**

* **Forward Diffusion (Adding Noise):**
    * The process of gradually adding noise to an image until it becomes pure noise.
    * This process is typically performed in a series of discrete steps.
* **Reverse Diffusion (Denoising):**
    * The process of iteratively removing noise from a noisy image to generate a clean image.
    * This is the generative process, where the model learns to reverse the forward diffusion.
    * The model predicts the noise, and then substracts that predicted noise from the noisy image.
* **Iterative Refinement:**
    * Diffusion models generate images through multiple steps, allowing for gradual refinement and high-quality results.
    * This is a key advantage over single-pass generative models like GANs.

**Example Breakdown:**

1.  **Loading a Pre-Trained Model:**
    * The example uses `DDPMPipeline.from_pretrained("ddpm-celebahq-256")` to load a pre-trained DDPM model.
    * This model has been trained on the CelebA-HQ dataset, so it can generate realistic celebrity faces.
2.  **Using the Pipeline:**
    * The pipeline handles the entire diffusion process, from generating initial noise to denoising it into an image.
    * The pipeline hides the complexity of the for loop that is used to iteratively denoise the image.

**Deconstructing a Basic Pipeline:**

* The library allows you to access the individual components of a pipeline (model and scheduler).
* This enables you to customize the diffusion process, such as using a different scheduler or modifying the model.
* This ability to access the individual components, is what gives the library its flexibility.
* This flexibility is what allows researchers and developers to create new and innovative diffusion based systems.

**In essence:**

The Hugging Face `diffusers` library provides a user-friendly and modular toolkit for working with diffusion models. It simplifies the process of loading pre-trained models, using pipelines for inference, and customizing diffusion systems. Its focus on modularity allows for both rapid prototyping and in depth research.


In [None]:
# nvidia-smi

In [None]:
# from PIL import Image

# def get_image_size(image_path):
#     """Gets the dimensions and file size of an image.

#     Args:
#         image_path: Path to the image file.
#     """
#     try:
#         img = Image.open(image_path)
#         width, height = img.size
#         import os
#         file_size = os.path.getsize(image_path)

#         print(f"Dimensions: {width} x {height} pixels")
#         print(f"File size: {file_size} bytes")

#     except FileNotFoundError:
#         print(f"Error: Image not found at {image_path}")
#     except Exception as e:
#         print(f"An error occurred: {e}")

# image_file = "cat playing.jpg"
# get_image_size(image_file)

In [None]:
# from PIL import Image

# def resize_image(image_path, output_path, size=(1024, 1024)):
#     """Resizes an image to a specified size using PIL.

#     Args:
#         image_path: Path to the input image.
#         output_path: Path to save the resized image.
#         size: Tuple representing the desired width and height (e.g., (1024, 1024)).
#     """
#     try:
#         img = Image.open(image_path)
#         img_resized = img.resize(size, Image.Resampling.LANCZOS) # Use LANCZOS for high-quality downsampling
#         img_resized.save(output_path)
#         print(f"Image resized and saved to {output_path}")

#     except FileNotFoundError:
#         print(f"Error: Image not found at {image_path}")
#     except Exception as e:
#         print(f"An error occurred: {e}")

# # Example usage:
# input_image = "cat playing.jpg"  # Replace with your input image path
# output_image = "cat_playing_1024.jpg" #Replace with desired output path and filename.

# resize_image(input_image, output_image)

# # Example with PNG:
# # input_image_png = "input.png"
# # output_image_png = "output_1024x1024.png"

# # resize_image(input_image_png, output_image_png)

# Working with Images in NumPy

NumPy is a powerful library for numerical operations in Python, and it can also handle images stored as arrays. In this quick lecture, we'll cover how to work with images using NumPy. Although we'll primarily use OpenCV to open and view images, we'll revisit NumPy and Matplotlib later in the Deep Learning section.

## Loading and Displaying Images with PIL and NumPy

First, let's start by loading an image using the `PIL` library and converting it to a NumPy array.

In [None]:
# code

In [None]:
# Check the type of the image


In [None]:
# Convert image to NumPy array


In [None]:
# show new pic


In [None]:
# Copy the array


In [None]:
# use cv2

In [None]:
# Convert BGR to RGB


In [None]:
# Load image in grayscale


In [None]:
# Resize image


## CNN Kernels

In [None]:
# import cv2
# import numpy as np
# import matplotlib.pyplot as plt

# def apply_convolution(image_path, kernel):
#     """Applies a convolution to an image using a given kernel.

#     Args:
#         image_path: Path to the input image.
#         kernel: NumPy array representing the convolution kernel.
#     """
#     try:
#         img = cv2.imread(image_path)
#         if img is None:
#             print(f"Error: Could not read image at {image_path}")
#             return

#         img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # Convert to RGB

#         convolved_img = cv2.filter2D(img_rgb, -1, kernel)  # Apply convolution

#         plt.figure(figsize=(10, 5))

#         plt.subplot(1, 2, 1)
#         plt.imshow(img_rgb)
#         plt.title("Original Image")
#         plt.axis('off')

#         plt.subplot(1, 2, 2)
#         plt.imshow(convolved_img)
#         plt.title("Convolved Image")
#         plt.axis('off')

#         plt.show()

#     except Exception as e:
#         print(f"An error occurred: {e}")

### Sharpening

In [None]:
# # Example usage:
# image_file = "cat_playing_1024.jpg" #replace with your image.

# # 1. Sharpening kernel:
# sharpening_kernel = np.array([[-1, -1, -1],
#                                [-1, 9.5, -1],
#                                [-1, -1, -1]])


# apply_convolution(image_file, sharpening_kernel)

**1. Understanding the Sharpening Kernel:**

```python
sharpening_kernel = np.array([[-1, -1, -1],
                               [-1, 9.5, -1],
                               [-1, -1, -1]])
```

* **NumPy Array:** This creates a 3x3 matrix (a kernel) using the NumPy library.
* **Kernel Structure:**
    * The center value is `9.5`. This is the most important part. It emphasizes the central pixel's value.
    * The surrounding values are all `-1`. These are negative values that subtract from the surrounding pixels.
* **How it Works (Convolution):**
    * When this kernel is applied to an image, it performs a weighted sum of the pixels in a 3x3 neighborhood.
    * The central pixel's value is multiplied by `9.5`, while the surrounding pixels' values are multiplied by `-1`.
    * This process enhances the differences between the central pixel and its neighbors.
    * Where there are sharp transitions (edges), the difference between the center and surrounding pixels will be large, and the result will be a larger value. This is why edges are enhanced.
    * Essentially, the kernel is subtracting a slightly blurred version of the image from the original image.

**2. Why Sharpening (and Similar Filters) Are Used in CNNs:**

* **Feature Extraction:**
    * CNNs are designed to learn features from images. These features can be edges, corners, textures, and other patterns.
    * Convolutional layers in CNNs use kernels (filters) to extract these features.
    * Sharpening kernels, edge detection kernels, and blur kernels are examples of basic filters that can extract specific types of features.
* **Edge Enhancement:**
    * Edges are fundamental features in images. They define the boundaries of objects and provide crucial information for object recognition.
    * Sharpening kernels enhance edges, making them more distinct. This can improve the performance of CNNs in tasks like object detection and image classification.
* **Learning Kernels:**
    * In a typical CNN, the kernel values are not fixed. Instead, they are learned during the training process.
    * The network adjusts the kernel values to minimize the error between its predictions and the actual labels.
    * This allows the network to learn kernels that are optimal for the specific task.
    * While we manually set the sharpening kernel's values in our example, a CNN would learn what values are best.
* **Early Layers:**
    * Sharpening and edge detection kernels are often used in the early layers of CNNs.
    * These layers are responsible for extracting low-level features, such as edges and textures.
    * The later layers of the network then combine these low-level features to form more complex representations of objects.
* **Data Preprocessing:**
    * Sometimes sharpening can be used as a preprocessing step, before the image is inputted into the CNN. This is done to help the CNN find the edges within the image easier.

**In summary:**

Sharpening kernels are used in CNNs (either as fixed filters or learned kernels) to enhance edges and extract relevant features from images. This improves the network's ability to recognize objects and patterns.


### Edge Detection

In [None]:
# # 2. Edge detection kernel:
# edge_detection_kernel = np.array([[-1, -1, -1],
#                                   [0, 0, 0],
#                                   [1, 1, 1]])


# apply_convolution(image_file, edge_detection_kernel)

### Blurring

In [None]:
# # 3. Blur kernel (average):
# blur_kernel = np.ones((15, 15), np.float32) / 225

# apply_convolution(image_file, blur_kernel)


## CNNs

* https://saturncloud.io/blog/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way/
* https://g.co/gemini/share/e9ebc46308d8  

In [None]:
# from diffusers import DDPMPipeline

# ddpm = DDPMPipeline.from_pretrained("google/ddpm-celebahq-256").to("cuda")
# image = ddpm(num_inference_steps=30).images[0]
# image

When you see "DDPM," it refers to "Denoising Diffusion Probabilistic Models." Here's a breakdown of what that means in the context of image generation and the code you provided:

**Denoising Diffusion Probabilistic Models (DDPMs):**

* **Core Idea:**
    * DDPMs are a type of generative model that learn to create images (or other data) by reversing a process that gradually adds noise to the data.
    * Imagine starting with a clear image and progressively adding more and more noise until it becomes pure random noise.
    * A DDPM learns to reverse this process, starting with random noise and gradually removing the noise to reconstruct a clear image.
* **Key Processes:**
    * **Forward Diffusion:**
        * This is the process of adding noise to the image over a series of steps.
        * Each step adds a little more Gaussian noise, eventually turning the image into pure noise.
    * **Reverse Diffusion:**
        * This is the generative process.
        * The model learns to predict and remove the noise at each step, gradually revealing the underlying image.
        * This process is iterative, meaning it repeats many times to refine the image.

**Understanding `image = ddpm(num_inference_steps=30).images[0]`:**

* **`ddpm`:**
    * This represents a DDPM pipeline or model that has been loaded, likely from the Hugging Face `diffusers` library.
    * It encapsulates the model and the scheduler needed to perform the diffusion process.
* **`num_inference_steps=30`:**
    * This parameter specifies the number of steps to take during the reverse diffusion process (denoising).
    * A higher number of steps generally leads to higher-quality images but takes longer to generate.
    * in this case, the model will run the reverse diffusion process 30 times.
* **.images[0]`:**
    * The `ddpm` pipeline, when executed, produces a set of generated images.
    * `.images` is accessing the array of generated images.
    * `[0]` selects the first image from that set.
* **`image`:**
    * The variable `image` now holds the generated image data, which can then be displayed or further processed.

**In simpler terms:**

The code tells the DDPM model to:

1.  Start with random noise.
2.  Perform 30 steps of denoising.
3.  Give me the resulting image.

Therefore, the variable "image" will contain the generated image that was created by the DDPM model.


Let's break down how a diffusion pipeline works, focusing on the `UNet2DModel` and `DDPMScheduler`, and then explain how to reconstruct that process manually.

**Understanding the Pipeline (Simplified):**

Imagine you have a blurry image that you want to sharpen. A diffusion pipeline does something similar, but it starts with *pure noise* and gradually turns it into a clear image.

Here's a rephrased explanation:

"The pipeline is a pre-built system that generates images by repeatedly removing noise. It uses two main components:

1.  **`UNet2DModel` (The 'Painter'):** This is the core neural network. It's like an artist who can look at a noisy image and figure out what the noise looks like. It essentially predicts the 'direction' we need to move in to get a less noisy image.
2.  **`DDPMScheduler` (The 'Guide'):** This component controls the denoising process. It tells the 'painter' how much noise to remove at each step. It's like a guide that says, 'Remove this much noise now, then this much next time.'

Here's how they work together:

* Start with random noise (a completely blurry image).
* The `UNet2DModel` analyzes the noise and estimates what the noise looks like (the 'noise residual').
* The `DDPMScheduler` uses this estimate to calculate a slightly less noisy version of the image.
* This process repeats many times. Each time, the image gets a little less noisy and more detailed.
* After a certain number of steps, the image is clear."

**In more technical terms:**

* The `UNet2DModel` is a specific type of neural network architecture designed for image processing. It's particularly good at finding patterns in noisy images.
* The `DDPMScheduler` implements the noise schedule, which is a key part of the diffusion process. It dictates how much noise is added or removed at each timestep.
* The term "noise residual" refers to the model's prediction of the noise that was added at that given timestep.

**Recreating the Pipeline Manually:**

The example then shows how to do the same thing as the pipeline, but by loading the `UNet2DModel` and `DDPMScheduler` directly and writing the denoising loop yourself. This gives you more control over the process.

**Why do this?**

* **Understanding:** It helps you understand how the pipeline works under the hood.
* **Customization:** It allows you to modify the denoising process, such as using a different scheduler or changing the number of steps.
* **Research:** It's essential for developing new diffusion techniques.

**In essence:**

The pipeline simplifies the process of generating images with diffusion models. By manually reconstructing the pipeline, you gain a deeper understanding of the underlying mechanisms and unlock greater flexibility.


In [None]:
# from diffusers import UNet2DModel, DDPMScheduler

# repo_id = "google/ddpm-church-256"
# scheduler = DDPMScheduler.from_pretrained(repo_id)
# model = UNet2DModel.from_pretrained(repo_id, device="cuda")

In [None]:
# scheduler.set_timesteps(50)
# scheduler.timesteps

In [None]:
# import torch

# sample_size = model.config.sample_size
# noise = torch.randn((1, 3, sample_size, sample_size))

# for t in scheduler.timesteps:
#     with torch.no_grad():
#         noisy_residual = model(noise, t).sample

#     previous_noisy_sample = scheduler.step(noisy_residual, t, noise).prev_sample
#     noise = previous_noisy_sample

**Code Breakdown:**

```python
import torch

sample_size = model.config.sample_size
noise = torch.randn((1, 3, sample_size, sample_size))

for t in scheduler.timesteps:
    with torch.no_grad():
        noisy_residual = model(noise, t).sample

    previous_noisy_sample = scheduler.step(noisy_residual, t, noise).prev_sample
    noise = previous_noisy_sample
```

**Explanation:**

1.  **`import torch`:**
    * Imports the PyTorch library, which is essential for tensor operations and neural network computations.

2.  **`sample_size = model.config.sample_size`:**
    * Retrieves the size of the images that the model is designed to generate.
    * `model.config` accesses the configuration settings of the `UNet2DModel`.
    * `sample_size` stores the width and height of the image (assuming square images).

3.  **`noise = torch.randn((1, 3, sample_size, sample_size))`:**
    * Creates a tensor filled with random Gaussian noise.
    * `(1, 3, sample_size, sample_size)` defines the shape of the tensor:
        * `1`: Batch size (generating a single image).
        * `3`: Number of color channels (RGB).
        * `sample_size, sample_size`: Width and height of the image.
    * This represents the starting point of the denoising process—a completely noisy image.

4.  **`for t in scheduler.timesteps:`:**
    * Starts a loop that iterates through the timesteps defined by the `DDPMScheduler`.
    * Each timestep `t` represents a specific level of noise in the diffusion process.
    * `scheduler.timesteps` is an array of timesteps that the scheduler uses.

5.  **`with torch.no_grad():`:**
    * Disables gradient calculations within the block.
    * This is done because we're performing inference (generating an image), not training the model. We don't need to compute gradients.
    * This saves memory and speeds up the process.

6.  **`noisy_residual = model(noise, t).sample`:**
    * Performs a forward pass through the `UNet2DModel`.
    * `noise`: The current noisy image.
    * `t`: The current timestep.
    * `model(noise, t)`: Passes the noise and the timestep to the model.
    * `.sample`: Retrieves the model's prediction of the noise residual (the estimated noise that was added at this timestep).

7.  **`previous_noisy_sample = scheduler.step(noisy_residual, t, noise).prev_sample`:**
    * Uses the `DDPMScheduler` to calculate the less noisy image at the previous timestep.
    * `scheduler.step(noisy_residual, t, noise)`:
        * `noisy_residual`: The model's noise prediction.
        * `t`: The current timestep.
        * `noise`: The current noisy image.
        * This function applies the denoising step, using the noise prediction and the scheduler's logic.
    * `.prev_sample`: Extracts the denoised image from the scheduler's output.

8.  **`noise = previous_noisy_sample`:**
    * Updates the `noise` variable with the denoised image.
    * This denoised image becomes the input for the next iteration of the loop.

**In essence, this code:**

* Starts with pure noise.
* Iteratively removes noise using the `UNet2DModel` and `DDPMScheduler`.
* Repeats this process for each timestep, gradually generating a clearer image.
* The for loop, is the reverse diffusion process.
* The model predicts the noise, and the scheduler uses that prediction to take a step backwards in the noise schedule.
* The final noise variable, after the for loop completes, will contain the generated image.


In [None]:
# from PIL import Image
# import numpy as np

# # Normalize the image data
# image = (noise / 2 + 0.5).clamp(0, 1).squeeze()
# # Change the shape and type for image conversion
# image = (image.permute(1, 2, 0) * 255).round().to(torch.uint8).cpu().numpy()

# image = Image.fromarray(image)
# image

In [None]:
# from PIL import Image
# import torch
# from transformers import CLIPTextModel, CLIPTokenizer
# from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler

# vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", use_safetensors=True)
# tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer")
# text_encoder = CLIPTextModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="text_encoder", use_safetensors=True)
# unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", use_safetensors=True)

This Python code snippet sets up the core components of a Stable Diffusion model, a popular text-to-image generation system. Let's break it down line by line:

**1. Importing Libraries:**

```python
from PIL import Image
import torch
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
```

* **`from PIL import Image`**: Imports the `Image` module from the Pillow (PIL) library, used for working with image files (loading, saving, displaying).
* **`import torch`**: Imports the PyTorch library, a deep learning framework used for building and running neural networks.
* **`from transformers import CLIPTextModel, CLIPTokenizer`**: Imports specific classes from the Hugging Face `transformers` library:
    * `CLIPTextModel`: Represents the text encoder part of the CLIP (Contrastive Language–Image Pre-training) model, which converts text prompts into numerical embeddings.
    * `CLIPTokenizer`: Used to tokenize text prompts, breaking them down into smaller units (tokens) that the CLIP model can understand.
* **`from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler`**: Imports classes from the Hugging Face `diffusers` library, designed for diffusion models:
    * `AutoencoderKL`: Represents the variational autoencoder (VAE), which compresses images into a lower-dimensional latent space and reconstructs them.
    * `UNet2DConditionModel`: Represents the U-Net, the core neural network that iteratively denoises latent representations to generate images.
    * `PNDMScheduler`: Implements the PNDM (Pseudo Numerical Methods for Diffusion Models) scheduler, which controls the denoising process.

**2. Loading Pre-trained Models:**

```python
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", use_safetensors=True)
tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="text_encoder", use_safetensors=True)
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", use_safetensors=True)
```

* **`from_pretrained("CompVis/stable-diffusion-v1-4", ...)`**: This is the key function from the Hugging Face libraries. It loads pre-trained model weights from the specified repository ("CompVis/stable-diffusion-v1-4" in this case).
* **`subfolder="vae"`, `subfolder="tokenizer"`, `subfolder="text_encoder"`, `subfolder="unet"`**: These arguments specify which subfolder within the repository to load the model from. This is how Stable Diffusion is split into its component parts.
* **`use_safetensors=True`**: This is a security and efficiency measure. Safetensors are a safe way to store and load tensors. It is the new standard for model weights.

**In essence, this code does the following:**

1.  **Imports necessary libraries** for image processing, deep learning, and diffusion models.
2.  **Downloads and loads the pre-trained weights** for the VAE, tokenizer, text encoder, and U-Net components of Stable Diffusion from the Hugging Face Model Hub.

These loaded components are the building blocks for generating images from text prompts. The subsequent steps in a Stable Diffusion pipeline would typically involve:

* Tokenizing and encoding the text prompt using the `tokenizer` and `text_encoder`.
* Generating a noisy latent representation.
* Iteratively denoising the latent representation using the `unet` and `PNDMScheduler`.
* Decoding the final latent representation into an image using the `vae`.


In [None]:
# from diffusers import UniPCMultistepScheduler

# scheduler = UniPCMultistepScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")

# torch_device = "cuda"
# vae.to(torch_device)
# text_encoder.to(torch_device)
# unet.to(torch_device)

In [None]:
# prompt = ["cartoon cat playing upright bass"]
# height = 512  # default height of Stable Diffusion
# width = 512  # default width of Stable Diffusion
# num_inference_steps = 25  # Number of denoising steps
# guidance_scale = 7.5  # Scale for classifier-free guidance

In [None]:
# batch_size = len(prompt)

# text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")

# with torch.no_grad():
#     text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]

# max_length = text_input.input_ids.shape[-1]
# uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
# uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]

# text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

In [None]:
# from tqdm.auto import tqdm

# latents = torch.randn((batch_size, unet.config.in_channels, height // 8, width // 8), device=torch_device)
# latents = latents * scheduler.init_noise_sigma

# scheduler.set_timesteps(num_inference_steps)

# for t in tqdm(scheduler.timesteps):
#     latent_model_input = torch.cat([latents] * 2)
#     latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)

#     with torch.no_grad():
#         noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

#     noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
#     noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

#     latents = scheduler.step(noise_pred, t, latents).prev_sample

In [None]:
# latents = 1/ 0.18215 * latents
# with torch.no_grad():
#     image = vae.decode(latents).sample

# image = (image / 2 + 0.5).clamp(0, 1).squeeze()
# image = (image.permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
# image = Image.fromarray(image)
# image

## Key Concepts

* Diffusers Library: This is a library from Hugging Face designed to make it easier to work with diffusion-based text-to-image models.
   
* Image Data Representation: Images are fundamentally represented as arrays of pixels. Grayscale images use 2D arrays, with pixel values ranging typically from 0 to 1 or 0 to 255. Color images use the RGB model, where each pixel is defined by a combination of red, green, and blue values.
   
* Color Channels: The RGB color model uses red, green, and blue channels. Additionally, an optional fourth channel, alpha, is sometimes used to represent transparency.
   
* NumPy for Image Data: NumPy is a Python library that can be used to process image data, including converting images into arrays and manipulating color channels.
   
* OpenCV (cv2): OpenCV is a library used for image processing, capable of loading and displaying images. It's important to note that OpenCV uses BGR color order by default, whereas matplotlib uses RGB.
   
* Image Generation Models: These are models that take a text string as input and generate an image as output.
   
* Diffusion Process: This is a technique used by image generation models to create images from noisy inputs by reversing a corruption process.
   
* Text Embedding: This refers to the creation of a vector representation of text, which is used in the early stages of image generation.
   
* CLIP (Contrastive Language-Image Pre-training): CLIP is a model designed to understand the relationship between text and images. It is trained to maximize similarity between correct image-caption pairs and minimize similarity between incorrect pairs.
   
* Decoder (Diffusion Model/Unclip): This component generates an image from a prior embedding by reversing a noising process.
   
* Auto Pipelines: This is a feature of the Diffusers library that simplifies the process of loading and using pre-trained models by attempting to automatically detect the correct internal pipelines.

## Diffuser Process

Here is an explanation of the diffuser process using bullet points:

* The diffusion model is trained to reverse a fixed corruption process.
   
* The corruption process involves adding small amounts of Gaussian noise to an image, gradually erasing information from it.
   
* By the final step of this process, the image becomes indistinguishable from pure noise.
   
* The diffusion model is trained to reverse this process, learning to regenerate the information that was erased at each step.
   
* In simpler terms, the model starts with a noisy image and works to remove the noise, aiming to produce an image that aligns with the given text embedding.
   
* There are two main stages in this process:
   * The prior stage generates a clip image embedding, describing the gist of an image from the given caption, which is itself a text embedding.
   * The decoder stage, also known as Unclip, is the diffusion model that generates the image from this embedding.
   
* The Unclip or decoder receives a corrupted version of the image it’s trained to reconstruct, along with the clip image embedding of the clean image.
   
* After these two stages, up-sampling may be performed on the image to achieve higher resolution.
   
* Early versions of stable diffusion were trained on 512x512 pixel images, so higher resolution outputs used a separate model to upscale the image.
   
* Latest models are trained on larger images, like 1024x1024 pixels, and can upscale to 4K or 8K resolutions.