# StyleAligned: Zero-Shot Style Alignment among a Series of Generated Images via Attention Sharing

### **Authors**: ***Borgi Alessio***, ***Danese Francesco***

### **Abstract**
In this notebook we aim to reproduce and enhance **[StyleAligned](https://arxiv.org/abs/2312.02133)**, a novel technique introduced by **Google Research**, for achieving **Style Consistency** in large-scale Text-to-Image (T2I) generative models. While current T2I models excel in creating visually compelling images from textual descriptions, they often struggle to maintain a consistent style across multiple images. Traditional methods to address this require extensive fine-tuning and manual intervention. 

**StyleAligned** addresses this challenge by introducing minimal **Attention Sharing** during the **Diffusion Process**, ensuring **Style Alignment among generated images** without the need for optimization or fine-tuning (**Zero-Shoot Inference**). The method operates by leveraging a straightforward inversion operation to apply a reference style across various generated images, maintaining high-quality synthesis and fidelity to the provided text prompts.

### 0: SETTINGS & IMPORTS

#### 0.1: CLONE REPOSITORY AND GIT SETUP

In the following cell, we setup the code, by cloning the repository, setting up the Git configurations, and providing some other useful commands useful for git.  

In [None]:
# Clone the repository
!git clone https://github.com/alessioborgi/StyleAlignedDiffModels.git

# Change directory to the cloned repository
%cd StyleAlignedDiffModels
%ls

# Set up Git configuration
!git config --global user.name "Alessio Borgi"
!git config --global user.email "alessioborgi3@gmail.com"

# Stage the changes
#!git add .

# Commit the changes
#!git commit -m "Added some content to your-file.txt"

# Push the changes (replace 'your-token' with your actual personal access token)
#!git push origin main

#### 0.2: INSTALL AND IMPORT REQUIRED LIBRARIES

We proceed then by installing and importing the required libraries.

In [None]:
# Install the required packages
!pip install -r requirements.txt > /dev/null

In [None]:
from diffusers import StableDiffusionXLPipeline, DDIMScheduler
import torch
import mediapy
import sa_handler
from diffusers.utils import load_image
import inversion
import numpy as np

### 4: DDIM SCHEDULER

We then proceed to load the **SDXL (Stable Diffusion XL)** Model and configure the **DDIM (Denoising Diffusion Implicit Models) Scheduler**.

The **DDIM Scheduler** is the component used in diffusion models for generating high-quality samples from noise. It controls the denoising process by defining a schedule for adding and removing noise to and from the data. The scheduler is essential in determining how the model transitions from pure noise to a final, coherent image or other data form.

In particular, its parameters are:
- **beta_start (float)**: Starting value of beta, the variance of the noise schedule. 
- **beta_end (float)**: Ending value of beta, the variance of the noise schedule. 
- **beta_schedule (str)**: The type of schedule for beta. (Possible values: "linear", "scaled_linear", "squaredcos_cap_v2", "sigmoid"). 
- **clip_sample (bool)**: If True, the samples are clipped to [-1, 1]. 
- **set_alpha_to_one (bool)**: If True, alpha will be set to 1 at the end of the sampling process.
- **num_train_timesteps (int)**: The number of diffusion steps used during training. 
- **timestep_spacing (str)**: The method to space out timesteps.(Possible values: "linspace", "leading"). 
	•	prediction_type (str): The type of prediction model used in the scheduler. Possible values: "epsilon", "sample", "v-prediction". Default: "epsilon"
	•	trained_betas (torch.Tensor or None): Optional tensor of pre-trained betas to use in the scheduler. Default: None


## DDIM Scheduler

The **DDIMScheduler** (Denoising Diffusion Implicit Models Scheduler) is used in diffusion models for generating high-quality samples from noise. It controls the denoising process by defining a schedule for adding and removing noise to and from the data.

### Key Parameters and their Descriptions

- **beta_start**: Starting value of beta, the variance of the noise schedule.
- **beta_end**: Ending value of beta, the variance of the noise schedule.
- **beta_schedule**: The type of schedule for beta values.
- **clip_sample**: If `True`, the samples are clipped to [-1, 1].
- **set_alpha_to_one**: If `True`, alpha will be set to 1 at the end of the sampling process.
- **num_train_timesteps**: The number of diffusion steps used during training.
- **timestep_spacing**: The method to space out timesteps.
- **prediction_type**: The type of prediction model used in the scheduler.
- **trained_betas**: Optional tensor of pre-trained betas to use in the scheduler.

### Diffusion Process

The diffusion process involves adding noise to the data over a series of timesteps, which is described by the forward process:

\[ q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{\alpha_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}) \]

where:
- \( \alpha_t \) and \( \beta_t \) are the scaling and noise variance terms, respectively.

### Reverse Process

The reverse process aims to recover the data by denoising it, and is given by:

\[ p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mu_{\theta}(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I}) \]

where:
- \( \mu_{\theta}(\mathbf{x}_t, t) \) is the predicted mean.
- \( \sigma_t \) is the standard deviation of the noise at timestep \( t \).

### Beta Schedule

The beta values are scheduled over timesteps from `beta_start` to `beta_end`. The schedule can be:
- **Linear**: 

\[ \beta_t = \beta_{\text{start}} + t \frac{\beta_{\text{end}} - \beta_{\text{start}}}{T} \]

- **Scaled Linear**:

\[ \beta_t = \beta_{\text{start}} + t \left(\frac{\beta_{\text{end}} - \beta_{\text{start}}}{T}\right)^2 \]

- **Sigmoid**:

\[ \beta_t = \beta_{\text{start}} + (\beta_{\text{end}} - \beta_{\text{start}}) \cdot \text{sigmoid}(t) \]

### Inference with DDIM

During inference, the denoising process can be described as:

\[ \mathbf{x}_{t-1} = \sqrt{\alpha_{t-1}} \left( \frac{\mathbf{x}_t - \sqrt{1 - \alpha_t} \mathbf{\epsilon}_{\theta}(\mathbf{x}_t, t)}{\sqrt{\alpha_t}} \right) + \sqrt{1 - \alpha_{t-1} - \sigma_t^2} \mathbf{\epsilon}_{\theta}(\mathbf{x}_t, t) \]

where:
- \( \mathbf{\epsilon}_{\theta}(\mathbf{x}_t, t) \) is the noise predicted by the model.
- \( \sigma_t \) is the standard deviation for the timestep \( t \).

### Example Initialization

Here is an example initialization of the DDIMScheduler with all available parameters and their default values:

```python
from diffusers import DDIMScheduler

scheduler = DDIMScheduler(
    beta_start=0.00085,                  # Starting value of beta
    beta_end=0.012,                     # Ending value of beta
    beta_schedule="scaled_linear",      # Type of schedule for beta values
    clip_sample=False,                  # Whether to clip samples to a specified range
    set_alpha_to_one=False,             # Whether to set alpha to one at the end of the process
    num_train_timesteps=1000,           # Number of diffusion steps used during training
    timestep_spacing="linspace",        # Method to space out timesteps
    prediction_type="epsilon",          # Type of prediction model used in the scheduler
    trained_betas=None                  # Optional pre-trained beta values
)

In [None]:
# init models

scheduler = DDIMScheduler(
    beta_start=0.00085, 
    beta_end=0.012, 
    beta_schedule="scaled_linear", 
    clip_sample=False,
    set_alpha_to_one=False)
from diffusers import DDIMScheduler

scheduler = DDIMScheduler(
    beta_start=0.00085,                 # Starting value of beta
    beta_end=0.012,                     # Ending value of beta
    beta_schedule="scaled_linear",      # Type of schedule for beta values
    clip_sample=False,                  # Whether to clip samples to a specified range
    set_alpha_to_one=False,             # Whether to set alpha to one at the end of the process
    
    num_train_timesteps=1000,           # Number of diffusion steps used during training
    timestep_spacing="linspace",        # Method to space out timesteps
    prediction_type="epsilon",          # Type of prediction model used in the scheduler
    trained_betas=None                  # Optional pre-trained beta values
)

pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", 
    torch_dtype=torch.float16, 
    variant="fp16", 
    use_safetensors=True,
    scheduler=scheduler
).to("cuda")

handler = sa_handler.Handler(pipeline)
sa_args = sa_handler.StyleAlignedArgs(share_group_norm=False,
                                      share_layer_norm=False,
                                      share_attention=True,
                                      adain_queries=True,
                                      adain_keys=True,
                                      adain_values=False
                                     )

handler.register(sa_args)

### 5: RUNNING STYLE-ALIGNED with A SET OF PROMPTS WITHOUT REFERENCE IMAGE

In [None]:
# run StyleAligned

sets_of_prompts = [
  "a toy train. macro photo. 3d game asset",
  "a toy airplane. macro photo. 3d game asset",
  "a toy bicycle. macro photo. 3d game asset",
  "a toy car. macro photo. 3d game asset",
  "a toy boat. macro photo. 3d game asset",
]
images = pipeline(sets_of_prompts,).images
mediapy.show_images(images)

In [None]:
# run StyleAligned
sets_of_prompts = [
  "a toy train. macro photo. 3d game asset",
  "a toy airplane. macro photo. 3d game asset",
  "a toy bicycle. macro photo. 3d game asset",
  "a toy car. macro photo. 3d game asset",
  "a toy boat. macro photo. 3d game asset",
]
# sets_of_prompts = [
#   "a hot hair balloon, simple wooden statue",
#   "a friendly robot, simple wooden statue",
#   "a bull, simple wooden statue",
# ]
images = []
for prompt in sets_of_prompts:
    # Generate image for each prompt individually
    image = pipeline([prompt]).images[0]
    images.append(image)
    # Clear CUDA cache to free memory
    torch.cuda.empty_cache()
    
    # Print Memory summary
    # print(torch.cuda.memory_summary(device=None, abbreviated=False))
    
mediapy.show_images(images)

### 6: STYLE-ALIGNED WITH REFERENCE IMAGE 

#### 6.1: LOADING AND INVERTING REFERENCE IMAGE  
Load a reference image and perform the inversion process to extract latent representations.

In [None]:
src_style = "medieval painting"
src_prompt = f'Man laying in a bed, {src_style}.'
image_path = './imgs/medieval-bed.jpeg'

num_inference_steps = 50
x0 = np.array(load_image(image_path).resize((1024, 1024)))
zts = inversion.ddim_inversion(pipeline, x0, src_prompt, num_inference_steps, 2)
mediapy.show_image(x0, title="innput reference image", height=256)