<a href="https://colab.research.google.com/github/davidricardocr/sdxl-lora-fine-tuning/blob/main/sdxl_lora_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Diffusion XL LoRA Fine-Tuning Guide

Welcome to this guide on fine-tuning **Stable Diffusion XL (SDXL) with LoRA (Low-Rank Adaptation)**. In this notebook, we will walk through setting up the environment, executing the fine-tuning script, and loading the resulting weights for inference.

**Objective**: Our goal is to customize the SDXL model to generate images with specific styles or themes. To achieve this, we use LoRA, a parameter-efficient fine-tuning technique, making the process computationally feasible even on limited hardware.

In this tutorial, we will fine-tune the model using a **Naruto-themed dataset**, [`lambdalabs/naruto-blip-captions`](https://huggingface.co/datasets/lambdalabs/naruto-blip-captions), enabling you to  generate your Naruto-inspired images.


### Getting Started

We will start by downloading the setup files (`build.sh` and `config.yaml`) directly from the GitHub repository. These files will help us configure the environment and set the necessary parameters for fine-tuning.



In [None]:
# Download the build.sh script
!wget https://github.com/davidricardocr/sdxl-lora-fine-tuning/raw/main/build.sh -O build.sh

# Download the config.yaml file
!wget https://github.com/davidricardocr/sdxl-lora-fine-tuning/raw/main/config.yaml -O config.yaml

## Environment Setup
To simplify the environment setup, we have created a shell script (build.sh) that installs the necessary dependencies for SDXL fine-tuning. This includes cloning the diffusers repository, installing specific requirements for SDXL examples, and configuring tools to optimize computation.

Simply run the following cell to execute the setup:

In [None]:
# Execute the setup script
!bash build.sh

## Fine-Tuning with LoRA and `accelerate`

Now that the environment is ready, to start fine-tuning SDXL with LoRA, we’ll be using Hugging Face's `train_text_to_image_lora_sdxl.py` script from the diffusers library. This script is designed specifically for text-to-image tasks, allowing you to customize and fine-tune large models like Stable Diffusion XL.

The command below leverages the `accelerate` library, which optimizes training on large models by handling parallelism and resource management efficiently. The script is configured with a range of parameters that control aspects of model setup, data handling, optimization, and output management, here’s a brief overview of the key ones used in this example:

- `--pretrained_model_name_or_path`: The base model to fine-tune. Here, we're using `stabilityai/stable-diffusion-xl-base-1.0`.
- `--dataset_name`: Specifies the dataset to use; in this case, a Naruto captioning dataset.
- `--caption_column`: The column in the dataset that provides the captions for image generation.
- `--resolution`: Sets the output image resolution, with higher values (e.g., 1024) leading to more detailed images at the cost of greater memory usage and longer processing times. For resource-constrained environments, using a resolution of 512 may be more practical, as it reduces memory demands while still providing acceptable quality.
- `--center_crop` & `--random_flip`: Basic data augmentations to improve model robustness.
- `--num_train_epochs`: The number of complete passes through the dataset. Increasing this value generally improves model performance by allowing it to learn more thoroughly, but excessive epochs may lead to overfitting, where the model performs well on training data but generalizes poorly to new inputs.
- `--train_batch_size`: Batch size of 5 helps balance memory load and training speed.
- `--gradient_checkpointing` & `--gradient_accumulation_steps`: Techniques to handle large model gradients without overwhelming memory.
- `--learning_rate`: A low rate (1e-4) to prevent overfitting during fine-tuning.
- `--lr_scheduler`: "constant" maintains a steady learning rate, simplifying training stability.
- `--lr_warmup_steps`: The number of warmup steps for the learning rate. Here, it's set to 0, meaning no warmup phase.
- `--max_grad_norm`: Limits the maximum gradient norm, set to 1 to prevent unstable updates that could arise from large gradient values.
- `--output_dir`: Where fine-tuned weights will be saved.
- `--checkpointing_steps`: Frequency of saving model checkpoints (every 500 steps) to secure progress and allow for resuming if needed.
- `--validation_prompt`:  A prompt for generating validation images during training, allowing for periodic checks on model output quality.
- `--mixed_precision="fp16"`: Enables 16-bit floating-point precision to reduce memory usage and accelerate processing.
- `--dataloader_num_workers`: Set to 8 to increase data loading efficiency during training.
- `--use_8bit_adam`: Uses the 8-bit Adam optimizer, conserving memory without compromising optimizer performance.
- `--seed`: Sets a random seed for reproducibility, ensuring that the training process can be replicated if needed.

[Full list of parameters](https://docs.google.com/spreadsheets/d/1NASo9Z_zmyQzcD1afpP6T2NhYy3sDxy_/edit?usp=drive_link&ouid=117205288755583780344&rtpof=true&sd=true)

Now, we can run the following cell to start fine-tuning:


In [None]:
!accelerate launch --config_file config.yaml diffusers/examples/text_to_image/train_text_to_image_lora_sdxl.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --dataset_name="lambdalabs/naruto-blip-captions" \
  --caption_column="text" \
  --resolution=512 \
  --center_crop \
  --random_flip \
  --num_train_epochs=20 \
  --train_batch_size=3 \
  --gradient_checkpointing \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-04 \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=0 \
  --max_grad_norm=1 \
  --output_dir="naruto-lora-weights" \
  --checkpointing_steps=500 \
  --validation_prompt="A man with blue eyes." \
  --mixed_precision="fp16" \
  --dataloader_num_workers=8 \
  --use_8bit_adam \
  --seed=42

## Loading LoRA Weights and Running Inference

Once fine-tuning is complete, we can load the LoRA weights into the Stable Diffusion XL pipeline and run inference. The `diffusers` library provides a simple way to do this with the `StableDiffusionXLPipeline`, which allows us to leverage the fine-tuned model for custom image generation.

The following code snippet loads the model and the weights.


In [None]:
from diffusers import StableDiffusionXLPipeline
import torch

# Path where the LoRA weights are saved
model_path = "naruto-lora-weights"

# Load the Stable Diffusion XL pipeline and set precision
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
)
pipe.to("cuda")  # Use GPU for faster inference

# Load the fine-tuned LoRA weights
pipe.load_lora_weights(model_path)

Finally, you can generate an image using your model. Two key parameters in this process are `num_inference_steps` and `guidance_scale`:

* **num_inference_steps**: This parameter controls how many steps the model takes to generate the image. Higher values typically lead to more detailed images as the model has more iterations to refine the output. Here, we've set it to 100 to balance image quality with processing time.

* **guidance_scale**: This parameter influences how closely the generated image follows the prompt. A higher guidance scale means the model will adhere more strictly to the prompt details, though excessively high values can sometimes affect image coherence. In this case, a guidance scale of 10 helps ensure the image aligns well with the prompt while maintaining visual quality.


In [None]:
# Generate an image with the fine-tuned model
image = pipe(
    prompt="a man with dark hair and brown eyes",
    num_inference_steps=100,
    guidance_scale=10
).images[0]

# Display the generated image
image