# Scalable online preprocessing and model training pipeline for Stable Diffusion

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/stable-diffusion/end_to_end_architecture_v6.jpeg" width=1000px />

The preceding architecture diagram illustrates the online preprocessing and model training pipeline for Stable Diffusion. 

Ray Data loads the data from a remote storage system, and then streams the data through the entire processing and training stages:
1. **Transformation**
   * Cropping and normalizing images.
   * Tokenizing the text caption using a CLIP tokenizer.
2. **Encoding**
   * Compressing images into a latent space using a VAE encoder.
   * Generating text embeddings using a CLIP model.
3. **Training**
   * Training a U-Net model on the image latents and text embeddings.
   * Generating model checkpoints and saving them to a remote storage system.

This notebook executes a fully self-contained module, `end_to_end.py`, that performs the online preprocessing and training over a small subset of the full 2 billion dataset to demonstrate the workload. You can parameterize the same module code to process the full dataset. The **Running production-scale model training** section below summarizes the necessary changes to scale the workload.

Run the following cell to perform the online preprocessing and model training. The script loads the data, transforms it, encodes it, and runs the model training. After the cell executes, view the generated model checkpoint files.

In [None]:
!python scripts/end_to_end.py

## Running production-scale model training

If you're looking to scale your Stable Diffusion pre-training with custom data, we're here to help 🙌 !

👉 **[Check out this link](https://forms.gle/9aDkqAqobBctxxMa8) so we can assist you**.


In case you would like to get an idea of the changes needed to scale the `end_to_end.py` script to the full dataset, below is a table that provides approximate guidance on the changes you need to make:

| Step | Change | Description |
| --- | --- | -- | 
| 1 |  Raw data path | Point to the full dataset |
| 2 | Number of data loading workers | Increase to 192 CPUs |
| 3 | Number of transformation workers | Increase to 192 CPUs |
| 4 | Batch size | Set to 120 for 256x256 images, 40 for 512x512 images |
| 5 | Number of encoding workers | Increase to 48 A10G GPUs |
| 6 | Batch size | Set to 128 for 256x256 images, 32 for 512x512 images |
| 7 | Number of training workers | Increase to 32 A100-80GB GPUs |
| 8 | Model config | Use the full U-Net model |
| 9 | Distributed training strategy | Set the distributed training strategy to FSDP, configure it to run in `SHARD_GRAD_OP` mode |
| 10 | Output path | Change to a permanent path |
| 11 | Run the process script | Run as an Anyscale Job |
| 12 | Run the first phase of training | Use resolution 256x256 for a total of 550,000 steps |
| 13 | Run the second phase of training | Use resolution 512x512 for a total of 850,000 steps loading the checkpoint from the first phase |

In terms of infrastructure, you would provision 48 instances of `g5.2xlarge` and 4 instances of `p4de.24xlarge` for the entire process or use Anyscale's autoscaling capabilities to scale up and down as needed.