From 8e9668bd190dd007314662813bc4877f915eae1d Mon Sep 17 00:00:00 2001
From: bghira <bghira@users.github.com>
Date: Fri, 12 Jan 2024 22:56:34 -0600
Subject: [PATCH 1/6] tutorial: update

---
 TUTORIAL.md | 63 ++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 43 insertions(+), 20 deletions(-)

diff --git a/TUTORIAL.md b/TUTORIAL.md
index f968305bd..4309fec8c 100644
--- a/TUTORIAL.md
+++ b/TUTORIAL.md
@@ -26,13 +26,13 @@ git clone --branch=release https://github.com/bghira/SimpleTuner
 
 ## Hardware Requirements
 
-Ensure your hardware meets the requirements for the resolution and batch size you plan to use. High-end GPUs with more than 24G VRAM are generally recommended.
+Ensure your hardware meets the requirements for the resolution and batch size you plan to use. High-end GPUs with more than 24G VRAM are generally recommended. For LoRA, 24G is more than enough - you can get by with a 12G or 16G GPU. More is better, but there's a threshold of diminishing returns around 24G for LoRA.
 
-Although SimpleTuner has an option to `--fully_unload_text_encoder` and by default will unload the VAE during training, the base SDXL u-net consumes 12.5GB at idle. When the first forward pass runs, a 24G GPU will hit an Out of Memory condition, *even* with 128x128 training data.
+**For full u-net tuning:** Although SimpleTuner has an option to `--fully_unload_text_encoder` and by default will unload the VAE during training, the base SDXL u-net consumes 12.5GB at idle. When the first forward pass runs, a 24G GPU will hit an Out of Memory condition, *even* with 128x128 training data.
 
-This occurs with Adafactor, AdamW8Bit, Prodigy, and D-adaptation.
+This occurs with Adafactor, AdamW8Bit, Prodigy, and D-adaptation due to a bug in PyTorch. Ensure you are using the **latest** 2.1.x release of PyTorch, which allows **full u-net tuning in ~22G of VRAM without DeepSpeed**.
 
-40G GPUs can meaningfully train SDXL.
+24G GPUs can meaningfully train SDXL, though 40G is the sweet spot for full fine-tune - an 80G GPU is pure heaven.
 
 ## Dependencies
 
@@ -40,9 +40,9 @@ Install SimpleTuner as detailed in [INSTALL.md](/INSTALL.md)
 
 ## Training data
 
-A publicly-available dataset is available [on huggingface hub](https://huggingface.co/datasets/ptx0/mj51-data).
+A publicly-available dataset is available [on huggingface hub](https://huggingface.co/datasets/ptx0/pseudo-camera-10k).
 
-Approximately 162GB of images are available in the `split_train` directory, although this format is not required by SimpleTuner.
+Approximately 10k images are available in this repository with their caption as their filename, ready to be imported for use in SimpleTuner.
 
 You can simply create a single folder full of jumbled-up images, or they can be neatly organised into subdirectories.
 
@@ -62,7 +62,7 @@ To summarise:
 
 - You want as high of a batch size as you can tolerate.
 - The larger you set `RESOLUTION`, the more VRAM is used, and the lower your batch size can be.
-- A larger batch size requires more training data in each bucket, since each one **must** contain a minimum of that many images.
+- A larger batch size requires more training data in each bucket, since each one **must** contain a minimum of that many images - a batch size of 8 means each bucket must have at least 8 images.
 - If you can't get a single iteration done with batch size of 1 and resolution of 128x128 on Adafactor or AdamW8Bit, your hardware just won't work.
 
 Which brings up the next point: **you should use as much high quality training data as you can acquire.**
@@ -70,23 +70,34 @@ Which brings up the next point: **you should use as much high quality training d
 ### Selecting images
 
 - JPEG artifacts and blurry images are a no-go. The model **will** pick these up.
+- High-res images introduce their own problems - they must be downsampled to fit their aspect bucket, and this inherently damages their quality.
+- For high quality photographs, some grainy CMOS sensors in the camera itself end up producing a lot of noise. Too much of this will result in nearly every image your model produces, containing the same sensor noise.
 - Same goes for watermarks and "badges", artist signatures. That will all be picked up effortlessly.
 - If you're trying to extract frames from a movie to train from, you're going to have a bad time. Compression ruins most films - only the large 40+ GB releases are really going to be useful for improving image clarity.
+  - Using 1080p Bluray extractions really helps - 4k isn't absolutely required, but you're going to need to reduce expectations as to what kind of content will actually WORK.
+  - Anime content will generally work very well if it's minimally compressed, but live action stuff tends to look blurry.
 - Image resolutions optimally should be divisible by 64.
   - This isn't **required**, but is beneficial to follow.
 - Square images are not required, though they will work.
-  - The model might fail to generalise across aspect ratios if they are not seen during training. This means if you train on only square images, you might not get a very good widescreen effect when you are done.
-- The trainer will resize images so that the smaller side is equal to the value of `RESOLUTION`, while maintaining the aspect ratio.
-  - If your images all hover around a certain resolution, eg. `512x768`, `1280x720` and `640x480`, you might then set `RESOLUTION=640`, which would result in upscaling a minimal number of images during training time.
-  - If your images are all above a given base resolution, the trainer will downsample them to your base `RESOLUTION`
-- Your dataset should be **as varied as possible** to get the highest quality.
+  - If you train on ONLY square images or ONLY non-square images, you might not get a very good balance of capabilities in the resulting model.
 - Synthetic data works great. This means AI-generated images, from either GAN upscaling or a different model entirely. Using outputs from a different model is called **transfer learning** and can be highly effective.
+  - Using ONLY synthetic data can harm the model's ability to generate more realistic details. A decent balance of regularisation images (eg. concepts that aren't your target) will help to maintain broad capabilities.
+- Your dataset should be **as varied as possible** to get the highest quality. It should be balanced across different concepts, unless heavily biasing the model is desired.
 
 ### Captioning
 
-SimpleTuner provides a [captioning](/toolkit/captioning/README.md) script that can be used to mass-rename files in a format that is acceptable to SimpleTuner.
+SimpleTuner provides multiple [captioning](/toolkit/captioning/README.md) scripts that can be used to mass-rename files in a format that is acceptable to SimpleTuner.
+
+Options:
+
+- T5 Flan and BLIP2 produce mediocre captions; it can be very slow and resource hungry.
+- LLaVA produces acceptable captions but misses subtle details.
+  - It is better than BLIP, can sometimes read text but invents details and speculates.
+  - Follows instruction templates better than CogVLM and BLIP.
+- CogVLM produces the best captions and requires the most time/resources.
+  - It still speculates, especially when given long instruct queries.
+  - It does not follow instruct queries very well.
 
-Currently, it uses T5 Flan and BLIP2 to produce high quality captions, though it can be very slow and resource hungry.
 
 Other tools are available from third-party sources, such as Captionr.
 
@@ -101,7 +112,9 @@ Longer captions aren't necessarily better for training. Simpler, concise caption
 
 Foundational models like Stable Diffusion are built using 10% caption drop-out, meaning the model is shown an "empty" caption instead of the real one, about 10% of the time. This ends up substantially improving the quality of generations, especially for prompts that involve subject matter that do not exist in your training data.
 
-In other words, caption drop-out will allow you to introduce a style or concept more broadly across the model. You might not want to use this at all if you really want to restrict your changes to just the captions you show the model during training.
+Disabling caption dropout can damage the model's ability to generalise to unseen prompts. Conversely, using too much caption dropout will damage the model's ability to adhere to prompts.
+
+A value of 25% seems to provide some additional benefits such as reducing the number of required steps during inference on v-prediction models.
 
 ### Advanced Configuration
 
@@ -113,9 +126,17 @@ If `--report_to=wandb` is passed to the trainer (the default), it will ask on st
 
 ### Post-Training Steps
 
-You might not want to train all the way to the end once you realise your progress has been "good enough". At this point, it would be best to reduce `NUM_EPOCHS` to `1` and start another training run. This will in fact, not do any more training, but will simply export the model into the pipeline directory - assuming a single epoch has been hit yet. **This may not be the case for very large datasets**. You can switch to a small folder of files to force it to export.
+#### How do I end training early?
+
+You might not want to train all the way to the end.
 
-Once the training is complete, you can evaluate the model using [the provided evaluation script](/inference.py) or [other options in the inference toolkit](/toolkit/inference/inference_ddpm.py).
+At this point, reduce `--max_train_steps` value to one smaller than your current training step to force a pipeline export into your `output_dir`.
+
+#### How do I test the model without wandb (Weights & Biases)?
+
+You can evaluate the model using [the provided evaluation script](/inference.py) or [other options in the inference toolkit](/toolkit/inference/inference_ddpm.py).
+
+If you used `--push_to_hub`, the Huggingface Diffusers SDXL example scripts will be useable with the same model name.
 
 If you require a single 13GiB safetensors file for eg. AUTOMATIC1111's Stable Diffusion WebUI or for uploading to CivitAI, you should make use of the [SDXL checkpoint conversion script](/convert_sdxl_checkpoint.py):
 
@@ -151,6 +172,8 @@ This command will capture the output of your training run into `train.log`, loca
 
 In each model checkpoint directory is a `tracker_state.json` file which contains the current epoch that training was on or the images it has seen so far.
 
+Each dataset will have its own tracking state documents in this directory as well.
+
 
 ### Example Environment File Explained
 
@@ -185,8 +208,10 @@ Here's a breakdown of what each environment variable does:
 - `TRACKER_PROJECT_NAME` and `TRACKER_RUN_NAME`: Names for the tracking project on Weights and Biases. Currently, run names are non-functional.
 - `INSTANCE_PROMPT`: Optional prompt to append to each caption. This can be useful if you want to add a **trigger keyword** for your model's style to associate with.
   - Make sure the instance prompt you use is similar to your data, or you could actually end up doing harm to the model.
+  - Each dataset entry in `multidatabackend.json` can have its own `instance_prompt` set in lieu of using this main variable.
 - `VALIDATION_PROMPT`: The prompt used for validation.
   - Optionally, a user prompt library or the built-in prompt library may be used to generate more than 84 images on each checkpoint across a large number of concepts.
+  - See `--user_prompt_library` for more information.
 
 #### Data Locations
 
@@ -194,8 +219,6 @@ Here's a breakdown of what each environment variable does:
   - `BASE_DIR` - Used for populating other variables, mostly.
   - `INSTANCE_DIR` - Where your actual training data is. This can be anywhere, it does not need to be underneath `BASE_DIR`.
   - `OUTPUT_DIR` - Where the model pipeline results are stored during training, and after it completes.
-- `STATE_PATH`, `SEEN_STATE_PATH`: Paths for the training state and seen images.
-  - These can effectively be ignored, unless you want to make use of this data for integrations in eg. a Discord bot and need it placed in a particular location.
 
 #### Training Parameters
 
@@ -210,4 +233,4 @@ Here's a breakdown of what each environment variable does:
 
 ## Additional Notes
 
-For more details, consult the [INSTALL](/INSTALL.md) and [OPTIONS](/OPTIONS.md) documents.
\ No newline at end of file
+For more details, consult the [INSTALL](/INSTALL.md) and [OPTIONS](/OPTIONS.md) documents or the [DATALOADER](/documentation/DATALOADER.md) information page for specific details on the dataset config file.
\ No newline at end of file

From c143f21db2205f7f4a269afb41240b9079451ba3 Mon Sep 17 00:00:00 2001
From: bghira <bghira@users.github.com>
Date: Sat, 13 Jan 2024 12:03:36 -0600
Subject: [PATCH 2/6] diffusers/accelerate - improve support for torch.compile
 (huggingface/diffusers#6555)

---
 helpers/legacy/sd_files.py   |  3 ++-
 helpers/training/wrappers.py |  7 +++++++
 train_sdxl.py                | 13 ++++++++++---
 3 files changed, 19 insertions(+), 4 deletions(-)
 create mode 100644 helpers/training/wrappers.py

diff --git a/helpers/legacy/sd_files.py b/helpers/legacy/sd_files.py
index ef8e7491a..f2c32ba4e 100644
--- a/helpers/legacy/sd_files.py
+++ b/helpers/legacy/sd_files.py
@@ -1,5 +1,6 @@
 from transformers import PretrainedConfig
 from diffusers import UNet2DConditionModel
+from helpers.training.wrappers import unwrap_model
 import os, logging, shutil
 
 
@@ -56,7 +57,7 @@ def save_model_hook(models, weights, output_dir):
                         shutil.rmtree(removing_checkpoint)
             sub_dir = (
                 "unet"
-                if isinstance(model, type(accelerator.unwrap_model(unet)))
+                if isinstance(model, type(unwrap_model(unet)))
                 else "text_encoder"
             )
             model.save_pretrained(os.path.join(output_dir, sub_dir))
diff --git a/helpers/training/wrappers.py b/helpers/training/wrappers.py
new file mode 100644
index 000000000..b94cc903a
--- /dev/null
+++ b/helpers/training/wrappers.py
@@ -0,0 +1,7 @@
+from diffusers.utils.torch_utils import is_compiled_module
+
+
+def unwrap_model(accelerator, model):
+    model = accelerator.unwrap_model(model)
+    model = model._orig_mod if is_compiled_module(model) else model
+    return model
diff --git a/train_sdxl.py b/train_sdxl.py
index 44d0052fd..da378eaf4 100644
--- a/train_sdxl.py
+++ b/train_sdxl.py
@@ -25,6 +25,7 @@
 from helpers.legacy.validation import prepare_validation_prompt_list, log_validations
 from helpers.training.state_tracker import StateTracker
 from helpers.training.deepspeed import deepspeed_zero_init_disabled_context_manager
+from helpers.training.wrappers import unwrap_model
 from helpers.data_backend.factory import configure_multi_databackend
 from helpers.data_backend.factory import random_dataloader_iterator
 from helpers.caching.sdxl_embeds import TextEmbeddingCache
@@ -256,12 +257,18 @@ def main():
     # For mixed precision training we cast the text_encoder and vae weights to half-precision
     # as these models are only used for inference, keeping weights in full precision is not required.
     weight_dtype = torch.float32
-    if accelerator.mixed_precision == "fp16":
+    if (
+        accelerator.mixed_precision == "fp16"
+        and args.pretrained_vae_model_name_or_path is None
+    ):
         weight_dtype = torch.float16
         logger.warning(
             f'Using "--fp16" with mixed precision training should be done with a custom VAE. Make sure you understand how this works.'
         )
-    elif accelerator.mixed_precision == "bf16":
+    elif (
+        accelerator.mixed_precision == "bf16"
+        and args.pretrained_vae_model_name_or_path is None
+    ):
         weight_dtype = torch.bfloat16
         logger.warning(
             f'Using "--bf16" with mixed precision training should be done with a custom VAE. Make sure you understand how this works.'
@@ -1376,7 +1383,7 @@ def main():
     # Create the pipeline using the trained modules and save it.
     accelerator.wait_for_everyone()
     if accelerator.is_main_process:
-        unet = accelerator.unwrap_model(unet)
+        unet = unwrap_model(unet)
         if args.model_type == "lora":
             unet_lora_layers = convert_state_dict_to_diffusers(
                 get_peft_model_state_dict(unet)

From f730d9a015b7875b4d4dd70edfda64abf15031ff Mon Sep 17 00:00:00 2001
From: bghira <bghira@users.github.com>
Date: Sat, 13 Jan 2024 12:56:12 -0600
Subject: [PATCH 3/6] bghira/SimpleTuner#262 use_deepspeed_optimizer should
 tell us whether or not to save_state on all processes

---
 helpers/arguments.py         |  6 +--
 helpers/legacy/sd_files.py   |  2 +-
 helpers/legacy/validation.py | 10 ++--
 helpers/sdxl/save_hooks.py   | 94 ++++++++++++++++++++++++------------
 train_sdxl.py                | 27 +++++------
 5 files changed, 87 insertions(+), 52 deletions(-)

diff --git a/helpers/arguments.py b/helpers/arguments.py
index 1dc95eb21..ac0abd05c 100644
--- a/helpers/arguments.py
+++ b/helpers/arguments.py
@@ -594,10 +594,10 @@ def parse_args(input_args=None):
     parser.add_argument(
         "--validation_torch_compile_mode",
         type=str,
-        default="reduce-overhead",
-        choices=["reduce-overhead", "default"],
+        default="max-autotune",
+        choices=["max-autotune", "reduce-overhead", "default"],
         help=(
-            "PyTorch provides different modes for the Torch Inductor when compiling graphs. reduce-overhead,"
+            "PyTorch provides different modes for the Torch Inductor when compiling graphs. max-autotune,"
             " the default mode, provides the most benefit."
         ),
     )
diff --git a/helpers/legacy/sd_files.py b/helpers/legacy/sd_files.py
index f2c32ba4e..8cc906d9f 100644
--- a/helpers/legacy/sd_files.py
+++ b/helpers/legacy/sd_files.py
@@ -57,7 +57,7 @@ def save_model_hook(models, weights, output_dir):
                         shutil.rmtree(removing_checkpoint)
             sub_dir = (
                 "unet"
-                if isinstance(model, type(unwrap_model(unet)))
+                if isinstance(model, type(unwrap_model(accelerator, unet)))
                 else "text_encoder"
             )
             model.save_pretrained(os.path.join(output_dir, sub_dir))
diff --git a/helpers/legacy/validation.py b/helpers/legacy/validation.py
index bb86d5713..156b38e8c 100644
--- a/helpers/legacy/validation.py
+++ b/helpers/legacy/validation.py
@@ -1,8 +1,10 @@
 import logging, os, torch, numpy as np
 from tqdm import tqdm
 from diffusers.utils import is_wandb_available
+from diffusers.utils.torch_utils import is_compiled_module
 from helpers.image_manipulation.brightness import calculate_luminance
 from helpers.training.state_tracker import StateTracker
+from helpers.training.wrappers import unwrap_model
 from helpers.prompts import PromptHandler
 from diffusers import (
     AutoencoderKL,
@@ -167,7 +169,7 @@ def log_validations(
             # The models need unwrapping because for compatibility in distributed training mode.
             pipeline = StableDiffusionXLPipeline.from_pretrained(
                 args.pretrained_model_name_or_path,
-                unet=accelerator.unwrap_model(unet),
+                unet=unwrap_model(accelerator, unet),
                 text_encoder=text_encoder_1,
                 text_encoder_2=text_encoder_2,
                 tokenizer=None,
@@ -186,9 +188,11 @@ def log_validations(
                 timestep_spacing=args.inference_scheduler_timestep_spacing,
                 rescale_betas_zero_snr=args.rescale_betas_zero_snr,
             )
-            if args.validation_torch_compile:
+            if args.validation_torch_compile and not is_compiled_module(pipeline.unet):
                 pipeline.unet = torch.compile(
-                    unet, mode=args.validation_torch_compile_mode, fullgraph=False
+                    pipeline.unet,
+                    mode=args.validation_torch_compile_mode,
+                    fullgraph=False,
                 )
             pipeline = pipeline.to(accelerator.device)
             pipeline.set_progress_bar_config(disable=True)
diff --git a/helpers/sdxl/save_hooks.py b/helpers/sdxl/save_hooks.py
index 432cdecaa..03baebf5c 100644
--- a/helpers/sdxl/save_hooks.py
+++ b/helpers/sdxl/save_hooks.py
@@ -1,12 +1,16 @@
-from diffusers.training_utils import EMAModel
+from diffusers.training_utils import EMAModel, _set_state_dict_into_text_encoder
+from helpers.training.wrappers import unwrap_model
 from diffusers.loaders import LoraLoaderMixin
-from diffusers.utils import convert_state_dict_to_diffusers
-from peft import LoraConfig
+from diffusers.utils import (
+    convert_state_dict_to_diffusers,
+    convert_unet_state_dict_to_peft,
+)
+from peft import set_peft_model_state_dict
 from peft.utils import get_peft_model_state_dict
 
 from diffusers import UNet2DConditionModel, StableDiffusionXLPipeline
 from helpers.training.state_tracker import StateTracker
-import os, logging, shutil, json
+import os, logging, shutil, torch
 
 logger = logging.getLogger("SDXLSaveHook")
 logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL") or "INFO")
@@ -14,7 +18,13 @@
 
 class SDXLSaveHook:
     def __init__(
-        self, args, unet, ema_unet, text_encoder_1, text_encoder_2, accelerator
+        self,
+        args,
+        unet,
+        ema_unet,
+        text_encoder_1,
+        text_encoder_2,
+        accelerator,
     ):
         self.args = args
         self.unet = unet
@@ -36,7 +46,7 @@ def save_model_hook(self, models, weights, output_dir):
             text_encoder_2_lora_layers_to_save = None
 
             for model in models:
-                if isinstance(model, type(self.accelerator.unwrap_model(self.unet))):
+                if isinstance(model, type(unwrap_model(self.accelerator, self.unet))):
                     unet_lora_layers_to_save = convert_state_dict_to_diffusers(
                         get_peft_model_state_dict(model)
                     )
@@ -106,47 +116,71 @@ def load_model_hook(self, models, input_dir):
         if self.args.model_type == "lora":
             logger.info(f"Loading LoRA weights from Path: {input_dir}")
             unet_ = None
-            text_encoder_1_ = None
-            text_encoder_2_ = None
+            text_encoder_one_ = None
+            text_encoder_two_ = None
 
             while len(models) > 0:
                 model = models.pop()
 
-                if isinstance(model, type(self.accelerator.unwrap_model(self.unet))):
+                if isinstance(model, type(unwrap_model(self.accelerator, self.unet))):
                     unet_ = model
                 elif isinstance(
-                    model, type(self.accelerator.unwrap_model(self.text_encoder_1))
+                    model, type(unwrap_model(self.accelerator, self.text_encoder_one))
                 ):
-                    text_encoder_1_ = model
+                    text_encoder_one_ = model
                 elif isinstance(
-                    model, type(self.accelerator.unwrap_model(self.text_encoder_2))
+                    model, type(unwrap_model(self.accelerator, self.text_encoder_two))
                 ):
-                    text_encoder_2_ = model
+                    text_encoder_two_ = model
                 else:
                     raise ValueError(f"unexpected save model: {model.__class__}")
 
             lora_state_dict, network_alphas = LoraLoaderMixin.lora_state_dict(input_dir)
-            LoraLoaderMixin.load_lora_into_unet(
-                lora_state_dict, network_alphas=network_alphas, unet=unet_
-            )
 
-            text_encoder_state_dict = {
-                k: v for k, v in lora_state_dict.items() if "text_encoder." in k
+            unet_state_dict = {
+                f'{k.replace("unet.", "")}': v
+                for k, v in lora_state_dict.items()
+                if k.startswith("unet.")
             }
-            LoraLoaderMixin.load_lora_into_text_encoder(
-                text_encoder_state_dict,
-                network_alphas=network_alphas,
-                text_encoder=text_encoder_1_,
+            unet_state_dict = convert_unet_state_dict_to_peft(unet_state_dict)
+            incompatible_keys = set_peft_model_state_dict(
+                unet_, unet_state_dict, adapter_name="default"
             )
+            if incompatible_keys is not None:
+                # check only for unexpected keys
+                unexpected_keys = getattr(incompatible_keys, "unexpected_keys", None)
+                if unexpected_keys:
+                    logger.warning(
+                        f"Loading adapter weights from state_dict led to unexpected keys not found in the model: "
+                        f" {unexpected_keys}. "
+                    )
 
-            text_encoder_2_state_dict = {
-                k: v for k, v in lora_state_dict.items() if "text_encoder_2." in k
-            }
-            LoraLoaderMixin.load_lora_into_text_encoder(
-                text_encoder_2_state_dict,
-                network_alphas=network_alphas,
-                text_encoder=text_encoder_2_,
-            )
+            if self.args.train_text_encoder:
+                # Do we need to call `scale_lora_layers()` here?
+                _set_state_dict_into_text_encoder(
+                    lora_state_dict,
+                    prefix="text_encoder.",
+                    text_encoder=text_encoder_one_,
+                )
+
+                _set_state_dict_into_text_encoder(
+                    lora_state_dict,
+                    prefix="text_encoder_2.",
+                    text_encoder=text_encoder_one_,
+                )
+
+            # Make sure the trainable params are in float32. This is again needed since the base models
+            # are in `weight_dtype`. More details:
+            # https://github.com/huggingface/diffusers/pull/6514#discussion_r1449796804
+            if self.args.mixed_precision == "fp16":
+                models = [unet_]
+                if self.args.train_text_encoder:
+                    models.extend([text_encoder_one_, text_encoder_two_])
+                for model in models:
+                    for param in model.parameters():
+                        # only upcast trainable parameters (LoRA) into fp32
+                        if param.requires_grad:
+                            param.data = param.to(torch.float32)
             logger.info("Completed loading LoRA weights.")
 
         if self.args.use_ema:
diff --git a/train_sdxl.py b/train_sdxl.py
index da378eaf4..1adc41d1b 100644
--- a/train_sdxl.py
+++ b/train_sdxl.py
@@ -257,22 +257,18 @@ def main():
     # For mixed precision training we cast the text_encoder and vae weights to half-precision
     # as these models are only used for inference, keeping weights in full precision is not required.
     weight_dtype = torch.float32
-    if (
-        accelerator.mixed_precision == "fp16"
-        and args.pretrained_vae_model_name_or_path is None
-    ):
+    if accelerator.mixed_precision == "fp16":
         weight_dtype = torch.float16
-        logger.warning(
-            f'Using "--fp16" with mixed precision training should be done with a custom VAE. Make sure you understand how this works.'
-        )
-    elif (
-        accelerator.mixed_precision == "bf16"
-        and args.pretrained_vae_model_name_or_path is None
-    ):
+        if args.pretrained_vae_model_name_or_path is None:
+            logger.warning(
+                f'Using "--fp16" with mixed precision training should be done with a custom VAE. Make sure you understand how this works.'
+            )
+    elif accelerator.mixed_precision == "bf16":
         weight_dtype = torch.bfloat16
-        logger.warning(
-            f'Using "--bf16" with mixed precision training should be done with a custom VAE. Make sure you understand how this works.'
-        )
+        if args.pretrained_vae_model_name_or_path is None:
+            logger.warning(
+                f'Using "--bf16" with mixed precision training should be done with a custom VAE. Make sure you understand how this works.'
+            )
     StateTracker.set_weight_dtype(weight_dtype)
     # Load scheduler, tokenizer and models.
     tokenizer_1 = AutoTokenizer.from_pretrained(
@@ -1327,6 +1323,7 @@ def main():
                                     )
                                     shutil.rmtree(removing_checkpoint)
 
+                    if accelerator.is_main_process or use_deepspeed_optimizer:
                         save_path = os.path.join(
                             args.output_dir, f"checkpoint-{global_step}"
                         )
@@ -1383,7 +1380,7 @@ def main():
     # Create the pipeline using the trained modules and save it.
     accelerator.wait_for_everyone()
     if accelerator.is_main_process:
-        unet = unwrap_model(unet)
+        unet = unwrap_model(accelerator, unet)
         if args.model_type == "lora":
             unet_lora_layers = convert_state_dict_to_diffusers(
                 get_peft_model_state_dict(unet)

From 76337a647538c7b8834477cd74dbadb99862af25 Mon Sep 17 00:00:00 2001
From: bghira <bghira@users.github.com>
Date: Sat, 13 Jan 2024 12:58:44 -0600
Subject: [PATCH 4/6] bghira/SimpleTuner#262 resolve save popping weights when
 it should not

---
 helpers/sdxl/save_hooks.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/helpers/sdxl/save_hooks.py b/helpers/sdxl/save_hooks.py
index 03baebf5c..e45694022 100644
--- a/helpers/sdxl/save_hooks.py
+++ b/helpers/sdxl/save_hooks.py
@@ -89,7 +89,8 @@ def save_model_hook(self, models, weights, output_dir):
 
         for model in models:
             model.save_pretrained(os.path.join(temporary_dir, "unet"))
-            weights.pop()  # Pop the last weight
+            if weights:
+                weights.pop()  # Pop the last weight
 
         # Copy contents of temporary directory to output directory
         for item in os.listdir(temporary_dir):

From d79963561fdd6b24706f99ff6ebc8d3769fc68a4 Mon Sep 17 00:00:00 2001
From: bghira <bghira@users.github.com>
Date: Sat, 13 Jan 2024 12:59:27 -0600
Subject: [PATCH 5/6] add more example arguments to the multidatabackend
 example

---
 multidatabackend.example.json | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/multidatabackend.example.json b/multidatabackend.example.json
index fab500008..68fe41461 100644
--- a/multidatabackend.example.json
+++ b/multidatabackend.example.json
@@ -4,14 +4,19 @@
         "type": "local",
         "instance_data_dir": "/path/to/data/tree",
         "crop": false,
+        "crop_style": "random|center|corner",
+        "crop_aspect": "square|preserve",
         "resolution": 1.0,
-        "resolution_type": "area",
+        "resolution_type": "area|pixel",
         "minimum_image_size": 1.0,
         "prepend_instance_prompt": false,
         "instance_prompt": "cat girls",
         "only_instance_prompt": false,
         "caption_strategy": "filename",
-        "cache_dir_vae": "cache_prefix"
+        "cache_dir_vae": "/path/to/vaecache",
+        "vae_cache_clear_each_epoch": true,
+        "probability": 1.0,
+        "repeats": 5
     },
     {
         "id": "another-special-name-for-another-backend",
@@ -22,6 +27,8 @@
         "aws_access_key_id": "wpz-764e9734523434",
         "aws_secret_access_key": "xyz-sdajkhfhakhfjd",
         "aws_data_prefix": "",
-        "cache_dir_vae": "/path/to/cache/dir"
+        "cache_dir_vae": "/path/to/cache/dir",
+        "vae_cache_clear_each_epoch": true,
+        "repeats": 2
     }
 ]
\ No newline at end of file

From acdfb09894a9b0b33ba3afe68792dbff70c3ec2e Mon Sep 17 00:00:00 2001
From: bghira <bghira@users.github.com>
Date: Sat, 13 Jan 2024 12:59:41 -0600
Subject: [PATCH 6/6] sdxl: use torch compile by default

---
 sdxl-env.sh.example | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sdxl-env.sh.example b/sdxl-env.sh.example
index 4b9eea0d7..c21b17691 100644
--- a/sdxl-env.sh.example
+++ b/sdxl-env.sh.example
@@ -156,4 +156,4 @@ export ACCELERATE_EXTRA_ARGS=""                          # --multi_gpu or other
 
 # With Pytorch 2.1, you might have pretty good luck here.
 # If you're using aspect bucketing however, each resolution change will recompile. Seriously, just don't do it.
-export TRAINING_DYNAMO_BACKEND='no'          # or 'inductor' if you want to brave PyTorch 2 compile issues
\ No newline at end of file
+export TRAINING_DYNAMO_BACKEND='inductor'                # or 'no' if you want to disable torch compile in case of performance issues or lack of support (eg. AMD)
\ No newline at end of file