In [15]:
TRAINER = "diffusers/examples/dreambooth/train_dreambooth.py"
CONVERTER = "diffusers/scripts/convert_original_stable_diffusion_to_diffusers.py"
BACK_CONVERTER = "diffusers/scripts/convert_diffusers_to_original_stable_diffusion.py"

In [16]:
#@title Login to wandb to watch training process (Optional, keep key empty if you want to skip)
WANDB_KEY = ""
if WANDB_KEY != "":
  !wandb login $WANDB_KEY

In [17]:
originalModels = "/models"
convertModels = "/convertModels"

# your model folder name
targetModelName = "animefull-final-pruned"

# if not need vae, comment it
vae_arg = f"--vae_path {originalModels}/animevae.pt"

#--------default variable
SRC_PATH = originalModels + "/" + targetModelName
MODEL_NAME = convertModels + "/" + targetModelName

## Instance Prompt and Class Prompt

What your training set is about|Instance prompt must contain|Class prompt should describe
-|-|-
A object/person|`[V]`|The object's type and/or characteristics
A artist's style|`by [V]`|The common characteristics of the training set

Where:
* `[V]` is a *token* in CLIP's [vocabulary](https://huggingface.co/openai/clip-vit-large-patch14/raw/main/vocab.json) which is not meaningful to the model. `sks` is a great example.

A common pitfall: like if you are training about a specific person with name `[N]`, you should NOT use `[N]` as `[V]`. Names have high chance of being separated (tokenized) to multiple tokens, which is possibly hazardous.

Finally `[V]` will carry the new information learned by the model.

### Examples

Training about a female character:
* Instance prompt: `sks 1girl`
* Class prompt: `1girl`

Training about hatsune miku (don't do this btw, model already knows):
* Instance prompt: `masterpiece, best quality, sks 1girl, aqua eyes, aqua hair`
* Class prompt: `masterpiece, best quality, 1girl, aqua eyes, aqua hair`

Training about an artist's style on drawing female characters:
* Instance prompt: `1girl, by sks`
* Class prompt: `1girl`

In [18]:
# declare input output folder

trainingFolder = "/train"
trainFolderName = "trcoot"

#--------default variable
INSTANCE_DIR = trainingFolder + "/" + trainFolderName +"/input"
CLASS_DIR = trainingFolder + "/" + trainFolderName +"/class"
OUTPUT_DIR =  trainingFolder + "/" + trainFolderName +"/output"

!mkdir -p $INSTANCE_DIR
!mkdir -p $CLASS_DIR
!mkdir -p $OUTPUT_DIR

print(f"[*] Weights will be saved at {OUTPUT_DIR}")

[*] Weights will be saved at /train/trcoot/output


In [19]:
INSTANCE_PROMPT = "masterpiece, best quality, sks 1girl"

CLASS_PROMPT = "masterpiece, best quality, 1girl"
CLASS_NEGATIVE_PROMPT = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry"
NUM_CLASS_IMAGES = 1

SAVE_SAMPLE_PROMPT = "masterpiece, best quality, sks 1girl, looking at viewer"
SAVE_SAMPLE_NEGATIVE_PROMPT = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry"

## Advanced

Use the table below to choose the best flags based on your memory and speed requirements. Tested on Tesla T4 GPU.

| `fp16` | `train_batch_size` | `gradient_accumulation_steps` | `gradient_checkpointing` | `use_8bit_adam` | GB VRAM usage | Speed (it/s) |
| ---- | ------------------ | ----------------------------- | ----------------------- | --------------- | ---------- | ------------ |
| fp16 | 1                  | 1                             | TRUE                    | TRUE            | 9.92       | 0.93         |
| no   | 1                  | 1                             | TRUE                    | TRUE            | 10.08      | 0.42         |
| fp16 | 2                  | 1                             | TRUE                    | TRUE            | 10.4       | 0.66         |
| fp16 | 1                  | 1                             | FALSE                   | TRUE            | 11.17      | 1.14         |
| no   | 1                  | 1                             | FALSE                   | TRUE            | 11.17      | 0.49         |
| fp16 | 1                  | 2                             | TRUE                    | TRUE            | 11.56      | 1            |
| fp16 | 2                  | 1                             | FALSE                   | TRUE            | 13.67      | 0.82         |
| fp16 | 1                  | 2                             | FALSE                   | TRUE            | 13.7       | 0.83          |
| fp16 | 1                  | 1                             | TRUE                    | FALSE           | 15.79      | 0.77         |

If you are using a GPU better than Tesla T4, remove `--gradient_checkpointing` from the arguments to improve training speed, also consider increasing `train_batch_size` for less overhead and better regularization.

Remove `--use_8bit_adam` flag for full precision optimizer. Requires 15.79 GB with `--gradient_checkpointing` else 17.8 GB. Somewhat unreasonable to do, but if you are encountering issues with the reduced precision, oh well.

### Multiple Concepts

You can set up a `concepts_list.json` like:

```json
[
    {
        "instance_prompt":      "photo of a woman wearing sks dress",
        "class_prompt":         "photo of a woman wearing dress",
        "instance_data_dir":    "data/wrap_dress",
        "class_data_dir":       "data/dress"
    },
    {
        "instance_prompt":      "photo of sks woman",
        "class_prompt":         "photo of a woman",
        "instance_data_dir":    "data/woman1",
        "class_data_dir":       "data/woman_class"
    }
]
```

And use it with `--concepts_list concepts_list.json`. Can let model learn multiple concepts at the same time. Currently not compatible with Variable Prompts.

### Variable Prompts

For each image (`[X].png` / `[X].jpg`) in data set, put an additional `[X].txt` containing corresponding prompt with it. Then set `READ_PROMPT_FROM_TXT`. Both train set and class set supports this.

Prompt read from txt `[PX]` will be inserted to the prompt you set `[P]` in train args. By default, it is inserted like `[PX] [P]`.

With Variable Prompts enabled and prior preservation loss disabled (`PRIOR_PRESERVATION`), the training process is effectively an equivalent to standard finetuning.

### Aspect Ratio Bucketing

Used by NovelAI when they train their model. In a nutshell, it sets variable training resolution, eliminating the need of cropping dataset manually to 1:1 while still preserving information well. Brings better result especially when generating images with aspect ratio ≠ 1.

Cost is it slows down training and requires slightly more VRAM. Tested using it with batch size = 2 on Tesla T4 is fine.

### Optimizer

The default is int8 AdamW. Works well. If you want to use SGDM, prepare to get into troubles like model does not give much different results after many steps.

### Storage Issue

To save Colab from out of storage, by default we save unet weights using FP16 (`--save_unet_half`).

If you enabled wandb, you can add `--wandb_artifact` to upload weights to wandb. Optionally, `--rm_after_wandb_saved` can let weights be removed after uploading. (Because Colab somewhat actively mess with connections, this is disabled by default.)


In [25]:
#@title Advanced Parameters
MAX_TRAIN_STEPS = 200 #@param {type:"number"}
SAVE_INTERVAL = 100 #@param {type:"number"}
SEED = 114514 #@param {type:"number"}
#@markdown ## Data Processing
RESOLUTION = 512 #@param {type:"slider", min:64, max:2048, step:28}
ASPECT_RATIO_BUCKETING = False #@param {type:"boolean"}
READ_PROMPT_FROM_TXT = "instance" #@param ["no", "instance", "class", "both"] {allow-input: false}
#@markdown ## Forward Pass
TRAIN_BATCH_SIZE = 1 #@param {type:"slider", min:1, max:10, step:1}
GRADIENT_ACCUMULATION_STEPS = 1 #@param {type:"slider", min:1, max:10, step:1}
CLIP_SKIP = 2 #@param {type:"slider", min:1, max:6, step:1}
MIXED_PRECISION = "fp16" #@param ["no", "fp16", "bf16"] {allow-input: false}
#@markdown ## Optimizer / Backward Pass
OPTIMIZER = "adamw_8bit" #@param ["adamw", "adamw_8bit", "adamw_ds", "sgdm", "sgdm_8bit"] {allow-input: false}
LEARNING_RATE = 5e-6 #@param {type:"number"}
LR_SCHEDULER = "cosine_with_restarts"  #@param ["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"] {allow-input: false}
LR_WARMUP_STEPS = 100 #@param {type:"number"}
LR_CYCLES = 1 #@param {type:"number"}
LAST_EPOCH = -1 #@param {type:"number"}
SCALE_LR = True #@param {type:"boolean"}
PRIOR_PRESERVATION = True #@param {type:"boolean"}
PRIOR_LOSS_WEIGHT = 1 #@param {type:"slider", min:0, max:1, step:0.01}
#@markdown ## Inference (Class Set Generation / Sample Images Generation)
INFER_STEPS = 28 #@param {type:"integer"}
GUIDANCE_SCALE = 11 #@param {type:"integer"}
SAMPLE_N = 4  #@param {type:"integer"}
INFER_BATCH_SIZE = 2 #@param {type:"slider", min:1, max:10, step:1}

In [26]:
%%bash

mkdir -p ~/.cache/huggingface/accelerate

cat > ~/.cache/huggingface/accelerate/default_config.yaml <<- EOM
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: 'NO'
downcast_bf16: 'no'
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
use_cpu: false
EOM

In [None]:
%cd /content
!mkdir -p $OUTPUT_DIR

wandb_arg = "--wandb" if WANDB_KEY != "" else ""
scale_lr_arg = "--scale_lr" if SCALE_LR else ""
ppl_arg = f"--with_prior_preservation --prior_loss_weight={PRIOR_LOSS_WEIGHT}" if PRIOR_PRESERVATION else ""
read_prompt_arg = f"--read_prompt_from_txt {READ_PROMPT_FROM_TXT}" if READ_PROMPT_FROM_TXT != "no" else ""
arb_arg = "--use_aspect_ratio_bucket --debug_arb" if ASPECT_RATIO_BUCKETING else ""


!accelerate launch $TRAINER \
  --instance_data_dir "{INSTANCE_DIR}" \
  --instance_prompt "{INSTANCE_PROMPT}" \
  --pretrained_model_name_or_path "{MODEL_NAME}" \
  --pretrained_vae_name_or_path "{MODEL_NAME}/vae" \
  --output_dir "{OUTPUT_DIR}" \
  --seed=$SEED \
  --resolution=$RESOLUTION \
  --optimizer "{OPTIMIZER}" \
  --train_batch_size=$TRAIN_BATCH_SIZE \
  --learning_rate=$LEARNING_RATE \
  --lr_scheduler=$LR_SCHEDULER \
  --lr_warmup_steps=$LR_WARMUP_STEPS \
  --lr_cycles=$LR_CYCLES \
  --last_epoch=$LAST_EPOCH \
  --max_train_steps=$MAX_TRAIN_STEPS \
  --save_interval=$SAVE_INTERVAL \
  --class_data_dir "{CLASS_DIR}" \
  --class_prompt "{CLASS_PROMPT}" --class_negative_prompt "{CLASS_NEGATIVE_PROMPT}" \
  --num_class_images=$NUM_CLASS_IMAGES \
  --save_sample_prompt "{SAVE_SAMPLE_PROMPT}" --save_sample_negative_prompt "{SAVE_SAMPLE_NEGATIVE_PROMPT}" \
  --n_save_sample=$SAMPLE_N \
  --infer_batch_size=$INFER_BATCH_SIZE \
  --infer_steps=$INFER_STEPS \
  --guidance_scale=$GUIDANCE_SCALE \
  --gradient_accumulation_steps=$GRADIENT_ACCUMULATION_STEPS \
  --gradient_checkpointing \
  --save_unet_half \
  --mixed_precision "{MIXED_PRECISION}" \
  --clip_skip=$CLIP_SKIP \
  $wandb_arg $scale_lr_arg $ppl_arg $read_prompt_arg $arb_arg

# disabled: --not_cache_latents 

/content
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `8` to improve out-of-box performance
Caching latents: 100%|████████████████████████████| 1/1 [00:04<00:00,  4.04s/it]
Steps:  33%|▎| 66/200 [00:57<01:41,  1.32step/s, epoch=66, loss=0.0694, lr=3.3e-

## Convert weights to ckpt to use in web UIs like AUTOMATIC1111.

In [None]:
#@markdown Which step number to use.
use_checkpoint = '2000' #@param {type:"string"}
#@markdown Id of which run to use (empty = latest run).
run_id = '' #@param {type:"string"}

if not run_id:
  runs = [d for d in Path(OUTPUT_DIR).iterdir() if d.is_dir()]
  runs.sort(lambda d: d.stat().st_ctime, reverse=True)
  run_id = runs[0].name

ckpt_path = f'{OUTPUT_DIR}/{run_id}/{use_checkpoint}/model.ckpt'

# You can add --vae and --text_encoder if you want.
!python "{BACK_CONVERTER}" --model_path "{OUTPUT_DIR}/{run_id}/{use_checkpoint}" --checkpoint_path $ckpt_path \
  --unet_dtype fp16

print(f"[*] Converted ckpt saved at {ckpt_path}")

## Inference

In [None]:
!nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader

In [None]:
!git clone https://github.com/CCRcmcpe/diffusers.git

In [35]:
!echo "Converting model..."
!python $CONVERTER --checkpoint_path $SRC_PATH/model.ckpt --original_config_file $SRC_PATH/config.yaml $vae_arg --dump_path $MODEL_NAME --scheduler_type ddim
!echo "Done"

Converting model...
Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPTextModel: ['vision_model.encoder.layers.6.self_attn.q_proj.weight', 'vision_model.encoder.layers.19.self_attn.k_proj.bias', 'vision_model.encoder.layers.15.self_attn.v_proj.weight', 'vision_model.encoder.layers.4.self_attn.k_proj.weight', 'vision_model.encoder.layers.4.self_attn.q_proj.bias', 'vision_model.encoder.layers.5.layer_norm2.weight', 'vision_model.encoder.layers.18.self_attn.k_proj.weight', 'vision_model.encoder.layers.21.self_attn.out_proj.bias', 'vision_model.encoder.layers.9.self_attn.out_proj.bias', 'vision_model.encoder.layers.10.mlp.fc1.weight', 'vision_model.encoder.layers.3.self_attn.out_proj.weight', 'vision_model.encoder.layers.2.self_attn.v_proj.weight', 'vision_model.encoder.layers.2.mlp.fc2.weight', 'vision_model.encoder.layers.12.layer_norm1.weight', 'vision_model.encoder.layers.19.layer_norm2.bias', 'vision_model.encoder.layers.11.mlp.f

Convert succeed
Done
