# Anima LoRA Trainer — Quick README (v4)

> **GitHub repo:** [citronlegacy/citron-colab-anima-lora-trainer](https://github.com/citronlegacy/citron-colab-anima-lora-trainer)

## Purpose

This notebook trains a LoRA for the [Anima](https://huggingface.co/circlestone-labs/Anima) diffusion model using [kohya-ss/sd-scripts](https://github.com/kohya-ss/sd-scripts). It is designed to run entirely inside Google Colab.

## Dataset format (flat)

- Use a flat directory (no nested subfolders).
- Each image must have a caption file with the exact same basename and a `.txt` extension.
- Supported image formats: `.jpg`, `.jpeg`, `.png`, `.webp`, `.bmp`, `.gif`.

### Example structure:

```
my_dataset/
  image001.png
  image001.txt    # caption: tag-based, comma-separated tags
  image002.jpg
  image002.txt
  image003.webp
  image003.txt
```

### Tags format

- Use tag-style captions (comma-separated tags). No special tokens required.
- Example caption: `mycharname, 1girl, long blonde hair, blue eyes, high quality, detailed`

## Colab quick usage

1. Install dependencies (run the Setup cell).
2. (Optional) Enable the `mount_drive` checkbox in the Setup cell to mount Google Drive before downloading models.
3. Download models (run the Setup cell). Models are stored in `/content/models/anima`.
4. Set `project_name`, `image_directory` (flat folder), `output_directory`, and hyperparameters in the Training Settings cell.
5. Run the Training cell to generate TOML configs and start training.

## ⚠️ Colab runtime limit — target < 1000 training steps

In testing, reaching 1000 training steps takes **over 4 hours** on Colab and often causes the session to disconnect before completion. Keep your total training steps under 1000 to finish reliably in a single session.

**Step calculation:**

$$steps\_per\_epoch = \left\lceil \frac{N_{images} \times repeats}{batch\_size \times grad\_accum} \right\rceil$$

$$total\_steps = steps\_per\_epoch \times epochs$$

The Training Settings cell will automatically calculate and warn you if your settings exceed 1000 steps.

## Behavior and defaults

- Single-project execution: the notebook trains one LoRA per run (no batch processing).
- No automatic resume: if training is interrupted, checkpoints remain in the output folder; you may manually resume via sd-scripts if desired.
- Defaults exposed to users: `Epochs=10`, `Dim=20`, `Alpha=20`, `Resolution=768`, `Learning Rate=0.0001`, `Caption Dropout=0.1`.
- Models: download to `/content/models` (do not assume Drive persistence unless you save there).

## Google Drive integration note

For consistent Google Drive integration, always work inside the Drive root `lora_training` directory. Use these three folders under your Drive root for predictable behavior and easy syncing:

- `lora_training/datasets`
- `lora_training/configs`
- `lora_training/output`

Enable the `mount_drive` checkbox in the Setup cell to mount Drive before model downloads. Then set `image_directory` and `output_directory` to paths under `/content/drive/MyDrive/`.

## Troubleshooting

- CUDA OOM: reduce `network_dim` or `resolution` (try `network_dim=8` and/or `resolution=512`).
- NaN loss: ensure PyTorch >= 2.5 and lower the learning rate.
- "No images found": verify captions are `.txt` files and images are not named only `.txt`.
- Too many steps / Colab disconnect: lower `repeats`, `max_train_epochs`, or use fewer images per run.

## Saving and checkpoints

- Trained LoRAs and epoch checkpoints are written to the `output_directory` you set (or Drive default when mounted).
- The notebook will create `/lora_training/output` inside your Drive mount when `mount_drive=True` and no custom `output_directory` is provided.


In [None]:
#@title Setup Cell — Install sd-scripts, dependencies, and download Anima models
# Run this cell first in Colab.

#@markdown Mount Google Drive (optional). If enabled, the notebook will mount before downloading models.
mount_drive = False #@param {type:"boolean"}
if mount_drive:
    try:
        from google.colab import drive
        drive.mount('/content/drive')
        print("✓ Google Drive mounted at /content/drive")
    except Exception as e:
        print("⚠ Could not mount Google Drive:", e)

# Clone and install sd-scripts (idempotent)
!if [ -d /content/sd-scripts ]; then echo 'sd-scripts already cloned'; else git clone https://github.com/kohya-ss/sd-scripts.git /content/sd-scripts; fi
%cd /content/sd-scripts
!python -m pip install --upgrade pip
!pip install -r requirements.txt
!pip install toml

# Verify installation
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")



# Create model directories and download models to /content/models (idempotent)
!mkdir -p /content/models/anima/dit
!mkdir -p /content/models/anima/text_encoder
!mkdir -p /content/models/anima/vae

print("Downloading Anima DiT model (4.18 GB)...")
!wget -c --show-progress -O /content/models/anima/dit/anima-preview.safetensors \
  https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/diffusion_models/anima-preview.safetensors

print("Downloading Qwen3 text encoder (1.19 GB)...")
!wget -c --show-progress -O /content/models/anima/text_encoder/qwen_3_06b_base.safetensors \
  https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/text_encoders/qwen_3_06b_base.safetensors

print("Downloading Qwen-Image VAE (254 MB)...")
!wget -c --show-progress -O /content/models/anima/vae/qwen_image_vae.safetensors \
  https://huggingface.co/circlestone-labs/Anima/resolve/main/split_files/vae/qwen_image_vae.safetensors

print("\n✓ All setup steps finished. Proceed to the settings cell.")

In [None]:
#@title Unzip a dataset
zip_file_path = "/content/my_dataset.zip" #@param {type:"string"}

import os

if os.path.exists(zip_file_path):
    print(f"Unzipping {zip_file_path}...")
    destination_dir = os.path.dirname(zip_file_path)
    !unzip -o -q "{zip_file_path}" -d "{destination_dir}"
    print("Unzipping complete.")
else:
    print(f"Error: Zip file not found at {zip_file_path}")

In [None]:
# ============================================================
# USER SETTINGS - Modify these to configure your training
# ============================================================

#@title Training Settings
#@markdown In this cell you will define the training settings. The cell also estimates your total training steps and warns you if they exceed 1000 (which risks a Colab timeout). Training runs in the next cell.

#@markdown ## Project Configuration
project_name = "my_lora" #@param {type:"string"}
# Set `image_directory` to a path under /content/drive if you mounted Drive in the Setup cell.
image_directory = "/content/drive/MyDrive/lora_training/datasets/my_dataset" #@param {type:"string"}
output_directory = "/content/drive/MyDrive/lora_training/output" #@param {type:"string"}

#@markdown ## Training parameters
network_dim = 20 #@param {type:"integer"}
network_alpha = 20 #@param {type:"integer"}
learning_rate = 0.0001 #@param {type:"number"}
max_train_epochs = 10 #@param {type:"integer"}

#@markdown ## Dataset Settings
resolution = 768 #@param {type:"integer"}
repeats = 5 #@param {type:"integer"}
caption_dropout = 0.1 #@param {type:"slider", min:0, max:1, step:0.05}

#@markdown ---
#@markdown **Tip**: Higher `network_dim` = more capacity but requires more VRAM
#@markdown
#@markdown **Tip**: `repeats` × number of images = steps per epoch

# Model paths (should not need to change these)
DIT_MODEL = "/content/models/anima/dit/anima-preview.safetensors"
QWEN3_MODEL = "/content/models/anima/text_encoder/qwen_3_06b_base.safetensors"
VAE_MODEL = "/content/models/anima/vae/qwen_image_vae.safetensors"

# Display settings summary
print("="*60)
print("TRAINING CONFIGURATION")
print("="*60)
print(f"Project Name:      {project_name}")
print(f"Image Directory:   {image_directory}")
print(f"Output Directory:  {output_directory}")
print(f"Network Dim:       {network_dim}")
print(f"Network Alpha:     {network_alpha}")
print(f"Learning Rate:     {learning_rate}")
print(f"Max Epochs:        {max_train_epochs}")
print(f"Resolution:        {resolution}px")
print(f"Repeats:           {repeats}")
print(f"Caption Dropout:   {caption_dropout}")
print("="*60)

# ── Step estimator ──────────────────────────────────────────
import os as _os
import math as _math

def estimate_steps(image_dir=None, num_images=None, repeats=5, epochs=10,
                   batch_size=1, grad_accum=1):
    """Return (steps_per_epoch, total_steps, num_images)."""
    if image_dir and _os.path.exists(image_dir) and num_images is None:
        files = _os.listdir(image_dir)
        num_images = len([f for f in files
                          if not f.lower().endswith('.txt')
                          and _os.path.isfile(_os.path.join(image_dir, f))])
    if num_images is None:
        raise ValueError('Provide image_dir or num_images')
    spe = _math.ceil((num_images * int(repeats)) / (int(batch_size) * int(grad_accum)))
    return spe, spe * int(epochs), num_images

print()
print("── Step Estimate ──────────────────────────────────────")
try:
    _spe, _tot, _n = estimate_steps(
        image_dir=image_directory,
        repeats=repeats,
        epochs=max_train_epochs
    )
    print(f"  Images found:        {_n}")
    print(f"  Steps per epoch:     {_spe}  ({_n} images × {repeats} repeats)")
    print(f"  Total steps:         {_tot}  ({_spe} × {max_train_epochs} epochs)")
    if _tot > 1000:
        print()
        print("  ⚠️  WARNING: Total steps exceed 1000!")
        print("  In testing, 1000 steps takes 4+ hours and risks a Colab disconnect.")
        print("  Suggestions to reduce steps:")
        print(f"    • Lower epochs   (current: {max_train_epochs})  →  try {max(_math.ceil(_tot / (_spe * 2)), 1)}")
        print(f"    • Lower repeats  (current: {repeats})  →  try {max(_math.ceil(1000 / (_n * max_train_epochs)), 1)}")
        print("    • Use fewer images per run (split your dataset across sessions)")
    else:
        print("  ✓ Steps look good — within the recommended < 1000 limit.")
except FileNotFoundError:
    print(f"  (image_directory not found yet — steps will be checked at runtime)")
except Exception as _e:
    print(f"  Could not estimate steps: {_e}")
print("──────────────────────────────────────────────────────")

In [None]:
#@title Training Cell — Generate configs and run training
# Run this cell after the Setup cell and the Training Settings cell.
#
# Uses subprocess.Popen for real-time streaming output so progress bars
# and loss values appear live in Colab. On failure, prints recent dmesg
# to help diagnose OOM kills.

import os
import shlex
import toml
from datetime import datetime
import subprocess
import math

# Ensure terminal-like output for tqdm/rich progress bars
os.environ.setdefault('PYTHONUNBUFFERED', '1')
os.environ.setdefault('TERM', 'xterm')
os.environ.setdefault('FORCE_TQDM', '1')

# Pull model paths from settings cell (with safe fallbacks)
DIT_MODEL   = globals().get('DIT_MODEL',   '/content/models/anima/dit/anima-preview.safetensors')
QWEN3_MODEL = globals().get('QWEN3_MODEL', '/content/models/anima/text_encoder/qwen_3_06b_base.safetensors')
VAE_MODEL   = globals().get('VAE_MODEL',   '/content/models/anima/vae/qwen_image_vae.safetensors')

CONFIG_DIR = '/content/lora_training/configs'
os.makedirs(CONFIG_DIR, exist_ok=True)


def create_training_config(project_name, output_dir, dit_model_path, qwen3_model_path, vae_model_path,
                           network_dim=20, network_alpha=20, learning_rate=1e-4, max_train_epochs=10):
    os.makedirs(output_dir, exist_ok=True)
    current_date = datetime.now().strftime('%Y-%m-%d')
    config_filename = f"{project_name}_training_config_{current_date}.toml"
    config_path = os.path.join(CONFIG_DIR, config_filename)

    training_config = {
        'pretrained_model_name_or_path': dit_model_path,
        'qwen3': qwen3_model_path,
        'vae': vae_model_path,
        'network_module': 'networks.lora_anima',
        'network_dim': int(network_dim),
        'network_alpha': int(network_alpha),
        'network_train_unet_only': True,
        'learning_rate': float(learning_rate),
        'optimizer_type': 'AdamW8bit',
        'optimizer_args': ['weight_decay=0.1', 'betas=[0.9, 0.99]'],
        'lr_scheduler': 'cosine_with_restarts',
        'lr_scheduler_num_cycles': 1,
        'lr_warmup_steps': 100,
        'max_train_epochs': int(max_train_epochs),
        'train_batch_size': 1,
        'gradient_accumulation_steps': 1,
        'max_grad_norm': 1.0,
        'seed': 42,
        'timestep_sampling': 'sigmoid',
        'discrete_flow_shift': 1.0,
        'qwen3_max_token_length': 512,
        't5_max_token_length': 512,
        'mixed_precision': 'bf16',
        'gradient_checkpointing': True,
        'cache_latents': True,
        'cache_text_encoder_outputs': True,
        'vae_chunk_size': 64,
        'vae_disable_cache': True,
        'output_dir': output_dir,
        'output_name': project_name,
        'save_model_as': 'safetensors',
        'save_precision': 'bf16',
        'save_every_n_epochs': 1,
        'save_last_n_epochs': 4,
        'shuffle_caption': False,
        'caption_extension': '.txt',
        'noise_offset': 0.03,
        'multires_noise_discount': 0.3,
        'training_comment': f'Anima LoRA - {datetime.now().strftime("%Y-%m-%d")}',
    }

    with open(config_path, 'w') as f:
        toml.dump(training_config, f)

    print(f"\u2713 Created training config: {config_filename}")
    return config_path


def create_dataset_config(project_name, image_dir, resolution=768, repeats=10, caption_dropout_rate=0.1):
    if not os.path.exists(image_dir):
        raise FileNotFoundError(f"Image directory not found: {image_dir}")

    all_files = os.listdir(image_dir)
    image_files = [f for f in all_files
                   if not f.lower().endswith('.txt')
                   and os.path.isfile(os.path.join(image_dir, f))]
    if len(image_files) == 0:
        raise ValueError(f"No image files found in {image_dir}")

    current_date = datetime.now().strftime('%Y-%m-%d')
    config_filename = f"{project_name}_dataset_config_{current_date}.toml"
    config_path = os.path.join(CONFIG_DIR, config_filename)

    dataset_config = {
        'general': {
            'resolution': int(resolution),
            'enable_bucket': True,
            'bucket_no_upscale': False,
            'bucket_reso_steps': 64,
            'min_bucket_reso': 256,
            'max_bucket_reso': 4096,
        },
        'datasets': [
            {
                'resolution': int(resolution),
                'subsets': [
                    {
                        'num_repeats': int(repeats),
                        'image_dir': image_dir,
                        'caption_extension': '.txt',
                        'caption_dropout_rate': float(caption_dropout_rate),
                    }
                ]
            }
        ]
    }

    with open(config_path, 'w') as f:
        toml.dump(dataset_config, f)

    print(f"\u2713 Created dataset config: {config_filename}")
    return config_path


def train_lora_simple(cmd_list):
    """Run training command with real-time streaming output for Colab."""
    import sys
    print("Executing:", ' '.join(shlex.quote(c) for c in cmd_list))
    print()

    process = subprocess.Popen(
        cmd_list,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        universal_newlines=True,
        bufsize=1
    )

    try:
        for line in iter(process.stdout.readline, ''):
            if line:
                print(line, end='', flush=True)
                sys.stdout.flush()
    except KeyboardInterrupt:
        process.kill()
        process.wait()
        raise

    process.stdout.close()
    exit_code = process.wait()

    if exit_code == 0:
        print("\n\u2713 Training completed successfully!")
        return True
    else:
        print(f"\n\u2717 Training failed (exit code: {exit_code})")
        try:
            print('\n--- Recent kernel messages (dmesg) ---')
            dmesg = subprocess.check_output(['dmesg', '-T'], stderr=subprocess.STDOUT, text=True)
            tail = '\n'.join(dmesg.splitlines()[-80:])
            print(tail)
            if any(t in tail for t in ('Out of memory', 'Killed process', 'oom_reaper', 'OOM')):
                print('\n\U0001f4a1 Hint: OOM detected. Try: network_dim=8, resolution=512')
        except Exception:
            pass
        return False


def main():
    g = globals()

    # Optionally mount Drive
    if g.get('mount_drive', False):
        try:
            from google.colab import drive
            drive.mount('/content/drive')
            print("\u2713 Google Drive mounted at /content/drive")
        except Exception as e:
            print("\u26a0 Could not mount Google Drive:", e)

    project_name     = g.get('project_name', 'my_lora')
    image_directory  = g.get('image_directory', '/content/my_training_images')
    output_directory = g.get('output_directory', '/content/lora_output')
    network_dim      = g.get('network_dim', 20)
    network_alpha    = g.get('network_alpha', 20)
    learning_rate    = g.get('learning_rate', 1e-4)
    max_train_epochs = g.get('max_train_epochs', 10)
    resolution       = g.get('resolution', 768)
    repeats          = g.get('repeats', 10)
    caption_dropout  = g.get('caption_dropout', 0.1)

    # Validate image dir
    if not os.path.exists(image_directory):
        raise SystemExit(f"Image directory not found: {image_directory}")

    # Re-run step check at training time (catches changes since settings cell)
    all_files = os.listdir(image_directory)
    n_images = len([f for f in all_files
                    if not f.lower().endswith('.txt')
                    and os.path.isfile(os.path.join(image_directory, f))])
    spe = math.ceil((n_images * int(repeats)) / 1)
    total_steps = spe * int(max_train_epochs)
    print(f"\nStep check: {n_images} images × {repeats} repeats × {max_train_epochs} epochs = {total_steps} steps")
    if total_steps > 1000:
        print("  \u26a0\ufe0f  WARNING: Total steps exceed 1000 — this run risks a Colab timeout (4+ hours).")
        print("  Consider reducing epochs or repeats before continuing.")

    # Validate models
    print("\nValidating models...")
    ok = True
    for name, path in [('DiT', DIT_MODEL), ('Qwen3', QWEN3_MODEL), ('VAE', VAE_MODEL)]:
        if os.path.exists(path):
            print(f"  \u2713 {name} found")
        else:
            print(f"  \u2717 {name} missing: {path}")
            ok = False
    if not ok:
        raise SystemExit("Required models missing — run the setup cell first.")

    # Create configs
    train_cfg = create_training_config(
        project_name, output_directory, DIT_MODEL, QWEN3_MODEL, VAE_MODEL,
        network_dim=network_dim, network_alpha=network_alpha,
        learning_rate=learning_rate, max_train_epochs=max_train_epochs
    )
    dataset_cfg = create_dataset_config(
        project_name, image_directory,
        resolution=resolution, repeats=repeats, caption_dropout_rate=caption_dropout
    )

    # Build accelerate command
    cmd = [
        'accelerate', 'launch',
        '--num_cpu_threads_per_process', str(g.get('NUM_CPU_THREADS_PER_PROCESS', 1)),
        '/content/sd-scripts/anima_train_network.py',
        '--config_file', train_cfg,
        '--dataset_config', dataset_cfg
    ]

    success = train_lora_simple(cmd)

    print('\n=== Training Summary ===')
    print(f'Project: {project_name}')
    print('Status:', '\u2713 Success' if success else '\u2717 Failed')
    if success:
        print(f'Trained LoRA saved to: {output_directory}')


if __name__ == '__main__':
    try:
        main()
    except Exception as e:
        print('Fatal error:', e)
        import traceback
        traceback.print_exc()