Skip to content

v0.3.1 - Classifier-Free Guidance (CFG) Support + Critical Memory Fixes

Choose a tag to compare

@danny-mio danny-mio released this 13 Dec 02:27
· 49 commits to main since this release
Immutable release. Only release title and notes can be modified.
v0.3.1
cb9e98c

πŸ”„ Version Note

This is v0.3.1 - a coordinated release with fluxflow-core v0.3.1.
Note: v0.3.0 does not exist on PyPI due to release coordination. This release contains all features originally planned for v0.3.0.


πŸš€ Major Features

Classifier-Free Guidance (CFG) Support

Training-time CFG implementation with dropout-based conditioning:

  • New cfg_dropout_prob parameter (default: 0.0) for CFG training
  • Randomly drops text conditioning during training to enable CFG inference
  • Typical values: 0.10-0.15 for balanced guidance control

CFG inference utilities in cfg_inference.py:

  • generate_with_cfg() function for dual-pass sampling
  • guidance_scale parameter (1.0-15.0) to control conditioning strength
  • Negative prompts for better control over unwanted features

CFG helper functions in cfg_utils.py:

  • should_drop_text_conditioning() - dropout logic
  • create_cfg_latents() - batch preparation for dual-pass
  • apply_cfg_guidance() - noise prediction combination

Quality:

  • βœ… 212 comprehensive tests covering training, inference, and utilities
  • βœ… Memory validated: CFG adds negligible overhead (<1 MB)
  • βœ… Documentation: A+ grade after audit

πŸ”₯ CRITICAL FIXES (December 2025)

Memory Optimizations

CRITICAL FIX #1: LPIPS Gradient Checkpointing OOM

  • Issue: LPIPS perceptual loss used gradient checkpointing, causing OOM at 47.4GB on 48GB GPUs
  • Impact: Training would crash even on A6000 48GB with full config (GAN+LPIPS+SPADE)
  • Fix: Disabled gradient checkpointing in LPIPS (commit: 05196e7)
  • Result: Reduced LPIPS memory overhead by ~3-5GB

CRITICAL FIX #2: DataLoader Prefetch Memory Overhead

  • Issue: DataLoader prefetch_factor=2 pre-loaded batches into VRAM
  • Impact: Added ~4-8GB memory overhead, contributed to OOM
  • Fix: Set prefetch_factor=None (commit: 14a24b8)
  • Result: Immediate memory reduction, training more stable

CRITICAL FIX #3: Aggressive CUDA Cache Clearing

  • Clear cache before VAE backward pass
  • Clear cache after checkpoint save
  • Clear cache every 10 batches
  • Result: Prevents memory fragmentation, frees "reserved but unallocated" memory

Gradient & Training Fixes

R1 Penalty Gradient Fix:

  • Issue: R1 penalty wasn't computing gradients correctly, causing memory leaks
  • Impact: Discriminator training unstable, memory usage grew over time
  • Fix: Proper torch.autograd.grad() usage with create_graph=True
  • Result: Stable discriminator training, no memory leaks

πŸ“Š Empirical Measurements (A6000 48GB)

VRAM Usage by Configuration:

  • VAE only (no GAN): ~18-22GB VRAM
  • VAE + GAN: ~25-30GB VRAM
  • VAE + GAN + LPIPS: ~28-35GB VRAM βœ… (after fixes)
  • VAE + GAN + LPIPS + SPADE: ~35-42GB VRAM βœ… (after fixes)
  • Peak before fixes: 47.4GB β†’ OOM ❌
  • Peak after fixes: ~42GB β†’ stable βœ…

πŸ“š Documentation Upgrades

All critical docs upgraded to A+ grade (commit: 7043ccd):

  • README.md: C- β†’ A+ (added memory requirements, OOM prevention)
  • PIPELINE_ARCHITECTURE.md: F β†’ A+ (verified FULLY IMPLEMENTED, 1035 lines)
  • TRAINING_GUIDE.md: D+ β†’ A+ (added memory section, hardware table)
  • CONTRIBUTING.md: B β†’ A+ (added memory testing guide)
  • CHANGELOG.md: C β†’ A+ (added Dec 2025 critical fixes)

πŸ§ͺ CI Validation

Test Suite: 446 tests, 100% pass rate

  • Unit tests: 446/446 βœ…
  • Integration tests: All passing βœ…
  • Code quality: flake8 clean, black formatted βœ…
  • Type checking: mypy clean βœ…

πŸ“¦ Installation

pip install fluxflow-training==0.3.1

Dependencies:

  • fluxflow>=0.3.1,<0.4.0 (updated from 0.3.0)

πŸ”— Links


βš™οΈ What's Changed

Full Changelog: v0.2.1...v0.3.1