Skip to content

v0.17.0

Latest

Choose a tag to compare

@github-actions github-actions released this 03 Jun 08:37
· 22 commits to main since this release
09d325b

Axolotl v0.17.0 Release Notes

Another packed release. ~84 commits since v0.16.1 last month, bringing Expert Parallelism for MoE training, BitNet 1.58-bit fine-tuning, remote training via Tinker, context parallelism for hybrid SSM models, MXFP4 ScatterMoE-LoRA, fused RMSNorm+RoPE kernels for the Qwen3 family, a systemic fix for multimodal loss masking, a uv-first install path, and a long tail of stability/perf work on FSDP2, Gemma 4, and DPO.


Highlights

Expert Parallelism (EP) via DeepEP

Distributed MoE training across ranks via DeepSeek's DeepEP all-to-all kernels, verified on 2×A100, 4×A100, 8×H100, and EP + FSDP composition. Hopper low-latency kernels and TP/CP composition are follow-ups. See the Expert Parallelism docs and DeepEP integration guide.

qwen30b_h100_chart sweep_speedup

Train on Remote Compute via Tinker-compatible APIs

Run training against a Tinker / Hatchery API endpoint instead of local hardware, with example SFT and GRPO configs plus a math reward function for RL workflows.

BitNet 1.58-bit Fine-tuning

Full fine-tuning support for BitNet via the onebitllms library, with config-validation guards against incompatible LoRA setups and a setup guide for post-training conversion.

Q-GaLore Optimizer

New memory-efficient optimizer q_galore_adamw8bit based on Q-GaLore. Requires FSDP2, full fine-tuning, and bf16. Not compatible with adapters.

MoRA / ReMoRA Integration

Optional MoRA (Mixture-of-Rank-Adapters) support with ReMoRA restart scheduling, registered through a new plugin-based adapter system so future custom adapters can slot in the same way. A companion ReLoRA cleanup fixes the optimizer reset scope, adds a configurable relora_prune_method, and renames relora_stepsjagged_restart_steps (breaking, see Deprecations).

Multimodal Assistant-only Loss Masking

Fixes a long-standing bug where train_on_inputs / roles_to_train / train_on_eos were silently ignored in the multimodal collator: every model except Gemma 3n was training on the full sequence regardless of config. New per-template strategies for Gemma 4, Llama 3.2 Vision, Llama 4, Pixtral, and Mistral V7 Tekken, plus an opt-in cfg.role_boundaries override for unverified templates.

Heads up: if you were training multimodal models on assistant-only data, your loss values will change after upgrading. This is expected.

DPO Loss Types & SimPO LoRA fix

New dpo_loss_type (list) and dpo_loss_weights config knobs expose the full TRL ≥ 0.29 loss menu and let users mix multiple DPO losses with custom weightings, restoring losses like RPO that broke after the TRL 0.29 refactor. Separately, rl: simpo + LoRA no longer raises ValueError: You passed a PeftModel instance together with a peft_config.

Deprecation: rl: ipo is deprecated. Use rl: dpo with dpo_loss_type: ["ipo"] instead.

Context Parallelism for Hybrid SSM Models

Context parallel support for hybrid attention + Mamba2 SSM models (Nemotron-H, Falcon-H1, Bamba, Granite MoE Hybrid, and Zamba2), plus seq_idx threading so SSM state resets correctly at packed-sequence boundaries. Uses an exact additive correction (exploiting SSM linearity) with one P2P round per layer, not the O(world_size) of ring attention.

uv-first installs and Docker images

uv is now the recommended package manager. New -uv Docker image variants ship with a generated lockfile and a migration guide from pip; minimum PyTorch is bumped to 2.9.1.

Gemma 4 Hybrid Attention + Fused RMSNorm/RoPE Kernels

Mixed FA + SDPA dispatch for Gemma 4 (FA2 on standard layers, SDPA where head_dim=512 OOMs FA2), plus a fused RMSNorm+RoPE Triton kernel. Enable via gemma4_hybrid_attn_impl: true. A VRAM leak in this path under activation checkpointing was also fixed.

Fused RMSNorm+RoPE Kernels for Qwen3 / Qwen3.X

Generalizes the Gemma 4 fused RMSNorm+RoPE Triton kernel to the Qwen3 family (Qwen3, Qwen3-MoE, Qwen3.5, Qwen3.6, Qwen3-VL) behind a new opt-in cfg.fused_attn_kernel, and auto-enables Liger's fused (m-)rope for the Qwen-VL models.

ScatterMoE-LoRA: MXFP4 Weights and Tiled-MLP for Long Context

Adds MXFP4-quantized expert weights to ScatterMoE-LoRA for memory-efficient MoE adapter training, plus a tiled-MLP path (FSDP2 reshard fix, grad-accumulator dtype fix, and a shard-count heuristic worth 3.2× at long context). An INT64 indices fix in the scatter2scatter Triton family resolves silent cuBLAS failures when routed-token offsets cross the 2³¹ boundary.


Performance & Kernel Optimizations

  • Pre-cache eot token ids (#3594 by @winglian): avoids recomputing on every iteration.
  • DPO collation padding (#3601 by @winglian): new pad_to_multiple_of pads to buckets to avoid blowing up FLA autotune memory on Qwen 3.5 DPO/ORPO.
  • gc_stepstorch_empty_cache_steps (#3604 by @SuperMarioYL): splits cache-clear and Python GC into separate native knobs; gc_steps is auto-migrated.

New Features

  • fp32_norms for FSDP2 (#3670 by @winglian): keep RMSNorm/LayerNorm in fp32 while training in bf16/fp16. Required for models like AFMOE that declare fp32 norms.
  • excess_length_strategy for RL trainers (#3578 by @yurekami): DPO/IPO/ORPO/SimPO/KTO now respect drop / truncate / raise (previously always dropped).
  • FineGrainedFP8Config quantization (#3587 by @madScientist10): loads FP8-quantized models with optional dequantize: true for FFT.
  • Skip redundant eval on checkpoint resume (#3575 by @joaquinhuigomez)
  • processor_kwargs YAML field (#3612 by @thad0ctor): overrides for e.g. Gemma 4 image_seq_length or Qwen-VL min_pixels.
  • field_messages support in multimodal collator (#3628 by @cyc00518)
  • Multimodal AutoProcessor support (#3656 by @ved1beta)
  • String-formatted messages field (#3607 by @brightwind26): JSON-string messages fields are now decoded transparently.
  • Trainable / masked spans in content and reasoning_content (#3592 by @winglian): per-part train / weight flags; replaces train_details.
  • Attention implementation refactor (#3602 by @winglian): unified attn_implementation (eager / flash / sdpa / xformers / flex / sage / s2 / fp8), with new FP8 attention backend. Legacy flags still honored.
  • Multi-GPU Qwen3.5 FFT example config (#3605 by @ved1beta)
  • DO_NOT_TRACK / AXOLOTL_DO_NOT_TRACK now properly honored (#3580 by @maximegmd)

Documentation

  • Easier doc discovery for agents (#3579 by @winglian): surfaces agent-optimized docs more prominently.
  • Document jinja2 file path support (#3588 by @NanoCode012): clarifies that chat templates can be loaded from a local jinja file.
  • Update Docker docs (#3623 by @NanoCode012): refreshed for the uv-first builds and the new image variants.
  • Security policy updated (#3645 by @NanoCode012): now uses email rather than Discord.
  • Cut Cross Entropy uninstall hint (#3583 by @floaty3): better message when an import mismatch is detected.

Model & Framework Support

Deprecations

  • relora_steps renamed to jagged_restart_steps (#3646 by @winglian): breaking config change, no migration shim. Existing ReLoRA configs need a one-line update.
  • rl: ipo deprecated in favor of rl: dpo, dpo_loss_type: ["ipo"] (#3566 by @BrownianNotion)
  • gc_steps deprecated in favor of torch_empty_cache_steps + gc_collect_steps (#3604 by @SuperMarioYL)
  • Legacy attention flags soft-deprecated by attn_implementation (#3602 by @winglian): still honored

New Model Support

  • Mistral Medium 3.5 (#3633 by @NanoCode012): base config and QLoRA examples for text + vision; reasoning-trace compatible.

Dependency Updates


Bug Fixes


Infrastructure

  • Smaller pretrained models in CI (#3620 by @winglian)
  • modal run migrated to explicit module flag (#3668 by @winglian): aligns with the modal CLI's expected invocation form.
  • Skip scattermoe-LoRA tests on CUDA OOM under xdist contention (#3689 by @winglian): correctness bugs still surface as failures (typed exceptions, not OOM).

New Contributors


Full Changelog: v0.16.1...v0.17.0