Axolotl v0.17.0 Release Notes
Another packed release. ~84 commits since v0.16.1 last month, bringing Expert Parallelism for MoE training, BitNet 1.58-bit fine-tuning, remote training via Tinker, context parallelism for hybrid SSM models, MXFP4 ScatterMoE-LoRA, fused RMSNorm+RoPE kernels for the Qwen3 family, a systemic fix for multimodal loss masking, a uv-first install path, and a long tail of stability/perf work on FSDP2, Gemma 4, and DPO.
Highlights
Expert Parallelism (EP) via DeepEP
Distributed MoE training across ranks via DeepSeek's DeepEP all-to-all kernels, verified on 2×A100, 4×A100, 8×H100, and EP + FSDP composition. Hopper low-latency kernels and TP/CP composition are follow-ups. See the Expert Parallelism docs and DeepEP integration guide.
- Contributed by @NanoCode012 in #3632.
Train on Remote Compute via Tinker-compatible APIs
Run training against a Tinker / Hatchery API endpoint instead of local hardware, with example SFT and GRPO configs plus a math reward function for RL workflows.
BitNet 1.58-bit Fine-tuning
Full fine-tuning support for BitNet via the onebitllms library, with config-validation guards against incompatible LoRA setups and a setup guide for post-training conversion.
- Contributed by @younesbelkada in #3634 and #3636.
Q-GaLore Optimizer
New memory-efficient optimizer q_galore_adamw8bit based on Q-GaLore. Requires FSDP2, full fine-tuning, and bf16. Not compatible with adapters.
MoRA / ReMoRA Integration
Optional MoRA (Mixture-of-Rank-Adapters) support with ReMoRA restart scheduling, registered through a new plugin-based adapter system so future custom adapters can slot in the same way. A companion ReLoRA cleanup fixes the optimizer reset scope, adds a configurable relora_prune_method, and renames relora_steps → jagged_restart_steps (breaking, see Deprecations).
Multimodal Assistant-only Loss Masking
Fixes a long-standing bug where train_on_inputs / roles_to_train / train_on_eos were silently ignored in the multimodal collator: every model except Gemma 3n was training on the full sequence regardless of config. New per-template strategies for Gemma 4, Llama 3.2 Vision, Llama 4, Pixtral, and Mistral V7 Tekken, plus an opt-in cfg.role_boundaries override for unverified templates.
Heads up: if you were training multimodal models on assistant-only data, your loss values will change after upgrading. This is expected.
- Contributed by @thad0ctor in #3625.
DPO Loss Types & SimPO LoRA fix
New dpo_loss_type (list) and dpo_loss_weights config knobs expose the full TRL ≥ 0.29 loss menu and let users mix multiple DPO losses with custom weightings, restoring losses like RPO that broke after the TRL 0.29 refactor. Separately, rl: simpo + LoRA no longer raises ValueError: You passed a PeftModel instance together with a peft_config.
Deprecation:
rl: ipois deprecated. Userl: dpowithdpo_loss_type: ["ipo"]instead.
- Contributed by @BrownianNotion in #3566 and @ved1beta in #3665.
Context Parallelism for Hybrid SSM Models
Context parallel support for hybrid attention + Mamba2 SSM models (Nemotron-H, Falcon-H1, Bamba, Granite MoE Hybrid, and Zamba2), plus seq_idx threading so SSM state resets correctly at packed-sequence boundaries. Uses an exact additive correction (exploiting SSM linearity) with one P2P round per layer, not the O(world_size) of ring attention.
uv-first installs and Docker images
uv is now the recommended package manager. New -uv Docker image variants ship with a generated lockfile and a migration guide from pip; minimum PyTorch is bumped to 2.9.1.
- Contributed by @NanoCode012 in #3545.
Gemma 4 Hybrid Attention + Fused RMSNorm/RoPE Kernels
Mixed FA + SDPA dispatch for Gemma 4 (FA2 on standard layers, SDPA where head_dim=512 OOMs FA2), plus a fused RMSNorm+RoPE Triton kernel. Enable via gemma4_hybrid_attn_impl: true. A VRAM leak in this path under activation checkpointing was also fixed.
- Contributed by @winglian in #3598 and @thad0ctor in #3611.
Fused RMSNorm+RoPE Kernels for Qwen3 / Qwen3.X
Generalizes the Gemma 4 fused RMSNorm+RoPE Triton kernel to the Qwen3 family (Qwen3, Qwen3-MoE, Qwen3.5, Qwen3.6, Qwen3-VL) behind a new opt-in cfg.fused_attn_kernel, and auto-enables Liger's fused (m-)rope for the Qwen-VL models.
- Contributed by @thad0ctor in #3680.
ScatterMoE-LoRA: MXFP4 Weights and Tiled-MLP for Long Context
Adds MXFP4-quantized expert weights to ScatterMoE-LoRA for memory-efficient MoE adapter training, plus a tiled-MLP path (FSDP2 reshard fix, grad-accumulator dtype fix, and a shard-count heuristic worth 3.2× at long context). An INT64 indices fix in the scatter2scatter Triton family resolves silent cuBLAS failures when routed-token offsets cross the 2³¹ boundary.
Performance & Kernel Optimizations
- Pre-cache eot token ids (#3594 by @winglian): avoids recomputing on every iteration.
- DPO collation padding (#3601 by @winglian): new
pad_to_multiple_ofpads to buckets to avoid blowing up FLA autotune memory on Qwen 3.5 DPO/ORPO. gc_steps→torch_empty_cache_steps(#3604 by @SuperMarioYL): splits cache-clear and Python GC into separate native knobs;gc_stepsis auto-migrated.
New Features
fp32_normsfor FSDP2 (#3670 by @winglian): keep RMSNorm/LayerNorm in fp32 while training in bf16/fp16. Required for models like AFMOE that declare fp32 norms.excess_length_strategyfor RL trainers (#3578 by @yurekami): DPO/IPO/ORPO/SimPO/KTO now respectdrop/truncate/raise(previously always dropped).- FineGrainedFP8Config quantization (#3587 by @madScientist10): loads FP8-quantized models with optional
dequantize: truefor FFT. - Skip redundant eval on checkpoint resume (#3575 by @joaquinhuigomez)
processor_kwargsYAML field (#3612 by @thad0ctor): overrides for e.g. Gemma 4image_seq_lengthor Qwen-VLmin_pixels.field_messagessupport in multimodal collator (#3628 by @cyc00518)- Multimodal AutoProcessor support (#3656 by @ved1beta)
- String-formatted
messagesfield (#3607 by @brightwind26): JSON-stringmessagesfields are now decoded transparently. - Trainable / masked spans in content and
reasoning_content(#3592 by @winglian): per-parttrain/weightflags; replacestrain_details. - Attention implementation refactor (#3602 by @winglian): unified
attn_implementation(eager / flash / sdpa / xformers / flex / sage / s2 / fp8), with new FP8 attention backend. Legacy flags still honored. - Multi-GPU Qwen3.5 FFT example config (#3605 by @ved1beta)
DO_NOT_TRACK/AXOLOTL_DO_NOT_TRACKnow properly honored (#3580 by @maximegmd)
Documentation
- Easier doc discovery for agents (#3579 by @winglian): surfaces agent-optimized docs more prominently.
- Document jinja2 file path support (#3588 by @NanoCode012): clarifies that chat templates can be loaded from a local jinja file.
- Update Docker docs (#3623 by @NanoCode012): refreshed for the uv-first builds and the new image variants.
- Security policy updated (#3645 by @NanoCode012): now uses email rather than Discord.
- Cut Cross Entropy uninstall hint (#3583 by @floaty3): better message when an import mismatch is detected.
Model & Framework Support
Deprecations
relora_stepsrenamed tojagged_restart_steps(#3646 by @winglian): breaking config change, no migration shim. Existing ReLoRA configs need a one-line update.rl: ipodeprecated in favor ofrl: dpo, dpo_loss_type: ["ipo"](#3566 by @BrownianNotion)gc_stepsdeprecated in favor oftorch_empty_cache_steps+gc_collect_steps(#3604 by @SuperMarioYL)- Legacy attention flags soft-deprecated by
attn_implementation(#3602 by @winglian): still honored
New Model Support
- Mistral Medium 3.5 (#3633 by @NanoCode012): base config and QLoRA examples for text + vision; reasoning-trace compatible.
Dependency Updates
- transformers 5.5.0 → 5.9.0 (#3593, #3603, #3696 by @winglian, #3650 by @NanoCode012)
- trl 0.29.0 → 1.5.1 (#3603, #3696 by @winglian)
- peft 0.18.1 → 0.19.1 (#3671 by @winglian)
- datasets 4.5.0 → 4.8.4
- kernels 0.12.2 → 0.13.0
- vLLM ≥ 0.19.0 for torch 2.10.0 (#3582 by @winglian)
- PyTorch 2.12 base images (#3621, #3697 by @winglian): adds 2.11/2.12 bases and prunes unused image variants.
- Blackwell Docker variant (#3641 by @ved1beta): CUDA 13.0 + PyTorch 2.9.1 with
TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3 12.0+PTX"covering B100/B200/B300 and RTX 50-series. typer < 0.26.0pinned (#3684 by @winglian): latest typer broke the HF CLI.
Bug Fixes
- Gemma 4 VRAM leak in hybrid FA2+SDPA path (#3611 by @thad0ctor):
shared_kv_statescaptured by checkpointing partials under FSDP2 pinned K/V tensors across steps. Routed through TLS; VRAM flat across 200+ steps. - FSDP
FULL_STATE_DICTOOM during save (#3635 by @ved1beta) - FSDP2: clone sharded param so full-size shard can be GC'd (#3597 by @winglian)
- Multimodal
processor_kwargsregression (#3643 by @NanoCode012) - Ring FA broken by transformers FA utils relocation (#3644 by @NanoCode012)
- Ray batch_size derivation, FSDP schema migration, FakeExperts/peft 0.19 compat (#3671 by @winglian)
- Probe GPU capabilities on Ray worker, not driver (#3619 by @zxuhan)
- CI fixes: FA2/Ray breakage (#3664 by @NanoCode012), EP test teardown (#3674 by @NanoCode012), MX under transformers 5.8.1 (#3679 by @winglian), flaky EP tests (#3683 by @winglian),
test_rm_loraskip (#3669 by @ved1beta) cu130LD_LIBRARY_PATHstartup (#3648 by @winglian)- Gemma 4 fixes:
kernelize()crash on vision tower (#3687 by @winglian), latest chat template (#3686 by @winglian), DDP/FSDP (#3584 by @winglian), profiler & misc (#3591 by @winglian), regex for unfrozen language tower (#3586 by @NanoCode012) - Refactor kernels patch to drop routing and inject into Expert (#3651 by @NanoCode012)
- KD trainer crash on transformers 5.x and silent wrong training on multimodal Gemma (#3661 by @roycho96): the injected liger fused loss had a signature mismatch with the stock loss path, and the patched forward never bound on
XxxForConditionalGenerationclasses. KD loss now computed insidecompute_lossdirectly. - KD liger chunked loss dropped CE gradient (#3660 by @roycho96):
torch.func.grad_and_value(has_aux=True)only differentiated the soft loss, so CE-only and KD-mix runs silently learned nothing from CE. - DoRA merge on Conv layers in Qwen 3.5 (#3599 by @winglian),
qwen3_5.jinjalist content on system messages (#3595 by @joaquinhuigomez) - Async prefetch with NeMo Gym (#3606 by @winglian),
prepare_context_parallel_inputsno-op (#3520 by @NanoCode012) - Preserve split slices for local file datasets (#3627 by @cyc00518)
- Rename
model→adapter_modelfor FSDP sharded final model (#3585 by @NanoCode012) - Unsupported tensor for PTQ (#3581 by @ved1beta), MoE activation VRAM leak test (#3649 by @ved1beta)
AssertionError: Original QKV code not found(#3657 by @ved1beta)- Misc: warning order after overrides (#3589 by @NanoCode012), Docker build (#3622 by @NanoCode012), post-v0.16 cleanup (#3577 by @NanoCode012), CCE + Liger added to Nemotron-H example (#3573 by @NanoCode012)
Infrastructure
- Smaller pretrained models in CI (#3620 by @winglian)
modal runmigrated to explicit module flag (#3668 by @winglian): aligns with the modal CLI's expected invocation form.- Skip scattermoe-LoRA tests on CUDA OOM under xdist contention (#3689 by @winglian): correctness bugs still surface as failures (typed exceptions, not OOM).
New Contributors
- @yurekami made their first contribution in #3578
- @SuperMarioYL made their first contribution in #3604
- @thad0ctor made their first contribution in #3611
- @cyc00518 made their first contribution in #3627
- @zxuhan made their first contribution in #3619
- @brightwind26 made their first contribution in #3607
- @floaty3 made their first contribution in #3583
Full Changelog: v0.16.1...v0.17.0