Axolotl v0.17.0 Release Notes

Another packed release. ~84 commits since v0.16.1 last month, bringing Expert Parallelism for MoE training, BitNet 1.58-bit fine-tuning, remote training via Tinker, context parallelism for hybrid SSM models, MXFP4 ScatterMoE-LoRA, fused RMSNorm+RoPE kernels for the Qwen3 family, a systemic fix for multimodal loss masking, a uv-first install path, and a long tail of stability/perf work on FSDP2, Gemma 4, and DPO.

Highlights

Expert Parallelism (EP) via DeepEP

Distributed MoE training across ranks via DeepSeek's DeepEP all-to-all kernels, verified on 2×A100, 4×A100, 8×H100, and EP + FSDP composition. Hopper low-latency kernels and TP/CP composition are follow-ups. See the Expert Parallelism docs and DeepEP integration guide.

Contributed by @NanoCode012 in #3632.

Train on Remote Compute via Tinker-compatible APIs

Run training against a Tinker / Hatchery API endpoint instead of local hardware, with example SFT and GRPO configs plus a math reward function for RL workflows.

Contributed by @winglian in #3614.

BitNet 1.58-bit Fine-tuning

Full fine-tuning support for BitNet via the onebitllms library, with config-validation guards against incompatible LoRA setups and a setup guide for post-training conversion.

Contributed by @younesbelkada in #3634 and #3636.

Q-GaLore Optimizer

New memory-efficient optimizer q_galore_adamw8bit based on Q-GaLore. Requires FSDP2, full fine-tuning, and bf16. Not compatible with adapters.

Contributed by @ved1beta in #3654.

MoRA / ReMoRA Integration

Optional MoRA (Mixture-of-Rank-Adapters) support with ReMoRA restart scheduling, registered through a new plugin-based adapter system so future custom adapters can slot in the same way. A companion ReLoRA cleanup fixes the optimizer reset scope, adds a configurable relora_prune_method, and renames relora_steps → jagged_restart_steps (breaking, see Deprecations).

Contributed by @winglian in #3647 and #3646.

Multimodal Assistant-only Loss Masking

Fixes a long-standing bug where train_on_inputs / roles_to_train / train_on_eos were silently ignored in the multimodal collator: every model except Gemma 3n was training on the full sequence regardless of config. New per-template strategies for Gemma 4, Llama 3.2 Vision, Llama 4, Pixtral, and Mistral V7 Tekken, plus an opt-in cfg.role_boundaries override for unverified templates.

Heads up: if you were training multimodal models on assistant-only data, your loss values will change after upgrading. This is expected.

Contributed by @thad0ctor in #3625.

DPO Loss Types & SimPO LoRA fix

New dpo_loss_type (list) and dpo_loss_weights config knobs expose the full TRL ≥ 0.29 loss menu and let users mix multiple DPO losses with custom weightings, restoring losses like RPO that broke after the TRL 0.29 refactor. Separately, rl: simpo + LoRA no longer raises ValueError: You passed a PeftModel instance together with a peft_config.

Deprecation: rl: ipo is deprecated. Use rl: dpo with dpo_loss_type: ["ipo"] instead.

Contributed by @BrownianNotion in #3566 and @ved1beta in #3665.

Context Parallelism for Hybrid SSM Models

Context parallel support for hybrid attention + Mamba2 SSM models (Nemotron-H, Falcon-H1, Bamba, Granite MoE Hybrid, and Zamba2), plus seq_idx threading so SSM state resets correctly at packed-sequence boundaries. Uses an exact additive correction (exploiting SSM linearity) with one P2P round per layer, not the O(world_size) of ring attention.

Contributed by @ved1beta in #3572.

uv-first installs and Docker images

uv is now the recommended package manager. New -uv Docker image variants ship with a generated lockfile and a migration guide from pip; minimum PyTorch is bumped to 2.9.1.

Contributed by @NanoCode012 in #3545.

Gemma 4 Hybrid Attention + Fused RMSNorm/RoPE Kernels

Mixed FA + SDPA dispatch for Gemma 4 (FA2 on standard layers, SDPA where head_dim=512 OOMs FA2), plus a fused RMSNorm+RoPE Triton kernel. Enable via gemma4_hybrid_attn_impl: true. A VRAM leak in this path under activation checkpointing was also fixed.

Contributed by @winglian in #3598 and @thad0ctor in #3611.

Fused RMSNorm+RoPE Kernels for Qwen3 / Qwen3.X

Generalizes the Gemma 4 fused RMSNorm+RoPE Triton kernel to the Qwen3 family (Qwen3, Qwen3-MoE, Qwen3.5, Qwen3.6, Qwen3-VL) behind a new opt-in cfg.fused_attn_kernel, and auto-enables Liger's fused (m-)rope for the Qwen-VL models.

Contributed by @thad0ctor in #3680.

ScatterMoE-LoRA: MXFP4 Weights and Tiled-MLP for Long Context

Adds MXFP4-quantized expert weights to ScatterMoE-LoRA for memory-efficient MoE adapter training, plus a tiled-MLP path (FSDP2 reshard fix, grad-accumulator dtype fix, and a shard-count heuristic worth 3.2× at long context). An INT64 indices fix in the scatter2scatter Triton family resolves silent cuBLAS failures when routed-token offsets cross the 2³¹ boundary.

Contributed by @winglian in #3663 (MXFP4), #3666 (tiled-MLP), and #3667 (INT64 indices).

Performance & Kernel Optimizations

Pre-cache eot token ids (#3594 by @winglian): avoids recomputing on every iteration.
DPO collation padding (#3601 by @winglian): new pad_to_multiple_of pads to buckets to avoid blowing up FLA autotune memory on Qwen 3.5 DPO/ORPO.
gc_steps → torch_empty_cache_steps (#3604 by @SuperMarioYL): splits cache-clear and Python GC into separate native knobs; gc_steps is auto-migrated.

New Features

fp32_norms for FSDP2 (#3670 by @winglian): keep RMSNorm/LayerNorm in fp32 while training in bf16/fp16. Required for models like AFMOE that declare fp32 norms.
excess_length_strategy for RL trainers (#3578 by @yurekami): DPO/IPO/ORPO/SimPO/KTO now respect drop / truncate / raise (previously always dropped).
FineGrainedFP8Config quantization (#3587 by @madScientist10): loads FP8-quantized models with optional dequantize: true for FFT.
Skip redundant eval on checkpoint resume (#3575 by @joaquinhuigomez)
processor_kwargs YAML field (#3612 by @thad0ctor): overrides for e.g. Gemma 4 image_seq_length or Qwen-VL min_pixels.
field_messages support in multimodal collator (#3628 by @cyc00518)
Multimodal AutoProcessor support (#3656 by @ved1beta)
String-formatted messages field (#3607 by @brightwind26): JSON-string messages fields are now decoded transparently.
Trainable / masked spans in content and reasoning_content (#3592 by @winglian): per-part train / weight flags; replaces train_details.
Attention implementation refactor (#3602 by @winglian): unified attn_implementation (eager / flash / sdpa / xformers / flex / sage / s2 / fp8), with new FP8 attention backend. Legacy flags still honored.
Multi-GPU Qwen3.5 FFT example config (#3605 by @ved1beta)
DO_NOT_TRACK / AXOLOTL_DO_NOT_TRACK now properly honored (#3580 by @maximegmd)

Documentation

Easier doc discovery for agents (#3579 by @winglian): surfaces agent-optimized docs more prominently.
Document jinja2 file path support (#3588 by @NanoCode012): clarifies that chat templates can be loaded from a local jinja file.
Update Docker docs (#3623 by @NanoCode012): refreshed for the uv-first builds and the new image variants.
Security policy updated (#3645 by @NanoCode012): now uses email rather than Discord.
Cut Cross Entropy uninstall hint (#3583 by @floaty3): better message when an import mismatch is detected.

Model & Framework Support

Deprecations

relora_steps renamed to jagged_restart_steps (#3646 by @winglian): breaking config change, no migration shim. Existing ReLoRA configs need a one-line update.
rl: ipo deprecated in favor of rl: dpo, dpo_loss_type: ["ipo"] (#3566 by @BrownianNotion)
gc_steps deprecated in favor of torch_empty_cache_steps + gc_collect_steps (#3604 by @SuperMarioYL)
Legacy attention flags soft-deprecated by attn_implementation (#3602 by @winglian): still honored

New Model Support

Mistral Medium 3.5 (#3633 by @NanoCode012): base config and QLoRA examples for text + vision; reasoning-trace compatible.

Dependency Updates

transformers 5.5.0 → 5.9.0 (#3593, #3603, #3696 by @winglian, #3650 by @NanoCode012)
trl 0.29.0 → 1.5.1 (#3603, #3696 by @winglian)
peft 0.18.1 → 0.19.1 (#3671 by @winglian)
datasets 4.5.0 → 4.8.4
kernels 0.12.2 → 0.13.0
vLLM ≥ 0.19.0 for torch 2.10.0 (#3582 by @winglian)
PyTorch 2.12 base images (#3621, #3697 by @winglian): adds 2.11/2.12 bases and prunes unused image variants.
Blackwell Docker variant (#3641 by @ved1beta): CUDA 13.0 + PyTorch 2.9.1 with TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3 12.0+PTX" covering B100/B200/B300 and RTX 50-series.
typer < 0.26.0 pinned (#3684 by @winglian): latest typer broke the HF CLI.

Bug Fixes

Gemma 4 VRAM leak in hybrid FA2+SDPA path (#3611 by @thad0ctor): shared_kv_states captured by checkpointing partials under FSDP2 pinned K/V tensors across steps. Routed through TLS; VRAM flat across 200+ steps.
FSDP FULL_STATE_DICT OOM during save (#3635 by @ved1beta)
FSDP2: clone sharded param so full-size shard can be GC'd (#3597 by @winglian)
Multimodal processor_kwargs regression (#3643 by @NanoCode012)
Ring FA broken by transformers FA utils relocation (#3644 by @NanoCode012)
Ray batch_size derivation, FSDP schema migration, FakeExperts/peft 0.19 compat (#3671 by @winglian)
Probe GPU capabilities on Ray worker, not driver (#3619 by @zxuhan)
CI fixes: FA2/Ray breakage (#3664 by @NanoCode012), EP test teardown (#3674 by @NanoCode012), MX under transformers 5.8.1 (#3679 by @winglian), flaky EP tests (#3683 by @winglian), test_rm_lora skip (#3669 by @ved1beta)
cu130 LD_LIBRARY_PATH startup (#3648 by @winglian)
Gemma 4 fixes: kernelize() crash on vision tower (#3687 by @winglian), latest chat template (#3686 by @winglian), DDP/FSDP (#3584 by @winglian), profiler & misc (#3591 by @winglian), regex for unfrozen language tower (#3586 by @NanoCode012)
Refactor kernels patch to drop routing and inject into Expert (#3651 by @NanoCode012)
KD trainer crash on transformers 5.x and silent wrong training on multimodal Gemma (#3661 by @roycho96): the injected liger fused loss had a signature mismatch with the stock loss path, and the patched forward never bound on XxxForConditionalGeneration classes. KD loss now computed inside compute_loss directly.
KD liger chunked loss dropped CE gradient (#3660 by @roycho96): torch.func.grad_and_value(has_aux=True) only differentiated the soft loss, so CE-only and KD-mix runs silently learned nothing from CE.
DoRA merge on Conv layers in Qwen 3.5 (#3599 by @winglian), qwen3_5.jinja list content on system messages (#3595 by @joaquinhuigomez)
Async prefetch with NeMo Gym (#3606 by @winglian), prepare_context_parallel_inputs no-op (#3520 by @NanoCode012)
Preserve split slices for local file datasets (#3627 by @cyc00518)
Rename model → adapter_model for FSDP sharded final model (#3585 by @NanoCode012)
Unsupported tensor for PTQ (#3581 by @ved1beta), MoE activation VRAM leak test (#3649 by @ved1beta)
AssertionError: Original QKV code not found (#3657 by @ved1beta)
Misc: warning order after overrides (#3589 by @NanoCode012), Docker build (#3622 by @NanoCode012), post-v0.16 cleanup (#3577 by @NanoCode012), CCE + Liger added to Nemotron-H example (#3573 by @NanoCode012)

Infrastructure

Smaller pretrained models in CI (#3620 by @winglian)
modal run migrated to explicit module flag (#3668 by @winglian): aligns with the modal CLI's expected invocation form.
Skip scattermoe-LoRA tests on CUDA OOM under xdist contention (#3689 by @winglian): correctness bugs still surface as failures (typed exceptions, not OOM).

New Contributors

@yurekami made their first contribution in #3578
@SuperMarioYL made their first contribution in #3604
@thad0ctor made their first contribution in #3611
@cyc00518 made their first contribution in #3627
@zxuhan made their first contribution in #3619
@brightwind26 made their first contribution in #3607
@floaty3 made their first contribution in #3583

Full Changelog: v0.16.1...v0.17.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.17.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Axolotl v0.17.0 Release Notes

Highlights

Expert Parallelism (EP) via DeepEP

Train on Remote Compute via Tinker-compatible APIs

BitNet 1.58-bit Fine-tuning

Q-GaLore Optimizer

MoRA / ReMoRA Integration

Multimodal Assistant-only Loss Masking

DPO Loss Types & SimPO LoRA fix

Context Parallelism for Hybrid SSM Models

uv-first installs and Docker images

Gemma 4 Hybrid Attention + Fused RMSNorm/RoPE Kernels

Fused RMSNorm+RoPE Kernels for Qwen3 / Qwen3.X

ScatterMoE-LoRA: MXFP4 Weights and Tiled-MLP for Long Context

Performance & Kernel Optimizations

New Features

Documentation

Model & Framework Support

Deprecations

New Model Support

Dependency Updates

Bug Fixes

Infrastructure

New Contributors

Contributors

Uh oh!