v0.9.1
Efficiency & Performance
Axolotl delivers demonstrably faster training due to improved packing and kernels, allowing you to accomplish more in less time. Our benchmarks show that we outperform the next faster trainer by 30% on real world workloads that translate directly to increased productivity and faster time-to-results.
Axolotl delivers this performance advantage while consuming less VRAM and eliminating resource spikes throughout your training runs.
Get started on Google Colab
Fine-tune your own Qwen3-14B for free on Google Colab: https://colab.research.google.com/drive/1EscYgLM38dWMcG5IyJz1qHl3VO7a2hZz
🚨 Breaking Changes & Deprecations
PyTorch 2.4.1 Support Removed
- Support for
torch==2.4.1has been officially removed. Please upgrade to a newer version of PyTorch. (by @winglian in #2582)
🎉 New Features
Greatly Improved Sample Packing
- We've implemented an improved Parallel Bin Packing algorithm that achieves ~99% packing efficiency on most datasets. This can improve your workload throughput by up to 10%. (by @winglian in #2631)
pad_to_sequence_lenis now automatically enabled when using sample packing for better performance and stability. (by @winglian in #2607)
Xformers Attention for Packed Sequences
Added support for using xformers optimized attention with packed sequences in fp16, boosting training speed even further. (by @winglian in #2619)
Support fine-tuning Text-to-Speech model with LLM backbone
Axolotl now supports training a Text-to-Speech (TTS) model on top of an LLM. (by @mhenrichsen in #2614)
Automatic LoRA Kernel Activation
LoRA kernels are now automatically enabled where possible, providing a hands-free performance boost for LoRA training. This feature is automatically disabled for RL training to ensure stability. (by @djsaunde in #2589 and @winglian in #2600)
CAME Optimizer Support
You can now use the CAME optimizer, a memory-efficient optimizer designed for large language models. (by @xzuyn in #2385)
General User Experience
- DeepSpeed Config Logging: Your DeepSpeed configuration is now automatically saved to Weights & Biases, making your runs easier to reproduce and debug. (by @winglian in #2593)
- Automatic Reasoning Dataset Splitting: Axolotl can now automatically split existing reasoning datasets to leverage new chat templates with reasoning/tool-use turns. (by @winglian in #2591)
📦 Dependency Updates
liger-kernelbumped to0.5.9. (by @winglian in #2640)vllmbumped to0.8.5for Qwen2 support. (by @winglian in #2583)datasetsand other Hugging Face libraries have been updated.
🔧 Major Fixes
Model and Training Stability
- Qwen2 Models: Fixed multiple issues with packing and kernel support for the Qwen2 and Qwen2-MoE model families, ensuring they train correctly. (by @winglian in #2588, #2612, #2622 and @NanoCode012 in #2596)
- Evaluation: Fixed a bug that could cause evaluation runs to fail. (by @djsaunde in #2586)
- DeepSpeed: Resolved an issue where the learning rate was passed as a tensor instead of a float, causing errors. (by @winglian in #2595). This was also pushed upstream at huggingface/transformers#37704 by @NanoCode012 and huggingface/transformers#37881 by @winglian.
- DPO Trainer: Fixed an issue where evaluation steps were incorrectly overridden. (by @winglian in #2628)
Other Improvements
📚 Documentation & Examples
- Added documentation for the new
split_thinkingfeature. (by @NanoCode012 in #2613) - Corrected documentation for multimodal datasets and Llama 4 delinearization. (by @NanoCode012 in #2575, #2644)
⚙️ Internal & Plugin Enhancements
- Plugins can now provide their own custom
lr_scheduler. (by @alexdremov in #2584) - Plugins are now able to return their own fully processed datasets. (by @winglian in #2617)
- Improved error messaging when a dataset fails to load. (by @NanoCode012 in #2637)
Full changelog
- remove torch 2.4.1 CI as part of support deprecation by @winglian in #2582
- Post release fixes by @winglian in #2581
- set config on the PluginManager for callback access by @winglian in #2587
- Fix eval + add smoke test by @djsaunde in #2586
- support for qwen3 with lora kernels by @winglian in #2588
- bump vllm==0.8.5 for qwen3 support by @winglian in #2583
- fix(doc): update key used to point to url in multimodal doc by @NanoCode012 in #2575
- auto-enable lora kernels where possible by @djsaunde in #2589
- Plugins create_lr_scheduler support by @alexdremov in #2584
- patch to convert LR from tensor to float when using DS by @winglian in #2595
- feat: add qwen3 moe block for ds3 by @NanoCode012 in #2596
- upload the deepspeed json to wandb by @winglian in #2593
- Handle other reasoning trace dataset formats by @winglian in #2591
- only import vllm serve cli if its being called by @winglian in #2597
- don't automatically enable lora kernels for RL training by @winglian in #2600
- ensure we pass axolotl extras to the Dockerfile so vllm is included in shipped images by @winglian in #2599
- replace zero_only with simpler if statement by @winglian in #2592
- additional args for grpo config/trainer by @winglian in #2598
- use latest hf-xet and don't install vllm for torch 2.7.0 by @winglian in #2603
- Add num_completions_to_print for trl and grpo by @dhruvmullick in #2604
- add missing init for lr monkeypatch fix by @winglian in #2609
- Logging config for colab by @winglian in #2611
- fix: run preview-docs only when md/qmd changes by @NanoCode012 in #2606
- automatically set pad_to_sequence_len when use packing by @winglian in #2607
- remove keys to incoporate changes for the trl update by @aitechguy0105 in #2616
- qwen3 and qwen3_moe support for liger kernels by @winglian in #2612
- setup hf transfer too and fix auto bf16 when fp16 enabled by @winglian in #2620
- include multipack support for qwen3 family by @winglian in #2622
- Fix logging deprecation warnings by @emmanuel-ferdman in #2623
- Adds example for training a TTS model on top of a LLM. by @mhenrichsen in #2614
- repop cache by @winglian in #2639
- make sure gc_steps is used for all trainers by @winglian in #2638
- fix dpo eval override to call grandparent instead of the broken super by @winglian in #2628
- Print axolotl art if train is called outside of cli by @winglian in #2627
- Update lr_scheduler options in config.qmd to include additional sched… by @mhenrichsen in #2636
- bump liger dep to 0.5.9 by @winglian in #2640
- feat(doc): add split_thinking docs by @NanoCode012 in #2613
- allow plugins to return their own dataset by @winglian in #2617
- Multipack parallel bin packing by @winglian in #2631
- xformers attention with packing by @winglian in #2619
- Add missing init.py file to
cut_cross_entropyintegration by @BitPhinix in #2642 - Configurable embeddings upcast by @winglian in #2621
- Fix: improve error message on failed dataset load by @NanoCode012 in #2637
- fix(doc): clarify instruction to delinearize llama4 similar to cli doc by @NanoCode012 in #2644
- Add CAME Optimizer by @xzuyn in #2385
- swap tinymodels that have safetensors for some ci tests by @winglian in #2641
New Contributors
- @alexdremov made their first contribution in #2584
- @rahul-tuli made their first contribution in #2479
- @aitechguy0105 made their first contribution in #2616
- @emmanuel-ferdman made their first contribution in #2623
- @BitPhinix made their first contribution in #2642
Full Changelog: v0.9.0...v0.9.1