v0.10.0
🚨 Breaking Changes & Deprecations
PyTorch 2.5 Future Deprecation Notice
- Support for
torch==2.5.1will be deprecated in a future release. We now recommend using PyTorch 2.6.0 or higher.
Internal Refactors
- The module for loading models has been refactored from
axolotl.modelstoaxolotl.loaders. This is an internal change and should not affect most users, but may impact those with custom scripts that import directly from these modules. (#2680)
Updated transformers to 4.52.3
Transformers has done a major refactor for vision language models and forward functions for other modeling code. There may be loading issues or patches (Liger / CCE) that do not work with this new version until we fix them.
🎉 New Features
Sparse Finetuning using LLMCompressor
Using LLMCompressor, the integration allows users to efficiently fine-tune models with structured/unstructured sparsity, recovering 99% accuracy or better for sparse models, and 3X faster inference. Learn how to use it here.
- Contributed by @rahul-tuli in #2479.
Quantization-Aware Training (QAT)
QAT simulates quantization during training to achieve higher quality post-training quantized (PTQ) models than from applying PTQ to models trained without QAT. Check the docs.
- Contributed by @salmanmohammadi in #2590 and @winglian in #2776.

Mistral Tokenizer and Improved Tool Calling
- Mistral Native Tokenizer: We now use
mistral-commonfor an official, robust implementation of the Mistral tokenizer. (by @NanoCode012 in #2780) - Enhanced Tool Calling: Added improved support for chat_template with a dedicated
toolscolumn in your dataset, enabling streamlined function-calling models. (by @NanoCode012 in #2774) chat_templatekwargs: Pass additional arguments to your chat templates for more flexible formatting control. (by @NanoCode012 in #2694)
Efficient Chunked Knowledge Distillation
We've added Liger-style chunking to efficiently calculate Knowledge Distillation (KD) loss and now support online distillation using logprobs from vllm/sglang. (by @winglian in #2700)
📦 Dependency & Build Updates
- Added CI and Docker images for CUDA 12.8 (for B200 GPUs). (by @winglian in #2683)
- Added base images for PyTorch 2.7.1. (by @winglian in #2764, #2784)
- Added support for
uvin base images and test tooling. (by @winglian in #2691, #2750) - Removed the
hqqdependency. (by @NanoCode012 in #2759)
📚 Documentation & Examples
- Added documentation for Group Relative Policy Optimization (GRPO) for RLHF. (by @mhenrichsen in #2748)
- Added new chat templates for Command-R+ and AYA-23 models. (by @hyeobiiii in #2731)
- Fixed the
lora_target_modulessyntax in the Qwen2-VL example config. (by @cummins-orgs in #2793)
🔧 Major fixes
Performance & Stability
- Slow Dataset Processing: Fixed a performance regression where dataset processing was slow by allowing
num_procto be configured. (by @michelyang in #2681) - Sample Packing: Limited the number of processes used by the multipack sampler to prevent resource exhaustion and slowdowns. (by @winglian in #2771)
- DeepSpeed: Fixed issues where distributed state was not initialized correctly ([#2737] by @djsaunde) and where the config was not being set for ZeRO Stage 3 ([#2754] by @NanoCode012).
- RL Training: Resolved an issue where RL plugins could overwrite the trainer class ([#2697] by @NanoCode012) and improved feature parity with base training ([#2133] by @NanoCode012).
- LoRA Kernels: Addressed a bug where a LoRA kernel pre-patch was being applied even when the corresponding post-patch was not. (by @NanoCode012 in #2772)
- Sequence Parallelism refactor: Simplify SP patching to patch accelerate instead of train / eval samplers (by @djsaunde in #2686)
Logging and Usability
- Rank 0-Only Logging: Cleaned up training logs by ensuring that logging only occurs on rank 0 in distributed setups. (by @salmanmohammadi in #2608)
- Suppressed Third-Party Logs: Reduced log spam by suppressing non-Axolotl logs unless they are at the
WARNINGlevel or higher. (by @NanoCode012 in #2724) - Dataset Name Logging: Axolotl will now print the dataset name before processing begins, improving visibility into the data pipeline. (by @xzuyn in #2668)
Full changes
- Add: Sparse Finetuning Integration with llmcompressor by @rahul-tuli in #2479
- fix: remove doc string imports in monkeypatches by @NanoCode012 in #2671
- Add ci and images for CUDA 12.8 for B200s by @winglian in #2683
- Add num_proc to fix data set slow processing issue by @michelyang in #2681
- Add missing init file to liger plugin by @BitPhinix in #2670
- Make Axolotl Print Dataset Name Before Processing by @xzuyn in #2668
- Fix: improve doc on merge/inference cli visibility by @NanoCode012 in #2674
- Fix: Make MLflow config artifact logging respect hf_mlflow_log_artifa… by @C080 in #2675
- Fix for setting
adam_beta3andadam_epsilon2for CAME Optimizer by @xzuyn in #2654 - GRPO fixes (peft) by @winglian in #2676
- SP dataloader patching + removing custom sampler / dataloader logic by @djsaunde in #2686
- feat(doc): clarify minimum pytorch and cuda to use blackwell by @NanoCode012 in #2704
- fix: plugin rl overwriting trainer_cls by @NanoCode012 in #2697
- feat: do not find turn indices if turn is not trainable by @NanoCode012 in #2696
- SP context manager update by @djsaunde in #2699
- Remove unused const by @djsaunde in #2714
- models.py -> loaders/ module refactor by @djsaunde in #2680
- update quarto for model loading refactor by @djsaunde in #2716
- Allow Liger with GraniteMoE by @xzuyn in #2715
- Fix quarto by @djsaunde in #2717
- no need to generate diff file by @djsaunde in #2728
- chore: update pre-commit hooks by @github-actions in #2729
- Fix(doc): clarify data loading for local datasets and splitting samples by @NanoCode012 in #2726
- feat(doc): note lora kernel incompat with RLHF by @NanoCode012 in #2706
- Add chat templates for command-a and aya-23-8B models by @hyeobiiii in #2731
- add two checks to handle legacy format interleaved multimodal ds by @sumo43 in #2721
- Fix Mistral chat template (mistral_v7_tekken) by @mashdragon in #2710
- feat(doc): add info on how to use dapo / dr grpo and misc doc fixes by @NanoCode012 in #2673
- feat(doc): add google analytics to docs by @NanoCode012 in #2708
- QAT by @salmanmohammadi in #2590
- Rank 0-only logging by @salmanmohammadi in #2608
- Lora kernels fix by @djsaunde in #2732
- fix dist state init before deepspeed setup by @djsaunde in #2737
- Add a few items to faq by @winglian in #2734
- Fix: RL base feature parity by @NanoCode012 in #2133
- fix(log): remove duplicate merge_lora param by @NanoCode012 in #2742
- fix: suppress non-axolotl logs unless it's warning or higher by @NanoCode012 in #2724
- add support for base image with uv by @winglian in #2691
- chore: update pre-commit hooks by @github-actions in #2745
- feat: add Group Relative Policy Optimization (GPRO) to RLHF documenta… by @mhenrichsen in #2748
- remove deprecated wandb env var by @djsaunde in #2751
- feat: add chat_template kwargs by @NanoCode012 in #2694
- feat(modal): update docker tag to use torch2.6 from torch2.5 by @NanoCode012 in #2749
- fix(deepspeed): deepspeed config not being set for z3 by @NanoCode012 in #2754
- bump hf deps by @winglian in #2735
- fix: remove hqq by @NanoCode012 in #2759
- remove unused field for chat_template.default for DPO training by @timofey in #2755
- add uv tooling for e2e gpu tests by @winglian in #2750
- add manual seed for flaky test_geglu_backward test by @winglian in #2763
- fix worker_init_fn signature handling by @winglian in #2769
- handle when unable to save optimizer state when using ao optimizer with FSDP by @winglian in #2773
- Fix the bug of position ids padding by @qywu in #2739
- Feat: add tool calling support via tools column by @NanoCode012 in #2774
- magistral small placeholder by @djsaunde in #2777
- Data loader refactor by @djsaunde in #2707
- build base images for torch 2.7.1 by @winglian in #2764
- build 2.7.1 images and ci by @winglian in #2784
- QAT docfix by @salmanmohammadi in #2778
- limit multipack sampler processes by @winglian in #2771
- feat(doc): update readme to include changelog and remove matrix by @NanoCode012 in #2775
- Fix logging import in evaluate.py (#2782) by @JZacaroli in #2783
- update loss value for flakey e2e test by @winglian in #2786
- Feat: Add Magistral and mistral-common tokenizer support by @NanoCode012 in #2780
- support for QAT w RL (DPO) by @winglian in #2776
- fix(doc): change grpo doc link by @NanoCode012 in #2788
- Fix: adding magistral fsdp config, fixing not eval with test_datasets, handle mllama attention by @NanoCode012 in #2789
- Fix: lora kernel pre-patch applied despite post-patch not applied by @NanoCode012 in #2772
- fixed the lora_target_modules syntax inside examples/qwen2-vl/lora-7b.yaml by @cummins-orgs in #2793
- KD fix w/ online distillation by @winglian in #2700
- feat: remove evalfirst callback with built-in trainer arg by @NanoCode012 in #2797
- release tag v0.10.0 by @winglian in #2799
New Contributors
- @rahul-tuli made their first contribution in #2479
- @michelyang made their first contribution in #2681
- @C080 made their first contribution in #2675
- @github-actions made their first contribution in #2729
- @hyeobiiii made their first contribution in #2731
- @sumo43 made their first contribution in #2721
- @timofey made their first contribution in #2755
- @qywu made their first contribution in #2739
- @JZacaroli made their first contribution in #2783
- @cummins-orgs made their first contribution in #2793
Full Changelog: v0.9.2...v0.10.0