v0.11.0
🚨 Breaking Changes
Upstream Patches for CCE, Phi3, Phi4
Our Cut-Cross-Entropy (CCE) patches have been moved to a dedicated upstream fork. This improves maintainability, enables the community to easily contribute new patches, and re-use across projects. This update includes:
- Updates to support
transformers>=4.52.4. - New patches for
phi3phi4_multimodal. - All patches have been sanity-tested for reliability.
Please make sure to install from our fork instead. We recommend using the provided script in the repo
python scripts/cutcrossentropy_install.py | sh- Contributed by @NanoCode012 in #2813.
Dropped support for PyTorch 2.5.1 and vLLM installation requirements
As PyTorch 2.8.0 is slated to be released later this month, we have now dropped support for 2.5.1. We recommend using torch==2.7.0 or 2.7.1.
Docker images now default to use torch 2.7.1 when using main-latest tags.
vLLM is no longer included in Docker images for torch==2.6.0. This is due to vllm wheels using the incorrect ABI for 2.6.0 and the last supported version of vLLM for torch 2.6.0 is 0.8.5.post1. See vllm-project/vllm#13608 for more details.
Similarly, vLLM is only included in torch==2.7.0 as it is pinned to that particular version and 2.7.1 support is still in review
🎉 New features
Added Chunked cross entropy loss
We've introduced chunked_cross_entropy as an alternative to the default trainer loss function. This can help reduce peak memory usage during training, especially for models with large vocabularies.
Added Support for Falcon-h1
You can now fine-tune models from the Falcon-h1 family. Run one of the example configs.
- Contributed by @younesbelkada in #2811.
Added Support for Devstral Small
It is now possible to fine-tune Devstral models in Axolotl. Give it a try following our docs.
- Contributed by @NanoCode012 in #2880.
TiledMLP support
TiledMLP, authored by Arctic Long Sequence Training, reduces the activation footprint of long sequences in the MLP modules.
This currently only works with DeepSpeed Zero1 through Zero3. Single GPU, DDP, and FSDP aren't supported with this currently. Enable it via tiled_mlp: true. Follow the linked PR for more info.
DenseMixer integration
DenseMixer is a MoE post-training method that improves router gradient estimation in MoE training. Read our docs learn more.
Flexible Evaluation Sequence Length
You can now set a different eval_sequence_len in your config. This allows you to train with one sequence length but run evaluations on a longer or shorter one, providing more flexibility for testing model capabilities.
Improved Merge LoRA on CPU for DPO
--lora-on-cpu flag now correctly moved LoRA adapters to CPU, even for DPO. This is useful for saving VRAM when merging LoRA adapters on machines with limited GPU memory.
- Contributed by @kallewoof in #2766.
Other Feature Enhancements
- Log Configuration on Startup: Axolotl now logs the full, resolved configuration at the start of every run, making it much easier to verify your settings. (by @djsaunde in #2819)
chat_templatekwargs: Restored the ability to pass additional arguments to your chat templates for more flexible formatting. (by @NanoCode012 in #2837)- Support Jinja2 template paths to
chat_template_jinjaand re-formatting string templates to files (by @winglian in #2795)
📦 Dependency Updates
flash-attnupgraded to2.8.0.post2. (by @winglian in #2828)accelerateupgraded to1.8.1andbitsandbytesto0.46.0. (by @winglian in #2815)mistral-commonupgraded to1.6.3to fix multiprocessing pickling issues. (by @NanoCode012 in #2790)transformersupgraded to4.53.1. (by @winglian in #2844)
🔧 Major fixes
Reduced Startup Time for Sample Packing
Due to a regression in a prior PR, the trainer took longer to start if packing was enabled due to starting too many new processes.
Distributed Training Fixes
- DeepSpeed Initialization: Resolved an issue where DeepSpeed would fail to initialize correctly after a recent refactor. (by @djsaunde in #2820)
- Sequence Parallelism VRAM: Addressed a high VRAM usage issue when using Sequence Parallelism with RL trainers. (by @djsaunde in #2829)
- FSDP / Device Mesh: Ensured that device mesh patching is correctly applied for FSDP training. (by @djsaunde in #2842)
Iterable Dataset Fixes
Resolved critical bugs affecting iterable datasets, improving their stability and usability:
- Fixed pickling errors that would prevent training from resuming. (by @winglian in #2831)
- Fixed a failure during preprocessing when sampling from an iterable dataset. (by @winglian in #2825)
Tokenization Stall Fixes
Addressed tokenization stall with single long datasets that resulted in tokenization taking hours. (by @NanoCode012 in #2845)
General Stability Fixes
- Gemma3: Re-added the Gemma3 loss patch that was inadvertently removed, fixing training for these models. (#2817 by @NanoCode012)
- Train Sampler: Fixed a
'NoneType' object has no attribute 'column_names'error that could occur with the train data sampler. (#2822 by @NanoCode012) - Packing: Added an assertion to the packing patch to prevent silent failures. (by @winglian in #2840)
Other Improvements
- fix: catch httperror from ratelimiting hf when checking user token by @NanoCode012 in #2827
- chore: update pre-commit hooks by @github-actions in #2821
- fix(doc): default messages example used wrong key by @NanoCode012 in #2832
- feat: replace old colab notebook with newer one by @NanoCode012 in #2838
- set a different triton cache for each test to avoid blocking writes to cache by @winglian in #2843
- feat(doc): update docker tag examples by @NanoCode012 in #2851
- fix nightlies to use correct cache by @winglian in #2848
- build fa2 from source for base image with torch2.6 and cu124 by @winglian in #2867
- respect shuffle_merged_datasets for single dataset too by @winglian in #2866
- don't use tokenizer parallelism when using packing by @winglian in #2862
- Fix: do not call preprocess in multimodal or pretraining case by @NanoCode012 in #2861
- setup defaults for dataloader to ensure GPU is kept busy by @winglian in #2632
- use latest version of cce fork for SP fix by @winglian in #2871
- feat(doc): add vllm and fa2 incompat error to faq by @NanoCode012 in #2877
- mark flaky geglu tests and add torch seed by @winglian in #2876
- chore: update pre-commit hooks by @github-actions in #2870
- Fix link in FSDP + QLoRA docs by @float-trip in #2879
- fix: set add_generation_prompt to False when apply chat template for multimodal by @NanoCode012 in #2859
- chore: update cce commit to include gemma3n fixes by @NanoCode012 in #2881
- fix: set add_generation_prompt to False when apply chat template for multimodal by @NanoCode012 in #2859
- Feat: add devstral model support by @NanoCode012 in #2880
- add 2.7.0 torch images back to support vlllm by @winglian in #2885
- Print slowest durations for tests by @salmanmohammadi in #2887
- fix xformers version by @winglian in #2888
- release v0.11.0 by @winglian in #2875
New Contributors
- @float-trip made their first contribution in #2879
Full Changelog: v0.10.1...v0.11.0