Release v0.11.0 · axolotl-ai-cloud/axolotl

🚨 Breaking Changes

Upstream Patches for CCE, Phi3, Phi4

Our Cut-Cross-Entropy (CCE) patches have been moved to a dedicated upstream fork. This improves maintainability, enables the community to easily contribute new patches, and re-use across projects. This update includes:

Updates to support transformers>=4.52.4 .
New patches for phi3 phi4_multimodal .
All patches have been sanity-tested for reliability.

Please make sure to install from our fork instead. We recommend using the provided script in the repo

python scripts/cutcrossentropy_install.py | sh

Contributed by @NanoCode012 in #2813.

Dropped support for PyTorch 2.5.1 and vLLM installation requirements

As PyTorch 2.8.0 is slated to be released later this month, we have now dropped support for 2.5.1. We recommend using torch==2.7.0 or 2.7.1.

Docker images now default to use torch 2.7.1 when using main-latest tags.

vLLM is no longer included in Docker images for torch==2.6.0. This is due to vllm wheels using the incorrect ABI for 2.6.0 and the last supported version of vLLM for torch 2.6.0 is 0.8.5.post1. See vllm-project/vllm#13608 for more details.
Similarly, vLLM is only included in torch==2.7.0 as it is pinned to that particular version and 2.7.1 support is still in review

🎉 New features

Added Chunked cross entropy loss

We've introduced chunked_cross_entropy as an alternative to the default trainer loss function. This can help reduce peak memory usage during training, especially for models with large vocabularies.

Contributed by @winglian in #2625.

Added Support for Falcon-h1

You can now fine-tune models from the Falcon-h1 family. Run one of the example configs.

Contributed by @younesbelkada in #2811.

Added Support for Devstral Small

It is now possible to fine-tune Devstral models in Axolotl. Give it a try following our docs.

Contributed by @NanoCode012 in #2880.

TiledMLP support

TiledMLP, authored by Arctic Long Sequence Training, reduces the activation footprint of long sequences in the MLP modules.

This currently only works with DeepSpeed Zero1 through Zero3. Single GPU, DDP, and FSDP aren't supported with this currently. Enable it via tiled_mlp: true. Follow the linked PR for more info.

Contributed by @winglian in #2811.

DenseMixer integration

DenseMixer is a MoE post-training method that improves router gradient estimation in MoE training. Read our docs learn more.

Contributed by @winglian in #2868.

Flexible Evaluation Sequence Length

You can now set a different eval_sequence_len in your config. This allows you to train with one sequence length but run evaluations on a longer or shorter one, providing more flexibility for testing model capabilities.

Contributed by @winglian in #2836.

Improved Merge LoRA on CPU for DPO

--lora-on-cpu flag now correctly moved LoRA adapters to CPU, even for DPO. This is useful for saving VRAM when merging LoRA adapters on machines with limited GPU memory.

Contributed by @kallewoof in #2766.

Other Feature Enhancements

Log Configuration on Startup: Axolotl now logs the full, resolved configuration at the start of every run, making it much easier to verify your settings. (by @djsaunde in #2819)
chat_template kwargs: Restored the ability to pass additional arguments to your chat templates for more flexible formatting. (by @NanoCode012 in #2837)
Support Jinja2 template paths to chat_template_jinja and re-formatting string templates to files (by @winglian in #2795)

📦 Dependency Updates

flash-attn upgraded to 2.8.0.post2. (by @winglian in #2828)
accelerate upgraded to 1.8.1 and bitsandbytes to 0.46.0. (by @winglian in #2815)
mistral-common upgraded to 1.6.3 to fix multiprocessing pickling issues. (by @NanoCode012 in #2790)
transformers upgraded to 4.53.1. (by @winglian in #2844)

🔧 Major fixes

Reduced Startup Time for Sample Packing

Due to a regression in a prior PR, the trainer took longer to start if packing was enabled due to starting too many new processes.

Contributed by @winglian in #2830.

Distributed Training Fixes

DeepSpeed Initialization: Resolved an issue where DeepSpeed would fail to initialize correctly after a recent refactor. (by @djsaunde in #2820)
Sequence Parallelism VRAM: Addressed a high VRAM usage issue when using Sequence Parallelism with RL trainers. (by @djsaunde in #2829)
FSDP / Device Mesh: Ensured that device mesh patching is correctly applied for FSDP training. (by @djsaunde in #2842)

Iterable Dataset Fixes

Resolved critical bugs affecting iterable datasets, improving their stability and usability:

Fixed pickling errors that would prevent training from resuming. (by @winglian in #2831)
Fixed a failure during preprocessing when sampling from an iterable dataset. (by @winglian in #2825)

Tokenization Stall Fixes

Addressed tokenization stall with single long datasets that resulted in tokenization taking hours. (by @NanoCode012 in #2845)

General Stability Fixes

Gemma3: Re-added the Gemma3 loss patch that was inadvertently removed, fixing training for these models. (#2817 by @NanoCode012)
Train Sampler: Fixed a 'NoneType' object has no attribute 'column_names' error that could occur with the train data sampler. (#2822 by @NanoCode012)
Packing: Added an assertion to the packing patch to prevent silent failures. (by @winglian in #2840)

Other Improvements

fix: catch httperror from ratelimiting hf when checking user token by @NanoCode012 in #2827
chore: update pre-commit hooks by @github-actions in #2821
fix(doc): default messages example used wrong key by @NanoCode012 in #2832
feat: replace old colab notebook with newer one by @NanoCode012 in #2838
set a different triton cache for each test to avoid blocking writes to cache by @winglian in #2843
feat(doc): update docker tag examples by @NanoCode012 in #2851
fix nightlies to use correct cache by @winglian in #2848
build fa2 from source for base image with torch2.6 and cu124 by @winglian in #2867
respect shuffle_merged_datasets for single dataset too by @winglian in #2866
don't use tokenizer parallelism when using packing by @winglian in #2862
Fix: do not call preprocess in multimodal or pretraining case by @NanoCode012 in #2861
setup defaults for dataloader to ensure GPU is kept busy by @winglian in #2632
use latest version of cce fork for SP fix by @winglian in #2871
feat(doc): add vllm and fa2 incompat error to faq by @NanoCode012 in #2877
mark flaky geglu tests and add torch seed by @winglian in #2876
chore: update pre-commit hooks by @github-actions in #2870
Fix link in FSDP + QLoRA docs by @float-trip in #2879
fix: set add_generation_prompt to False when apply chat template for multimodal by @NanoCode012 in #2859
chore: update cce commit to include gemma3n fixes by @NanoCode012 in #2881

fix: set add_generation_prompt to False when apply chat template for multimodal by @NanoCode012 in #2859
Feat: add devstral model support by @NanoCode012 in #2880
add 2.7.0 torch images back to support vlllm by @winglian in #2885
Print slowest durations for tests by @salmanmohammadi in #2887
fix xformers version by @winglian in #2888
release v0.11.0 by @winglian in #2875

New Contributors

@float-trip made their first contribution in #2879

Full Changelog: v0.10.1...v0.11.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.11.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🚨 Breaking Changes

Upstream Patches for CCE, Phi3, Phi4

Dropped support for PyTorch 2.5.1 and vLLM installation requirements

🎉 New features

Added Chunked cross entropy loss

Added Support for Falcon-h1

Added Support for Devstral Small

TiledMLP support

DenseMixer integration

Flexible Evaluation Sequence Length

Improved Merge LoRA on CPU for DPO

Other Feature Enhancements

📦 Dependency Updates

🔧 Major fixes

Reduced Startup Time for Sample Packing

Distributed Training Fixes

Iterable Dataset Fixes

Tokenization Stall Fixes

General Stability Fixes

Other Improvements

New Contributors

Contributors

Uh oh!