Release v0.10.0 · axolotl-ai-cloud/axolotl

🚨 Breaking Changes & Deprecations

PyTorch 2.5 Future Deprecation Notice

Support for torch==2.5.1 will be deprecated in a future release. We now recommend using PyTorch 2.6.0 or higher.

Internal Refactors

The module for loading models has been refactored from axolotl.models to axolotl.loaders. This is an internal change and should not affect most users, but may impact those with custom scripts that import directly from these modules. (#2680)

Updated transformers to `4.52.3`

Transformers has done a major refactor for vision language models and forward functions for other modeling code. There may be loading issues or patches (Liger / CCE) that do not work with this new version until we fix them.

Contributed by @winglian in #2735

🎉 New Features

Sparse Finetuning using LLMCompressor

Using LLMCompressor, the integration allows users to efficiently fine-tune models with structured/unstructured sparsity, recovering 99% accuracy or better for sparse models, and 3X faster inference. Learn how to use it here.

Contributed by @rahul-tuli in #2479.

Quantization-Aware Training (QAT)

QAT simulates quantization during training to achieve higher quality post-training quantized (PTQ) models than from applying PTQ to models trained without QAT. Check the docs.

Contributed by @salmanmohammadi in #2590 and @winglian in #2776.

Mistral Tokenizer and Improved Tool Calling

Mistral Native Tokenizer: We now use mistral-common for an official, robust implementation of the Mistral tokenizer. (by @NanoCode012 in #2780)
Enhanced Tool Calling: Added improved support for chat_template with a dedicated tools column in your dataset, enabling streamlined function-calling models. (by @NanoCode012 in #2774)
chat_template kwargs: Pass additional arguments to your chat templates for more flexible formatting control. (by @NanoCode012 in #2694)

Efficient Chunked Knowledge Distillation

We've added Liger-style chunking to efficiently calculate Knowledge Distillation (KD) loss and now support online distillation using logprobs from vllm/sglang. (by @winglian in #2700)

📦 Dependency & Build Updates

Added CI and Docker images for CUDA 12.8 (for B200 GPUs). (by @winglian in #2683)
Added base images for PyTorch 2.7.1. (by @winglian in #2764, #2784)
Added support for uv in base images and test tooling. (by @winglian in #2691, #2750)
Removed the hqq dependency. (by @NanoCode012 in #2759)

📚 Documentation & Examples

Added documentation for Group Relative Policy Optimization (GRPO) for RLHF. (by @mhenrichsen in #2748)
Added new chat templates for Command-R+ and AYA-23 models. (by @hyeobiiii in #2731)
Fixed the lora_target_modules syntax in the Qwen2-VL example config. (by @cummins-orgs in #2793)

🔧 Major fixes

Performance & Stability

Slow Dataset Processing: Fixed a performance regression where dataset processing was slow by allowing num_proc to be configured. (by @michelyang in #2681)
Sample Packing: Limited the number of processes used by the multipack sampler to prevent resource exhaustion and slowdowns. (by @winglian in #2771)
DeepSpeed: Fixed issues where distributed state was not initialized correctly ([#2737] by @djsaunde) and where the config was not being set for ZeRO Stage 3 ([#2754] by @NanoCode012).
RL Training: Resolved an issue where RL plugins could overwrite the trainer class ([#2697] by @NanoCode012) and improved feature parity with base training ([#2133] by @NanoCode012).
LoRA Kernels: Addressed a bug where a LoRA kernel pre-patch was being applied even when the corresponding post-patch was not. (by @NanoCode012 in #2772)
Sequence Parallelism refactor: Simplify SP patching to patch accelerate instead of train / eval samplers (by @djsaunde in #2686)

Logging and Usability

Rank 0-Only Logging: Cleaned up training logs by ensuring that logging only occurs on rank 0 in distributed setups. (by @salmanmohammadi in #2608)
Suppressed Third-Party Logs: Reduced log spam by suppressing non-Axolotl logs unless they are at the WARNING level or higher. (by @NanoCode012 in #2724)
Dataset Name Logging: Axolotl will now print the dataset name before processing begins, improving visibility into the data pipeline. (by @xzuyn in #2668)

Full changes

Add: Sparse Finetuning Integration with llmcompressor by @rahul-tuli in #2479
fix: remove doc string imports in monkeypatches by @NanoCode012 in #2671
Add ci and images for CUDA 12.8 for B200s by @winglian in #2683
Add num_proc to fix data set slow processing issue by @michelyang in #2681
Add missing init file to liger plugin by @BitPhinix in #2670
Make Axolotl Print Dataset Name Before Processing by @xzuyn in #2668
Fix: improve doc on merge/inference cli visibility by @NanoCode012 in #2674
Fix: Make MLflow config artifact logging respect hf_mlflow_log_artifa… by @C080 in #2675
Fix for setting adam_beta3 and adam_epsilon2 for CAME Optimizer by @xzuyn in #2654
GRPO fixes (peft) by @winglian in #2676
SP dataloader patching + removing custom sampler / dataloader logic by @djsaunde in #2686
feat(doc): clarify minimum pytorch and cuda to use blackwell by @NanoCode012 in #2704
fix: plugin rl overwriting trainer_cls by @NanoCode012 in #2697
feat: do not find turn indices if turn is not trainable by @NanoCode012 in #2696
SP context manager update by @djsaunde in #2699
Remove unused const by @djsaunde in #2714
models.py -> loaders/ module refactor by @djsaunde in #2680
update quarto for model loading refactor by @djsaunde in #2716
Allow Liger with GraniteMoE by @xzuyn in #2715
Fix quarto by @djsaunde in #2717
no need to generate diff file by @djsaunde in #2728
chore: update pre-commit hooks by @github-actions in #2729
Fix(doc): clarify data loading for local datasets and splitting samples by @NanoCode012 in #2726
feat(doc): note lora kernel incompat with RLHF by @NanoCode012 in #2706
Add chat templates for command-a and aya-23-8B models by @hyeobiiii in #2731
add two checks to handle legacy format interleaved multimodal ds by @sumo43 in #2721
Fix Mistral chat template (mistral_v7_tekken) by @mashdragon in #2710
feat(doc): add info on how to use dapo / dr grpo and misc doc fixes by @NanoCode012 in #2673
feat(doc): add google analytics to docs by @NanoCode012 in #2708
QAT by @salmanmohammadi in #2590
Rank 0-only logging by @salmanmohammadi in #2608
Lora kernels fix by @djsaunde in #2732
fix dist state init before deepspeed setup by @djsaunde in #2737
Add a few items to faq by @winglian in #2734
Fix: RL base feature parity by @NanoCode012 in #2133
fix(log): remove duplicate merge_lora param by @NanoCode012 in #2742
fix: suppress non-axolotl logs unless it's warning or higher by @NanoCode012 in #2724
add support for base image with uv by @winglian in #2691
chore: update pre-commit hooks by @github-actions in #2745
feat: add Group Relative Policy Optimization (GPRO) to RLHF documenta… by @mhenrichsen in #2748
remove deprecated wandb env var by @djsaunde in #2751
feat: add chat_template kwargs by @NanoCode012 in #2694
feat(modal): update docker tag to use torch2.6 from torch2.5 by @NanoCode012 in #2749
fix(deepspeed): deepspeed config not being set for z3 by @NanoCode012 in #2754
bump hf deps by @winglian in #2735
fix: remove hqq by @NanoCode012 in #2759
remove unused field for chat_template.default for DPO training by @timofey in #2755
add uv tooling for e2e gpu tests by @winglian in #2750
add manual seed for flaky test_geglu_backward test by @winglian in #2763
fix worker_init_fn signature handling by @winglian in #2769
handle when unable to save optimizer state when using ao optimizer with FSDP by @winglian in #2773
Fix the bug of position ids padding by @qywu in #2739
Feat: add tool calling support via tools column by @NanoCode012 in #2774
magistral small placeholder by @djsaunde in #2777
Data loader refactor by @djsaunde in #2707
build base images for torch 2.7.1 by @winglian in #2764
build 2.7.1 images and ci by @winglian in #2784
QAT docfix by @salmanmohammadi in #2778
limit multipack sampler processes by @winglian in #2771
feat(doc): update readme to include changelog and remove matrix by @NanoCode012 in #2775
Fix logging import in evaluate.py (#2782) by @JZacaroli in #2783
update loss value for flakey e2e test by @winglian in #2786
Feat: Add Magistral and mistral-common tokenizer support by @NanoCode012 in #2780
support for QAT w RL (DPO) by @winglian in #2776
fix(doc): change grpo doc link by @NanoCode012 in #2788
Fix: adding magistral fsdp config, fixing not eval with test_datasets, handle mllama attention by @NanoCode012 in #2789
Fix: lora kernel pre-patch applied despite post-patch not applied by @NanoCode012 in #2772
fixed the lora_target_modules syntax inside examples/qwen2-vl/lora-7b.yaml by @cummins-orgs in #2793
KD fix w/ online distillation by @winglian in #2700
feat: remove evalfirst callback with built-in trainer arg by @NanoCode012 in #2797
release tag v0.10.0 by @winglian in #2799

New Contributors

@rahul-tuli made their first contribution in #2479
@michelyang made their first contribution in #2681
@C080 made their first contribution in #2675
@github-actions made their first contribution in #2729
@hyeobiiii made their first contribution in #2731
@sumo43 made their first contribution in #2721
@timofey made their first contribution in #2755
@qywu made their first contribution in #2739
@JZacaroli made their first contribution in #2783
@cummins-orgs made their first contribution in #2793

Full Changelog: v0.9.2...v0.10.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.10.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🚨 Breaking Changes & Deprecations

PyTorch 2.5 Future Deprecation Notice

Internal Refactors

Updated transformers to `4.52.3`

🎉 New Features

Sparse Finetuning using LLMCompressor

Quantization-Aware Training (QAT)

Mistral Tokenizer and Improved Tool Calling

Efficient Chunked Knowledge Distillation

📦 Dependency & Build Updates

📚 Documentation & Examples

🔧 Major fixes

Performance & Stability

Logging and Usability

Full changes

New Contributors

Contributors

Uh oh!

Uh oh!

v0.10.0

🚨 Breaking Changes & Deprecations

PyTorch 2.5 Future Deprecation Notice

Internal Refactors

Updated transformers to 4.52.3

🎉 New Features

Sparse Finetuning using LLMCompressor

Quantization-Aware Training (QAT)

Mistral Tokenizer and Improved Tool Calling

Efficient Chunked Knowledge Distillation

📦 Dependency & Build Updates

📚 Documentation & Examples

🔧 Major fixes

Performance & Stability

Logging and Usability

Full changes

New Contributors

Contributors

Uh oh!

Updated transformers to `4.52.3`