Skip to content

v0.10.0

Choose a tag to compare

@github-actions github-actions released this 17 Jun 16:13
· 637 commits to main since this release
88c0e8d

🚨 Breaking Changes & Deprecations

PyTorch 2.5 Future Deprecation Notice

  • Support for torch==2.5.1 will be deprecated in a future release. We now recommend using PyTorch 2.6.0 or higher.

Internal Refactors

  • The module for loading models has been refactored from axolotl.models to axolotl.loaders. This is an internal change and should not affect most users, but may impact those with custom scripts that import directly from these modules. (#2680)

Updated transformers to 4.52.3

Transformers has done a major refactor for vision language models and forward functions for other modeling code. There may be loading issues or patches (Liger / CCE) that do not work with this new version until we fix them.

🎉 New Features

Sparse Finetuning using LLMCompressor

Using LLMCompressor, the integration allows users to efficiently fine-tune models with structured/unstructured sparsity, recovering 99% accuracy or better for sparse models, and 3X faster inference. Learn how to use it here.

Quantization-Aware Training (QAT)

QAT simulates quantization during training to achieve higher quality post-training quantized (PTQ) models than from applying PTQ to models trained without QAT. Check the docs.

Mistral Tokenizer and Improved Tool Calling

  • Mistral Native Tokenizer: We now use mistral-common for an official, robust implementation of the Mistral tokenizer. (by @NanoCode012 in #2780)
  • Enhanced Tool Calling: Added improved support for chat_template with a dedicated tools column in your dataset, enabling streamlined function-calling models. (by @NanoCode012 in #2774)
  • chat_template kwargs: Pass additional arguments to your chat templates for more flexible formatting control. (by @NanoCode012 in #2694)

Efficient Chunked Knowledge Distillation

We've added Liger-style chunking to efficiently calculate Knowledge Distillation (KD) loss and now support online distillation using logprobs from vllm/sglang. (by @winglian in #2700)

📦 Dependency & Build Updates

📚 Documentation & Examples

  • Added documentation for Group Relative Policy Optimization (GRPO) for RLHF. (by @mhenrichsen in #2748)
  • Added new chat templates for Command-R+ and AYA-23 models. (by @hyeobiiii in #2731)
  • Fixed the lora_target_modules syntax in the Qwen2-VL example config. (by @cummins-orgs in #2793)

🔧 Major fixes

Performance & Stability

  • Slow Dataset Processing: Fixed a performance regression where dataset processing was slow by allowing num_proc to be configured. (by @michelyang in #2681)
  • Sample Packing: Limited the number of processes used by the multipack sampler to prevent resource exhaustion and slowdowns. (by @winglian in #2771)
  • DeepSpeed: Fixed issues where distributed state was not initialized correctly ([#2737] by @djsaunde) and where the config was not being set for ZeRO Stage 3 ([#2754] by @NanoCode012).
  • RL Training: Resolved an issue where RL plugins could overwrite the trainer class ([#2697] by @NanoCode012) and improved feature parity with base training ([#2133] by @NanoCode012).
  • LoRA Kernels: Addressed a bug where a LoRA kernel pre-patch was being applied even when the corresponding post-patch was not. (by @NanoCode012 in #2772)
  • Sequence Parallelism refactor: Simplify SP patching to patch accelerate instead of train / eval samplers (by @djsaunde in #2686)

Logging and Usability

  • Rank 0-Only Logging: Cleaned up training logs by ensuring that logging only occurs on rank 0 in distributed setups. (by @salmanmohammadi in #2608)
  • Suppressed Third-Party Logs: Reduced log spam by suppressing non-Axolotl logs unless they are at the WARNING level or higher. (by @NanoCode012 in #2724)
  • Dataset Name Logging: Axolotl will now print the dataset name before processing begins, improving visibility into the data pipeline. (by @xzuyn in #2668)

Full changes

New Contributors

Full Changelog: v0.9.2...v0.10.0