You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fsspec integration for remote filesystem support. Checkpoints can be saved to and loaded from S3 via --checkpoint-dir s3://bucket/path/. Requires s3fs. (#1126)
New GlobalFileSystem replaces LocalFileSystem as default, dispatching to the appropriate backend based on URI scheme. (#1126)
PyTorch 2.9.1 and 2.10 (forward compatibility) are now supported. PyTorch 2.9 introduced breaking changes to LR scheduler return types, which have been addressed. (#1477, #1491, #1456)
New context managers for procedural programming: GangContext, DeviceContext, DataTypeContext, current_dtype. Eliminates need to pass state through nested function calls. (#1474, #1473, #1464)
CheckpointManager, Optimizer, and LRScheduler now exposed in RecipeContext. (#1461)
Synchronous asset loading across ranks for models and tokenizers. Use when all ranks need identical assets loaded simultaneously. (#1429, #1426)
CheckpointManager.register_save_hook allows custom logic during checkpoint saves. (#1439)
Config files now support ${env:<NAME>} to interpolate environment variables. (#1435)
--no-rich CLI flag disables rich text output for log parsing. (#1421)
Hugging Face export now runs in isolated process with saved command line and logs for debugging. (#1459, #1458, #1437, #1434)
Improved support for gated Hugging Face models. (#1422)
get_family utility functions for detecting model families. (#1454)
Gemma3n model family (E2B/E4B) with text + audio inference and SFT training. (#1496)
Generic HuggingFace model integration: load, shard, and train any HuggingFace CausalLM model directly through HgCausalLMAdapter without requiring a native fairseq2 reimplementation. Inc
ludes FSDP sharding, HF tokenizer integration, and SFT recipe support. (#1479)
Fixed cross_entropy with reduction="mean" to properly exclude padding tokens from the denominator. (#1455)
Fixed Flash3SDPA to support the flash-attn-3 v3.0.0 package API (flash_attn_3._C / torch.ops.flash_attn_3) in addition to the legacy flash_attn_3_cuda module. (#1495)
Fixed data pipeline sampling bug when allow_repeats=False with many pipelines. (#1471)