v2.2.0
What's new
Added 🎉
- Added option to set LR scheduler based on tokens instead of steps (e.g.
--train_module.scheduler.units=tokens). - Added a "packed" numpy FSL variant that packs documents into sequences using the best-fit-decreasing bin packing algorithm following the work from Fewer Truncates Improve Language Modeling.
- Added module
olmo_core.testing. - Added a "interleaved" numpy FSL variant that interleaves several documents into sequences following the work from LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models.
- Added sliding window attention as a feature
- Added
BatchSizeSchedulerCallbackfor setting a batch size schedule over the course of a training run. - Added optional
TrainModulemethod,.pre_train(), which runs right afterCallback.pre_train(). - The
BeakerCallbackwill save the config and Python requirements to the results dataset. - Added
from_filemethod toConfigclass. - Added in-loop evals for OLMES basic skills eval
- Added in-loop fast MCQA for in-loop evals and translated MBPP tasks
- Added in-loop few-shot HumanEval BPB
- Added
fastandfullin-loop recommendations, wherefastis a roughly 2-3x faster subset offull - Added support for converting to HF models in lower precisions.
- Added support for headwise QK norm.
- Add BOS token in in-loop evals, when specified by the tokenizer (
ai2-olmo-eval==0.8.4) - Add support for BOS token matching EOS token for intra-document masking in FSL numpy datasets.
- Added option to allow profiler to record on multiple ranks.
- Added support for accessing Google on non-Google clusters via auth with service account keys.
- Added support for revisions in
convert_checkpoint_from_hf.pyand theload_hf_modelmethod ofolmo_core.nn.hf.checkpoint. foreachsupport inSkipStepAdamW.- Added
budgetmode for activation checkpointing configuration. - Added
io.remove_file()andio.glob_directoryfunctions. - Added ABF, PI, and YaRN rope scaling strategies.
- Added a script to compare two WandB runs
- Added
namespaceoption tonn.buffer_cache.BufferCache. - Added the option to configure
head_stridefor context parallelism with ring-flash-attn. - Added the option to group multiple npy source files together for packing with the packed FSL dataset by setting
source_group_sizeto an integer greater than 1. - Added
load_optim_state: Optional[bool]option toTrainer.load_checkpoint(). - Added
GenerationModulefor OLMo-core native autoregressive generation with support for kv caching.
Changed ⚠️
- Output of
LMHeadwhenlabelsis passed as input is now a 4-tuple instead of a 3-tuple, with(logits, loss, ce_loss, z_loss), wherelossis the combined loss (ce_loss + z_loss). - The
ConfigSavercallback will automatically set the config to save for other callbacks (WandBCallback,CometCallback, andBeakerCallbackas of now). - Fixed bug causing slow evals in BPB/RC in-loop evals due to fast MC
- Changed default precision of converted HF models in
src/examples/huggingface/convert_checkpoint_to_hf.pyto bfloat16. - Changed default cluster to
saturninsrc/examples/llama/train_launch.py. - Made some beaker secrets optional for internal experiments.
- Changed
SlidingWindowAttentionConfigto improve clarity. - Changed the default Beaker budget
Fixed ✅
- Modify
TokenizerConfig.from_hf()to fallback to tokenizer_config.json if config.json is not found. - Fixed loading checkpoints with missing keys from transformer train modules using torch 2.7.
- Made MoE load balancing loss more robust.
- Fixed a bug with
ReorderedNormTransformerBlockwhen using fine-grained FSDP wrapping and activation checkpointing together. - Fixed an issue preventing tensor parallelism from working with
LMHeadwhen using the "fused_linear" loss implementation. - Fixed a bug with
LMHeadwhen using "fused_linear" loss implementation where thece_lossoutput included thez_lossadded to it. - Fixed training on single GPU when using a
SkipStepOptimizer. - Fixed the initialization of the
CosWithWarmupAndLinearDecaylearning rate scheduler - Ensured eval tasks are sorted to maintain the same order across ranks (the cookbook was configuring these in an unsorted way).
- W&B callback uses working directory instead of save folder for local cache.
- Reset speed monitor callback after changing batch size.
- Fixed parallelism compatiblity between cp + tp and cp + pp and added test to catch regressions.
- Ensure sharded parameters are initialized differently on separate ranks.
- Fixed fingerprinting for FSL datasets
- Fixed bug where
stepstate inSkipStepAdamWwas not incremented, biasing the optimizer steps. Added option to restore the bug for backwards compatibility. - Removed
sklearnfrom upstream dependencyai2-olmo-eval. - Made removing ephemeral checkpoints more robust.
- Made running bookkeeping operations more robust.
- Ensure RoPE modules with different settings use a unique sub-cache for their buffers.
- Fixed bug with context parallelism where every transformer block would use the same RoPE buffers even if their RoPE was configured differently.
- Fixed MFU computation to work with FSDP, corrected some device specs.
- Optimization: avoid redundant calls to
model.train()inTransformerTrainModule. NumpyDatasetConfig.expand_globnow works with remote directories.- Fixed Attention block sharding when TP and head-wise QK norm are both applied.
Commits
de89fbe (chore) prepare for release v2.2.0
54d3af0 run gpu tests with gantry (#357)
c82d13c GenerationModule with support for KV Caching (#324)
effdef3 Add option to Trainer.load_checkpoint() to ignore optim state (#351)
2cd5b82 Fix TP when headwise QK norm is applied (#353)
429054a Fix empty config.json output in convert_checkpoint_from_hf (#354)
e97f58d Fix skip step optimizer with TP (#352)
7f4b45f Option to group npy sources together for packing (#349)
9fb6366 Add io.glob_directory function (#348)
67854f9 Avoid redundant calls to model.train() (#345)
7e14a68 Fix CP bug with RoPE buffers (#341)
699971d Support configuring head_stride for ring-flash-attn (#344)
b0019cc Pull updates from olmo3 branches (#334)
026b882 Fix MFU Calculation (#343)
4162151 Add 'namespaces' to BufferCache to avoid collisions (#340)
085755c Port the WandB comparison tool from the old trainer (#338)
9e42a20 The Beaker default budget has changed (#337)
e395b82 Ensure LR scheduler's unit are tokens when using BZ scheduler (#335)
08340f9 fix off-by-one issue with SWA
abc12e5 Make async bookkeeping more robust (#333)
6d2f334 Add TrainModule.pre_train(), fix pain point with BZ scheduler (#332)
8fb85bc make Scheduler a subclass of Config (#331)
e1bac95 More Rope Scaling Implementations (PI, Yarn) (#330)
97a6f4e Make removing ephemeral checkpoints more robust (#329)
ebe0b19 loosen numpy requirement (#328)
5b924a6 Add missing return in init (#327)
34beda5 Bump ai2-olmo-eval==0.8.5 (#326)
f107e9c fix typo in release process
bddf65a fix initialization (again) (#319)
992a79e Memory budget strategy for activation checkpointing (#297)
0dda3ec DDP parameter dtype casting for 16-bit precision and flash attention support (#314)
26998de Improve clarity of SWA config (#301)
91630ea Add config option for to enable/disable the step-increment bugfix. (#317)
51e2049 "Better" sorting of Augusta ranks at runtime (#313)
fe50d9e Only check for beaker secrets in non-distributed settings (#302)
c41962a SkipStepAdamW foreach implementation; bug fix for step state in SkipStepAdamW (#309)
a04e93f ensure sharded parameters initialized different on different ranks (#307)
f7c394d Fix fingerprints for various FSLDatasets (#303)
5f93eed hot fix
20d833b Add support for revisions in conversion from HF (#304)
b0f1d9e use qualname instead of name
b744d9f Make async bookkeeping more robust (#305)
796c2dc Add support for accessing Google cloud on non-Google clusters (#299)
2bdb93c Allow profiler to record on multiple ranks (#298)
00f9b00 Fix for parallelism compatibilities in build_world_mesh (#293)
ef2845f Bump pytorch, ring-flash-attn, and liger-kernel versions (#295)
4cffd08 Add support for BOS token matching EOS token in documents (#291)
c5777a3 Make ruff line-length match black line-length (#290)
f2cc497 Reset speed monitor after changing batch size (#289)
9816e44 ai2-olmo-eval==0.8.4 (#288)
a42ce36 Add head QK norm support for Attention (#287)
e7d01b1 use work_dir instead of save_folder for W&B cache
2cd5a31 Update default cluster for src/examples/llama/train_launch.py (#286)
2b73817 Support lower precisions for conversion to HF (#277)
aa96c3a Bump ai2-olmo-eval==0.8.3 (RC/BPB speed fix) (#285)
9fac3c5 HF conversion improvements (#284)
db2b8a4 catch other types of error when importing liger-kernel (#283)
fbeaa97 Add "fast" in-loop task set (#282)
776778e Fast in-loop MCQA (#281)
c779ca5 Bump ai2-olmo-eval==0.7.2 (in-loop Basic Skills) (#279)
fd44f03 Update images to stable 2.7.0, use CUDA 12.8 by default (#280)
1acde9d Fix Numpy data loader indices (again) (#276)
b18e58a Ensure eval tasks are sorted for consistent order (#275)
e185944 Save metadata to Beaker results dir (#274)
490f03a make indices filename robust to change in batch size (#273)
80cf40e Add BatchSizeSchedulerCallback (#272)
cd3d995 Sliding Window Attention (#271)
53c28a4 use REST API to follow jobs again
ef28f2a update pins
1662d0d don't auto upgrade beaker-py for now
5cc16a2 Add Interleaved Numpy Dataset (#263)
1cb5add Scheduler init (#267)
2caadea Move test utilities to new submodule olmo_core.testing (#266)
5d4c7ec Fix singe-GPU training with SkipStepOptimizer (#265)
9a19a71 More fixes for LMHead with TP (#264)
5bd9006 minor improvements to log streaming
1d4bd1d Add a numpy FSL dataset variant with document packing (#260)
7835077 use new RPC interface to follow Beaker experiments (#262)
a57d089 Fixes for LMHead with fused linear loss (#261)
a16e47d fix bug with fine-grained FSDP + activation checkpointing
75b003e Fix issue loading checkpoints with missing keys (#259)
bc5be4e make LB loss with sigmoid more robust (#255)
fa412e4 Allow setting LR schedule based on tokens, not steps (#258)
9bb72b5 fix bug when compile disabled
54e734c fix doc build
0bef170 Some HF tokenizers only use tokenizer_config.json (#256)
0b1d73e bump cached_path to fix issue with S3 downloads (#257)
1c28c1d Add more documentation to the template train script
984a4db make finding open ports more robust in dist tests (#254)
b2ce138 [muP] Mini optimizer group building refactor (#245)