Skip to content

v2.2.0

Choose a tag to compare

@github-actions github-actions released this 26 Aug 16:44
· 241 commits to main since this release

What's new

Added 🎉

  • Added option to set LR scheduler based on tokens instead of steps (e.g. --train_module.scheduler.units=tokens).
  • Added a "packed" numpy FSL variant that packs documents into sequences using the best-fit-decreasing bin packing algorithm following the work from Fewer Truncates Improve Language Modeling.
  • Added module olmo_core.testing.
  • Added a "interleaved" numpy FSL variant that interleaves several documents into sequences following the work from LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models.
  • Added sliding window attention as a feature
  • Added BatchSizeSchedulerCallback for setting a batch size schedule over the course of a training run.
  • Added optional TrainModule method, .pre_train(), which runs right after Callback.pre_train().
  • The BeakerCallback will save the config and Python requirements to the results dataset.
  • Added from_file method to Config class.
  • Added in-loop evals for OLMES basic skills eval
  • Added in-loop fast MCQA for in-loop evals and translated MBPP tasks
  • Added in-loop few-shot HumanEval BPB
  • Added fast and full in-loop recommendations, where fast is a roughly 2-3x faster subset of full
  • Added support for converting to HF models in lower precisions.
  • Added support for headwise QK norm.
  • Add BOS token in in-loop evals, when specified by the tokenizer (ai2-olmo-eval==0.8.4)
  • Add support for BOS token matching EOS token for intra-document masking in FSL numpy datasets.
  • Added option to allow profiler to record on multiple ranks.
  • Added support for accessing Google on non-Google clusters via auth with service account keys.
  • Added support for revisions in convert_checkpoint_from_hf.py and the load_hf_model method of olmo_core.nn.hf.checkpoint.
  • foreach support in SkipStepAdamW.
  • Added budget mode for activation checkpointing configuration.
  • Added io.remove_file() and io.glob_directory functions.
  • Added ABF, PI, and YaRN rope scaling strategies.
  • Added a script to compare two WandB runs
  • Added namespace option to nn.buffer_cache.BufferCache.
  • Added the option to configure head_stride for context parallelism with ring-flash-attn.
  • Added the option to group multiple npy source files together for packing with the packed FSL dataset by setting source_group_size to an integer greater than 1.
  • Added load_optim_state: Optional[bool] option to Trainer.load_checkpoint().
  • Added GenerationModule for OLMo-core native autoregressive generation with support for kv caching.

Changed ⚠️

  • Output of LMHead when labels is passed as input is now a 4-tuple instead of a 3-tuple, with (logits, loss, ce_loss, z_loss), where loss is the combined loss (ce_loss + z_loss).
  • The ConfigSaver callback will automatically set the config to save for other callbacks (WandBCallback, CometCallback, and BeakerCallback as of now).
  • Fixed bug causing slow evals in BPB/RC in-loop evals due to fast MC
  • Changed default precision of converted HF models in src/examples/huggingface/convert_checkpoint_to_hf.py to bfloat16.
  • Changed default cluster to saturn in src/examples/llama/train_launch.py.
  • Made some beaker secrets optional for internal experiments.
  • Changed SlidingWindowAttentionConfig to improve clarity.
  • Changed the default Beaker budget

Fixed ✅

  • Modify TokenizerConfig.from_hf() to fallback to tokenizer_config.json if config.json is not found.
  • Fixed loading checkpoints with missing keys from transformer train modules using torch 2.7.
  • Made MoE load balancing loss more robust.
  • Fixed a bug with ReorderedNormTransformerBlock when using fine-grained FSDP wrapping and activation checkpointing together.
  • Fixed an issue preventing tensor parallelism from working with LMHead when using the "fused_linear" loss implementation.
  • Fixed a bug with LMHead when using "fused_linear" loss implementation where the ce_loss output included the z_loss added to it.
  • Fixed training on single GPU when using a SkipStepOptimizer.
  • Fixed the initialization of the CosWithWarmupAndLinearDecay learning rate scheduler
  • Ensured eval tasks are sorted to maintain the same order across ranks (the cookbook was configuring these in an unsorted way).
  • W&B callback uses working directory instead of save folder for local cache.
  • Reset speed monitor callback after changing batch size.
  • Fixed parallelism compatiblity between cp + tp and cp + pp and added test to catch regressions.
  • Ensure sharded parameters are initialized differently on separate ranks.
  • Fixed fingerprinting for FSL datasets
  • Fixed bug where step state in SkipStepAdamW was not incremented, biasing the optimizer steps. Added option to restore the bug for backwards compatibility.
  • Removed sklearn from upstream dependency ai2-olmo-eval.
  • Made removing ephemeral checkpoints more robust.
  • Made running bookkeeping operations more robust.
  • Ensure RoPE modules with different settings use a unique sub-cache for their buffers.
  • Fixed bug with context parallelism where every transformer block would use the same RoPE buffers even if their RoPE was configured differently.
  • Fixed MFU computation to work with FSDP, corrected some device specs.
  • Optimization: avoid redundant calls to model.train() in TransformerTrainModule.
  • NumpyDatasetConfig.expand_glob now works with remote directories.
  • Fixed Attention block sharding when TP and head-wise QK norm are both applied.

Commits

de89fbe (chore) prepare for release v2.2.0
54d3af0 run gpu tests with gantry (#357)
c82d13c GenerationModule with support for KV Caching (#324)
effdef3 Add option to Trainer.load_checkpoint() to ignore optim state (#351)
2cd5b82 Fix TP when headwise QK norm is applied (#353)
429054a Fix empty config.json output in convert_checkpoint_from_hf (#354)
e97f58d Fix skip step optimizer with TP (#352)
7f4b45f Option to group npy sources together for packing (#349)
9fb6366 Add io.glob_directory function (#348)
67854f9 Avoid redundant calls to model.train() (#345)
7e14a68 Fix CP bug with RoPE buffers (#341)
699971d Support configuring head_stride for ring-flash-attn (#344)
b0019cc Pull updates from olmo3 branches (#334)
026b882 Fix MFU Calculation (#343)
4162151 Add 'namespaces' to BufferCache to avoid collisions (#340)
085755c Port the WandB comparison tool from the old trainer (#338)
9e42a20 The Beaker default budget has changed (#337)
e395b82 Ensure LR scheduler's unit are tokens when using BZ scheduler (#335)
08340f9 fix off-by-one issue with SWA
abc12e5 Make async bookkeeping more robust (#333)
6d2f334 Add TrainModule.pre_train(), fix pain point with BZ scheduler (#332)
8fb85bc make Scheduler a subclass of Config (#331)
e1bac95 More Rope Scaling Implementations (PI, Yarn) (#330)
97a6f4e Make removing ephemeral checkpoints more robust (#329)
ebe0b19 loosen numpy requirement (#328)
5b924a6 Add missing return in init (#327)
34beda5 Bump ai2-olmo-eval==0.8.5 (#326)
f107e9c fix typo in release process
bddf65a fix initialization (again) (#319)
992a79e Memory budget strategy for activation checkpointing (#297)
0dda3ec DDP parameter dtype casting for 16-bit precision and flash attention support (#314)
26998de Improve clarity of SWA config (#301)
91630ea Add config option for to enable/disable the step-increment bugfix. (#317)
51e2049 "Better" sorting of Augusta ranks at runtime (#313)
fe50d9e Only check for beaker secrets in non-distributed settings (#302)
c41962a SkipStepAdamW foreach implementation; bug fix for step state in SkipStepAdamW (#309)
a04e93f ensure sharded parameters initialized different on different ranks (#307)
f7c394d Fix fingerprints for various FSLDatasets (#303)
5f93eed hot fix
20d833b Add support for revisions in conversion from HF (#304)
b0f1d9e use qualname instead of name
b744d9f Make async bookkeeping more robust (#305)
796c2dc Add support for accessing Google cloud on non-Google clusters (#299)
2bdb93c Allow profiler to record on multiple ranks (#298)
00f9b00 Fix for parallelism compatibilities in build_world_mesh (#293)
ef2845f Bump pytorch, ring-flash-attn, and liger-kernel versions (#295)
4cffd08 Add support for BOS token matching EOS token in documents (#291)
c5777a3 Make ruff line-length match black line-length (#290)
f2cc497 Reset speed monitor after changing batch size (#289)
9816e44 ai2-olmo-eval==0.8.4 (#288)
a42ce36 Add head QK norm support for Attention (#287)
e7d01b1 use work_dir instead of save_folder for W&B cache
2cd5a31 Update default cluster for src/examples/llama/train_launch.py (#286)
2b73817 Support lower precisions for conversion to HF (#277)
aa96c3a Bump ai2-olmo-eval==0.8.3 (RC/BPB speed fix) (#285)
9fac3c5 HF conversion improvements (#284)
db2b8a4 catch other types of error when importing liger-kernel (#283)
fbeaa97 Add "fast" in-loop task set (#282)
776778e Fast in-loop MCQA (#281)
c779ca5 Bump ai2-olmo-eval==0.7.2 (in-loop Basic Skills) (#279)
fd44f03 Update images to stable 2.7.0, use CUDA 12.8 by default (#280)
1acde9d Fix Numpy data loader indices (again) (#276)
b18e58a Ensure eval tasks are sorted for consistent order (#275)
e185944 Save metadata to Beaker results dir (#274)
490f03a make indices filename robust to change in batch size (#273)
80cf40e Add BatchSizeSchedulerCallback (#272)
cd3d995 Sliding Window Attention (#271)
53c28a4 use REST API to follow jobs again
ef28f2a update pins
1662d0d don't auto upgrade beaker-py for now
5cc16a2 Add Interleaved Numpy Dataset (#263)
1cb5add Scheduler init (#267)
2caadea Move test utilities to new submodule olmo_core.testing (#266)
5d4c7ec Fix singe-GPU training with SkipStepOptimizer (#265)
9a19a71 More fixes for LMHead with TP (#264)
5bd9006 minor improvements to log streaming
1d4bd1d Add a numpy FSL dataset variant with document packing (#260)
7835077 use new RPC interface to follow Beaker experiments (#262)
a57d089 Fixes for LMHead with fused linear loss (#261)
a16e47d fix bug with fine-grained FSDP + activation checkpointing
75b003e Fix issue loading checkpoints with missing keys (#259)
bc5be4e make LB loss with sigmoid more robust (#255)
fa412e4 Allow setting LR schedule based on tokens, not steps (#258)
9bb72b5 fix bug when compile disabled
54e734c fix doc build
0bef170 Some HF tokenizers only use tokenizer_config.json (#256)
0b1d73e bump cached_path to fix issue with S3 downloads (#257)
1c28c1d Add more documentation to the template train script
984a4db make finding open ports more robust in dist tests (#254)
b2ce138 [muP] Mini optimizer group building refactor (#245)