Release v2.2.0 · allenai/OLMo-core

What's new

Added 🎉

Added option to set LR scheduler based on tokens instead of steps (e.g. --train_module.scheduler.units=tokens).
Added a "packed" numpy FSL variant that packs documents into sequences using the best-fit-decreasing bin packing algorithm following the work from Fewer Truncates Improve Language Modeling.
Added module olmo_core.testing.
Added a "interleaved" numpy FSL variant that interleaves several documents into sequences following the work from LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models.
Added sliding window attention as a feature
Added BatchSizeSchedulerCallback for setting a batch size schedule over the course of a training run.
Added optional TrainModule method, .pre_train(), which runs right after Callback.pre_train().
The BeakerCallback will save the config and Python requirements to the results dataset.
Added from_file method to Config class.
Added in-loop evals for OLMES basic skills eval
Added in-loop fast MCQA for in-loop evals and translated MBPP tasks
Added in-loop few-shot HumanEval BPB
Added fast and full in-loop recommendations, where fast is a roughly 2-3x faster subset of full
Added support for converting to HF models in lower precisions.
Added support for headwise QK norm.
Add BOS token in in-loop evals, when specified by the tokenizer (ai2-olmo-eval==0.8.4)
Add support for BOS token matching EOS token for intra-document masking in FSL numpy datasets.
Added option to allow profiler to record on multiple ranks.
Added support for accessing Google on non-Google clusters via auth with service account keys.
Added support for revisions in convert_checkpoint_from_hf.py and the load_hf_model method of olmo_core.nn.hf.checkpoint.
foreach support in SkipStepAdamW.
Added budget mode for activation checkpointing configuration.
Added io.remove_file() and io.glob_directory functions.
Added ABF, PI, and YaRN rope scaling strategies.
Added a script to compare two WandB runs
Added namespace option to nn.buffer_cache.BufferCache.
Added the option to configure head_stride for context parallelism with ring-flash-attn.
Added the option to group multiple npy source files together for packing with the packed FSL dataset by setting source_group_size to an integer greater than 1.
Added load_optim_state: Optional[bool] option to Trainer.load_checkpoint().
Added GenerationModule for OLMo-core native autoregressive generation with support for kv caching.

Changed ⚠️

Output of LMHead when labels is passed as input is now a 4-tuple instead of a 3-tuple, with (logits, loss, ce_loss, z_loss), where loss is the combined loss (ce_loss + z_loss).
The ConfigSaver callback will automatically set the config to save for other callbacks (WandBCallback, CometCallback, and BeakerCallback as of now).
Fixed bug causing slow evals in BPB/RC in-loop evals due to fast MC
Changed default precision of converted HF models in src/examples/huggingface/convert_checkpoint_to_hf.py to bfloat16.
Changed default cluster to saturn in src/examples/llama/train_launch.py.
Made some beaker secrets optional for internal experiments.
Changed SlidingWindowAttentionConfig to improve clarity.
Changed the default Beaker budget

Fixed ✅

Modify TokenizerConfig.from_hf() to fallback to tokenizer_config.json if config.json is not found.
Fixed loading checkpoints with missing keys from transformer train modules using torch 2.7.
Made MoE load balancing loss more robust.
Fixed a bug with ReorderedNormTransformerBlock when using fine-grained FSDP wrapping and activation checkpointing together.
Fixed an issue preventing tensor parallelism from working with LMHead when using the "fused_linear" loss implementation.
Fixed a bug with LMHead when using "fused_linear" loss implementation where the ce_loss output included the z_loss added to it.
Fixed training on single GPU when using a SkipStepOptimizer.
Fixed the initialization of the CosWithWarmupAndLinearDecay learning rate scheduler
Ensured eval tasks are sorted to maintain the same order across ranks (the cookbook was configuring these in an unsorted way).
W&B callback uses working directory instead of save folder for local cache.
Reset speed monitor callback after changing batch size.
Fixed parallelism compatiblity between cp + tp and cp + pp and added test to catch regressions.
Ensure sharded parameters are initialized differently on separate ranks.
Fixed fingerprinting for FSL datasets
Fixed bug where step state in SkipStepAdamW was not incremented, biasing the optimizer steps. Added option to restore the bug for backwards compatibility.
Removed sklearn from upstream dependency ai2-olmo-eval.
Made removing ephemeral checkpoints more robust.
Made running bookkeeping operations more robust.
Ensure RoPE modules with different settings use a unique sub-cache for their buffers.
Fixed bug with context parallelism where every transformer block would use the same RoPE buffers even if their RoPE was configured differently.
Fixed MFU computation to work with FSDP, corrected some device specs.
Optimization: avoid redundant calls to model.train() in TransformerTrainModule.
NumpyDatasetConfig.expand_glob now works with remote directories.
Fixed Attention block sharding when TP and head-wise QK norm are both applied.

Commits

de89fbe (chore) prepare for release v2.2.0
54d3af0 run gpu tests with gantry (#357)
c82d13c GenerationModule with support for KV Caching (#324)
effdef3 Add option to Trainer.load_checkpoint() to ignore optim state (#351)
2cd5b82 Fix TP when headwise QK norm is applied (#353)
429054a Fix empty config.json output in convert_checkpoint_from_hf (#354)
e97f58d Fix skip step optimizer with TP (#352)
7f4b45f Option to group npy sources together for packing (#349)
9fb6366 Add io.glob_directory function (#348)
67854f9 Avoid redundant calls to model.train() (#345)
7e14a68 Fix CP bug with RoPE buffers (#341)
699971d Support configuring head_stride for ring-flash-attn (#344)
b0019cc Pull updates from olmo3 branches (#334)
026b882 Fix MFU Calculation (#343)
4162151 Add 'namespaces' to BufferCache to avoid collisions (#340)
085755c Port the WandB comparison tool from the old trainer (#338)
9e42a20 The Beaker default budget has changed (#337)
e395b82 Ensure LR scheduler's unit are tokens when using BZ scheduler (#335)
08340f9 fix off-by-one issue with SWA
abc12e5 Make async bookkeeping more robust (#333)
6d2f334 Add TrainModule.pre_train(), fix pain point with BZ scheduler (#332)
8fb85bc make Scheduler a subclass of Config (#331)
e1bac95 More Rope Scaling Implementations (PI, Yarn) (#330)
97a6f4e Make removing ephemeral checkpoints more robust (#329)
ebe0b19 loosen numpy requirement (#328)
5b924a6 Add missing return in init (#327)
34beda5 Bump ai2-olmo-eval==0.8.5 (#326)
f107e9c fix typo in release process
bddf65a fix initialization (again) (#319)
992a79e Memory budget strategy for activation checkpointing (#297)
0dda3ec DDP parameter dtype casting for 16-bit precision and flash attention support (#314)
26998de Improve clarity of SWA config (#301)
91630ea Add config option for to enable/disable the step-increment bugfix. (#317)
51e2049 "Better" sorting of Augusta ranks at runtime (#313)
fe50d9e Only check for beaker secrets in non-distributed settings (#302)
c41962a SkipStepAdamW foreach implementation; bug fix for step state in SkipStepAdamW (#309)
a04e93f ensure sharded parameters initialized different on different ranks (#307)
f7c394d Fix fingerprints for various FSLDatasets (#303)
5f93eed hot fix
20d833b Add support for revisions in conversion from HF (#304)
b0f1d9e use qualname instead of name
b744d9f Make async bookkeeping more robust (#305)
796c2dc Add support for accessing Google cloud on non-Google clusters (#299)
2bdb93c Allow profiler to record on multiple ranks (#298)
00f9b00 Fix for parallelism compatibilities in build_world_mesh (#293)
ef2845f Bump pytorch, ring-flash-attn, and liger-kernel versions (#295)
4cffd08 Add support for BOS token matching EOS token in documents (#291)
c5777a3 Make ruff line-length match black line-length (#290)
f2cc497 Reset speed monitor after changing batch size (#289)
9816e44 ai2-olmo-eval==0.8.4 (#288)
a42ce36 Add head QK norm support for Attention (#287)
e7d01b1 use work_dir instead of save_folder for W&B cache
2cd5a31 Update default cluster for src/examples/llama/train_launch.py (#286)
2b73817 Support lower precisions for conversion to HF (#277)
aa96c3a Bump ai2-olmo-eval==0.8.3 (RC/BPB speed fix) (#285)
9fac3c5 HF conversion improvements (#284)
db2b8a4 catch other types of error when importing liger-kernel (#283)
fbeaa97 Add "fast" in-loop task set (#282)
776778e Fast in-loop MCQA (#281)
c779ca5 Bump ai2-olmo-eval==0.7.2 (in-loop Basic Skills) (#279)
fd44f03 Update images to stable 2.7.0, use CUDA 12.8 by default (#280)
1acde9d Fix Numpy data loader indices (again) (#276)
b18e58a Ensure eval tasks are sorted for consistent order (#275)
e185944 Save metadata to Beaker results dir (#274)
490f03a make indices filename robust to change in batch size (#273)
80cf40e Add BatchSizeSchedulerCallback (#272)
cd3d995 Sliding Window Attention (#271)
53c28a4 use REST API to follow jobs again
ef28f2a update pins
1662d0d don't auto upgrade beaker-py for now
5cc16a2 Add Interleaved Numpy Dataset (#263)
1cb5add Scheduler init (#267)
2caadea Move test utilities to new submodule olmo_core.testing (#266)
5d4c7ec Fix singe-GPU training with SkipStepOptimizer (#265)
9a19a71 More fixes for LMHead with TP (#264)
5bd9006 minor improvements to log streaming
1d4bd1d Add a numpy FSL dataset variant with document packing (#260)
7835077 use new RPC interface to follow Beaker experiments (#262)
a57d089 Fixes for LMHead with fused linear loss (#261)
a16e47d fix bug with fine-grained FSDP + activation checkpointing
75b003e Fix issue loading checkpoints with missing keys (#259)
bc5be4e make LB loss with sigmoid more robust (#255)
fa412e4 Allow setting LR schedule based on tokens, not steps (#258)
9bb72b5 fix bug when compile disabled
54e734c fix doc build
0bef170 Some HF tokenizers only use tokenizer_config.json (#256)
0b1d73e bump cached_path to fix issue with S3 downloads (#257)
1c28c1d Add more documentation to the template train script
984a4db make finding open ports more robust in dist tests (#254)
b2ce138 [muP] Mini optimizer group building refactor (#245)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.2.0

Choose a tag to compare

Sorry, something went wrong.