Release v2.3.0 · allenai/OLMo-core

What's new

Fixed ✅

Fixed parsing username+password git remote URLs in launch.beaker module.
Fixed bug with default setup steps in launch.beaker.BeakerLaunchConfig when a branch can't be resolved.
Cluster names in Beaker have changed.
Fixed mixture rounding error with SourceMixtureDataset, which was previously causing samples to be repeated at the end of training.
Don't DDOS Beaker from big jobs.
A configuration error is now raised if you pass in a URL for the trainer or dataset's working directory.
Previously the URL would just get mangled into a local path, leading to unexpected behavior.
Fixed an issue where the ConsoleLoggerCallback would attempt to log before the first step.
Only call teardown_distributed_environment() when training ends cleanly to avoid a hang for the duration of the distributed backend's timeout when there's an error from one rank.
Fixed tensor parallelism issue with torch 2.8.
More fixes for Beaker cluster names.
Callback.post_train() will still be called even if the run is canceled before the dry-run batch.
GarbageCollectorCallback will restore gc settings even when Trainer.fit() exits on an error.
Make move_to_device blocking for MPS device to fix possible incorrect transfer of data from CPU to MPS.
Fixed bug where glob_directory() would fail to match certain glob patterns.
Added one more type of error to retry on when the Google Storage API throws it.
Perform a garbage collection after checkpointing to avoid running out of CPU memory.
Avoidable overflow error when using NumpyPackedFSLDataset.
Fixed issue with NumpyFSLDatasetMixture + SourceMixtureDataset where not all instances would have the same sequence length.
Attention backend will no longer default to flash in non-CUDA environments.

Changed ⚠️

The dir option to Trainer.maybe_load_checkpoint() is now optional and defaults to the save_folder.
Set fused_linear_cross_entropy_loss accum_dtype to fp32 in LMHead.
Increased NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS from 10 minutes to 30 minutes.
SlackNotifierCallback will now notify on checkpoint saved and post epoch events.
BeakerLaunchConfig.launch() will now send Slack notifications by default when follow=True if the env var SLACK_WEBHOOK_URL is set.
src/examples/llama/ has been renamed to src/examples/llm/.
Refactored eval task groups into task_groups.py
The use_flash argument to the Attention classes is deprecated. Use backend="flash_2" instead.
Refactored NumpyDatasetConfig by splitting it into a separate config per underlying dataset class.
Refactored internal/experiment module to facilitate modifying datasets or supplying a fully custom ExperimentConfig.
Simplified SourceMixtureDatasetConfig by removing redundant sequence_length and dtype fields.
The model_id argument to convert_state_from_hf is deprecated. Conversion information is deduced from the model type.
Refactored the example conversion scripts to/from HF, including decreasing false failures in validation.
Small refactor to source_mixture.py to make it easier to define data mixes in yaml.
Reorganized/cleaned up internal training scripts.

Added 🎉

Added CLI script src/scripts/unshard.py for converting distributed checkpoints to regular PyTorch or safetensors format.
Added a custom block that does LayerNorm scaling.
Added OLMo-mix-0625-150Bsample data mix.
Added alias support to DataMix enum.
Added the HalfCos learning rate scheduler.
Added CONTRIBUTING.md guidelines.
Added a lightweight, gantry-like Beaker launch CLI: python -m olmo_core.launch.beaker.
Added Beaker images with torch 2.8. There is olmo-core-tch280cu128-2025-09-18 and olmo-core-tch280cu129-2025-09-18 for CUDA 12.8 and 12.9, respectively.
Added TransformerEngine to Docker images and a TransformerEngine attention backend.
Added Callback.close() method, which is always called when exiting Trainer.fit().
Added flash-attention 3 to Docker images, added flash_3 attention backend.
Added support for sliding window attention to the Torch attention backend. Performance is not optimized, so other backends should be preferred.
Added RoPEScalingConfig.to_hf_config() for each RoPE scaling method to support automatic conversion to HuggingFace format.
Guide to dataset mixing in docs/source/guides/data_mixing.rst.
Added support for converting FlexOlmo models (with both dropless and default MoEs) between OLMo Core and HF formats.
Added olmo3_7B model config.
Added additional internal configuration tools.
Added a new named data mix that we used for the 32B run
Added internal OLMo3 7B midtraining and long-context configs.
Added ability to convert OLMo3 models to/from HF format with support for rope scaling configs.
Added a script that can pull out a single training batch from a training job

Commits

5b32459 (chore) prepare for release v2.3.0
3f21b77 Reorganize internal scripts for Olmo3 (#423)
67dd1d6 Use exec to start torchrun (#422)
206d25e Cookbook migration part 2 - long context config (#404)
eabb869 Raise timeout error if training job doesn't start in 5 mins (#421)
cc69286 Script to dump training tokens (#418)
b5ba7be Script for unsharding (#420)
464d01e olmo3 conversion w/ support for rope scaling (#415)
3afa7dd Add Rope scaling configs to rope module's exports (#414)
3ef0c05 Cookbook migration part 1 - midtraining config (#403)
cdb7922 Add Dolma 3 sample, and a way to alias data mixes (#412)
77adc0b pull Slack webhook URL from secret if available (#413)
c8e32ab NumpyFSLDatasetMixture + SourceMixtureDataset fix (#411)
6ce62cc Fix OverflowError in pack_documents (#409)
e229c17 Cookbook migration part 0 - more setup (#397)
8eef6f3 Data mix for the 32B (#405)
9ba6154 Retry more on GS failures (#406)
73d11f8 Do GC after checkpointing (#407)
cdfd201 Typo
f477a8d Fix glob_directory bug (#402)
bc2a3f1 Support conversion of dropless MoE to FlexOlmo (#401)
7fc49de Support rope scaling configs for hf conversion (#394)
0f38dc6 HF Conversion Refactor (#390)
a53f825 source mixture dataset simplification and documentation (#399)
69bd9d2 MPS bug fixes (#395)
601d336 refactor internal experiment configuration assembler (#386)
874be5d Add flash_3 attention backend (#377)
3aac611 Remove old beaker refs (#393)
1aeb369 Add sliding window support for torch attention backend (#388)
7bebea3 Fully migrate to new cluster names, fix internal experiment launching on Augusta (#391)
1d4b67f Add flash-attn-3 install to dockerfile (#392)
04d5d5d Add Callback.close() method, other minor callback improvements (#389)
dcf3a0e Add attention backend abstraction and integrate transformer engine's attention (#384)
0356c1b Np Dataset Config Refactor (#381)
afb5827 Improve distributed error handling (#380)
45d2031 Consolidate CLI code in public scripts (#379)
cbcc735 Port task groups from cookbook to OlmoCore (#378)
71f0023 bump torch and other dependencies in our Docker build (#360)
b3c3b4e Add an all-in-one guide for researchers (#374)
65d411b CONTRIBUTING.md (#376)
e469561 Treat cordoned beaker hosts as occupied (#375)
75d4d3a Avoid DDOS-ing Beaker for real (#372)
258a7a8 fix
b068044 Set accum_grad to fp32 for fused_linear_cross_entropy_loss, bump liger-kernel version (#370)
b550f9b Add the ability to send Slack notifications from launch.beaker (#371)
801de4b Makes it possible to override the common config builder (#369)
59a2b2f Adds a new, highly specific LR scheduler (#368)
d4ad23f fix rounding error with mixing datasets (#316)
46f7211 [Feat] Add LNS training example (#320)
3d9e9cd redo base dir calculation for new cluster names (#364)
b852ec0 Up NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS to 30 minutes (#365)
72b7c1b Moved google-cloud-compute dependency from dev to beaker group. (#363)
c00d715 catch issues with dolma metadata files earlier (#361)
fa5a5dc Refine hostname constraints for beaker experiments on Google clusters (#355)
386b0a8 Fix parsing username+password git remote URLs (#356)
6726ffc make release process more robust

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.3.0

Choose a tag to compare

Sorry, something went wrong.