v2.3.0
What's new
Fixed ✅
- Fixed parsing username+password git remote URLs in
launch.beakermodule. - Fixed bug with default setup steps in
launch.beaker.BeakerLaunchConfigwhen a branch can't be resolved. - Cluster names in Beaker have changed.
- Fixed mixture rounding error with
SourceMixtureDataset, which was previously causing samples to be repeated at the end of training. - Don't DDOS Beaker from big jobs.
- A configuration error is now raised if you pass in a URL for the trainer or dataset's working directory.
Previously the URL would just get mangled into a local path, leading to unexpected behavior. - Fixed an issue where the
ConsoleLoggerCallbackwould attempt to log before the first step. - Only call
teardown_distributed_environment()when training ends cleanly to avoid a hang for the duration of the distributed backend's timeout when there's an error from one rank. - Fixed tensor parallelism issue with torch 2.8.
- More fixes for Beaker cluster names.
Callback.post_train()will still be called even if the run is canceled before the dry-run batch.GarbageCollectorCallbackwill restoregcsettings even whenTrainer.fit()exits on an error.- Make
move_to_deviceblocking for MPS device to fix possible incorrect transfer of data from CPU to MPS. - Fixed bug where
glob_directory()would fail to match certain glob patterns. - Added one more type of error to retry on when the Google Storage API throws it.
- Perform a garbage collection after checkpointing to avoid running out of CPU memory.
- Avoidable overflow error when using NumpyPackedFSLDataset.
- Fixed issue with NumpyFSLDatasetMixture + SourceMixtureDataset where not all instances would have the same sequence length.
- Attention backend will no longer default to flash in non-CUDA environments.
Changed ⚠️
- The
diroption toTrainer.maybe_load_checkpoint()is now optional and defaults to thesave_folder. - Set
fused_linear_cross_entropy_loss accum_dtypeto fp32 inLMHead. - Increased
NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MSfrom 10 minutes to 30 minutes. SlackNotifierCallbackwill now notify on checkpoint saved and post epoch events.BeakerLaunchConfig.launch()will now send Slack notifications by default whenfollow=Trueif the env varSLACK_WEBHOOK_URLis set.src/examples/llama/has been renamed tosrc/examples/llm/.- Refactored eval task groups into
task_groups.py - The
use_flashargument to theAttentionclasses is deprecated. Usebackend="flash_2"instead. - Refactored
NumpyDatasetConfigby splitting it into a separate config per underlying dataset class. - Refactored
internal/experimentmodule to facilitate modifying datasets or supplying a fully customExperimentConfig. - Simplified
SourceMixtureDatasetConfigby removing redundantsequence_lengthanddtypefields. - The
model_idargument toconvert_state_from_hfis deprecated. Conversion information is deduced from the model type. - Refactored the example conversion scripts to/from HF, including decreasing false failures in validation.
- Small refactor to
source_mixture.pyto make it easier to define data mixes in yaml. - Reorganized/cleaned up internal training scripts.
Added 🎉
- Added CLI script
src/scripts/unshard.pyfor converting distributed checkpoints to regular PyTorch or safetensors format. - Added a custom block that does LayerNorm scaling.
- Added
OLMo-mix-0625-150Bsampledata mix. - Added alias support to
DataMixenum. - Added the
HalfCoslearning rate scheduler. - Added
CONTRIBUTING.mdguidelines. - Added a lightweight, gantry-like Beaker launch CLI:
python -m olmo_core.launch.beaker. - Added Beaker images with torch 2.8. There is
olmo-core-tch280cu128-2025-09-18andolmo-core-tch280cu129-2025-09-18for CUDA 12.8 and 12.9, respectively. - Added TransformerEngine to Docker images and a TransformerEngine attention backend.
- Added
Callback.close()method, which is always called when exitingTrainer.fit(). - Added flash-attention 3 to Docker images, added
flash_3attention backend. - Added support for sliding window attention to the Torch attention backend. Performance is not optimized, so other backends should be preferred.
- Added
RoPEScalingConfig.to_hf_config()for each RoPE scaling method to support automatic conversion to HuggingFace format. - Guide to dataset mixing in
docs/source/guides/data_mixing.rst. - Added support for converting FlexOlmo models (with both dropless and default MoEs) between OLMo Core and HF formats.
- Added
olmo3_7Bmodel config. - Added additional internal configuration tools.
- Added a new named data mix that we used for the 32B run
- Added internal OLMo3 7B midtraining and long-context configs.
- Added ability to convert OLMo3 models to/from HF format with support for rope scaling configs.
- Added a script that can pull out a single training batch from a training job
Commits
5b32459 (chore) prepare for release v2.3.0
3f21b77 Reorganize internal scripts for Olmo3 (#423)
67dd1d6 Use exec to start torchrun (#422)
206d25e Cookbook migration part 2 - long context config (#404)
eabb869 Raise timeout error if training job doesn't start in 5 mins (#421)
cc69286 Script to dump training tokens (#418)
b5ba7be Script for unsharding (#420)
464d01e olmo3 conversion w/ support for rope scaling (#415)
3afa7dd Add Rope scaling configs to rope module's exports (#414)
3ef0c05 Cookbook migration part 1 - midtraining config (#403)
cdb7922 Add Dolma 3 sample, and a way to alias data mixes (#412)
77adc0b pull Slack webhook URL from secret if available (#413)
c8e32ab NumpyFSLDatasetMixture + SourceMixtureDataset fix (#411)
6ce62cc Fix OverflowError in pack_documents (#409)
e229c17 Cookbook migration part 0 - more setup (#397)
8eef6f3 Data mix for the 32B (#405)
9ba6154 Retry more on GS failures (#406)
73d11f8 Do GC after checkpointing (#407)
cdfd201 Typo
f477a8d Fix glob_directory bug (#402)
bc2a3f1 Support conversion of dropless MoE to FlexOlmo (#401)
7fc49de Support rope scaling configs for hf conversion (#394)
0f38dc6 HF Conversion Refactor (#390)
a53f825 source mixture dataset simplification and documentation (#399)
69bd9d2 MPS bug fixes (#395)
601d336 refactor internal experiment configuration assembler (#386)
874be5d Add flash_3 attention backend (#377)
3aac611 Remove old beaker refs (#393)
1aeb369 Add sliding window support for torch attention backend (#388)
7bebea3 Fully migrate to new cluster names, fix internal experiment launching on Augusta (#391)
1d4b67f Add flash-attn-3 install to dockerfile (#392)
04d5d5d Add Callback.close() method, other minor callback improvements (#389)
dcf3a0e Add attention backend abstraction and integrate transformer engine's attention (#384)
0356c1b Np Dataset Config Refactor (#381)
afb5827 Improve distributed error handling (#380)
45d2031 Consolidate CLI code in public scripts (#379)
cbcc735 Port task groups from cookbook to OlmoCore (#378)
71f0023 bump torch and other dependencies in our Docker build (#360)
b3c3b4e Add an all-in-one guide for researchers (#374)
65d411b CONTRIBUTING.md (#376)
e469561 Treat cordoned beaker hosts as occupied (#375)
75d4d3a Avoid DDOS-ing Beaker for real (#372)
258a7a8 fix
b068044 Set accum_grad to fp32 for fused_linear_cross_entropy_loss, bump liger-kernel version (#370)
b550f9b Add the ability to send Slack notifications from launch.beaker (#371)
801de4b Makes it possible to override the common config builder (#369)
59a2b2f Adds a new, highly specific LR scheduler (#368)
d4ad23f fix rounding error with mixing datasets (#316)
46f7211 [Feat] Add LNS training example (#320)
3d9e9cd redo base dir calculation for new cluster names (#364)
b852ec0 Up NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS to 30 minutes (#365)
72b7c1b Moved google-cloud-compute dependency from dev to beaker group. (#363)
c00d715 catch issues with dolma metadata files earlier (#361)
fa5a5dc Refine hostname constraints for beaker experiments on Google clusters (#355)
386b0a8 Fix parsing username+password git remote URLs (#356)
6726ffc make release process more robust