v2.0.0

github-actions released this 13 Mar 01:42

· 385 commits to main since this release

dfa8f2b

What's new

This major release introduces a few breaking changes. We've provided more information here: OLMo-core v2 design and upgrade guide.

Added 🎉

Added TrainModule abstraction with TransformerTrainModule implementation, which encapsulates both a model and optimizer.
Added namespace argument to Trainer.record_metric().
Added support for context parallelism.
Added support for expert parallelism with MoE models.
Added in-loop evals for Minerva, GSM, HumanEval, MBPP (ai2-olmo-eval==0.7.0)
Added CosWithWarmupAndLinearDecay learning rate scheduler
Added WSD learning rate scheduler

Changed ⚠️

The Trainer now takes a TrainModule instead of a model and optimizer, and several configuration options have been moved to TransformerTrainModule, including rank_microbatch_size, fused_loss, compile_loss, z_loss_multiplier, and autocast_precision.
Several TransformerModelConfig options have been to TransformerTrainModule / TransformerTrainModuleConfig, including dp_config, tp_config, float8_config, and compile.

Removed 👋

Removed the following callbacks: MoEHandlerCallback, SchedulerCallback, MatrixNormalizerCallback, GradClipperCallback, and Float8HandlerCallback.
The functionality from all of those callbacks has been moved to the TransformerTrainModule class.
Removed the callback methods .pre_eval_batch() and .post_eval_batch().

Fixed ✅

Fixed the model ladder code when training on mps or cpu device

Commits

dfa8f2b (chore) prepare for release v2.0.0
95fb084 add work-around for pytorch/ao#1871 (#205)
3ce0c58 32B Documentation (#210)
41f8ddc Add a public "official" version of our 32B train script (#214)
7e58d12 Update data paths in example to public URLs (#213)
4327bb9 upload data to r2 and updated their paths (#208)
0e6ea23 Assorted improvements (#207)
9ceb1e4 Add CUDA 12.6 images (#209)
eda3afb guard against wrapping MoE modules for AC (#206)
6e5b16f Bump ai2-olmo-eval==0.7.0 (in-loop Minerva, GSM, HumanEval, MBPP) (#204)
eccdc00 Make it easier for external users to run train scripts (#203)
da33f5b fix entrypoint steps
947a293 clean up changelog
725adf3 V2 (#202)

Assets 4