This is a big one, as I haven't released since March. Whoops! Also my last release as maintainer, as today is my last day at Ai2. I'll miss you all.
Added
- Add
minimax-m3toPRICE_PER_MILLION_TOKENSinopen_instruct/judge_utils.pyfor cost tracking on the latest MiniMax flagship model.
Changed
- Add minimal support for DPO-training the Olmo Hybrid 7B (GDN linear attention) model with
open_instruct/dpo.py(OLMo-core): bump OLMo-core to a rev with theolmo3_hybrid_7Bpreset and bidirectional HF weight conversion, bumpflash-linear-attentionto 0.5.0 and addtilelang(correct GDN gradients on Hopper), addselected_modulesactivation checkpointing (forwardingdeterminism_check="none"through OLMo-core's activation-checkpointing config sotorch.compileand the opaqueflakernels coexist), extendModelDimswith GDN FLOPs/params accounting, and addscripts/train/olmo-hybrid/7b_instruct_dpo_sweep_olmo_core.sh(#1715). - Record a
_metrics_keepalivemetric on every rank every GRPO+OLMo-core step to keep_metricsnon-empty, preventing OLMo-core's empty-skip in_log_metricsfrom desyncing the bookkeeping process group and deadlocking gloo for 30 minutes at save-time flushes (#1708). - Expand type-checking coverage by replacing
# ty: ignoredirectives with typed casts and fixing related type issues (#1688). - Add TV divergence rho filtering for GRPO (#1681).
- Export
SETUPTOOLS_SCM_PRETEND_VERSION_FOR_OPEN_INSTRUCT=0.0.0+debuginscripts/train/debug/grpo.shandgrpo_fast.sh(local Ray debug scripts that disable torch compile) so setuptools-scm can resolve the package version (#1696). - Simplify GRPO clip fraction handling by returning the final policy loss and clip fraction directly from
compute_grpo_loss(#1679). - Bring
grpo.py(OLMo-core GRPO) to feature parity withgrpo_fast.py: addEvalCallback,setup_evalactor RPC, unconditional vLLM-sync callback,ConstantWithWarmupscheduler support, andStepTimingCallbackend-to-end step timing (#1672). - Remove references to deleted
ppo_vllm_thread_ray_gtrl.pyscript: delete broken launch scripts (scripts/train/debug/ppo.sh,scripts/train/rlvr/tulu_rlvr.sh,scripts/train/tulu3/ppo_8b.sh) and add historical-reference notes todocs/tulu3.mdanddocs/archived_dev_scripts/olmoe_0125.shpointing to the deletion commit. Also drop the deadupdate_command_args.pyreferences: deletescripts/train/benchmark.shand its section indocs/get_started/ai2_internal_setup.md, and update the README RLVR quickstart to launchgrpo_fast.pyviascripts/train/build_image_and_launch.sh. - Bump vllm to >=0.19.1 (and refresh
uv.lock, including compressed-tensors v0.14.0.1 → v0.15.0.1). - Move
maybe_evaluatefromgrpo_fast.pytogrpo_utils.pyand drop the duplicatePolicyTrainerRayProcess.calculate_token_countsmethod, routing both trainer paths through the sharedgrpo_utils.calculate_token_counts(#1669). - Rename
time/trainer_idle_waiting_for_inferencetotime/trainer_waiting_for_dataandtime/generation_idle_waiting_for_trainertotime/generation_waiting_for_trainer, and emit per-Group generation timing (time/group_generation_{mean,max,min}plusbatch/per_group_generation_timeshistogram) so latency vs. throughput in the inference pipeline is legible from wandb (#1690). - Add parameterized
combine_datasettests inopen_instruct/test_utils.pyagainst local jsonl fixtures (no network), covering varied fractional/sample-count weight combinations and split-count mismatch (would have caught the bug fixed in #1674). Extract the interleaved-list→dict parsing into a sharedutils.parse_dataset_mixer_listhelper (with its own parameterized unit tests) and tightencombine_dataset/get_datasetsto accept dict-onlydataset_mixer; the one external list-form caller (rejection_sampling/generation.py) now converts at the call site. - Make
mason.py--output_dir/--checkpoint_state_diroverrides idempotent viareplace_or_append_flag, addopen_instruct/grpo.pytoOPEN_INSTRUCT_COMMANDS/OPEN_INSTRUCT_RESUMABLES, and wire OLMo-core checkpoint save/resume intogrpo.py(CheckpointerCallback+DataPreparationActorCheckpointCallback+LoadStrategy.if_available) so resumable Beaker jobs actually resume (#1666). - Make
--budgetoptional inmason.py(falls back to the workspace's default budget) and drop the explicit--budgetflag from launch scripts where it already matched the workspace default (#1673). - Restore 🤡 to resample warnings and use
self.training_stepinDataPreparationActor.run(#1663). - Add a unified
use_rho_correctioninterface (clamp + mask, per-token or sequence-level) for the train/infer engine mismatch in GRPO loss; replacestruncated_importance_sampling_ratio_capand the IcePop flags (#1650). - Resample on filtered batches in
DataPreparationActorinstead of emitting emptyCollatedBatchData, unifying thegrpo.pyandgrpo_fast.pyconsumer paths and removing the now-dead empty-batch checks ingrpo_fast.py(#1660). - Update Beaker budget from
ai2/oe-omaitoai2/oe-otheracross launch scripts and beaker configs. - Update Beaker budget from
ai2/oe-adapttoai2/oe-omaiacross launch scripts and beaker configs to fix experiment launch failures from the retired budget (#1662). - Log every filtered prompt in
accumulate_inference_batchesat INFO level with the zero/solved/nonzero breakdown, and addbatch/filtered_prompts_pctto wandb so policy collapse / convergence is visible without spelunking debug logs (#1657). - Aggregate prompt/response lengths across all DP ranks (deduplicating SP groups) when computing GRPO step token counts and utilization metrics, instead of using only rank 0 (#1659).
- Split
accumulate_inference_batchesintoprocess_single_resultandcombine_processed_resultsfor clarity (#1614). - Match reference SFT run:
olmo_core_finetune.pyparity with pure olmo-core; default CP strategy switched toulyssesand ring-flash-attn dependency removed (#1620). - Address review feedback on #1620: derive vocab size from the run's tokenizer (no longer hardcoded to dolma2), validate complete numpy artifacts before reusing the SFT cache, fold seed/max_seq_length into the cache directory, fix HF-vs-olmo-core checkpoint detection for relative local paths, and log which checkpoint format was detected (#1620).
- Stream SFT tokens/labels/boundaries directly to
_*.partial.binfiles and derive per-dataset stats at the end from disk, dropping the explicit_checkpoint.jsonfile.--resumenow works by truncating the partial files to a consistent sample boundary (#1631). - Revert reapply of packaging fix from #1634 (#1637).
- Drop unused
data_typesimport and inlinebatch["batch"].to(device)inGRPOTrainModule(#1635). - Use incremental binary checkpoint for SFT tokenization resume, eliminating O(N²) re-serialization (#1633).
- Extract numpy SFT conversion helpers into
open_instruct.numpy_dataset_conversion(#1622). - Simplified model step tracking logic (#1616).
- Pass
attention_mask=Nonein GRPOforward_for_logprobscalls — HF constructs the correct 3D intra-document mask fromposition_idsinternally (#1617). - Migrate GRPO trainer→vLLM weight sync to vLLM 0.16.0's native weight transfer API (
NCCLWeightTransferEngine), replacing custom NCCL process-group and broadcast code (#1515). - Extend pre-commit hook to also ban
nonlocalkeyword (#1613). - Set checkpoint_state_freq default in data_loader.py, not mason.py (#1600).
- Inline data prep actor naming in
StreamingDataLoaderand GRPO, removing redundant helpers and parameter plumbing (#1326). - Use local fixture for AceCode test instead of downloading from HuggingFace (#1593).
- Now, to disable
max_grad_normclipping, set None, not -1 (#1591). - Inline GRPO utility functions and rename
ExperimentConfigtoGRPOExperimentConfig(#1578). - Extract shared OLMo-core config classes and helpers into
olmo_core_utils.py; refactor DPO to use shared configs (#1576). - Decouple
mix_data.pyfromfinetune.pyby replacingFlatArgumentsimport with a lightweightMixDataArgumentsdataclass (#1573). - Extracted shared
find_free_portutility function (#1607).
Deprecated
- Add deprecation warning to
finetune.pypointing users to the OLMo-core SFT implementation (#1574).
Changed
- Replace olmo-core's
save_hf_modelpath with a directconvert_state_to_hf+ HFsave_pretrainedflow; verify HF export works at startup indpo.py/grpo.py(#1671).
Fixed
- Fix flaky
open_instruct/code_utils/test_api.pyserver startup by using per-instance free ports, failing fast on uvicorn exits, and surfacing subprocess output in startup errors (#1721). - Fix
detect_attn_implementationforflash-attn-2. (#1716). - Pass packed-sequence
doc_lens/max_doc_lensto OLMo-core models inforward_for_logprobs(instead of relying onattention_mask), so OLMo-core GRPO uses correct intra-document attention; bumps olmo-core to a commit that accepts these kwargs (#1670). - Fix gpt-4o / gpt-4o-standard output pricing (was 10× too low) and restate
open_instruct/judge_utils.pyrates as dollars per 1M tokens (renamedPRICE_PER_TOKEN→PRICE_PER_MILLION_TOKENS); update the cost calculation inopen_instruct/ground_truth_utils.pyaccordingly (supersedes #1618) (#1686). - Use processed vLLM logprobs in GRPO rollouts so sampled-token logprobs include sampling transforms like temperature (#1678).
- Fix
_get_batch_logpsdivision-by-zero (NaN return) inopen_instruct/dpo_utils.pywhen a sequence has every label masked (-100) andaverage_log_prob=True; clamp the denominator at 1 (supersedes #1625). - Fix gpt-4o / gpt-4o-standard output pricing (was 10× too low) and restate
open_instruct/judge_utils.pyrates as dollars per 1M tokens (renamedPRICE_PER_TOKEN→PRICE_PER_MILLION_TOKENS); update the cost calculation inopen_instruct/ground_truth_utils.pyaccordingly (supersedes #1618) (#1686). - Bundle IFEval correctness fixes:
validate_choiceoperand direction andvalidate_frequency_capital_words"around" tolerance inopen_instruct/if_functions.py;IFEvalVerifierZeroDivisionErrorwhen the instruction list is empty; deduplicate by deleting the unusedscripts/eval_constraints/if_functions.pycopy (supersedes #1615, #1646, #1655). - Fix NameError on
streaming_config/vllm_configwhen importing fromgrpo_fast.pydue to implicit global dependency. (#1675) - Fix
grpo.pytoken-count inflation by emitting boolresponse_masksfromDataPreparationActor(instead of int64 doc-id-valued masks) and dropping per-consumer.bool()coercions ingrpo_fast.py,grpo_utils.py, andolmo_core_train_modules.py. Previously the OLMo-core path summed the doc-id-valued mask incalculate_token_counts, inflatingloss_denominatorby ~60× (#1668). - Fix hardcoded project version (#1636) by using setuptools scm's automatic versioning.
- Fix CUDA illegal-memory-access in FSDP2 weight sync to vLLM by also unsharding the root FSDPModule (root-level params like model.norm and lm_head were producing local-shard buffers with global stride) (#1649).
- Fix weight sync on resume by initializing vLLM weight sync before the training loop and warming up the learner with a dummy forward so DeepSpeed Stage 3 params materialize before the first broadcast; accept IPC
update_infodict inLLMRayActor.update_weights; replace toothless weight-sync tests with a real divergent-weight broadcast test (#1627). - Fix
verify_sentence_constraintnot recognising!as a sentence terminator, causing IFEval sentence-count checks to undercount any response containing exclamations (#1612). - Fix
DataPreparationActorhanging on shutdown by killing the actor withray.kill()during cleanup (#1611). - Fix empty optimizer group error with torch 2.10 and DeepSpeed in
finetune.py,dpo_tune_cache.py, andutils.py. (#1598) - Fix
DatasetTransformationCache.load_or_transform_datasetreturn type to match expected tuple unpacking. (#1598) - Fix DeepSpeed gradient clipping in
grpo_fast.pyby passingmax_normto the DS config. (#1598) - Fix
dpo_tune_cache.pylogging on every rank. (#1598) - Fix
truncate_messages_to_fit_contextdouble-counting system tokens, which under-filled the judge context window bysystem_tokensworth of space (#1601). - Fix
is_equivreturningNoneinstead ofFalsewhen expression simplification raisesValueError(#1605). - Fix off-by-one in
find_shared_textso the full shared prefix is returned when one string is a prefix of another, and handle empty-input cases (#1604). - Fix
PreferenceDatasetProcessor.filterdropping the rejected-sequence length check, so over-long rejected completions were no longer filtered (#1597). - Fix dataset validation logic that rejected
--dataset_nameas the sole dataset mechanism in DPO and finetuning configs (#1595). - Improve GRPO vLLM timeout handling: retry
_check_healthonTimeoutErrorand ensureset_should_stopis always reset in the weight sync thread to prevent training hangs (#1532). - Fix
Batch.__getitem__handling ofactive_toolsfor int and list indexing (#1592). - Fix
RepeatPhraseChecker.check_followingto validate all matched phrases differ by exactly one word and return a proper boolean instead ofNone(#1044). - Fix incorrect hardcoded checkpoint state path for multi-GPU DeepSpeed resumption (#1589).
- Route GRPO LLM judge requests through the shared semaphore-guarded LiteLLM helper, preserving judge-specific retries and cost accounting while removing stale per-verifier client cleanup code (#1587).
- Harden
grpo_faststartup with explicit Ray resource preflight checks and actionable learner placement-group timeout diagnostics for single-node runs (#1586). - Fix shellcheck
$@quoting in GRPO debug scripts (#1572). - Add
--no_auto_dataset_cacheto GRPO and SFT integration test scripts to avoid HuggingFace 504 timeouts on CI runner (#1571).
Added
- Replace
scripts/submit_eval_jobs.pywith a new olmo-eval-internal launcher (Beaker v2, no gantry); the previous script is preserved asscripts/submit_eval_jobs_old.pyand emits aDeprecationWarning(#1638). - Add OLMo-core SFT implementation (#1579).
- Add DR-TULU replication script for Qwen 3.5 4B with evolving rubrics, per-tool pool size overrides,
vllm_qwen3_xmlparser, and<answer>tag extraction in rubric scoring (#1609). - Add MiniMax provider support: register
minimax-m2.7andminimax-m2.7-highspeedmodels inPRICE_PER_TOKENfor cost tracking and add cl100k_base encoding support incontext_window_checker(#1602). - Wire evolving rubric config flags into the GRPO training loop so
apply_evolving_rubric_rewardactually triggers rubric generation, buffer management, and ground-truth overrides during training (#1581). - Add model step logging for GRPO/vLLM by propagating
model_stepthrough generation metadata/results, syncing vLLM engines to the latest training step after weight sync, and reportingmodel_step_min/max/meanreward metrics (#1508). - Add Qwen3.5 VLM-as-CausalLM support for GRPO, SFT, and DPO:
language_model_onlyfor vLLM, param name mapping for weight sync, VLM config handling, liger-kernel bump to 0.7.0, pre-download model on rank 0 to avoid HF cache race conditions, update vllm to 0.19.0, and fix Ulysses SP for VLM models by passing the model object toregister_with_transformers(#1568). - Add OLMo-core sharding and parallelism documentation covering HSDP configuration across DPO, GRPO, and SFT (#1582).
- Add a vLLM-based teacher logit sampling pipeline for offline distillation, including
sample_logits_vllm.py, distillkit sampling writer utilities, and a launch script for generating compressed parquet shards (#1534). - Add user-focused documentation for tool use training, RL environments, parser selection, and rollout configuration (#1546).
- Adds support for flash attention 4, and changes attention implementation to FA2 (#1569).
- Add Git LFS documentation to README.md and CONTRIBUTING.md (#1570).
- Auto-detect attention implementation from model config, removing
use_flash_attnandattn_backendflags; addflash-attnv2 fallback for Blackwell GPU support (#1567). - Add hybrid model (Olmo-Hybrid) support: MambaSpec monkey-patch for vLLM dtype serialization,
trust_remote_codepass-through to vLLM engines,get_text_config()for multimodal model support, dependency upgrades (vllm>=0.18.0, transformers>=5.3.0),return_dict=Falsefor transformers 5.x compat, and hybrid test/production training scripts (#1425). - Add Ulysses sequence parallelism support to SFT training via
--sequence_parallel_size, using HF Accelerate'sParallelismConfigwith the DeepSpeed Ulysses SP backend. Enables training with much longer context lengths by sharding sequences across GPUs. Includes SP-aware loss aggregation, batch collation (padding to divisible seq len, index column removal), LR scheduler correction, and a two-node integration test script (#1539). - Added a GRPO implementation that uses OLMo-core with Ray-distributed FSDP2 training (#1389).
- Add the Qwen 3 4B DAPO math 32k training launch script under
scripts/train/qwen/(#1536). - Add Muon optimizer support to DPO training via OLMo-core's native MuonConfig (#1533).
- Add documentation for Slack alert integrations in GRPO and DPO training (#1529).
- Add
flash-attn-3dependency for Flash Attention 3 support on H100/H800 GPUs. DPO training via olmo-core auto-detects FA3 at runtime (#1525). - Tensor parallelism (TP) support for OLMo-core DPO training (#1467).
- Pulls out weight sync code from GRPO into a more generic function (#1411 (review))
- Adds callbacks for GRPO training with Olmo-core's trainer (#1397).
- Adds FSDP2 block-by-block weight gathering support for vLLM weight sync.
- OLMo-core GRPO actor with Ray-distributed FSDP2 training (#1398).
Fixed
- Refactor flash attention configuration: make
attn_implementationconfigurable with auto-detect default, removeuse_flash_attn/attn_backendflags, and unify attention backend detection across DPO, GRPO, and olmo-core models (#1563). - Fix GPU test deadlock and make dataset transformation tests fully offline with local fixtures (#1563).
- Remove stale
VLLM_ATTENTION_BACKENDfromDEFAULT_ENV_VARS; vLLM 0.18+ auto-detects attention backends (#1564). - Use
setup_zero_stage3_hooks()for DeepSpeed 0.18+ compat inadd_hooks(#1566). - Remove the runtime
temperaturefield from GRPOExperimentConfigand pass streaming temperature explicitly, avoiding W&B config collisions withStreamingDataLoaderConfig.temperature(#1561). - Log
val/tis_ratioandval/tis_clipfracingrpo_fastso truncated importance sampling diagnostics are visible during GRPO training (#1558). - Fix SP double-shift bug: keep both
labelsandshift_labelsin batch soForCausalLMLossuses pre-shifted labels (#1549). - Fix
total_batch_sizelogging to account for sequence parallelism (SP ranks share data, not independent) (#1542). - Got Olmo-core GRPO running in single-gpu mode and added a grpo.py debug script (#1543).
- Batch vLLM weight sync broadcasts to reduce Ray RPCs from ~200+ to 1, fixing timeouts with 32k response lengths (#1535).
- Fix
wandb_tracker.run.urlAttributeErroron non-main processes in multi-node SFT training by guarding accesses withaccelerator.is_main_processchecks (#1539). - Fix
UnboundLocalErrorforbeaker_configin SFT tracking setup whenpush_to_hubis disabled (#1539). - Pre-download HF model on main process before Ray actors spawn to avoid hitting HuggingFace rate limits (#1528).
- Fixed GPU test failures: DPO
get_num_tokensattention mask matching, DPO forward pass logps computation, mock model interface intest_dpo_utils_gpu.py, patch target intest_olmo_core_callbacks_gpu.py, reference logprobs cachedrop_last, and flaky streaming dataloader tool test (#1514). - Extended CONTRIBUTING.md with documentation on running tests, CI workflows, Beaker experiments, GRPO/DPO test scripts, and environment variables.
Changed
- Add support for loading DeepSpeed universal checkpoints when resuming GRPO runs so checkpoints can be reused across different parallelisms and cluster sizes (#1517).
- Extract shared GRPO metric helpers into
grpo_utils.pyand aligngrpo.pymetrics withgrpo_fast.py(#1552). - Add a configurable vLLM attention backend option and switch remaining
flash_attention_2defaults/references toflash_attention_3(#1559). - Switch back to CUDA 12.8.1, pin
flash-attn-3to a direct x86_64 wheel URL to avoid flat-index drift to aarch64-only releases (#1560). - Added GRPO local eval
pass@kmetrics, plus optionaleval_response_lengthhandling so eval generations can exceed rollout response length without undersizing vLLMmax_model_len(#1464). - Added other configs to wandb logging so all hyperparams are visible, set beaker name with RUN_NAME for grpo_fast.py (#1554).
- Updated vLLM to 0.17.1 and torch to 2.10+.
- Log
optim/grad_normingrpo_fast, including non-finite DeepSpeed values (nan/inf) when they occur (#1540). - Update GRPO/DPO defaults to match Olmo 3 experiments (
async_steps=8,advantage_normalization_type=mean_std,inflight_updates=True,clip_higher=0.28,truncated_importance_sampling_ratio_cap=10.0) and remove redundant flags from training scripts (#1547). - Removed all Augusta cluster (
ai2/augusta) references and GCP-cluster-specific code paths since the cluster has been decommissioned. - Added GRPO fast idle wait-time metrics for trainer waiting on inference and generation waiting on trainer consumption (
time/trainer_idle_waiting_for_inference,time/generation_idle_waiting_for_trainer) (#1516). - Updated vLLM to 0.16.0 and fixed
ChatCompletionRequestimport path which moved tovllm.entrypoints.openai.chat_completion.protocol(#1510). - Enable multiple active targets per rollout in RL training by unifying tool and environment dispatch in vLLM with upfront pool activation (no lazy tool acquisition), normalizing row-level
env_configduring preprocessing (dict/listforms -> canonical{"env_configs": [...]}), enforcing canonical-only runtime parsing, validating unknown configured targets early, and reporting per-target metrics while retaining text-environment handling (#1500). - Bound async data preparation to stay within
async_stepsof training, preventing training data getting too far out of sync with trainer. (#1496). - Refactor Legacy and DRTulu tool parsers to use OpenAI-format
tool_definitionsinstead of Raytool_actors. Removesimport rayfromparsers.py, fixes DRTulu parser which was broken after the pool refactor, and fixes--tool_parser_typetypo in dr_tulu debug script (#1491). - Replaces lambda collators with a "single_example_collator" (#1472).
- Clarified
activation_memory_budgetguidance in DPO utils with a practical default (0.5) and memory/speed tradeoff notes (#1460). - Let TransformerTrainModule handle FSDP parallelism instead of manual application in DPO (#1458).
- Refactored DPOTrainModule to inherit from TransformerTrainModule (#1456)
- Increased vLLM health check timeout from 30s to 600s (10 minutes) (#1452).
- Updated vllm version to 0.14.1 (#1433).
- Changed default wandb x-axis from
episodetotraining_stepfor grpo_fast (#1437). - Made a bunch of changes to
dpo.pyso it matchesdpo_tune_cache.pyperfectly (#1451).
Fixed
- Fix GRPO data prep actor checkpoint resume so resumed runs restore data prep client state and continue from the next unseen learner step (#1523).
- Fixed dataset cache hashing to include chat template source/content and tokenizer template metadata so dataset caches invalidate when chat templates change (#1497).
- Fixed
dataset_mixer_list_splitsvalidation indataset_transformationwhen multiple splits are provided, and preventcombined_datasetindex-column conflicts by dropping an existingindexcolumn before adding a new one (#1494). - Fixed GSM8K reward verification for signed final answers by preserving explicit
+and-signs when extracting the last numeric prediction, including boxed negative answers (#1530). - Exclude
CUDA_VISIBLE_DEVICESandROCR_VISIBLE_DEVICESfrom the Rayruntime_envso Ray can manage per-worker GPU visibility correctly on heterogeneous clusters and avoid invalid GPU assignments (#1519). - Include tokenizer configuration in per-transform dataset cache fingerprints so rerunning transformations with a different tokenizer does not silently reuse stale cached outputs (#1518).
- Fixed
grpo_fastlocal eval rounds enqueueing 0 prompts after the first run by resettingeval_data_loaderafter each eval pass (statefulDataLoaderBaserequires reset after epoch exhaustion); also switched eval prompt ID prefix from constant0totraining_stepto avoid cross-round metadata key collisions in vLLM request tracking (#1493). - Force
generation_config="vllm"in vLLM engine kwargs to prevent model HF generation defaults from capping OpenAI requestmax_tokens(#1512). - Avoided synchronous CUDA transfers when moving batches to device (#1443).
Removed
- Deletes some commented out code (#1537).