Release v1.1.0-rc0 · arcee-ai/NeMo-RL

Since v1.0.1:

Breaking Changes:

Megatron removed - no need for the megatron_cfg block anymore.
Legacy environments removed
Legacy eval harness removed
Dataset options other than dataset.shuffle have been removed.
"Max rollout turns" config option has been removed - implement this in your verifiers environments.

Changelog:

Added grpo.interleave_rolluts. Set it to true to run one step off-policy (consider enabling importance sampling to compensate) and generate the next step's rollouts while you train on the current step's data.
Added checkpointing.hf_checkpoint. Set it to true to checkpoint directly to HF (slower than DCP).
Added new training path: examples/run_sft.py. See examples/configs/sft/afm_pocket_sft.yaml for full configuration.
Added support for Muon via dion. To use it, specify dion.MuonReference as your optimizer, and specify policy.optimizer.scalar_optim as adamw for non-applicable parameters.
Rename project to "RLKit".
Removed DPO training path.
Legacy evaluation and rollout-generation system removed.
Fixed a bug where train/approx_entropy would be include entropy from masked-off tokens with no generation logprobs, causing NaNs to appear.
Fixed crash affecting sequence packing when responses are truncated.
Trust vLLM's tokenization over HuggingFace's, avoiding some off-policy training.
Required HF checkpointing on systems where DCP checkpointing would fail due to PCIe comms issues.

Provide feedback