v1.1.0-rc0
Pre-release
Pre-release
·
10 commits
to main
since this release
Since v1.0.1:
Breaking Changes:
- Megatron removed - no need for the
megatron_cfgblock anymore. - Legacy environments removed
- Legacy eval harness removed
- Dataset options other than
dataset.shufflehave been removed. - "Max rollout turns" config option has been removed - implement this in your verifiers environments.
Changelog:
- Added
grpo.interleave_rolluts. Set it totrueto run one step off-policy (consider enabling importance sampling to compensate) and generate the next step's rollouts while you train on the current step's data. - Added
checkpointing.hf_checkpoint. Set it totrueto checkpoint directly to HF (slower than DCP). - Added new training path:
examples/run_sft.py. Seeexamples/configs/sft/afm_pocket_sft.yamlfor full configuration. - Added support for Muon via
dion. To use it, specifydion.MuonReferenceas your optimizer, and specifypolicy.optimizer.scalar_optimasadamwfor non-applicable parameters. - Rename project to "RLKit".
- Removed DPO training path.
- Legacy evaluation and rollout-generation system removed.
- Fixed a bug where
train/approx_entropywould be include entropy from masked-off tokens with no generation logprobs, causing NaNs to appear. - Fixed crash affecting sequence packing when responses are truncated.
- Trust vLLM's tokenization over HuggingFace's, avoiding some off-policy training.
- Required HF checkpointing on systems where DCP checkpointing would fail due to PCIe comms issues.