Skip to content

v1.1.0-rc0

Pre-release
Pre-release

Choose a tag to compare

@AutumnAurelium AutumnAurelium released this 15 Oct 03:23
· 10 commits to main since this release
88828a0

Since v1.0.1:

Breaking Changes:

  • Megatron removed - no need for the megatron_cfg block anymore.
  • Legacy environments removed
  • Legacy eval harness removed
  • Dataset options other than dataset.shuffle have been removed.
  • "Max rollout turns" config option has been removed - implement this in your verifiers environments.

Changelog:

  • Added grpo.interleave_rolluts. Set it to true to run one step off-policy (consider enabling importance sampling to compensate) and generate the next step's rollouts while you train on the current step's data.
  • Added checkpointing.hf_checkpoint. Set it to true to checkpoint directly to HF (slower than DCP).
  • Added new training path: examples/run_sft.py. See examples/configs/sft/afm_pocket_sft.yaml for full configuration.
  • Added support for Muon via dion. To use it, specify dion.MuonReference as your optimizer, and specify policy.optimizer.scalar_optim as adamw for non-applicable parameters.
  • Rename project to "RLKit".
  • Removed DPO training path.
  • Legacy evaluation and rollout-generation system removed.
  • Fixed a bug where train/approx_entropy would be include entropy from masked-off tokens with no generation logprobs, causing NaNs to appear.
  • Fixed crash affecting sequence packing when responses are truncated.
  • Trust vLLM's tokenization over HuggingFace's, avoiding some off-policy training.
  • Required HF checkpointing on systems where DCP checkpointing would fail due to PCIe comms issues.