Skip to content

[Bug] Multiple stability issues in RLVR pipeline: LoRA synchronization failure and recovery crash #436

@Wangxiaoxiaoa

Description

@Wangxiaoxiaoa

Describe the bug

During large-scale reinforcement learning (RLVR) training, we identified several critical issues affecting the stability and correctness of the ROLL framework:

  1. LoRA Weight Inconsistency: When training with LoRA adapters, parameter updates are not correctly gathered and broadcasted from training nodes to rollout workers. This results in the inference phase using outdated base model weights.
  2. State Recovery Failure: When resuming training, DynamicSamplingScheduler attempts to call get_next_dataset_item() before dataset_iter is properly initialized in the init sequence, causing a crash.
  3. DeepSpeed Group Initialization: DeepSpeed initialization fails if it receives an empty parameter group (common when freezing layers in LoRA).
  4. Ray Metadata Overflow: Long-running sessions can exhaust the system's /tmp partition. There is no current way to redirect Ray's temporary directory.

Logs

1 (DynamicSamplingScheduler) Traceback (most recent call last):
2 File ".../roll/distributed/scheduler/generate_scheduler.py", line 478, in init
3 self.get_next_dataset_item()
4 File ".../roll/distributed/scheduler/generate_scheduler.py", line 727, in get_next_dataset_item
5 if self.dataset_iter is None:
6 AttributeError: 'DynamicSamplingScheduler' object has no attribute 'dataset_iter'

DeepSpeed Error:
1 ValueError: optimizer got an empty parameter list

Environment:

  • Hardware: NVIDIA H200 Cluster
  • Backend: DeepSpeed + Ray

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions