[Bug] Multiple stability issues in RLVR pipeline: LoRA synchronization failure and recovery crash

## Describe the bug
  During large-scale reinforcement learning (RLVR) training, we identified several critical issues affecting the stability and correctness of the ROLL framework:

   1. LoRA Weight Inconsistency: When training with LoRA adapters, parameter updates are not correctly gathered and broadcasted from training nodes to rollout workers. This results in the inference phase using outdated base model weights.
   2. State Recovery Failure: When resuming training, DynamicSamplingScheduler attempts to call get_next_dataset_item() before dataset_iter is properly initialized in the __init__ sequence, causing a crash.
   3. DeepSpeed Group Initialization: DeepSpeed initialization fails if it receives an empty parameter group (common when freezing layers in LoRA).
   4. Ray Metadata Overflow: Long-running sessions can exhaust the system's /tmp partition. There is no current way to redirect Ray's temporary directory.

 ## Logs 

   1 (DynamicSamplingScheduler) Traceback (most recent call last):
   2   File ".../roll/distributed/scheduler/generate_scheduler.py", line 478, in __init__
   3     self.get_next_dataset_item()
   4   File ".../roll/distributed/scheduler/generate_scheduler.py", line 727, in get_next_dataset_item
   5     if self.dataset_iter is None:
   6 AttributeError: 'DynamicSamplingScheduler' object has no attribute 'dataset_iter'

  DeepSpeed Error:
   1 ValueError: optimizer got an empty parameter list

 ## Environment:
   - Hardware: NVIDIA H200 Cluster
   - Backend: DeepSpeed + Ray

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Multiple stability issues in RLVR pipeline: LoRA synchronization failure and recovery crash #436

Describe the bug

Logs

Environment:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Multiple stability issues in RLVR pipeline: LoRA synchronization failure and recovery crash #436

Description

Describe the bug

Logs

Environment:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions