LR scheduler exhausts early in agentic training with AgentNativeStepEnvManager

## Bug Description

When using `AgentNativeStepEnvManager` (step-level env manager) for agentic training, the LR scheduler exhausts its step budget far before all pipeline steps complete, causing the learning rate to drop to zero mid-training.

In a 200-step training run with `lr_scheduler_type: "linear"`, the LR reached zero at **pipeline step 123** — meaning 38.5% of training happened with **zero learning rate** and no learning.

## Root Cause

`PPOConfig.set_max_steps()` computes the total optimizer steps for the LR scheduler using `rollout_batch_size` (number of **trajectories**):

https://github.com/alibaba/ROLL/blob/main/roll/configs/base_config.py#L701-L718

```python
self.actor_train.training_args.max_steps = max(1, (
    max_steps
    * self.rollout_batch_size              # trajectories per rollout
    * self.actor_infer.generating_args.num_return_sequences
    * self.ppo_epochs
    // actor_backward_batch_size
))
```

With the config in `agent_val_rock_swe_qwen35_2b.yaml`:
```
max_steps=200, rollout_batch_size=4, num_return_sequences=1, ppo_epochs=1, backward_batch_size=4
→ scheduler total = 200 * 4 * 1 * 1 // 4 = 200 optimizer steps
```

**But the training batch contains chunks (one per agent turn), not trajectories.** `AgentNativeStepEnvManager.formulate_rollouts()` creates one training sample per turn:

https://github.com/alibaba/ROLL/blob/main/roll/pipeline/agentic/env_manager/agent_native_env_manager.py#L248

```python
for step, history in enumerate(rollout_cache.history):
    # ... one DataProto per turn ...
    samples.append(lm_input)
batch = DataProto.concat(samples)  # all turns as separate samples
```

So with 4 trajectories × ~10 turns each = ~40 training samples per pipeline step. With `backward_batch_size=4`, that's **~10 optimizer steps per pipeline step** — not the 1 that the scheduler was budgeted for.

Additionally, `batch_adjust_mode: "random_sample"` has a bifurcation that makes this worse:
- When `total_chunks % backward_batch_size == 0`: keeps ALL chunks → many optimizer steps
- When not divisible: subsamples to exactly `backward_batch_size` → 1 optimizer step

https://github.com/alibaba/ROLL/blob/main/roll/pipeline/agentic/agentic_pipeline.py (search for `adjust_batch`)

This creates wildly inconsistent optimizer step counts per pipeline step, making the scheduler exhaustion unpredictable.

## Evidence from Training Run

wandb run: https://wandb.ai/shamanework-pl/roll-agentic/runs/gvoe0mq8

Config: `openreward_endless_terminals_IPA_qwen35_2b.yaml` (same architecture, `rollout_batch_size=16`, `backward_batch_size=16`)

| Pipeline Step | Backward Steps | Cumulative Optimizer Steps | LR |
|---|---|---|---|
| 0 | 1 | 1 | 9.95e-7 |
| 2 | **15** | 17 | 9.15e-7 |
| 21 | **20** | 55 | 7.25e-7 |
| 64 | **11** | 118 | 4.10e-7 |
| 108 | **12** | 178 | 1.10e-7 |
| **123** | 6 | 205 | **0.00** |
| 199 | 1 | 313 | 0.00 |

**Total: 313 optimizer steps across 200 pipeline steps**, but scheduler budgeted for 200. LR hit zero at step 123.

## Affected Configs

Any agentic config using `AgentNativeStepEnvManager` with a decaying LR scheduler (linear, cosine, etc.):
- `examples/agentic_demo/agent_val_rock_swe_qwen35_2b.yaml`
- Any similar agentic training config

## Suggested Fix

**Option A — Use constant LR** (simplest, no code change):
```yaml
actor_train:
  training_args:
    lr_scheduler_type: "constant_with_warmup"
    warmup_steps: 10
```

**Option B — Fix `set_max_steps` for agentic training** (proper fix):

Override `set_max_steps` in `AgenticConfig` to account for the actual number of chunks per trajectory:

```python
# In AgenticConfig:
def set_max_steps(self, max_steps: int):
    actor_backward_batch_size = (
        self.actor_train.training_args.per_device_train_batch_size
        * self.actor_train.training_args.gradient_accumulation_steps
    )
    # Estimate chunks per trajectory (each turn = 1 training sample)
    estimated_avg_turns = self.max_actions_per_traj // 2  # conservative midpoint
    self.actor_train.training_args.max_steps = max(1, (
        max_steps
        * self.rollout_batch_size
        * estimated_avg_turns
        * self.ppo_epochs
        // actor_backward_batch_size
    ))
```

**Option C — Fix `batch_adjust_mode`** (complementary):

Change `random_sample` to always produce a consistent batch size (e.g., always round down with `"delete"` mode), so each pipeline step = exactly 1 optimizer step, matching the current `set_max_steps` formula.

## Environment

- ROLL version: main branch
- Model: Qwen3.5-2B
- Environment: SWE-bench / OpenReward EndlessTerminals
- GPUs: 8× (TP=2, CP=2 for training, 8× vLLM inference)

/cc @shamanez

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LR scheduler exhausts early in agentic training with AgentNativeStepEnvManager #407

Bug Description

Root Cause

Evidence from Training Run

Affected Configs

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pipeline Step	Backward Steps	Cumulative Optimizer Steps	LR
0	1	1	9.95e-7
2	15	17	9.15e-7
21	20	55	7.25e-7
64	11	118	4.10e-7
108	12	178	1.10e-7
123	6	205	0.00
199	1	313	0.00

LR scheduler exhausts early in agentic training with AgentNativeStepEnvManager #407

Description

Bug Description

Root Cause

Evidence from Training Run

Affected Configs

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions