reward_normalization in agent_val_rock_swe.yaml produces zero advantages for SWE-bench

## Bug

The reward normalization config in [`agent_val_rock_swe.yaml` (L130)](https://github.com/alibaba/ROLL/blob/cb617dba8a713ba51df7e196fec6773aa50037f4/examples/agentic_demo/agent_val_rock_swe.yaml#L130) produces **all-zero advantages**, meaning no gradient signal reaches the policy.

```yaml
reward_normalization:
  grouping: traj_group_id
  method: mean
```

### Why advantages are always zero

1. **SWE-bench gives terminal-only reward**: The `RockTBNativeEnv.step()` returns `self.reward = 0` for every intermediate step. `calculate_reward()` is only called in `check_terminated()`, producing `[0, 0, ..., 0, R]` where `R ∈ {0, 1}`.

2. **`compute_discounted_returns` with `gamma=1.0`** propagates the terminal reward backwards uniformly: every step gets `step_rewards = R` (all identical).

3. **`agentic_reward_norm` with `grouping: traj_group_id, method: mean`**: `method: mean` resolves to `norm_mean_type: "group", norm_std_type: None`, so it computes `score - group_mean`. Since all steps in a trajectory share the same `traj_group_id` and the same value `R`, the result is `R - R = 0` for every step.

4. **`group_size: 1`** means each `traj_group_id` contains exactly one trajectory, so there's no cross-trajectory contrast either.

5. **The loss is `−log_probs × advantages = 0`** → no gradients, no learning.

### Possible fixes

- `grouping: batch` — normalize across the whole batch so successful vs failed trajectories contrast each other
- `method: identity` — skip normalization, use raw rewards directly
- `group_size: N` (N > 1) — collect multiple trajectories per prompt for GRPO-style contrast

### Question

Is this example config intended to reflect the training setup from the [*Let It Flow*](https://arxiv.org/abs/2512.24873) paper / ROME model? If so, was the actual ROME training using a different `reward_normalization` config (e.g. `grouping: batch` or the IPA-based chunk-level credit assignment described in the paper)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reward_normalization in agent_val_rock_swe.yaml produces zero advantages for SWE-bench #397

Bug

Why advantages are always zero

Possible fixes

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

reward_normalization in agent_val_rock_swe.yaml produces zero advantages for SWE-bench #397

Description

Bug

Why advantages are always zero

Possible fixes

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions