Skip to content

reward_normalization in agent_val_rock_swe.yaml produces zero advantages for SWE-bench #397

@shamanez

Description

@shamanez

Bug

The reward normalization config in agent_val_rock_swe.yaml (L130) produces all-zero advantages, meaning no gradient signal reaches the policy.

reward_normalization:
  grouping: traj_group_id
  method: mean

Why advantages are always zero

  1. SWE-bench gives terminal-only reward: The RockTBNativeEnv.step() returns self.reward = 0 for every intermediate step. calculate_reward() is only called in check_terminated(), producing [0, 0, ..., 0, R] where R ∈ {0, 1}.

  2. compute_discounted_returns with gamma=1.0 propagates the terminal reward backwards uniformly: every step gets step_rewards = R (all identical).

  3. agentic_reward_norm with grouping: traj_group_id, method: mean: method: mean resolves to norm_mean_type: "group", norm_std_type: None, so it computes score - group_mean. Since all steps in a trajectory share the same traj_group_id and the same value R, the result is R - R = 0 for every step.

  4. group_size: 1 means each traj_group_id contains exactly one trajectory, so there's no cross-trajectory contrast either.

  5. The loss is −log_probs × advantages = 0 → no gradients, no learning.

Possible fixes

  • grouping: batch — normalize across the whole batch so successful vs failed trajectories contrast each other
  • method: identity — skip normalization, use raw rewards directly
  • group_size: N (N > 1) — collect multiple trajectories per prompt for GRPO-style contrast

Question

Is this example config intended to reflect the training setup from the Let It Flow paper / ROME model? If so, was the actual ROME training using a different reward_normalization config (e.g. grouping: batch or the IPA-based chunk-level credit assignment described in the paper)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions