-
Notifications
You must be signed in to change notification settings - Fork 251
Description
Bug
The reward normalization config in agent_val_rock_swe.yaml (L130) produces all-zero advantages, meaning no gradient signal reaches the policy.
reward_normalization:
grouping: traj_group_id
method: meanWhy advantages are always zero
-
SWE-bench gives terminal-only reward: The
RockTBNativeEnv.step()returnsself.reward = 0for every intermediate step.calculate_reward()is only called incheck_terminated(), producing[0, 0, ..., 0, R]whereR ∈ {0, 1}. -
compute_discounted_returnswithgamma=1.0propagates the terminal reward backwards uniformly: every step getsstep_rewards = R(all identical). -
agentic_reward_normwithgrouping: traj_group_id, method: mean:method: meanresolves tonorm_mean_type: "group", norm_std_type: None, so it computesscore - group_mean. Since all steps in a trajectory share the sametraj_group_idand the same valueR, the result isR - R = 0for every step. -
group_size: 1means eachtraj_group_idcontains exactly one trajectory, so there's no cross-trajectory contrast either. -
The loss is
−log_probs × advantages = 0→ no gradients, no learning.
Possible fixes
grouping: batch— normalize across the whole batch so successful vs failed trajectories contrast each othermethod: identity— skip normalization, use raw rewards directlygroup_size: N(N > 1) — collect multiple trajectories per prompt for GRPO-style contrast
Question
Is this example config intended to reflect the training setup from the Let It Flow paper / ROME model? If so, was the actual ROME training using a different reward_normalization config (e.g. grouping: batch or the IPA-based chunk-level credit assignment described in the paper)?