Confused about the extract function in rlvr math_rule_reward_worker.

In roll/pipeline/rlvr/rewards/math_rule_reward_worker.py, which is the default reward worker class for the math dataset rlvr training, the `_extract_after_last_end_think` func requires the response should not contain the beginning think tag `<think>`, otherwise the reward will be 0. It's quite unreasonable, and there is no clues in the math training dataset `data/math_deepmath_deal.jsonl` that response should not have the beginning think tag. This will cause an invalid rlvr training, and I suggest using the `last_boxed_only_string` func, which you can refer to [slime](https://github.com/THUDM/slime/blob/bbb67c083430adf8e44b39a98baa0cd3d10185cc/slime/rollout/rm_hub/math_utils.py#L418).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Confused about the extract function in rlvr math_rule_reward_worker. #281

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Confused about the extract function in rlvr math_rule_reward_worker. #281

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions