In roll/pipeline/rlvr/rewards/math_rule_reward_worker.py, which is the default reward worker class for the math dataset rlvr training, the _extract_after_last_end_think func requires the response should not contain the beginning think tag <think>, otherwise the reward will be 0. It's quite unreasonable, and there is no clues in the math training dataset data/math_deepmath_deal.jsonl that response should not have the beginning think tag. This will cause an invalid rlvr training, and I suggest using the last_boxed_only_string func, which you can refer to slime.