添加训推修复功能（add feature train-infer-mismatch） #273

millioniron · 2025-12-03T06:26:56Z

包括TIS和MASK掉超出阈值的样本（token，sequence），以及灾难性过滤等。

在base work/actor worker/actor-pg-worker进行了添加。

经过测试，rlvr和agentic都可运行，但（agentic的VL-trajectory-envmanager尚不可运行，因其formulate_rollouts中重新的encode可能造成打乱或？）但或许可以借鉴verl中的重新计算logprobs，总之这点我并没有修改。

new file: examples/qwen2.5-infer_correction/agentic_webshop_infer_correction.yaml new file: examples/qwen2.5-infer_correction/rlvr_infer_correction_config.yaml new file: examples/qwen2.5-infer_correction/run_agentic_pipeline_webshop.sh new file: examples/qwen2.5-infer_correction/run_rlvr_pipeline.sh modified: roll/configs/base_config.py modified: roll/configs/generating_args.py modified: roll/distributed/scheduler/generate_scheduler.py modified: roll/pipeline/agentic/env_manager/step_env_manager.py modified: roll/pipeline/agentic/env_manager/traj_env_manager.py modified: roll/pipeline/base_worker.py modified: roll/pipeline/rlvr/actor_pg_worker.py modified: roll/pipeline/rlvr/actor_worker.py

PanAndy · 2025-12-03T06:38:50Z

Thank you very much for contributing the code. Let’s review this feature together. @wwxFromTju

wwxFromTju · 2025-12-03T07:01:27Z

Thank you very much for contributing the code. Let’s review this feature together. @wwxFromTju

ok，快速看了下，社区同学主要改动在于新增了“infer_correction”这个函数，需要测试一下训练效果后，可以和咱们新的代码直接合并。

wwxFromTju · 2025-12-03T07:01:54Z

包括TIS和MASK掉超出阈值的样本（token，sequence），以及灾难性过滤等。

在base work/actor worker/actor-pg-worker进行了添加。

经过测试，rlvr和agentic都可运行，但（agentic的VL-trajectory-envmanager尚不可运行，因其formulate_rollouts中重新的encode可能造成打乱或？）但或许可以借鉴verl中的重新计算logprobs，总之这点我并没有修改。

hey，您好，是否有运行的测试效果？咱们做一下对照

millioniron · 2025-12-03T07:45:56Z

你好，我这边是有一些测试的效果对比，但是由于是在公司的一些私有数据上面，估计难以放出来。只能说确实是更加稳定和效果提升

PanAndy · 2025-12-03T07:54:39Z

你好，我这边是有一些测试的效果对比，但是由于是在公司的一些私有数据上面，估计难以放出来。只能说确实是更加稳定和效果提升

请问方便跑一跑仓库里的数据吗？7B rlvr的和 agentic frozelake或者sokoan的

millioniron · 2025-12-03T08:08:56Z

好的，明天等我消息

tt0718 · 2025-12-05T06:23:32Z

@millioniron
👋hello from ROLL！感谢您对我们的关注和贡献 🤝ROLL期待和社区开发者建立更多的联系，共建交流，一起进步！后续项目也会组织定期的活动、给活跃用户发放礼品等 📧欢迎添加我的微信进一步沟通：tt19960718tt 如果您没有微信，也可以通过邮箱联系：tangtang.tt@alibaba-inc.com

👋 Hello from ROLL! Thank you for your interest and contributions to our project! 🤝 We at ROLL are eager to build stronger connections with our community developers. Let's foster collaboration and grow together! Looking ahead, the ROLL project will also organize regular events and offer gifts to active contributors. 📧 For further discussion:

Add me on WeChat: tt19960718tt
Or reach me via email: tangtang.tt@alibaba-inc.com

millioniron · 2025-12-08T06:49:53Z

hi，时间稍微有些久了，这段时间用了公用的仓库的实验跑的一下，说一下问题吧。

RLVR：Qwen2.5-7B math的数学数据集，

但是加不加infer-correction的修正影响不大，因为本身也就没有太大的偏离，估计是因为没有外部反馈以及response len（10240）不够长的的原因。

然后是

agentic：使用largesokoban作为环境训练，模型Qwen3-4B-Thinking-2507，一些训练的指标图

在非使用infer-correction的时候，模型会中间崩溃，但是可能是整个游戏环境比较简单，他崩溃了之后又可以自己修复

使用infer-correction计算出来的超过阈值的项比较多，我估计是因为环境的格式等问题造成的，其与模型输入输出差太大。

然后我又跑了个code4math的环境，目前是这样的，这种就比较稳定的好了。

millioniron · 2025-12-08T06:51:05Z

配置如下：

defaults:
  - ../config/traj_envs@_here_
  - ../config/deepspeed_zero@_here_
  - ../config/deepspeed_zero2@_here_
  - ../config/deepspeed_zero3@_here_
  - ../config/deepspeed_zero3_cpuoffload@_here_

hydra:
  run:
    dir: .
  output_subdir: null

exp_name: "agentic_pipeline_none_correction"
seed: 42
logging_dir: ./output/logs/${exp_name}
output_dir: ./output
render_save_dir: ./output/render
system_envs:
  USE_MODELSCOPE: '1'

#track_with: wandb
#tracker_kwargs:
#  api_key:
#  project: roll-agentic
#  name: ${exp_name}_sokoban
#  notes: "agentic_pipeline"
#  tags:
#    - agentic
#    - roll
#    - baseline

track_with: tensorboard
tracker_kwargs:
  log_dir: ./tensorboard/roll_exp/${exp_name}

checkpoint_config:
  type: file_system
  output_dir: ./agentic/models/${exp_name}

num_gpus_per_node: 8

max_steps: 500
save_steps: 10000
logging_steps: 1
eval_steps: 20
resume_from_checkpoint: false

async_generation_ratio: 1

rollout_batch_size: 256
val_batch_size: 256
sequence_length: 24576

ppo_epochs: 1
adv_estimator: "grpo"
#pg_clip: 0.1
#dual_clip_loss: True
init_kl_coef: 0.0
whiten_advantages: true
entropy_loss_coef: 0
max_grad_norm: 1.0
add_token_level_kl: false
use_kl_loss: false


global_template: qwen3
pretrain: /Qwen3-4B-Thinking-2507
reward_pretrain: /Qwen3-4B-Thinking-2507

infer_correction: true 

infer_is_mode: None 
infer_is_threshold_min: 0.0
infer_is_threshold_max: 5.0     # 1.5~5.0

enable_token_reject: true
infer_token_mask_threshold_min: 0.1
infer_token_mask_threshold_max: 10.0 # 2~10

enable_catastrophic_reject: true
infer_catastrophic_threshold: 1e-4

enable_seq_reject: None

actor_train:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: false
    dtype: bf16
    model_type: ~
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 32
    warmup_steps: 10
    lr_scheduler_type: cosine
  data_args:
    template: qwen3
  strategy_args:
#    strategy_name: deepspeed_train
#    strategy_config: ${deepspeed_zero3}
    strategy_name: megatron_train
    strategy_config:
      tensor_model_parallel_size: 1
      pipeline_model_parallel_size: 1
      expert_model_parallel_size: 1
      use_distributed_optimizer: true
      recompute_granularity: full
  device_mapping: list(range(0,4))
  infer_batch_size: 2
  # use_dynamic_batching_in_infer: true
  # max_tokens_per_microbatch_in_infer: 30732
  # sequence_length_round_in_infer: 128

actor_infer:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
  generating_args:
    max_new_tokens: 2048 # single-turn response length
    top_p: 1
    top_k: 100
    num_beams: 1
    temperature: 1
    num_return_sequences: 1
    logprobs: 1
  data_args:
    template: qwen3
  strategy_args:
    strategy_name: sglang
    strategy_config:
      mem_fraction_static: 0.85
      load_format: auto
  device_mapping: list(range(4,8))
  # use_dynamic_batching_in_infer: true
  # max_tokens_per_microbatch_in_infer: 30732
  # sequence_length_round_in_infer: 128

reference:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: true
    dtype: bf16
    model_type: ~
  data_args:
    template: qwen3
  strategy_args:
    strategy_name: hf_infer
    strategy_config: ~
  device_mapping: list(range(0,4))
  infer_batch_size: 8
  # use_dynamic_batching_in_infer: true
  # max_tokens_per_microbatch_in_infer: 30732
  # sequence_length_round_in_infer: 128

reward_normalization:
  grouping: traj_group_id # 可以tags(env_type)/traj_group_id(group)/batch(rollout_batch)... group_by计算reward/adv
  method: mean_std # asym_clip / identity / mean_std

train_env_manager:
  max_env_num_per_worker: 16
  num_env_groups: 32
  # under the same group, the env config and env seed are ensured to be equal
  group_size: 8
  tags: [LargerSokoban]
  num_groups_partition: [32] # If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation

val_env_manager:
  max_env_num_per_worker: 32
  num_env_groups: 256
  group_size: 1 # should be set to 1 because val temperature is set to 0 and same prompt leads to same output
  tags: [LargerSokoban]
  num_groups_partition: [256] # TODO: If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation

# Here, you can override variables defined in the imported envs. max_tokens_per_step: 128 in custom_env.SimpleSokoban, here replaced by 64
max_tokens_per_step: 2048
sokoban_format_penalty: 0.0
max_actions_per_traj: 8


custom_envs:
  SimpleSokoban:
    ${custom_env.SimpleSokoban}
  LargerSokoban:
    ${custom_env.LargerSokoban}
  SokobanDifferentGridVocab:
    ${custom_env.SokobanDifferentGridVocab}
  FrozenLake:
    ${custom_env.FrozenLake}
  FrozenLakeThink:
    ${custom_env.FrozenLakeThink}

…oken/seq mask & add rollout importance sampling.

…l mask update.

… model.

…ptimize memory usage during top-k logits computation.

PanAndy · 2025-12-08T08:13:45Z

#283 也包含一些infer_logporbs的内容，引入了一些conflict到这个PR，辛苦rebase一下。
我也来跑跑你的代码🫡

millioniron · 2025-12-08T08:31:27Z

好的

new file: examples/qwen2.5-infer_correction/agentic_webshop_infer_correction.yaml new file: examples/qwen2.5-infer_correction/rlvr_infer_correction_config.yaml new file: examples/qwen2.5-infer_correction/run_agentic_pipeline_webshop.sh new file: examples/qwen2.5-infer_correction/run_rlvr_pipeline.sh modified: roll/configs/base_config.py modified: roll/configs/generating_args.py modified: roll/distributed/scheduler/generate_scheduler.py modified: roll/pipeline/agentic/env_manager/step_env_manager.py modified: roll/pipeline/agentic/env_manager/traj_env_manager.py modified: roll/pipeline/base_worker.py modified: roll/pipeline/rlvr/actor_pg_worker.py modified: roll/pipeline/rlvr/actor_worker.py

millioniron · 2025-12-08T14:42:30Z

hi，我进行了合并修订。具体来讲，对于新版本中有关infer-log-prob的获取部分我遵循了官方的版本。但是在具体的训推修复部分，是使用我自己的修订。

✨ What's Changed

What does this PR do?

✨ What's Changed

1. 核心组件重构

新增 InferCorrectionHandler （roll/utils/infer_correction.py）类：专注处理IS校正+样本拒绝，替代原loss_func中混杂逻辑

handler = InferCorrectionHandler(pipeline_config)
weighted_loss, final_mask, metrics = handler(
    old_log_probs, infer_log_probs, response_mask, pg_loss
)

2. 三级拒绝策略体系

策略类型	触发条件	保护目标	关键参数
Token级拒绝	IS比率超出合理范围	防止单点梯度爆炸	`infer_token_mask_threshold_{min,max}`
序列级拒绝	序列整体IS比率异常	保证序列级一致性	`enable_seq_reject`, `infer_seq_mask_threshold_{min,max}`
灾难性拒绝 ✨	IS比率 < 1e-3 (指数级概率差异)	防止训练完全崩溃	`infer_catastrophic_threshold`

3. 智能重要性采样

模式动态切换：
```
infer_is_mode: Literal["token", "sequence", "geometric", "none"]
```
- token：传统token级IS（默认）
- sequence：序列总log-ratio（稳定长序列训练）
- geometric：几何平均比率（平衡极端值）
- none：关闭IS（基准测试用）

自适应裁剪：

is_weight = raw_is_weight.clamp(
    min=infer_is_threshold_min, 
    max=infer_is_threshold_max

4. 工业级诊断系统

StatsCollector 集中管理指标，分三类：
- 基础分布：token_ratio_mean/std/min/max
- 拒绝分析：token_reject_frac, seq_reject_frac, catastrophic_seq_frac
- 训练健康度：inferkl (原始KL), inferkl_reject (拒绝后KL)

延迟计算优化：避免频繁GPU-CPU同步

self.stats.add_tensor_stat("token_ratio", ratio, mask)  # 注册但不立即计算
self.stats.compute_tensor_stats()  # 批量计算

millioniron · 2025-12-08T14:43:15Z

这个PR由于我跟进了最新版本的缘故，commit有些多了。是否需要我重新提交一个PR？

PanAndy · 2025-12-08T14:50:47Z

这个PR由于我跟进了最新版本的缘故，commit有些多了。是否需要我重新提交一个PR？

可以自己squash一下，然后提交？

millioniron · 2025-12-08T15:05:09Z

重开了一个新pull：#288

millioniron changed the title ~~添加训推修复功能~~ 添加训推修复功能（add feature train-infer-mismatch） Dec 3, 2025

PanAndy requested review from PanAndy and wwxFromTju December 3, 2025 06:39

WeepCat added 2 commits December 5, 2025 10:06

fixed_llm_proxy_mode_rollout_pipeline

3d291b1

fixed some typos

6ca3d10

重新修订了整个的排版，抽象出了一个类，使得可以更加自由定义

77c1e5d

chengengru.cgr and others added 14 commits December 8, 2025 15:35

(fix): update math rule reward worker.

a0ea354

(feat): set RAY_CGRAPH_get_timeout=600.

055ef9b

(fix): vllm 0.11.0 import

29a7610

(fix): fix train infer ratio/diff mean & add train infer ratio/diff t…

bcc5818

…oken/seq mask & add rollout importance sampling.

(feat): support vllm beam_search.

a252c2e

(fix): ensure compatibility with transformers version check for causa…

d8e5c94

…l mask update.

(feat): support pytorch280 docker.

77325f7

(fix): fix agentic val get_batch state in redundancy env.

accefed

(feat): Add support for Qwen-3-next on AMD GPUs.

8629e85

fix: fix tokenizer usage in llm judge reward worker.

41fe274

(feat): add vlm option.

38bfc2e

(feat): agentic-spec actor worker.

8e4bf7c

(feat): agentic_filter_task.

79af5c3

(refactor): agentic pipeline modify.

7c261c8

PanAndy and others added 16 commits December 8, 2025 15:35

(fix): fix get_cached_module_file.

d5f07c9

(fix): fix bugs with metrics recording in the DPO pipeline.

5266468

(feat): add enable_old_logprobs, opt old log probs by cache.

0e47311

(fix): update image loading logic for byte data in rlvr_vlm_pipeline.py

3de37e1

(feat): mcore_adapter support qwen3vl.

5caa55c

(fix): add force_vit flags for image and video processing in Qwen3 VL…

b5cd1ea

… model.

(feat): add qwen3-vl example.

e30bb72

(feat): mock infer.

2f9f2df

(feat): add qwen3-vl 32B example.

3e4633e

(feat): add sequence packing for sft pipeline and distill pipeline, o…

3657919

…ptimize memory usage during top-k logits computation.

(feat): add alive check.

4a68470

(feat): sglang support dp-attention.

21460df

(fix): set broadcast_non_tensor_batch for old_logprobs.

1c45b7a

(fix): fix vllm get_metrics exception.

24374f1

(fix): fix vllm 0110.

61a544a

(fix): fix AgenticAcotrWorker import.

e1695f2

millioniron mentioned this pull request Dec 8, 2025

on-policy下，log-prob和old-logprob数值不一致，有一些崩溃现象 #284

Open

PanAndy mentioned this pull request Dec 8, 2025

🚀 [2025/12/8] Recent Updates Summary for ROLL Project #286

Open

millioniron added 5 commits December 8, 2025 18:44

重新修订了整个的排版，抽象出了一个类，使得可以更加自由定义

e98c4ce

Merge branch 'main' of https://github.com/millioniron/ROLL

7613dbd

modified: roll/pipeline/agentic/env_manager/step_env_manager.py

c3d1121

去掉原本官方的train-infer实现

515bf39

millioniron closed this Dec 8, 2025

添加训推修复功能（add feature train-infer-mismatch） #273

添加训推修复功能（add feature train-infer-mismatch） #273

Uh oh!

Conversation

millioniron commented Dec 3, 2025

Uh oh!

PanAndy commented Dec 3, 2025

Uh oh!

wwxFromTju commented Dec 3, 2025

Uh oh!

wwxFromTju commented Dec 3, 2025

Uh oh!

millioniron commented Dec 3, 2025

Uh oh!

PanAndy commented Dec 3, 2025

Uh oh!

millioniron commented Dec 3, 2025

Uh oh!

tt0718 commented Dec 5, 2025

Uh oh!

millioniron commented Dec 8, 2025

Uh oh!

millioniron commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PanAndy commented Dec 8, 2025

Uh oh!

millioniron commented Dec 8, 2025

Uh oh!

millioniron commented Dec 8, 2025

What does this PR do?

✨ What's Changed

1. 核心组件重构

2. 三级拒绝策略体系

3. 智能重要性采样

4. 工业级诊断系统

Uh oh!

millioniron commented Dec 8, 2025

Uh oh!

PanAndy commented Dec 8, 2025

Uh oh!

millioniron commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

millioniron commented Dec 8, 2025 •

edited

Loading