Skip to content

Conversation

@millioniron
Copy link

包括TIS和MASK掉超出阈值的样本(token,sequence),以及灾难性过滤等。

在base work/actor worker/actor-pg-worker进行了添加。

经过测试,rlvr和agentic都可运行,但(agentic的VL-trajectory-envmanager尚不可运行,因其formulate_rollouts中重新的encode可能造成打乱或?)但或许可以借鉴verl中的重新计算logprobs,总之这点我并没有修改。

	new file:   examples/qwen2.5-infer_correction/agentic_webshop_infer_correction.yaml
	new file:   examples/qwen2.5-infer_correction/rlvr_infer_correction_config.yaml
	new file:   examples/qwen2.5-infer_correction/run_agentic_pipeline_webshop.sh
	new file:   examples/qwen2.5-infer_correction/run_rlvr_pipeline.sh
	modified:   roll/configs/base_config.py
	modified:   roll/configs/generating_args.py
	modified:   roll/distributed/scheduler/generate_scheduler.py
	modified:   roll/pipeline/agentic/env_manager/step_env_manager.py
	modified:   roll/pipeline/agentic/env_manager/traj_env_manager.py
	modified:   roll/pipeline/base_worker.py
	modified:   roll/pipeline/rlvr/actor_pg_worker.py
	modified:   roll/pipeline/rlvr/actor_worker.py
@millioniron millioniron changed the title 添加训推修复功能 添加训推修复功能(add feature train-infer-mismatch) Dec 3, 2025
@PanAndy
Copy link
Collaborator

PanAndy commented Dec 3, 2025

Thank you very much for contributing the code. Let’s review this feature together. @wwxFromTju

@PanAndy PanAndy requested review from PanAndy and wwxFromTju December 3, 2025 06:39
@wwxFromTju
Copy link
Collaborator

Thank you very much for contributing the code. Let’s review this feature together. @wwxFromTju

ok,快速看了下,社区同学主要改动在于新增了“infer_correction”这个函数,需要测试一下训练效果后,可以和咱们新的代码直接合并。

@wwxFromTju
Copy link
Collaborator

包括TIS和MASK掉超出阈值的样本(token,sequence),以及灾难性过滤等。

在base work/actor worker/actor-pg-worker进行了添加。

经过测试,rlvr和agentic都可运行,但(agentic的VL-trajectory-envmanager尚不可运行,因其formulate_rollouts中重新的encode可能造成打乱或?)但或许可以借鉴verl中的重新计算logprobs,总之这点我并没有修改。

hey,您好,是否有运行的测试效果?咱们做一下对照

@millioniron
Copy link
Author

你好,我这边是有一些测试的效果对比,但是由于是在公司的一些私有数据上面,估计难以放出来。只能说确实是更加稳定和效果提升

@PanAndy
Copy link
Collaborator

PanAndy commented Dec 3, 2025

你好,我这边是有一些测试的效果对比,但是由于是在公司的一些私有数据上面,估计难以放出来。只能说确实是更加稳定和效果提升

请问方便跑一跑仓库里的数据吗?7B rlvr的 和 agentic frozelake或者sokoan的

@millioniron
Copy link
Author

好的,明天等我消息

@tt0718
Copy link

tt0718 commented Dec 5, 2025

@millioniron
👋hello from ROLL!感谢您对我们的关注和贡献
🤝ROLL期待和社区开发者建立更多的联系,共建交流,一起进步!后续项目也会组织定期的活动、给活跃用户发放礼品等
📧欢迎添加我的微信进一步沟通:tt19960718tt
如果您没有微信,也可以通过邮箱联系:tangtang.tt@alibaba-inc.com

👋 Hello from ROLL! Thank you for your interest and contributions to our project!
🤝 We at ROLL are eager to build stronger connections with our community developers. Let's foster collaboration and grow together! Looking ahead, the ROLL project will also organize regular events and offer gifts to active contributors.
📧 For further discussion:

@millioniron
Copy link
Author

hi,时间稍微有些久了,这段时间用了公用的仓库的实验跑的一下,说一下问题吧。

RLVR:Qwen2.5-7B math的数学数据集,

image (6)

但是加不加infer-correction的修正影响不大,因为本身也就没有太大的偏离,估计是因为没有外部反馈以及response len(10240)不够长的的原因。

然后是

agentic:使用largesokoban作为环境训练,模型Qwen3-4B-Thinking-2507,一些训练的指标图
image (7)
image (8)

在非使用infer-correction的时候,模型会中间崩溃,但是可能是整个游戏环境比较简单,他崩溃了之后又可以自己修复
image (9)
image (10)

使用infer-correction计算出来的超过阈值的项比较多,我估计是因为环境的格式等问题造成的,其与模型输入输出差太大。
image (11)

然后我又跑了个code4math的环境,目前是这样的,这种就比较稳定的好了。

@millioniron
Copy link
Author

millioniron commented Dec 8, 2025

配置如下:

defaults:
  - ../config/traj_envs@_here_
  - ../config/deepspeed_zero@_here_
  - ../config/deepspeed_zero2@_here_
  - ../config/deepspeed_zero3@_here_
  - ../config/deepspeed_zero3_cpuoffload@_here_

hydra:
  run:
    dir: .
  output_subdir: null

exp_name: "agentic_pipeline_none_correction"
seed: 42
logging_dir: ./output/logs/${exp_name}
output_dir: ./output
render_save_dir: ./output/render
system_envs:
  USE_MODELSCOPE: '1'

#track_with: wandb
#tracker_kwargs:
#  api_key:
#  project: roll-agentic
#  name: ${exp_name}_sokoban
#  notes: "agentic_pipeline"
#  tags:
#    - agentic
#    - roll
#    - baseline

track_with: tensorboard
tracker_kwargs:
  log_dir: ./tensorboard/roll_exp/${exp_name}

checkpoint_config:
  type: file_system
  output_dir: ./agentic/models/${exp_name}

num_gpus_per_node: 8

max_steps: 500
save_steps: 10000
logging_steps: 1
eval_steps: 20
resume_from_checkpoint: false

async_generation_ratio: 1

rollout_batch_size: 256
val_batch_size: 256
sequence_length: 24576

ppo_epochs: 1
adv_estimator: "grpo"
#pg_clip: 0.1
#dual_clip_loss: True
init_kl_coef: 0.0
whiten_advantages: true
entropy_loss_coef: 0
max_grad_norm: 1.0
add_token_level_kl: false
use_kl_loss: false


global_template: qwen3
pretrain: /Qwen3-4B-Thinking-2507
reward_pretrain: /Qwen3-4B-Thinking-2507

infer_correction: true 

infer_is_mode: None 
infer_is_threshold_min: 0.0
infer_is_threshold_max: 5.0     # 1.5~5.0

enable_token_reject: true
infer_token_mask_threshold_min: 0.1
infer_token_mask_threshold_max: 10.0 # 2~10

enable_catastrophic_reject: true
infer_catastrophic_threshold: 1e-4

enable_seq_reject: None

actor_train:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: false
    dtype: bf16
    model_type: ~
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 32
    warmup_steps: 10
    lr_scheduler_type: cosine
  data_args:
    template: qwen3
  strategy_args:
#    strategy_name: deepspeed_train
#    strategy_config: ${deepspeed_zero3}
    strategy_name: megatron_train
    strategy_config:
      tensor_model_parallel_size: 1
      pipeline_model_parallel_size: 1
      expert_model_parallel_size: 1
      use_distributed_optimizer: true
      recompute_granularity: full
  device_mapping: list(range(0,4))
  infer_batch_size: 2
  # use_dynamic_batching_in_infer: true
  # max_tokens_per_microbatch_in_infer: 30732
  # sequence_length_round_in_infer: 128

actor_infer:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
  generating_args:
    max_new_tokens: 2048 # single-turn response length
    top_p: 1
    top_k: 100
    num_beams: 1
    temperature: 1
    num_return_sequences: 1
    logprobs: 1
  data_args:
    template: qwen3
  strategy_args:
    strategy_name: sglang
    strategy_config:
      mem_fraction_static: 0.85
      load_format: auto
  device_mapping: list(range(4,8))
  # use_dynamic_batching_in_infer: true
  # max_tokens_per_microbatch_in_infer: 30732
  # sequence_length_round_in_infer: 128

reference:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: true
    dtype: bf16
    model_type: ~
  data_args:
    template: qwen3
  strategy_args:
    strategy_name: hf_infer
    strategy_config: ~
  device_mapping: list(range(0,4))
  infer_batch_size: 8
  # use_dynamic_batching_in_infer: true
  # max_tokens_per_microbatch_in_infer: 30732
  # sequence_length_round_in_infer: 128

reward_normalization:
  grouping: traj_group_id # 可以tags(env_type)/traj_group_id(group)/batch(rollout_batch)... group_by计算reward/adv
  method: mean_std # asym_clip / identity / mean_std

train_env_manager:
  max_env_num_per_worker: 16
  num_env_groups: 32
  # under the same group, the env config and env seed are ensured to be equal
  group_size: 8
  tags: [LargerSokoban]
  num_groups_partition: [32] # If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation

val_env_manager:
  max_env_num_per_worker: 32
  num_env_groups: 256
  group_size: 1 # should be set to 1 because val temperature is set to 0 and same prompt leads to same output
  tags: [LargerSokoban]
  num_groups_partition: [256] # TODO: If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation

# Here, you can override variables defined in the imported envs. max_tokens_per_step: 128 in custom_env.SimpleSokoban, here replaced by 64
max_tokens_per_step: 2048
sokoban_format_penalty: 0.0
max_actions_per_traj: 8


custom_envs:
  SimpleSokoban:
    ${custom_env.SimpleSokoban}
  LargerSokoban:
    ${custom_env.LargerSokoban}
  SokobanDifferentGridVocab:
    ${custom_env.SokobanDifferentGridVocab}
  FrozenLake:
    ${custom_env.FrozenLake}
  FrozenLakeThink:
    ${custom_env.FrozenLakeThink}

@PanAndy
Copy link
Collaborator

PanAndy commented Dec 8, 2025

#283 也包含一些infer_logporbs的内容,引入了一些conflict到这个PR,辛苦rebase一下。
我也来跑跑你的代码🫡

@millioniron
Copy link
Author

好的

	new file:   examples/qwen2.5-infer_correction/agentic_webshop_infer_correction.yaml
	new file:   examples/qwen2.5-infer_correction/rlvr_infer_correction_config.yaml
	new file:   examples/qwen2.5-infer_correction/run_agentic_pipeline_webshop.sh
	new file:   examples/qwen2.5-infer_correction/run_rlvr_pipeline.sh
	modified:   roll/configs/base_config.py
	modified:   roll/configs/generating_args.py
	modified:   roll/distributed/scheduler/generate_scheduler.py
	modified:   roll/pipeline/agentic/env_manager/step_env_manager.py
	modified:   roll/pipeline/agentic/env_manager/traj_env_manager.py
	modified:   roll/pipeline/base_worker.py
	modified:   roll/pipeline/rlvr/actor_pg_worker.py
	modified:   roll/pipeline/rlvr/actor_worker.py
@millioniron
Copy link
Author

hi,我进行了合并修订。具体来讲,对于新版本中有关infer-log-prob的获取部分我遵循了官方的版本。但是在具体的训推修复部分,是使用我自己的修订。

✨ What's Changed

What does this PR do?

✨ What's Changed

1. 核心组件重构

  • 新增 InferCorrectionHandler (roll/utils/infer_correction.py)类:专注处理IS校正+样本拒绝,替代原loss_func中混杂逻辑
    handler = InferCorrectionHandler(pipeline_config)
    weighted_loss, final_mask, metrics = handler(
        old_log_probs, infer_log_probs, response_mask, pg_loss
    )

2. 三级拒绝策略体系

策略类型 触发条件 保护目标 关键参数
Token级拒绝 IS比率超出合理范围 防止单点梯度爆炸 infer_token_mask_threshold_{min,max}
序列级拒绝 序列整体IS比率异常 保证序列级一致性 enable_seq_reject, infer_seq_mask_threshold_{min,max}
灾难性拒绝 IS比率 < 1e-3 (指数级概率差异) 防止训练完全崩溃 infer_catastrophic_threshold

3. 智能重要性采样

  • 模式动态切换
    infer_is_mode: Literal["token", "sequence", "geometric", "none"]
    • token:传统token级IS(默认)
    • sequence:序列总log-ratio(稳定长序列训练)
    • geometric:几何平均比率(平衡极端值)
    • none:关闭IS(基准测试用)
  • 自适应裁剪
    is_weight = raw_is_weight.clamp(
        min=infer_is_threshold_min, 
        max=infer_is_threshold_max 

4. 工业级诊断系统

  • StatsCollector 集中管理指标,分三类:
    • 基础分布token_ratio_mean/std/min/max
    • 拒绝分析token_reject_frac, seq_reject_frac, catastrophic_seq_frac
    • 训练健康度inferkl (原始KL), inferkl_reject (拒绝后KL)
  • 延迟计算优化:避免频繁GPU-CPU同步
    self.stats.add_tensor_stat("token_ratio", ratio, mask)  # 注册但不立即计算
    self.stats.compute_tensor_stats()  # 批量计算

@millioniron
Copy link
Author

这个PR由于我跟进了最新版本的缘故,commit有些多了。是否需要我重新提交一个PR?

@PanAndy
Copy link
Collaborator

PanAndy commented Dec 8, 2025

这个PR由于我跟进了最新版本的缘故,commit有些多了。是否需要我重新提交一个PR?

可以自己squash一下,然后提交?

@millioniron millioniron closed this Dec 8, 2025
@millioniron
Copy link
Author

重开了一个新pull:#288

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.